Maxime Alvarez

656 posts

Maxime Alvarez

Maxime Alvarez

@qu3tzalify

VLA/Robot Foundation Models PhD student @Matsuo_Lab & Robot Foundation Model Engineer @telexistenceinc

Tokyo, Japan Katılım Kasım 2011
887 Takip Edilen163 Takipçiler
Sabitlenmiş Tweet
Maxime Alvarez
Maxime Alvarez@qu3tzalify·
My very first paper is being published in the ML Evaluation Workshop at ICLR 2022 (as a poster)! We take a look at the evaluation of anomaly detection methods and show that differences in the evaluation protocol give misleading results. arxiv.org/abs/2204.09825
Stephanie Chan@scychan_brains

Apparent progress in ML research doesn't always map to real progress - it often isn't generalizable, usable or meaningful. Tomorrow at the ML Evaluation Workshop @iclr_conf, join our many distinguished speakers in discussing and improving this situation! iclr.cc/Conferences/20…

English
2
3
7
0
Chris Paxton
Chris Paxton@chris_j_paxton·
One really cool thing from this report: retaining a third-party firm to do benchmarking. if this was done double-blind it would be essentially the best possible setup for knowing which models are best. hope to see these practices in future model releases.
Chris Paxton tweet media
Ai2@allen_ai

Robotics models often struggle outside controlled environments. Ours is built to work in real ones. Today we're launching MolmoAct 2, which can assist with a host of chores & lab tasks, plus the MolmoAct 2-Bimanual YAM dataset—the largest open robotics dataset of its kind. 🧵

English
5
7
34
5.4K
Nikolaus West
Nikolaus West@NikolausWest·
The view with many episodes is coming as experimental in the next release so not surprising you don’t know how to do that with current Rerun! Loading times that are 10x the episode length definitely sounds like something is off. Usual suspects would be poor chunk compaction (run rerun rrd compact on your rrds or change the microbatcher settings when logging), or something with image / video compression. That said in that video the viewer is connected to our cloud platform and doing a lot of optimized stuff to only stream in the visible data from the cloud etc so not directly comparable to loading the whole recording file.
English
2
0
1
285
Kojiro Aoumi
Kojiro Aoumi@AoumiKojir27045·
@jianlanluo i dont understand why made in human format, just elevate cost, just a box with some "arm" is enough.
English
1
0
0
218
Jianlan Luo
Jianlan Luo@jianlanluo·
Excited to share LWD: Learning While Deploying. Our robots learn while doing real tasks—restocking groceries, brewing Gongfu tea, making cocktails, making juice, and packing shoes. Deployment is no longer just evaluation; it becomes the training loop. 🧵
English
7
38
396
621.1K
Cheng Qian
Cheng Qian@qiancheng1231·
📅 Invited speakers/panelists: Sergey Levine (UC Berkeley) Azalia Mirhoseini (Stanford / DeepMind) Siva Reddy (McGill / Mila) Graham Neubig (CMU / All Hands AI) Asli Celikyilmaz (Meta) Yu Su (OSU) Manos Koukoumidis (Oumi)
Suomi
1
0
1
241
Jona Ruthardt
Jona Ruthardt@jonaruthardt·
@qu3tzalify @giffmana @gaur_manu But it goes further: SteerViT also steers the dense representations. Consider this example where specifying a certain person leads to clear separation from other people in the PCA feature visualization. This helps semantic discrimination in downstream tasks (e.g., segmentation).
Jona Ruthardt tweet media
English
1
0
1
51
Manu Gaur
Manu Gaur@gaur_manu·
Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat"). In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same 🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.
Manu Gaur tweet media
English
13
135
899
148.6K
Maxime Alvarez
Maxime Alvarez@qu3tzalify·
@giffmana @gaur_manu Interesting, it seems that the claim that DINOv2 "encode the most salient visual concepts" only is a bit off? That's true if you use the CLS token because it was fine-tuned for that, but if you do like the SAM papers and use all the tokens from DINO, you have encoded everything?
English
1
0
0
102
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
@gaur_manu Ha nice, i was thinking about this (steering ViT) a couple years ago but didn't end up with anything nice & good, I was missing the idea to put (rough) masks as the loss target. Good idea!
English
4
0
40
3.9K
Maxime Alvarez
Maxime Alvarez@qu3tzalify·
@GeneralistAI The wording is so confusing on one side you say "our own (world’s largest) robotics pretraining dataset" and on the other you say "The pretraining dataset contains no robot data". Is this only for GEN 0 or GEN 1 as well? Do you not consider Fast UMI data as robotic data?
English
0
0
0
50
Maxime Alvarez
Maxime Alvarez@qu3tzalify·
@CharlesXu0124 Super interesting! I have a question though, when human intervenes, why does it replace the VLA actions and not the policy actions? Is the assumption that the failure comes from the base VLA and not the policy?
English
0
0
0
12
Maxime Alvarez
Maxime Alvarez@qu3tzalify·
@xxunhuang latent actions using the smaller set of actions for weak-supervised learning
English
1
0
3
214
Xun Huang
Xun Huang@xxunhuang·
It's clear that combining lots of unlabeled video with a smaller set of action-labeled video data is the way to go. The remaining question is how to combine them: Inverse dynamics? joint perdiction? world model? My guess is all approaches will be useful, but for different goals.
Standard Intelligence@si_pbc

Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.

English
1
5
77
7.3K
Maxime Alvarez
Maxime Alvarez@qu3tzalify·
@liu730chaoqi What about parallely decoding the separate tree branches? (Maybe just through masking)
English
1
0
0
47
Chaoqi Liu
Chaoqi Liu@liu730chaoqi·
Love this—I’ve actually been thinking along similar lines. While I'm not an expert, I'd love to share some of my thoughts, could be badly wrong : ( Let’s take a closer look at what’s really going on during action generation, in particular, along the action dimension, i.e., a[:, this axis]. Binning performs per-dimension, per-time-step prediction. That is, in the autoregressive loop, we predict a[t, i+1] given the observation o and previous actions a[1...t, 1...i]. When the kinematic structure is simple—like a robotic arm along a path—this ordering is straightforward: just follow the link order from base to end-effector. But for more complex structures, like dexterous hands or humanoids, where the kinematic structure forms a tree rather than a path, the ordering becomes ambiguous. Should we predict the left arm before the right arm? Does that choice affect performance? It’s unclear (or is it? please share some paper if I missed them). FAST does per-dim-chunk prediction. The BPE-style tokenizer introduces its own heuristic grouping—sometimes combining a[:, 2, 3, 4], sometimes just a[:, 3], and so on. This also carries ambiguity in ordering, similar to binning. Learned tokenizers, in contrast, seem less prone to this issue. They compress the full action trajectory H_a x D_a into a latent space H_l x D_l, where the compression along the dimension axis (from D_a to D_l) can often be handled by a linear projection. Each latent token then models all action dimensions jointly, so there’s no need to decide if dim-abc should come before dim-def. While I haven’t worked extensively with dexterous hands or humanoids yet, my intuition is that learned tokenizers are likely better suited for such complex kinematics. That said, if we do want to explicitly incorporate kinematic structure into the tokenization process, it’s certainly possible. In fact, some works already explore this direction—GET-ZERO (get-zero-paper.github.io) is a good example. Would love to discuss and learn more.
Qi Lv@Aopolin

Nice work! Further, instead of merely modeling what policy model learn, should we consider more about the kinematics or dynamic of the robot system, and integrate them into the action tokenization?

English
3
0
1
694
underscore advait patel
underscore advait patel@_advaitpatel·
@chrisbarber innate (consumer robotics startup) defense tech (anduril and saronic off the top of my head) meta also has a robotics/humanoid division bedrock robotics (autonomous construction via imitation learning) i would rename "robots for homes" as "humanoids" and put matic under product
English
2
0
17
2.2K
Chris Barber
Chris Barber@chrisbarber·
robotics labs and startups list draft - which ones am i missing, and which categorizations are terrible and need updating? labs - pi (most people felt like this was the best team on the research side. also more of a general foundation model company, very ambitious.) - also tesla (but xai is making the models?) - gdm robotics - uma - nvidia robotics - intrinsic - rai institute, kind of - genesis ai - general intuition - generalist - dyna - skild ai for science incl lab automation - periodic labs - prometheus - lila sciences - futurehouse (tbd) - medra industrial - halcyon - boston dynamics (hyundai) - unitree - agibot - ubtech - amazon robotics - also tesla - also dyna - dynatronics - stealth - sanctuary - agility - covariant (acq) - humanoid - figure - apptronik robots for homes - 1x - also tesla - prosper - sunday - matic - bot company product - orchard robotics (farming) - path ai (manufacturing) - ironsite (construction) different axes to look at robotics labs - precustomer vs post customer - manufacturing vs home automation vs something else - specialist vs generalist - simulation training vs real world training - lab vs product focused thanks to friends who suggested companies and ways to improve organizations which am i missing, and which categorizations need updating
English
37
38
573
92.7K
Chen Sun 🤖
Chen Sun 🤖@ChenSun92·
This is a paper so beautiful and simple that I think it should have been invented way earlier🌹🚨 To see its shine: by replacing the standard single token generation with a multiplex token that aggregates K sampled embeddings at every timestep, multiplex thinking implicitly enables exploration of an exponential state space, but it does so via compression rather than enumeration. But why can this possibly work? In lower dimensions, averaging tokens e.g. "Left" and "Right" gets you mush. But in the high-dimensional space, averaging vectors creates a superposition. Then, by feeding the multiplex vector back into the transformer again and again autoregressively, the model can reason $K^L$ paths (give or take) simultaneously in a "holographic" way (at least until interference eventually hits). This seems like such a very natural successor to CoT, and amazingly, works straight out of the box 🧙‍♂️ Hindsight is 20:20 of course, but again, surprised why no one thought of this before?
Chen Sun 🤖 tweet media
Yao Tang@tyao923

𝗧𝗵𝗶𝗻𝗸 𝘄𝗶𝗱𝗲𝗿. 𝗧𝗵𝗶𝗻𝗸 𝘀𝗵𝗼𝗿𝘁𝗲𝗿. 🚀 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: token-wise branch-and-merge reasoning for LLMs. 💸 Discrete CoT is costly. 🎛️ Existing continuous tokens often clash with 𝗼𝗻-𝗽𝗼𝗹𝗶𝗰𝘆 𝗥𝗟 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻. 🎥 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, a sampling-based continuous reasoning paradigm:

English
11
25
197
32.1K
Maxime Alvarez
Maxime Alvarez@qu3tzalify·
@YuXiang_IRVL If it’s with gloves with sensors is it still human data? I kind of assumed human data meant with vision only
English
1
0
1
436
Genrobot.AI
Genrobot.AI@GenrobotAI·
THE LARGEST OPEN-SOURCE EMBODIED AI DATASET IS COMING.🔥🔥🔥 1Wh RealOmni-Open Dataset 🚀🚀🚀 Launching soon on @huggingface
Genrobot.AI tweet media
English
32
49
328
60.6K
Alex Dimakis
Alex Dimakis@AlexGDimakis·
@qu3tzalify I dont have class materials on this exact topic online yet, but we will be adding Data science pitfalls next semester when I teach this course. There is a lot of great material eg from Fall25 eecs189.org/fa25/
English
1
1
17
2.6K
Alex Dimakis
Alex Dimakis@AlexGDimakis·
A paper was recently published in Science on highest level of human performance across athletics, science, math and music. I think the paper makes some classical statistics mistakes that still fool many smart people. The paper "Recent discoveries on the acquisition of the highest levels of human performance" by Gullich et al. claims: "In summary, when comparing performers across the highest levels of achievement, the evidence suggests that eventual peak performance is negatively associated with early performance." The paper makes two mistakes. Base-rate fallacy and missing Berkson's paradox (aka Collider Bias). 1. Base-rate fallacy: The study says simply that the very top at young age are not identical with the very top adults. (As one would expect, since there are *many many more non-elite young candidates*). Still, elite young performers are 40 times more likely to be in the top adults compare to general population. This is acknowledged in the paper but in page 6-7, a bit buried in the technical analysis and not sufficiently discussed in abstract or conclusions. 2. Berkson's paradox (Collider bias): The paper claims "Across the highest adult performance levels, peak performance is negatively correlated with early performance." This is a classic example of Berkson's paradox. Here is a simplified example to understand this: Assume that to be a successful actor you have to be either extremely good looking or extremely talented. Assume also that talent and looks are independent in the population. However, among sucessful actors you will observe a negative correlation between looks and talent. This doesn't meant anything beyond the selection process and should not be extrapolated. My favorite example-joke of this is that basketball points scored is negatively associated with height among NBA players. (because to be an NBA player you have to be very tall OR be very good at scoring). From this, I extrapolated that since I'm 5'7, I will be scoring 80+ points per NBA game. (I include my slide from my lecture on data science and statistics pitfalls)
Alex Dimakis tweet media
English
26
212
1.4K
125.4K