Maxime Alvarez

656 posts

Maxime Alvarez

@qu3tzalify

VLA/Robot Foundation Models PhD student @Matsuo_Lab & Robot Foundation Model Engineer @telexistenceinc

Tokyo, Japan Katılım Kasım 2011

887 Takip Edilen163 Takipçiler

Sabitlenmiş Tweet

Maxime Alvarez@qu3tzalify·28 Nis

My very first paper is being published in the ML Evaluation Workshop at ICLR 2022 (as a poster)! We take a look at the evaluation of anomaly detection methods and show that differences in the evaluation protocol give misleading results. arxiv.org/abs/2204.09825

Stephanie Chan@scychan_brains

Apparent progress in ML research doesn't always map to real progress - it often isn't generalizable, usable or meaningful. Tomorrow at the ML Evaluation Workshop @iclr_conf, join our many distinguished speakers in discussing and improving this situation! iclr.cc/Conferences/20…

English

Maxime Alvarez@qu3tzalify·6 May

@chris_j_paxton @DJiafei What is RoboArena missing in that case? Maybe consistency?

English

Chris Paxton@chris_j_paxton·6 May

One really cool thing from this report: retaining a third-party firm to do benchmarking. if this was done double-blind it would be essentially the best possible setup for knowing which models are best. hope to see these practices in future model releases.

Ai2@allen_ai

Robotics models often struggle outside controlled environments. Ours is built to work in real ones. Today we're launching MolmoAct 2, which can assist with a host of chores & lab tasks, plus the MolmoAct 2-Bimanual YAM dataset—the largest open robotics dataset of its kind. 🧵

English

5.4K

Maxime Alvarez retweetledi

Take Ohkawa@tkhkaeio·5 May

My intern mentees are glowing since I joined AIRoA. I'm looking for self-motivated students who are interested in the following topics: tkhkaeio.github.io/contact/recrui… @airoa_org

English

1.4K

Maxime Alvarez@qu3tzalify·2 May

@NikolausWest Looking forward to it! Thanks I'll try that.

English

Nikolaus West@NikolausWest·2 May

The view with many episodes is coming as experimental in the next release so not surprising you don’t know how to do that with current Rerun! Loading times that are 10x the episode length definitely sounds like something is off. Usual suspects would be poor chunk compaction (run rerun rrd compact on your rrds or change the microbatcher settings when logging), or something with image / video compression. That said in that video the viewer is connected to our cloud platform and doing a lot of optimized stuff to only stream in the visible data from the cloud etc so not directly comparable to loading the whole recording file.

English

285

Nikolaus West@NikolausWest·30 Nis

x.com/i/article/2049…

ZXX

207

169.4K

Maxime Alvarez@qu3tzalify·2 May

@AoumiKojir27045 @jianlanluo Cool, enjoy having 200 robots for 200 different tasks 👍

English

Kojiro Aoumi@AoumiKojir27045·1 May

@jianlanluo i dont understand why made in human format, just elevate cost, just a box with some "arm" is enough.

English

218

Jianlan Luo@jianlanluo·30 Nis

Excited to share LWD: Learning While Deploying. Our robots learn while doing real tasks—restocking groceries, brewing Gongfu tea, making cocktails, making juice, and packing shoes. Deployment is no longer just evaluation; it becomes the training loop. 🧵

English

396

621.1K

Maxime Alvarez@qu3tzalify·29 Nis

@MingchenZhuge @AilingZeng81332 @tikgiau @shirleyrz_ @sherryyangML @sthuyan @_yunzhong @vikasc @SchmidhuberAI @Wenyi_AI_Wang @dmitrii_tech @PiotrPiekosAI Is the workshop recorded or streamed online?

English

Mingchen Zhuge@MingchenZhuge·27 Nis

Thanks to all the organizers: @AilingZeng81332 @tikgiau @shirleyrz_ @sherryyangML @sthuyan @_yunzhong @vikasc @SchmidhuberAI and friends coming for help today @Wenyi_AI_Wang @dmitrii_tech @PiotrPiekosAI We truly appreciate all the authors who submitted their papers to our workshop. Special thanks to the 352 reviewers who completed their reviews. And thanks to @TencentHunyuan @Meta @BAAIBeijing @KAUST_News for their sponsorship.

English

4.3K

Maxime Alvarez@qu3tzalify·29 Nis

@qiancheng1231 @SFResearch @jiqizhixin @TencentGlobal @Oumi_PBC @emrecanacikgoz @HongruWang007 @9LdROhjZE56jSh9 @ManlingLi_ @YunNungChen @JiahaoQiu99 @CaimingXiong @hengjinlp @tur_gokhan @dilekhakkanitur @philiptorr @kamfai_kfw @seawan @MengdiWang10 Is the workshop recorded or streamed online?

English

Cheng Qian@qiancheng1231·5 Şub

@SFResearch @jiqizhixin @TencentGlobal @Oumi_PBC Also great thanks to the organizing team + advisory board @emrecanacikgoz @HongruWang007 @9LdROhjZE56jSh9 @ManlingLi_ @YunNungChen @Guanhua_Chen @JiahaoQiu99 @CaimingXiong @hengjinlp @tur_gokhan @dilekhakkanitur @philiptorr @kamfai_kfw @seawan @MengdiWang10

English

163

Cheng Qian@qiancheng1231·5 Şub

📅 Invited speakers/panelists: Sergey Levine (UC Berkeley) Azalia Mirhoseini (Stanford / DeepMind) Siva Reddy (McGill / Mila) Graham Neubig (CMU / All Hands AI) Asli Celikyilmaz (Meta) Yu Su (OSU) Manos Koukoumidis (Oumi)

Suomi

241

Maxime Alvarez@qu3tzalify·23 Nis

@YifeiDong314159 @Kingchou007Li @s_zhanyi @LujieYang0 ICRA 2026 Workshop on Manipulation Robustness's website says the deadline is AoE (UTC-12) but OpenReview submissions just closed with a deadline of UTC-0! Which one is correct?

English

Maxime Alvarez@qu3tzalify·13 Nis

@jonaruthardt @giffmana @gaur_manu Thank you for the explanation!

English

Jona Ruthardt@jonaruthardt·13 Nis

@qu3tzalify @giffmana @gaur_manu But it goes further: SteerViT also steers the dense representations. Consider this example where specifying a certain person leads to clear separation from other people in the PCA feature visualization. This helps semantic discrimination in downstream tasks (e.g., segmentation).

English

Manu Gaur@gaur_manu·10 Nis

Pretrained ViTs like DINOv2 or CLIP are great, but they produce fixed, generic representations that encode the most salient visual concepts (e.g., "cat"). In human vision, prior priming with language changes how people parse an image. We believe visual encoders should do the same 🚨 Introducing Steerable Visual Representations, a new family of visual features you can steer with text towards specific visual concepts.

English

135

899

148.6K

Maxime Alvarez@qu3tzalify·11 Nis

@giffmana @gaur_manu Interesting, it seems that the claim that DINOv2 "encode the most salient visual concepts" only is a bit off? That's true if you use the CLS token because it was fine-tuned for that, but if you do like the SAM papers and use all the tokens from DINO, you have encoded everything?

English

102

Lucas Beyer (bl16)@giffmana·10 Nis

@gaur_manu Ha nice, i was thinking about this (steering ViT) a couple years ago but didn't end up with anything nice & good, I was missing the idea to put (rough) masks as the loss target. Good idea!

English

3.9K

Maxime Alvarez@qu3tzalify·3 Nis

@GeneralistAI The wording is so confusing on one side you say "our own (world’s largest) robotics pretraining dataset" and on the other you say "The pretraining dataset contains no robot data". Is this only for GEN 0 or GEN 1 as well? Do you not consider Fast UMI data as robotic data?

English

Generalist@GeneralistAI·2 Nis

8/ Read the full post, along with videos of robots completing dexterous tasks 100s of times in a row, for hours 👇 generalistai.com/blog/apr-02-20…

English

5.2K

Maxime Alvarez@qu3tzalify·24 Mar

@CharlesXu0124 Super interesting! I have a question though, when human intervenes, why does it replace the VLA actions and not the policy actions? Is the assumption that the failure comes from the base VLA and not the policy?

English

Charles Xu@CharlesXu0124·20 Mar

Pushing online RL to the next level -- exposing RL Token from the π-0.6 model for online RL achieves superhuman performance with as little as 15 minutes of data.

Physical Intelligence@physical_int

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

English

102

15K

Maxime Alvarez@qu3tzalify·25 Şub

@xxunhuang latent actions using the smaller set of actions for weak-supervised learning

English

214

Xun Huang@xxunhuang·25 Şub

It's clear that combining lots of unlabeled video with a smaller set of action-labeled video data is the way to go. The remaining question is how to combine them: Inverse dynamics? joint perdiction? world model? My guess is all approaches will be useful, but for different goals.

Standard Intelligence@si_pbc

Computer use models shouldn't learn from screenshots. We built a new foundation model that learns from video like humans do. FDM-1 can construct a gear in Blender, find software bugs, and even drive a real car through San Francisco using arrow keys.

English

7.3K

Maxime Alvarez@qu3tzalify·9 Şub

@liu730chaoqi What about parallely decoding the separate tree branches? (Maybe just through masking)

English

Chaoqi Liu@liu730chaoqi·8 Şub

Love this—I’ve actually been thinking along similar lines. While I'm not an expert, I'd love to share some of my thoughts, could be badly wrong : ( Let’s take a closer look at what’s really going on during action generation, in particular, along the action dimension, i.e., a[:, this axis]. Binning performs per-dimension, per-time-step prediction. That is, in the autoregressive loop, we predict a[t, i+1] given the observation o and previous actions a[1...t, 1...i]. When the kinematic structure is simple—like a robotic arm along a path—this ordering is straightforward: just follow the link order from base to end-effector. But for more complex structures, like dexterous hands or humanoids, where the kinematic structure forms a tree rather than a path, the ordering becomes ambiguous. Should we predict the left arm before the right arm? Does that choice affect performance? It’s unclear (or is it? please share some paper if I missed them). FAST does per-dim-chunk prediction. The BPE-style tokenizer introduces its own heuristic grouping—sometimes combining a[:, 2, 3, 4], sometimes just a[:, 3], and so on. This also carries ambiguity in ordering, similar to binning. Learned tokenizers, in contrast, seem less prone to this issue. They compress the full action trajectory H_a x D_a into a latent space H_l x D_l, where the compression along the dimension axis (from D_a to D_l) can often be handled by a linear projection. Each latent token then models all action dimensions jointly, so there’s no need to decide if dim-abc should come before dim-def. While I haven’t worked extensively with dexterous hands or humanoids yet, my intuition is that learned tokenizers are likely better suited for such complex kinematics. That said, if we do want to explicitly incorporate kinematic structure into the tokenization process, it’s certainly possible. In fact, some works already explore this direction—GET-ZERO (get-zero-paper.github.io) is a good example. Would love to discuss and learn more.

Qi Lv@Aopolin

Nice work! Further, instead of merely modeling what policy model learn, should we consider more about the kinematics or dynamic of the robot system, and integrate them into the action tokenization?

English

694

Maxime Alvarez@qu3tzalify·21 Oca

@_advaitpatel @chrisbarber Hm, humanoids for factories and heavy duty would have wildly different requirements than home humanoids

English

underscore advait patel@_advaitpatel·14 Oca

@chrisbarber innate (consumer robotics startup) defense tech (anduril and saronic off the top of my head) meta also has a robotics/humanoid division bedrock robotics (autonomous construction via imitation learning) i would rename "robots for homes" as "humanoids" and put matic under product

English

2.2K

Chris Barber@chrisbarber·14 Oca

robotics labs and startups list draft - which ones am i missing, and which categorizations are terrible and need updating? labs - pi (most people felt like this was the best team on the research side. also more of a general foundation model company, very ambitious.) - also tesla (but xai is making the models?) - gdm robotics - uma - nvidia robotics - intrinsic - rai institute, kind of - genesis ai - general intuition - generalist - dyna - skild ai for science incl lab automation - periodic labs - prometheus - lila sciences - futurehouse (tbd) - medra industrial - halcyon - boston dynamics (hyundai) - unitree - agibot - ubtech - amazon robotics - also tesla - also dyna - dynatronics - stealth - sanctuary - agility - covariant (acq) - humanoid - figure - apptronik robots for homes - 1x - also tesla - prosper - sunday - matic - bot company product - orchard robotics (farming) - path ai (manufacturing) - ironsite (construction) different axes to look at robotics labs - precustomer vs post customer - manufacturing vs home automation vs something else - specialist vs generalist - simulation training vs real world training - lab vs product focused thanks to friends who suggested companies and ways to improve organizations which am i missing, and which categorizations need updating

English

573

92.7K

Maxime Alvarez@qu3tzalify·21 Oca

@ChenSun92 @tyao923 How is interference not hit really quickly?

English

109

Chen Sun 🤖@ChenSun92·20 Oca

This is a paper so beautiful and simple that I think it should have been invented way earlier🌹🚨 To see its shine: by replacing the standard single token generation with a multiplex token that aggregates K sampled embeddings at every timestep, multiplex thinking implicitly enables exploration of an exponential state space, but it does so via compression rather than enumeration. But why can this possibly work? In lower dimensions, averaging tokens e.g. "Left" and "Right" gets you mush. But in the high-dimensional space, averaging vectors creates a superposition. Then, by feeding the multiplex vector back into the transformer again and again autoregressively, the model can reason $K^L$ paths (give or take) simultaneously in a "holographic" way (at least until interference eventually hits). This seems like such a very natural successor to CoT, and amazingly, works straight out of the box 🧙‍♂️ Hindsight is 20:20 of course, but again, surprised why no one thought of this before?

Yao Tang@tyao923

𝗧𝗵𝗶𝗻𝗸 𝘄𝗶𝗱𝗲𝗿. 𝗧𝗵𝗶𝗻𝗸 𝘀𝗵𝗼𝗿𝘁𝗲𝗿. 🚀 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: token-wise branch-and-merge reasoning for LLMs. 💸 Discrete CoT is costly. 🎛️ Existing continuous tokens often clash with 𝗼𝗻-𝗽𝗼𝗹𝗶𝗰𝘆 𝗥𝗟 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻. 🎥 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, a sampling-based continuous reasoning paradigm:

English

197

32.1K

Maxime Alvarez@qu3tzalify·18 Oca

@YuXiang_IRVL If it’s with gloves with sensors is it still human data? I kind of assumed human data meant with vision only

English

436

Yu Xiang@YuXiang_IRVL·17 Oca

Temporal and spatial alignment between glove and camera data. It is not an easy task, but it’s coming together. Human data for robots. #Robotics

Yu Xiang@YuXiang_IRVL

From egocentric glove data to a MANO hand model, replayed in Rerun.

English

295

63.4K

Maxime Alvarez@qu3tzalify·30 Ara

@q17224 @GenrobotAI @huggingface I was so confused, thank you

English

Jiabin@q17224·30 Ara

@GenrobotAI @huggingface FYI, W in Chinese means 10k, so 1Wh means 10k hours

English

196

Genrobot.AI@GenrobotAI·29 Ara

THE LARGEST OPEN-SOURCE EMBODIED AI DATASET IS COMING.🔥🔥🔥 1Wh RealOmni-Open Dataset 🚀🚀🚀 Launching soon on @huggingface

English

328

60.6K

Maxime Alvarez@qu3tzalify·27 Ara

@AlexGDimakis Thank you!

English

Alex Dimakis@AlexGDimakis·22 Ara

@qu3tzalify I dont have class materials on this exact topic online yet, but we will be adding Data science pitfalls next semester when I teach this course. There is a lot of great material eg from Fall25 eecs189.org/fa25/

English

2.6K

Alex Dimakis@AlexGDimakis·22 Ara

A paper was recently published in Science on highest level of human performance across athletics, science, math and music. I think the paper makes some classical statistics mistakes that still fool many smart people. The paper "Recent discoveries on the acquisition of the highest levels of human performance" by Gullich et al. claims: "In summary, when comparing performers across the highest levels of achievement, the evidence suggests that eventual peak performance is negatively associated with early performance." The paper makes two mistakes. Base-rate fallacy and missing Berkson's paradox (aka Collider Bias). 1. Base-rate fallacy: The study says simply that the very top at young age are not identical with the very top adults. (As one would expect, since there are *many many more non-elite young candidates*). Still, elite young performers are 40 times more likely to be in the top adults compare to general population. This is acknowledged in the paper but in page 6-7, a bit buried in the technical analysis and not sufficiently discussed in abstract or conclusions. 2. Berkson's paradox (Collider bias): The paper claims "Across the highest adult performance levels, peak performance is negatively correlated with early performance." This is a classic example of Berkson's paradox. Here is a simplified example to understand this: Assume that to be a successful actor you have to be either extremely good looking or extremely talented. Assume also that talent and looks are independent in the population. However, among sucessful actors you will observe a negative correlation between looks and talent. This doesn't meant anything beyond the selection process and should not be extrapolated. My favorite example-joke of this is that basketball points scored is negatively associated with height among NBA players. (because to be an NBA player you have to be very tall OR be very good at scoring). From this, I extrapolated that since I'm 5'7, I will be scoring 80+ points per NBA game. (I include my slide from my lecture on data science and statistics pitfalls)

English

212

1.4K

125.4K

Keşfet

@chris_j_paxton @DJiafei @airoa_org @NikolausWest @AoumiKojir27045 @jianlanluo @MingchenZhuge @AilingZeng81332