Marius Memmel

130 posts

Marius Memmel

@memmelma

Robotics PhD student @UW, previously intern @NVIDIA, intern @Bosch_AI, @EPFL, @TUDarmstadt, @DHBW

Seattle, WA Katılım Nisan 2021

566 Takip Edilen498 Takipçiler

Sabitlenmiş Tweet

Marius Memmel@memmelma·5 Mar

There’s a discussion going on rn about two recent robotic reward models: TOPReward⛰️ and Robometer🌡️ Which one is better? It depends entirely on your objective! Here is a deep dive into the conceptual differences, strengths, and weaknesses of both. 🧵👇

English

14.1K

Marius Memmel retweetledi

Abhishek Gupta@abhishekunique7·21h

Excited to share the project that has surprised me the most in the last year! Large-scale RL in simulation, no demos and no reward engineering can solve dynamic, dexterous and contact rich tasks. The learned behaviors are reactive, forceful and use the environment for recovery in ways that are extremely challenging to bake in or teleoperate! You can play with the policies yourself to see: weirdlabuw.github.io/omnireset/ And, the learned behavior transfers to real world robots from RGB camera inputs! So what’s the trick - using simulator resets carefully! Let’s unpack (1/10)

English

513

55.6K

Marius Memmel retweetledi

Patrick Yin@patrickhyin·23h

We’re releasing OmniReset, a framework for training robot policies using large-scale RL and diverse resets for contact-rich, dexterous manipulation. OmniReset pushes the frontier of robustness and dexterity, without any reward engineering or demonstrations. Try the policies yourself in our interactive simulator! weirdlabuw.github.io/omnireset/ (1/N 🧵)

English

393

73.7K

Marius Memmel@memmelma·5 Mar

Hi Shirui, thanks for the clarification. The plot above shows that TOPReward fails to detect task success ("put the pear on the red plate") and hallucinates progress for the unrelated language instructions. So even with the raw logits (which we use for the trajectory ranking and confusion matrix evaluations), TOPReward performs poorly on comparing progress across multiple trajectories. To be useful as a reward model for RL, there must be a clear separation between successful and unsuccessful/unrelated trajectories. Basically, the problem is that during RL exploration, the agent often fails to solve the task or solves an unrelated task (e.g., "put the strawberry on the red plate"). If these trajectories get comparable rewards to a successful one, Q/V values will be similar for all of these trajectories. As such, the policy will likely extract behaviors that solve the wrong task.

English

Shirui@ChinSengi·5 Mar

@memmelma Hi Marius, author of TOPReward here, actually our method does allow comparison across multiple trajectories with the raw token probability. It is the model's internal belief of the progress of the task. Also our advantage-aligned BC does not use normalized progress metric.

English

Marius Memmel@memmelma·5 Mar

English

14.1K

Marius Memmel retweetledi

Tyler Han@TylerHan19·4 Mar

Animals can’t learn by being tele-operated. But, they do learn by observing and interacting with the world around them. So, why don’t robots learn this way? Excited to release, “Planning from Observation and Interaction”, for real-world observational learning on robots! 🧵(1/12)

English

7.4K

Marius Memmel@memmelma·5 Mar

Finally, thanks to @Jesse_Y_Zhang, @DJiafei, and co-authors for their great work and for helping me set up the code to run these experiments 🎉 Robometer🌡️: robometer.github.io TOPReward⛰️: topreward.github.io/webpage/

English

535

Marius Memmel@memmelma·5 Mar

Do off-the-shelf VLMs hallucinate progress? While playing around with TOPReward⛰️, I’ve noticed some patterns that I didn't see with Roboreward🌡️ and hypothesized that “off-the-shelf VLMs hallucinate progress”. To test this hypothesis, I evaluated Robometer🌡️ and TOPReward⛰️ (Qwen3-VL-8B-Instruct) with different language instructions on the same video of the robot “placing a banana on a red plate”. While TOPReward⛰️ predicts increasing progress for the correct language instruction (”put the banana on the red plate”), it predicts oscillating progress for the always-true instruction (”put the pear on the red plate”), and completely OOD instructions (”fold the pink towel” and “make a left turn with the car”). In contrast, the correct and OOD instructions are clearly separated in the Robometer🌡️ evaluation! These findings support the above recommendation to use TOPReward⛰️ as a simple baseline for filtering offline trajectories and Robometer🌡️ as a reward model for reinforcement learning 🦾

English

668

Marius Memmel retweetledi

Anthony Liang@aliangdw·3 Mar

Super excited to share Robometer, a reward model that works zero-shot across robots, tasks, and scenes! Try fine-tuning Robometer on your own dataset! 🌐Project website: robometer.github.io 💻Code: github.com/robometer/robo…

Jesse Zhang@Jesse_Y_Zhang

A reward model that works, zero-shot, across robots, tasks, and scenes? Introducing Robometer: Scaling general-purpose robotic reward models with 1M+ trajectories. Enables zero-shot: online/offline/model-based RL, data retrieval + IL, automatic failure detection, and more! 🧵 (1/12)

English

12.4K

Marius Memmel retweetledi

Jesse Zhang@Jesse_Y_Zhang·3 Mar

English

104

401

87.4K

Marius Memmel retweetledi

Avinandan Bose@avibose22·19 Şub

🚨Checkout our latest work on inference time coldstart personalization Memory is promising, but what if a user queries a task out of domain from their chat history? In session multi turn interactions provide a great opportunity to surface user needs via strategic elicitation 🔁

Stella Li@StellaLisy

Personalization assumes you need history with a user. What if you don't? Cold-start is hard: each task&user has many preference dimensions, but each user only cares about a few. A few strategic questions is all you need, if u know how preferences correlate across population👉🏻🧵

English

5.5K

Marius Memmel retweetledi

Nicholas Pfaff@NicholasEPfaff·11 Şub

Meet SceneSmith: An agentic system that generates entire simulation-ready environments from a single text prompt. VLM agents collaborate to build scenes with dozens of objects per room, articulated furniture, and full physics properties. We believe environment generation is no longer the bottleneck for scalable robot training and evaluation in simulation. Website: scenesmith.github.io 👇🧵(1/8)

English

560

71.3K

Marius Memmel retweetledi

Abhishek Gupta@abhishekunique7·7 Şub

Check out new work from @EntongSu for RL finetuning pre-training flow policies with residual flow steering. The motivation is simple - steering input diffusion noise can struggle to handle higher dexterity problems like multi-fingered hands, because the base policy may not cover the requisite behavior. In these cases, you want local adjustments to the behavior with an action space residual, along with global adjustments to the strategy by modulating input noise for a flow policy. The solution is frustratingly simple, but works really well for high dexterity manipulation - especially with dexterous multifingered hands. Check out @EntongSu's thread for more details, and our website / paper. entongsu.github.io/rfs/ arxiv.org/abs/2602.01789

Entong Su@EntongSu

Pretrained diffusion/flow policies are powerful — but brittle at deployment. We introduce RFS, a data-efficient RL framework that: • steers latent noise for global adaptation • applies residual actions for precise local correction Works in sim and real-world dexterous manipulation 🖐️🤖 👉📄 Paper + videos: entongsu.github.io/rfs/

English

Marius Memmel retweetledi

Entong Su@EntongSu·6 Şub

English

178

17K

Marius Memmel retweetledi

Jiafei Duan@DJiafei·5 Şub

Why do generalist robotic models fail when a cup is moved just two inches to the left? It’s not a lack of motor skill, it’s an alignment problem. Today, we introduce VLS: Vision-Language Steering of Pretrained Robot Policies, a training-free framework that guides robot behavior in real time. Check out the project: vision-language-steering.github.io/webpage/ 👇🧵 (Watch till the end: VLS runs uncut, steering pretrained policies across long-horizon tasks.)

English

192

53.8K

Marius Memmel retweetledi

Avinandan Bose@avibose22·20 Oca

🚨 New Blog Alert: Adaptive Intelligence (and Why LLMs Lack It) Lately I've been obsessing over a simple question: why do AI assistants feel harder to use even as they get smarter? You know the feeling. You wanted something specific. Your AI didn't ask, it just "thought for 57 seconds" and confidently solved a problem you never had in mind. Now you're 10 "thought for a minute" deep, feeling unheard, trying to steer it back to what you actually meant. That feeling? It's a real capability gap. 🧵👇

English

10.2K

Marius Memmel retweetledi

Andrew Wagenmaker@ajwagenmaker·22 Ara

How should we pretrain a policy from demonstrations to ensure it is an effective initialization for RL finetuning, while preserving the performance of the pretrained policy itself? We propose Posterior Behavioral Cloning (PostBC)! (1/11)

English

274

57.6K

Keşfet

@Jesse_Y_Zhang @DJiafei @EntongSu @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates