Ludvig Erikson Brangstrup

22 posts

Ludvig Erikson Brangstrup

Ludvig Erikson Brangstrup

@BrangstrupL

Founder of Qualia https://t.co/eyu2QCyRwm

Katılım Aralık 2025
47 Takip Edilen3 Takipçiler
Roberto
Roberto@robertorobotics·
@vai_viswanathan Mainly intervention data, with advantage labels and additional stage labels etc. Intervention data is most important.
English
3
0
5
174
Roberto
Roberto@robertorobotics·
With my custom architecture I now achieve on average 60% progress on my gearbox assembly task. The original average progress of pi0.5 base (after 10k fine tuning) was 14% . Trajectory of progress is looking good. Now have to reduce the cycle time of iterating on a base policy.
English
9
5
139
10.3K
Ludvig Erikson Brangstrup retweetledi
Lakshita Dodeja
Lakshita Dodeja@lakshitadodeja·
Can BC policies be quickly improved through real world experience? Our new #RSS2026 paper proposes Q2RL, a method that bridges BC and RL for on-robot learning. Q2RL improves BC policies by up to 3.75x with just 1-2 hours of online interaction! So when life gives you BC, make Q-functions! 🍋 Details in thread 🧵
English
7
45
198
34.3K
Ludvig Erikson Brangstrup retweetledi
LeRobot
LeRobot@LeRobotHF·
🤖Adding new RL algorithms to LeRobot just got much easier. Demo: HIL-SERL training with a SAC-based RL algorithm on an SO-100 for a hole-in-hand peg-in-hole task. Sparse reward, only 30 offline demos mixed with live robot experience, and ~1 hour of online training with human interventions only when the policy fails. The bottom graph tracks intervention rate: high at the start, steadily dropping as the policy improves. The refactor separates algorithm logic from training infrastructure: • RLAlgorithm owns learning logic • RLTrainer handles orchestration • DataMixer combines rollouts, demos, interventions, and future data sources Adding an RL algorithm now looks much closer to adding a policy: one algorithm file, one config, one registry entry. SAC is first. RLT, RECAP, ConRFT, QC-FQL, DSRL, and VLA RL fine-tuning next! @Thom_Wolf @ClementDelangue
English
9
35
289
26.6K
Ludvig Erikson Brangstrup retweetledi
Weikai Huang
Weikai Huang@weikaih04·
Traditional VLA perceives the world with 2D perceptions with a ViT, while human perceive it in 3D. Introducing MolmoAct 2, a fully open-sourced VLA that can first Reason in 3D spaces and then Act and beat Pi0.5 in nearly all benchmarks. We open-sourced all the data/code/models, and huge shout out to proj leads: @hq_fang and @DJiafei
Ai2@allen_ai

Robotics models often struggle outside controlled environments. Ours is built to work in real ones. Today we're launching MolmoAct 2, which can assist with a host of chores & lab tasks, plus the MolmoAct 2-Bimanual YAM dataset—the largest open robotics dataset of its kind. 🧵

English
2
20
118
14.9K
Ludvig Erikson Brangstrup retweetledi
Robots Digest 🤖
Robots Digest 🤖@robotsdigest·
Ever fine-tuned a VLA policy on a small demo dataset and it suddenly stops listening to new instructions? This paper calls it lock-in. The model just repeats what it saw during training like always picking bread even when you say apple Low-data post-training quietly kills steerability The fix? DeLock is surprisingly simple and clever
English
1
17
61
4.6K
Ludvig Erikson Brangstrup retweetledi
Robots Digest 🤖
Robots Digest 🤖@robotsdigest·
VLLR tackles one of the hardest problems in robotics: long-horizon reward design. → LLMs decompose tasks into subgoals → VLMs estimate progress (for value init) → Policy self-certainty gives dense intrinsic reward Result: up to +56% success without manual reward engineering
English
4
15
82
5.6K
Ludvig Erikson Brangstrup retweetledi
pdawg
pdawg@prathamgrv·
I made a Claude Code skill that turns any arxiv paper into working code. Every line traces back to the paper section it came from & any implementation detail the paper skips will be flagged, and not assumed. open sourcing it - github.com/PrathamLearnsT…
English
53
292
2.6K
205.6K
Ludvig Erikson Brangstrup retweetledi
Robots Digest 🤖
Robots Digest 🤖@robotsdigest·
It's called EgoNav. The robot learned to: • Wait for doors to open • Avoid glass walls that sensors can't detect • Yield to people walking by • Re-route when furniture moves
Robots Digest 🤖 tweet media
English
2
2
10
223
Ludvig Erikson Brangstrup retweetledi
Robots Digest 🤖
Robots Digest 🤖@robotsdigest·
A person strapped on a backpack and walked around campus for 5 hours. That footage just trained a humanoid robot to navigate unseen buildings, crowds, and glass walls with zero robot data AND zero finetuning.
English
5
7
33
1.5K
Ludvig Erikson Brangstrup retweetledi
Max Fu
Max Fu@letian_fu·
Robotics: coding agents’ next frontier. So how good are they? We introduce CaP-X: an open-source framework and benchmark for coding agents, where they write code for robot perception and control, execute it on sim and real robots, observe the outcomes, and iteratively improve code reliability. From @NVIDIA @Berkeley_AI @CMU_Robotics @StanfordAILab capgym.github.io 🧵
English
20
127
633
165.6K
Ludvig Erikson Brangstrup retweetledi
Robots Digest 🤖
Robots Digest 🤖@robotsdigest·
VLAs struggle with contact-rich manipulation because vision isn’t enough. TacVLA adds tactile sensing to VLA policies with contact-aware gating, so touch is used only when contact happens. Result: much better disassembly, picking, and robustness under occlusion.
Robots Digest 🤖 tweet media
English
2
11
108
5.3K
Ludvig Erikson Brangstrup retweetledi
Brian Roemmele
Brian Roemmele@BrianRoemmele·
Boom! A World Model trained on 44,000 hours of human videos! A partnership of @NVIDIARobotics and UC Berkeley, HKUST, UT Austin, they just released open source DreamDojo, a foundation world model for robots trained on the largest video dataset to date for world model pretraining.
English
5
34
235
18K
Ludvig Erikson Brangstrup retweetledi
Turing Post
Turing Post@TheTuringPost·
A simple recipe for Continual Reinforcement Learning (CRL) for VLA models: Sequential Fine-Tuning (Seq. FT) = Large pretrained VLA model + LoRA + on-policy RL Researchers from @UTAustin found that this combination helps to: - prevent catastrophic forgetting - learn new tasks easily (strong plasticity) - keep strong zero-shot abilities In many cases, it even beats more complicated continual learning methods. Here is how it works: Btw, in this article we explain why LoRA is especially important for fine-tuning now turingpost.com/p/beyondrl
Turing Post tweet media
English
4
26
149
10.7K
Ludvig Erikson Brangstrup retweetledi
Zhuokai Zhao
Zhuokai Zhao@zhuokaiz·
AMI Labs just raised $1.03B. World Labs raised $1B a few weeks earlier. Both are betting on world models. But almost nobody means the same thing by that term. Here are, in my view, five categories of world models. --- 1. Joint Embedding Predictive Architecture (JEPA) Representatives: AMI Labs (@ylecun), V-JEPA 2 The central bet here is that pixel reconstruction alone is an inefficient objective for learning the abstractions needed for physical understanding. LeCun has been saying this for years — predicting every pixel of the future is intractable in any stochastic environment. JEPA sidesteps this by predicting in a learned latent space instead. Concretely, JEPA trains an encoder that maps video patches to representations, then a predictor that forecasts masked regions in that representation space — not in pixel space. This is a crucial design choice. A generative model that reconstructs pixels is forced to commit to low-level details (exact texture, lighting, leaf position) that are inherently unpredictable. By operating on abstract embeddings, JEPA can capture "the ball will fall off the table" without having to hallucinate every frame of it falling. V-JEPA 2 is the clearest large-scale proof point so far. It's a 1.2B-parameter model pre-trained on 1M+ hours of video via self-supervised masked prediction — no labels, no text. The second training stage is where it gets interesting: just 62 hours of robot data from the DROID dataset is enough to produce an action-conditioned world model that supports zero-shot planning. The robot generates candidate action sequences, rolls them forward through the world model, and picks the one whose predicted outcome best matches a goal image. This works on objects and environments never seen during training. The data efficiency is the real technical headline. 62 hours is almost nothing. It suggests that self-supervised pre-training on diverse video can bootstrap enough physical prior knowledge that very little domain-specific data is needed downstream. That's a strong argument for the JEPA design — if your representations are good enough, you don't need to brute-force every task from scratch. AMI Labs is LeCun's effort to push this beyond research. They're targeting healthcare and robotics first, which makes sense given JEPA's strength in physical reasoning with limited data. But this is a long-horizon bet — their CEO has openly said commercial products could be years away. --- 2. Spatial Intelligence (3D World Models) Representative: World Labs (@drfeifei) Where JEPA asks "what will happen next," Fei-Fei Li's approach asks "what does the world look like in 3D, and how can I build it?" The thesis is that true understanding requires explicit spatial structure — geometry, depth, persistence, and the ability to re-observe a scene from novel viewpoints — not just temporal prediction. This is a different bet from JEPA: rather than learning abstract dynamics, you learn a structured 3D representation of the environment that you can manipulate directly. Their product Marble generates persistent 3D environments from images, text, video, or 3D layouts. "Persistent" is the key word — unlike a video generation model that produces a linear sequence of frames, Marble's outputs are actual 3D scenes with spatial coherence. You can orbit the camera, edit objects, export meshes. This puts it closer to a 3D creation tool than to a predictive model, which is deliberate. For context, this builds on a lineage of neural 3D representation work (NeRFs, 3D Gaussian Splatting) but pushes toward generation rather than reconstruction. Instead of capturing a real scene from multi-view photos, Marble synthesizes plausible new scenes from sparse inputs. The challenge is maintaining physical plausibility — consistent geometry, reasonable lighting, sensible occlusion — across a generated world that never existed. --- 3. Learned Simulation (Generative Video + Latent-Space RL) Representatives: Google DeepMind (Genie 3, Dreamer V3/V4), Runway GWM-1 This category groups two lineages that are rapidly converging: generative video models that learn to simulate interactive worlds, and RL agents that learn world models to train policies in imagination. The video generation lineage. DeepMind's Genie 3 is the purest version — text prompt in, navigable environment out, 24 fps at 720p, with consistency for a few minutes. Rather than relying on an explicit hand-built simulator, it learns interactive dynamics from data. The key architectural property is autoregressive generation conditioned on user actions: each frame is generated based on all previous frames plus the current input (move left, look up, etc.). This means the model must maintain an implicit spatial memory — turn away from a tree and turn back, and it needs to still be there. DeepMind reports consistency up to about a minute, which is impressive but still far from what you'd need for sustained agent training. Runway's GWM-1 takes a similar foundation — autoregressive frame prediction built on Gen-4.5 — but splits into three products: Worlds, Robotics, and Avatars. The split into Worlds / Avatars / Robotics suggests the practical generality problem is still being decomposed by action space and use case. The RL lineage. The Dreamer series has the longer intellectual history. The core idea is clean: learn a latent dynamics model from observations, then roll out imagined trajectories in latent space and optimize a policy via backpropagation through the model's predictions. The agent never needs to interact with the real environment during policy learning. Dreamer V3 was the first AI to get diamonds in Minecraft without human data. Dreamer 4 did the same purely offline — no environment interaction at all. Architecturally, Dreamer 4 moves from Dreamer’s earlier recurrent-style lineage to a more scalable transformer-based world-model recipe, and introduced "shortcut forcing" — a training objective that lets the model jump from noisy to clean predictions in just 4 steps instead of the 64 typical in diffusion models. This is what makes real-time inference on a single H100 possible. These two sub-lineages used to feel distinct: video generation produces visual environments, while RL world models produce trained policies. But Dreamer 4 blurred the line — humans can now play inside its world model interactively, and Genie 3 is being used to train DeepMind's SIMA agents. The convergence point is that both need the same thing: a model that can accurately simulate how actions affect environments over extended horizons. The open question for this whole category is one LeCun keeps raising: does learning to generate pixels that look physically correct actually mean the model understands physics? Or is it pattern-matching appearance? Dreamer 4's ability to get diamonds in Minecraft from pure imagination is a strong empirical counterpoint, but it's also a game with discrete, learnable mechanics — the real world is messier. --- 4. Physical AI Infrastructure (Simulation Platform) Representative: NVIDIA Cosmos NVIDIA's play is don't build the world model, build the platform everyone else uses to build theirs. Cosmos launched at CES January 2025 and covers the full stack — data curation pipeline (process 20M hours of video in 14 days on Blackwell, vs. 3+ years on CPU), a visual tokenizer with 8x better compression than prior SOTA, model training via NeMo, and deployment through NIM microservices. The pre-trained world foundation models are trained on 9,000 trillion tokens from 20M hours of real-world video spanning driving, industrial, robotics, and human activity data. They come in two architecture families: diffusion-based (operating on continuous latent tokens) and autoregressive transformer-based (next-token prediction on discretized tokens). Both can be fine-tuned for specific domains. Three model families sit on top of this. Predict generates future video states from text, image, or video inputs — essentially video forecasting that can be post-trained for specific robot or driving scenarios. Transfer handles sim-to-real domain adaptation, which is one of the persistent headaches in physical AI — your model works great in simulation but breaks in the real world due to visual and dynamics gaps. Reason (added at GTC 2025) brings chain-of-thought reasoning over physical scenes — spatiotemporal awareness, causal understanding of interactions, video Q&A. --- 5. Active Inference Representative: VERSES AI (Karl Friston) This is the outlier on the list — not from the deep learning tradition at all, but from computational neuroscience. Karl Friston's Free Energy Principle says intelligent systems continuously generate predictions about their environment and act to minimize surprise (technically: variational free energy, an upper bound on surprise). Where standard RL is usually framed around reward maximization, active inference frames behavior as minimizing variational / expected free energy, which blends goal-directed preferences with epistemic value. This leads to natural exploration behavior: the agent is drawn to situations where it's uncertain, because resolving uncertainty reduces free energy. VERSES built AXIOM (Active eXpanding Inference with Object-centric Models) on this foundation. The architecture is fundamentally different from neural network world models. Instead of learning a monolithic function approximator, AXIOM maintains a structured generative model where each entity in the environment is a discrete object with typed attributes and relations. Inference is Bayesian — beliefs are probability distributions that get updated via message passing, not gradient descent. This makes it interpretable (you can inspect what the agent believes about each object), compositional (add a new object type without retraining), and extremely data-efficient. In their robotics work, they've shown a hierarchical multi-agent setup where each joint of a robot arm is its own active inference agent. The joint-level agents handle local motor control while higher-level agents handle task planning, all coordinating through shared beliefs in a hierarchy. The whole system adapts in real time to unfamiliar environments without retraining — you move the target object and the agent re-plans immediately, because it's doing online inference, not executing a fixed policy. They shipped a commercial product (Genius) in April 2025, and the AXIOM benchmarks against RL baselines are competitive on standard control tasks while using orders of magnitude less data. --- imo, these five categories aren't really competing — they're solving different sub-problems. JEPA compresses physical understanding. Spatial intelligence reconstructs 3D structure. Learned simulation trains agents through generated experience. NVIDIA provides the picks and shovels. Active inference offers a fundamentally different computational theory of intelligence. My guess is the lines between them blur fast.
English
61
245
1.5K
321.4K
Ludvig Erikson Brangstrup retweetledi
alphaXiv
alphaXiv@askalphaxiv·
“Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning” Continual learning in robotics usually creates a problem of wiping out the old knowledge. But this paper shows that big pretrained vision-language-action robot policies don’t catastrophically forget like smaller behavior cloning models. So by keeping a tiny replay buffer while learning new tasks, it can preserve and sometimes even improve old skills, making lifelong robot learning look far more achievable than we thought
alphaXiv tweet media
English
7
53
278
14.6K
Ludvig Erikson Brangstrup retweetledi
Lukas Ziegler
Lukas Ziegler@lukas_m_ziegler·
These robots pack trucks like Tetris in under half a second! 📦 @DexterityRobots just released Foresight, a system that helps robots understand the physical world well enough to stack boxes in trucks automatically. Loading delivery trucks efficiently is incredibly difficult. Each box is a different size and weight. You need to stack them tightly so they don't fall, but also make sure the robot can reach each position, and both arms should work at the same time without bumping into each other. There are nearly infinite ways boxes can arrive, up to 400 possible places to put each box, and the robot is packing multiple walls of freight simultaneously. Foresight solves each placement decision in under 400 milliseconds. It figures out where to place each box by considering density (pack tightly), stability (boxes don't collapse), reachability (robot can access positions), and dual-arm parallelism. It predicts how each placement affects the entire truck. The system runs in actual warehouses right now. It works across six different applications, four types of robots, and five different gripper designs. It has learned from over 100 million real placements in production. Warehouse operators can see why the system makes each decision. The system is designed safety-first, so humans understand what's happening and can intervene if needed. Think of it this way: Imagine playing Tetris where the pieces arrive in random order, you have to stack them in 3D space inside a truck, they have real weight and can fall over, and you have two robot arms to coordinate. ~~ ♻️ Join the weekly robotics newsletter, and never miss any news → ziegler.substack.com
English
11
35
157
9.7K
Ludvig Erikson Brangstrup retweetledi
Olivier Duchenne
Olivier Duchenne@inventorOli·
At Mistral, robots keep their workspace tidy. Autonomous, 1x. Robostral WMa1. wip 🚧
English
21
59
608
54K