Stephen James

508 posts

Stephen James banner
Stephen James

Stephen James

@stepjamUK

CEO @Neuracore_AI | Assistant Professor @imperialcollege | ex-Director of Dyson Robot Learning Lab | Postdoc @UCBerkeley w/ @pabbeel | PhD ICL w/ @ajdDavison

London, England Katılım Ocak 2010
290 Takip Edilen7.3K Takipçiler
Sabitlenmiş Tweet
Stephen James
Stephen James@stepjamUK·
𝗔𝗳𝘁𝗲𝗿 𝟭𝟬+ 𝘆𝗲𝗮𝗿𝘀 𝗶𝗻 𝗿𝗼𝗯𝗼𝘁 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴, from my PhD at Imperial to Berkeley to building the Dyson Robot Learning Lab, one frustration kept hitting me: 𝗪𝗵𝘆 𝗱𝗼 𝗜 𝗵𝗮𝘃𝗲 𝘁𝗼 𝗿𝗲𝗯𝘂𝗶𝗹𝗱 𝘁𝗵𝗲 𝘀𝗮𝗺𝗲 𝗶𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗼𝘃𝗲𝗿 𝗮𝗻𝗱 𝗼𝘃𝗲𝗿 𝗮𝗴𝗮𝗶𝗻? 𝗧𝗵𝗲 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 𝗜 𝗸𝗲𝗽𝘁 𝘀𝗲𝗲𝗶𝗻𝗴: • New robotics team starts • Spends 6 months building data collection pipeline • Spends another 3 months debugging synchronization issues • Finally starts collecting task-specific data • Realizes their infrastructure choices limit their flexibility • Starts over 𝗧𝗵𝗶𝘀 𝗶𝘀 𝘁𝗵𝗲 𝘄𝗵𝗼𝗹𝗲 𝗽𝗼𝗶𝗻𝘁 𝗼𝗳 𝗿𝗼𝗯𝗼𝘁 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴: Robot learning is fundamentally data-driven. Whether you're picking strawberries or assembling electronics, the core infrastructure needs are identical. That's actually why I was so interested in pursuing data-driven robotics over a decade ago. 𝗬𝗼𝘂 𝗮𝗹𝘄𝗮𝘆𝘀 𝗻𝗲𝗲𝗱: • Multi-sensor data synchronization across different frequencies • Flexible storage that works with future algorithms • Visualization tools to understand your data • The ability to experiment with different temporal resolutions • Robust logging that captures everything you might need later The trend towards AI in robotics is growing, with robots needing to process and analyze large amounts of sensor data to manage variability and unpredictability in real environments. 𝗕𝘂𝘁 𝗲𝘃𝗲𝗿𝘆 𝘁𝗲𝗮𝗺 𝗯𝘂𝗶𝗹𝗱𝘀 𝘁𝗵𝗶𝘀 𝗳𝗿𝗼𝗺 𝘀𝗰𝗿𝗮𝘁𝗰𝗵. Imagine if every web developer had to build their own database, web server, and deployment pipeline before writing their first line of application code. 𝗧𝗵𝗶𝘀 𝗶𝘀 𝘄𝗵𝘆 𝗜 𝗳𝗼𝘂𝗻𝗱𝗲𝗱 𝗡𝗲𝘂𝗿𝗮𝗰𝗼𝗿𝗲. Instead of every robotics team spending months on infrastructure, we provide the common tools that let you go from "I have a robot" to "I'm shipping intelligent robot behaviors" in days, not months. 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹 𝗶𝗻𝗻𝗼𝘃𝗮𝘁𝗶𝗼𝗻 𝗶𝗻 𝗿𝗼𝗯𝗼𝘁𝗶𝗰𝘀 𝘄𝗼𝗻'𝘁 𝗰𝗼𝗺𝗲 𝗳𝗿𝗼𝗺 𝗲𝘃𝗲𝗿𝘆𝗼𝗻𝗲 𝗿𝗲𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝘁𝗵𝗲 𝘀𝗮𝗺𝗲 𝗽𝗹𝘂𝗺𝗯𝗶𝗻𝗴. 𝗜𝘁'𝗹𝗹 𝗰𝗼𝗺𝗲 𝗳𝗿𝗼𝗺 𝘁𝗲𝗮𝗺𝘀 𝘄𝗵𝗼 𝗰𝗮𝗻 𝗳𝗼𝗰𝘂𝘀 𝗲𝗻𝘁𝗶𝗿𝗲𝗹𝘆 𝗼𝗻 𝘄𝗵𝗮𝘁 𝗺𝗮𝗸𝗲𝘀 𝘁𝗵𝗲𝗶𝗿 𝗮𝗽𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 𝘂𝗻𝗶𝗾𝘂𝗲. Robot learning shouldn't be bottlenecked by infrastructure. It should be bottlenecked by creativity. What's the longest you've spent building infrastructure before getting to the actual robotics problem you wanted to solve?
English
19
87
748
56.2K
Stephen James retweetledi
Neuracore
Neuracore@Neuracore_AI·
That's a wrap on our inaugural sponsored hackathon! Congratulations to the winners of the "Best Use of Neuracore" award at the Oxford Edge and Oxford Artificial Intelligence Society Hackathon this weekend. Well done to Sarthak Das from the robot learning team at Neuracore for his effort on-site supporting teams and presenting the award!
Neuracore tweet mediaNeuracore tweet media
English
0
1
11
412
Stephen James
Stephen James@stepjamUK·
Most simulation benchmarks for VLAs cannot tell you whether their numbers map to reality. REALM can: p < 0.001 correlation with real-world rollouts across 7 manipulation skills and 5 perturbations. The sim-to-real gap has been the central reason I have argued for collecting real data wherever possible. Most simulation benchmarks tell you something, but you cannot tell whether that something maps to reality. REALM, from Martin Sedlacek and the team at CTU Prague and Amsterdam, takes that problem seriously. The team built a simulation environment designed to correlate with real-world performance, and then validated it. Pearson values close to identity on task progression curves. Attention maps from π0 show 0.85 cosine similarity between matched real and simulated frames. They did not skip the validation step. They led with it. That changes what the simulation results actually mean. Across 15 perturbation factors covering visual, semantic, and behavioural variation, π0, π0-FAST, and GR00T N1.5 all show noticeable performance drops under semantic perturbations despite their internet-pretrained VLM backbones. All show sensitivity to camera viewpoint despite training on DROID's unusually diverse viewpoint distribution. The hardest axis of generalisation is across objects and their properties, not across skills. Reliability under perturbation is low across all three models. If the sim correlates with reality at the level REALM demonstrates, these are not simulation artefacts. They are real failure modes that real teams should be planning around. Two things this tells us. Validated simulation has a role in evaluation that it does not yet have in training. The cost of running thousands of perturbed rollouts in the real world is prohibitive. If REALM's correlation holds up across more task families, sim-based evaluation could become a serious tool for surfacing failure modes that ad-hoc real-world testing misses. The failure pattern across all three tested models also points back at the same place it always does. Pretraining buys you semantic grounding and skill primitives. It does not buy you robustness. The next generation of training data needs to focus on demonstrations where the object, scene, and viewpoint move underneath the skill, not on more demonstrations of the same skill on the same object. Paper link in comments.
English
3
8
76
6.8K
Stephen James
Stephen James@stepjamUK·
Excited to have @Neuracore_AI powering in this weekends Hackathon hosted by The Oxford Edge and @OxfordAI!
Neuracore@Neuracore_AI

This weekend we're powering the Oxford Hardware / Physical AI Hackathon at @UniofOxford, with free access to the Neuracore platform for every participant. Hosted by The Oxford Edge and @OxfordAI with hardware from @FoundryRobotics, @Quanser and @huggingface LeRobot. Sensor kits from Atech. Coding credits from @AnthropicAI and @Cursor. If you're going, come find us!

English
0
0
5
1.3K
Stephen James
Stephen James@stepjamUK·
Full fine-tuning is undoing the priors you spent the pretraining budget to build. That's the case PriorVLA is making, and the new paper from the team at CAS, Dexmal and collaborators is one of the cleaner demonstrations I have seen of the problem. Here's what happens. You take a pretrained VLA. You fine-tune on your downstream task. In-distribution evaluation looks fine. Then you test out-of-distribution and the model falls over. The pretraining gave you broad priors across diverse data. Fine-tuning pulled those priors toward the narrow patterns of your training set. The model effectively forgot what it knew. PriorVLA's response is to stop updating the pretrained action expert during fine-tuning. Freeze it, treat it as a read-only prior source, and train a parallel adaptation expert alongside it. Scene priors get pulled from the VLM, motor priors from the frozen expert, both routed into the adaptation expert via learned queries. Only 25% of the parameters a full fine-tune would touch actually get updated. The headline numbers: 11 points over π0.5 on RoboTwin 2.0-Hard, 99.1% average on LIBERO, 81% in-distribution and 57% out-of-distribution across 8 real-world tasks on two embodiments with standard data. The number that actually matters: with 10 demonstrations per task, PriorVLA beats π0.5 by 24 points in-distribution and 22 points out-of-distribution. A 24-point lift from 10 demos is the kind of sample efficiency that maps to how real teams ship robots, where you cannot collect thousands of demonstrations per skill. The broader implication is that we have been treating fine-tuning as if pretraining is just a smarter random initialisation. It isn't. Pretrained VLAs encode structure that downstream training overwrites unless you actively preserve it. Whether the right answer is frozen experts, LoRA-style adapters, or something else, the question of how to adapt without forgetting is now a first-class problem in the VLA stack. Credit: @CAS__Science Paper link in comments.
English
2
13
81
5.3K
Stephen James
Stephen James@stepjamUK·
We kicked off our Robotics in Europe series last week with @IvanTregear and the @KAIKAKU_AI team over on Neuracore. The most underrated takeaway: operators don't need convincing on robotics. They need the layer underneath it. No databases, no analytics, no sensing means no foundation for automation to act on. The arm is the easy part. This is exactly why at @Neuracore_AI we stand with the teams building robotics today, ready for the learning-driven era ahead. Full episode and series live on our YouTube channel.
Neuracore@Neuracore_AI

𝗥𝗲𝘀𝘁𝗮𝘂𝗿𝗮𝗻𝘁𝘀 𝗱𝗼𝗻'𝘁 𝗻𝗲𝗲𝗱 𝗰𝗼𝗻𝘃𝗶𝗻𝗰𝗶𝗻𝗴 𝗼𝗻 𝗿𝗼𝗯𝗼𝘁𝗶𝗰𝘀. 𝗧𝗵𝗲𝘆 𝗻𝗲𝗲𝗱 𝘁𝗵𝗲 𝗹𝗮𝘆𝗲𝗿 𝘁𝗵𝗮𝘁 𝘀𝗶𝘁𝘀 𝘂𝗻𝗱𝗲𝗿𝗻𝗲𝗮𝘁𝗵 𝗶𝘁. @IvanTregear, CTO of @KAIKAKU_AI speaks on the misconception that operators are tech resistant. Most are eager to deploy. The real blocker is foundational: no databases, no analytics, no sensing layer for automation to act on. Head to the link in comments to watch our new series exploring Robotics in Europe.

English
0
0
2
599
Stephen James
Stephen James@stepjamUK·
𝗦𝗽𝗲𝗰𝗶𝗮𝗹𝗶𝘀𝘁 𝗽𝗼𝗹𝗶𝗰𝗶𝗲𝘀 𝗵𝗮𝘃𝗲 𝗵𝗶𝘀𝘁𝗼𝗿𝗶𝗰𝗮𝗹𝗹𝘆 𝗼𝘂𝘁𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗲𝗱 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝘁 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗣𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 𝗷𝘂𝘀𝘁 𝗿𝗲𝗹𝗲𝗮𝘀𝗲𝗱 𝗮 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘀𝘁 𝘁𝗵𝗮𝘁 𝗺𝗮𝘁𝗰𝗵𝗲𝘀 𝗮𝗻𝗱 𝗶𝗻 𝘀𝗼𝗺𝗲 𝘁𝗮𝘀𝗸𝘀 𝗯𝗲𝗮𝘁𝘀 𝘁𝗵𝗲𝗺. π0.7 is a 5B parameter VLA trained on a mixture that would normally poison a policy: expert demonstrations, suboptimal autonomous rollouts, failure cases, egocentric human video, and web-scale multimodal data. The usual outcome is mode averaging and performance collapse. π0.7 avoids that by conditioning every training episode on multimodal context that describes not just what to do, but how it was done. Three conditioning signals in the prompt: • Detailed language instructions. Not "fold laundry" but "grasp the left sleeve with the left gripper, fold the shirt in half vertically." • Subgoal images. A separate world model generates near-future multi-view targets, and the policy learns inverse dynamics from current observation to subgoal. • Episode metadata. Quality, execution speed, mistake flags. This is what lets the model learn from suboptimal data without copying its mistakes. Results out of the box, against the RL specialist baselines from π*0.6: • Laundry folding (T-shirts, shorts): matches specialist throughput and success rate • Diverse laundry (hardest items): exceeds specialist throughput by 20%+ • Espresso machine: matches specialist performance • Box building: exceeds specialist throughput Single checkpoint, no task-specific post-training, no RL fine-tuning. The cross-embodiment result is more interesting. Shirt folding on a UR5e bimanual system, no task-specific data on that robot: 85.6% task progress, 80% success rate. Expert human teleoperators attempting the same task on UR5e for the first time scored 90.9% progress, 80.6% success. The policy lands in the same range as skilled humans attempting an unfamiliar embodiment cold. The UR5e arms are longer, heavier, and have different morphology than the source robot. π0.7 adapts its strategy. On the source robot, operators use tilted end-effector grasps. On UR5e, the policy switches to vertical grasps suited to the arm kinematics. Two things this implies for the broader VLA stack: The bottleneck for generalist VLAs has never really been the architecture. It's been the training mixture. If episode metadata is enough to make heterogeneous, partly-failed data productive, the cost curve for foundation model training in robotics changes meaningfully. Every team currently throwing away their autonomous rollout data is leaving training signal on the floor. The open question is how far world-model-generated subgoals scale. World model fidelity has been the soft underbelly of similar approaches before, and zero-shot cross-embodiment is a strong demonstration but a narrow one. Great work as always from the team at @physical_int. Paper link in comments.
English
2
5
52
12.4K
Stephen James
Stephen James@stepjamUK·
Looking forward to joining this panel next Tuesday to talk about what it actually takes to get robots out of simulation and into the real world at scale. If you're in London and working in robotics, come join us!
Neuracore@Neuracore_AI

One week to go. Next Tuesday, our Founder and CEO Stephen James takes the stage at STIQ Ltd's Robotics & Automation networking event, one of London's sharpest gatherings for roboticists, operators, buyers and investors. Stephen joins the panel with Obinna Njoku (@PepsiCo), Mark Slack (@CMRSurgical), Caroline France (@SciTechgovuk) and Oana Andreea Jinga (@dexoryHQ), hosted by Tom Andersson, to talk about how we're building the cloud-native infrastructure that closes the gap between simulation and real-world robot deployment. 📍 20 Primrose St, London ⏰ Tuesday 19 May, 17:30 to 21:00 Sign up here: luma.com/avzbz7v0?tk=cU…

English
0
1
8
1.1K
Stephen James
Stephen James@stepjamUK·
𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗺𝗮𝗻𝗶𝗽𝘂𝗹𝗮𝘁𝗶𝗼𝗻 𝗶𝘀 𝗼𝗻𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗵𝗮𝗿𝗱𝗲𝗿 𝗳𝗮𝗶𝗹𝘂𝗿𝗲 𝗺𝗼𝗱𝗲𝘀 𝗳𝗼𝗿 𝗩𝗟𝗔𝘀, 𝗮𝗻𝗱 𝘁𝗵𝗲 𝗽𝗲𝗿𝗰𝗲𝗽𝘁𝗶𝗼𝗻-𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗴𝗮𝗽 𝗶𝘀 𝗽𝗮𝗿𝘁 𝗼𝗳 𝘄𝗵𝘆. Static tasks tolerate 200ms inference delays. Dynamic tasks expose them. An object rolling at 0.5 m/s travels 10 cm during a 200ms inference cycle. By the time the action executes, the observation is stale. For continuous motion tasks, this becomes a real constraint. NTU S-Lab has released DynamicVLA, a 0.4B model that targets exactly this. It is built on a FastViT convolutional encoder and SmolLM2-360M truncated to 16 layers. Inference completes in around 226ms on an RTX A6000. The interesting part is not the raw speed, which is comparable to baselines. It is how the architecture handles latency. Continuous Inference overlaps prediction and execution, triggering new inference when the previous completes rather than waiting for action chunk exhaustion. Latent-Aware Action Streaming discards outdated actions and overwrites stale predictions with recent ones to keep what the policy is doing aligned with what the world is doing. 𝗗𝗢𝗠 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝗿𝗲𝘀𝘂𝗹𝘁𝘀 𝗿𝗲𝗽𝗼𝗿𝘁𝗲𝗱 𝗶𝗻 𝘁𝗵𝗲 𝗽𝗮𝗽𝗲𝗿: • 47% success against 13.6% baseline • 71.6% closed-loop reactivity against 28.3% for SmolVLA • 30.3% success without continuous inference and action streaming, a 36% degradation What is also worth noting is the data pipeline. Human teleoperators cannot react quickly enough to objects moving at 0.75 m/s, so the team replaces demonstrations with automated state-machine controllers. Dual RGB views handle 6D pose and velocity tracking in real time. The training data is generated by a system that can keep up with the task, not a human trying to. Inference lag is a real constraint once you move past pick-and-place demos, and work like this is a meaningful step. But the harder question for production robotics is still upstream. Can you actually collect enough data, at the right quality, to make these systems reliable in unconstrained environments. Architecture matters. The data path matters more. Nice work from the team at @NTUsg Lab. Paper and project page in comments below. #RobotLearning #VLA
English
3
23
193
18.5K
Stephen James
Stephen James@stepjamUK·
𝗬𝗼𝘂𝗿 𝗿𝗼𝗯𝗼𝘁 𝗶𝘀 𝗹𝗼𝘀𝗶𝗻𝗴 𝗮𝗹𝗺𝗼𝘀𝘁 𝟵𝟬 𝗺𝗶𝗻𝘂𝘁𝗲𝘀 𝗽𝗲𝗿 𝗱𝗮𝘆 𝘄𝗮𝗶𝘁𝗶𝗻𝗴 𝗳𝗼𝗿 𝗱𝗮𝘁𝗮 𝘁𝗼 𝗹𝗼𝗴. That is a minimum of 500 hours per year of operational time lost to storage and sync. Many robot data pipelines couple logging with execution. In these setups, the control thread handles both movement commands and data serialisation in the same process. This forces a choice: slow down the control loop to accommodate logging overhead, or accept jitter and dropped frames. 10,000 picks per day. Each action generates RGB frames, depth maps, and joint positions that must be serialised and stored. The cumulative overhead adds up: slower cycle times, inconsistent sampling rates, or data loss during network hiccups. The robot writes sensor data to a lock-free ring buffer in shared memory and continues execution immediately. A separate daemon process reads from the buffer, serialises to storage, and handles cloud upload without blocking the control loop. This architecture creates two critical capabilities: 1. Recording and uploading decouple. The daemon uploads yesterday's data to cloud storage while you record today's demonstrations. The robot never waits. The recording pipeline never blocks robot execution. 2. Network failures do not stop recording. If your warehouse WiFi drops during a recording session, the robot keeps working and data keeps logging locally. The daemon uploads everything once the network recovers. You lose zero data. This is the difference between hobbyist tools and production-grade systems that companies rely on. Your data pipeline does not interfere with robot operation. Your robot does not wait for cloud sync. Your recording continues regardless of network reliability. Faster cycle time. Reliable data collection. Zero execution blocking.
Stephen James tweet media
English
1
1
17
1.3K
Stephen James retweetledi
Neuracore
Neuracore@Neuracore_AI·
𝗡𝗲𝘄 𝗗𝗮𝘁𝗮𝘀𝗲𝘁 | 𝗦𝘁𝗮𝗻𝗳𝗼𝗿𝗱 𝗛𝗬𝗗𝗥𝗔 𝗶𝘀 𝗻𝗼𝘄 𝗹𝗶𝘃𝗲 𝗼𝗻 𝗡𝗲𝘂𝗿𝗮𝗰𝗼𝗿𝗲 The HYDRA dataset from @Stanford University is officially available for exploration and training. HYDRA introduces a hybrid approach to imitation learning, dynamically switching between sparse waypoints and dense low-level actions to reduce distribution shift. • 𝗟𝗼𝗻𝗴-𝗛𝗼𝗿𝗶𝘇𝗼𝗻 𝗧𝗮𝘀𝗸𝘀: Expert demonstrations for complex real-world manipulation, including making coffee and toasting bread. • 𝗛𝘆𝗯𝗿𝗶𝗱 𝗔𝗰𝘁𝗶𝗼𝗻 𝗦𝗽𝗮𝗰𝗲: 570 high-fidelity recordings featuring mode labels and waypoint annotations. • 𝗙𝗿𝗮𝗻𝗸𝗮 𝗣𝗮𝗻𝗱𝗮 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻: 84 GB of synchronized 6-DOF end-effector poses, joint positions, and RGB vision. • 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲𝗱 𝗳𝗼𝗿 𝗗𝗲𝘅𝘁𝗲𝗿𝗶𝘁𝘆: Built to train policies that balance coarse waypoints with fine-grained control. Explore the dataset: neuracore.com/cb634296-ada0-…
English
0
3
15
1.1K
Stephen James
Stephen James@stepjamUK·
Most open VLA models are not really open. They release weights and call it reproducibility. The training data is withheld. The training code is withheld. The deployment pipeline is withheld. You get a checkpoint file and a paper. You cannot verify the data quality. You cannot reproduce the training run. You cannot adapt it to your robot without starting from scratch. Researchers from Allen AI released MolmoAct2, the first VLA that is open. Weights, training code, complete datasets.  • MolmoAct2-BimanualYAM Dataset: 720 hours of teleoperated trajectories across 28 real-world tasks, the largest open bimanual dataset available.  • MolmoAct2-SO100/101 Dataset: 38,059 episodes curated from 1,222 public datasets.  • MolmoAct2-DROID Dataset: Quality-filtered Franka trajectories with re-annotated instructions. The system deploys out-of-the-box on three platforms spanning the low-to-medium cost range. Bimanual YAM, SO-100/101, DROID Franka. No additional fine-tuning required. The backbone is Molmo2-ER, trained on a 3.3M sample corpus for embodied reasoning: metric distance estimation, free space detection, cross-view object tracking, scene geometry reconstruction. The skills general-purpose VLMs do not test. Results Look Promissing 63.8% average across 13 embodied reasoning benchmarks. Outperforms GPT-5 and Gemini Robot ER-1.5 on 9 of 13 tasks. Outperforms π0.5 across 7 simulation and real-world benchmarks. The architecture uses per-layer KV conditioning between the VLM and a flow-matching action expert trained with DiT-style transformers. This bridges discrete reasoning tokens to continuous control trajectories while exposing the attention state the VLM itself uses. This is the deployment model NeuraCore advocates for: standardized ecosystems with reproducible training data. Custom infrastructure for every embodiment is technical debt that prevents fleet scaling. Nice work from @hq_fang, @DJiafei, and the team at @allen_ai
English
2
18
116
10.3K