

Dexmal
58 posts

@Dexmal_AI
Build Intelligent, Useful and Trustworthy Robots to Make Our Life Better



Realtime-VLA FLASH tackles one of the biggest deployment bottlenecks for diffusion-based VLAs: inference latency. The key idea is speculative inference for flow-matching VLAs. A lightweight draft model predicts an action chunk, while the main model’s Action Expert verifies it in parallel using flow-consistency checks instead of running full denoising every replanning round. This lets the system replace many expensive 58 ms full inference rounds with speculative rounds as fast as 7.8 ms, reducing average latency to 19.1 ms and achieving a 3.04× speedup on LIBERO while largely preserving success rate. Interesting systems insight: they profile π0 and show VLM prefill is compute-bound, while Action Denoise is memory-bound. FLASH exploits this by reusing KV cache and parallelizing verification instead of repeatedly running sequential denoising.

Full fine-tuning is undoing the priors you spent the pretraining budget to build. That's the case PriorVLA is making, and the new paper from the team at CAS, Dexmal and collaborators is one of the cleaner demonstrations I have seen of the problem. Here's what happens. You take a pretrained VLA. You fine-tune on your downstream task. In-distribution evaluation looks fine. Then you test out-of-distribution and the model falls over. The pretraining gave you broad priors across diverse data. Fine-tuning pulled those priors toward the narrow patterns of your training set. The model effectively forgot what it knew. PriorVLA's response is to stop updating the pretrained action expert during fine-tuning. Freeze it, treat it as a read-only prior source, and train a parallel adaptation expert alongside it. Scene priors get pulled from the VLM, motor priors from the frozen expert, both routed into the adaptation expert via learned queries. Only 25% of the parameters a full fine-tune would touch actually get updated. The headline numbers: 11 points over π0.5 on RoboTwin 2.0-Hard, 99.1% average on LIBERO, 81% in-distribution and 57% out-of-distribution across 8 real-world tasks on two embodiments with standard data. The number that actually matters: with 10 demonstrations per task, PriorVLA beats π0.5 by 24 points in-distribution and 22 points out-of-distribution. A 24-point lift from 10 demos is the kind of sample efficiency that maps to how real teams ship robots, where you cannot collect thousands of demonstrations per skill. The broader implication is that we have been treating fine-tuning as if pretraining is just a smarter random initialisation. It isn't. Pretrained VLAs encode structure that downstream training overwrites unless you actively preserve it. Whether the right answer is frozen experts, LoRA-style adapters, or something else, the question of how to adapt without forgetting is now a first-class problem in the VLA stack. Credit: @CAS__Science Paper link in comments.





