Aurionpro
1.4K posts

























Drop 3/7: #LexsiLabsResearch Making RL more efficient. We made RL on math learn faster by changing one thing: token credit assignment. When we train LLMs with GRPO/DAPO-style reinforcement learning on reasoning tasks, we accidentally reward everything equally, including fluff like โLet me solve this step by stepโฆโ and the actual step that makes the answer correct. What if we could identify and train harder on the parts of reasoning that causally produce the right answer? Our approach: Counterfactual Importance Weighting โก๏ธ We do a lightweight โcounterfactual auditโ of reasoning: โก๏ธ Mask one reasoning span (a calc step/equation) โก๏ธ Measure how much the probability of the correct final answer drops โก๏ธ Use that drop as a per-token weight in the policy-gradient loss What we found ๐ Critical calculation chains are ~11ร more likely to matter than scaffolding ๐ 3.5% of spans are distractors (removing them improves answer probability) #RLHF #LLM #Reasoning #CausalInference #Optimization #GSM8K #AIResearch

Drop 2/7: #LexsiLabsResearch Policy optimization has become the post-training workhorse for improving LLMs' reasoning after pretraining. GRPO and the growing variants (Dr. GRPO, GSPO, off-policy GRPO, GTPO, etc.) have improved accuracy, tamed instability, and generally made RL-style post-training much more โproduction-shapedโ. But thereโs a quiet constant hiding inside almost all of these methods. No matter how fancy the reward shaping or variance tricks get, we almost always regularize with the KL divergence. Thatโsโฆ odd. Because KL isnโt a law of physics. Itโs a choice. And choices have consequences. But it also defines the geometry of your policy updates, which can directly affect accuracy, stability, and even verbosity. So we asked: what if KL is just the default, not the best choice? What we did (GBMPO): GBMPO is our attempt to pry open that โKL-onlyโ door and treat regularization geometry as a first-class design dimension, not a default setting. Weโre releasing Group-Based Mirror Policy Optimization (GBMPO), a framework that replaces KL with flexible Bregman divergences (mirror-descent style), turning โdivergence choiceโ into a real design knob Two routes: 1๏ธโฃ ProbL2 (hand-designed): L2 distance in probability space 2๏ธโฃ Neural Mirror Maps (learned): a small network learns task-specific geometry (i) NM-GRPO: single run, random init (ii) NM-GRPO-ES: meta-initialized with evolutionary strategies for extra stability + efficiency #PolicyOptimization #RL #LLMResoning #LexsiLabsParis #LexsiLabs

Drop 1/7: Safety controls without inference-time hooks ๐ง โ๏ธ Modern LLM deployments need selective refusal at scale, but most safety controls still depend on inference-time interventions (runtime hooks, gating logic, per-generation overhead).








