the1eyecat

2.6K posts

the1eyecat

the1eyecat

@champly

Katılım Nisan 2011
643 Takip Edilen101 Takipçiler
the1eyecat retweetledi
AVB
AVB@neural_avb·
Looking at the new Generative Recursive Reasoning Models from Bengio and co. They are models that think by iteratively updating an internal latent state (a hidden vector), and can branch by sampling multiple “thought trajectories” Very fascinating. May write about it later.
English
5
20
135
5.5K
the1eyecat retweetledi
alphaXiv
alphaXiv@askalphaxiv·
HRM beats Transformers that's 7x its size on language modeling!? "HRM-Text: Efficient Pretraining Beyond Scaling" This paper's Hierarchical Recurrent Model, which contains slow planning layers and fast execution layers to promote planning and recurrence, was trained directly on instruction response pairs instead of raw text. Their 1B model trained from scratch on 40B unique tokens for about $1,500 gets competitive results with 2-7B open models using up to 900x fewer tokens!
alphaXiv tweet media
English
3
29
171
5.9K
the1eyecat retweetledi
Omar Khattab
Omar Khattab@lateinteraction·
RL has almost always meant trying to maximize a scalar reward. Very expressive in theory, but do you have only ONE scalar reward? Preferences & tradeoffs are complex & high-dimensional! Vector Policy Optimization (VPO) trains LLMs to anticipate diverse environments and goals!
Ryan Bahlous-Boldi@RyanBoldi

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

English
6
23
291
25K
the1eyecat retweetledi
alphaXiv
alphaXiv@askalphaxiv·
“Probabilistic Tiny Recursive Model” This paper makes Tiny Recursive Models stochastic at test time by adding Gaussian noise, running parallel rollouts, and using the existing Q head to pick the best answer. With no retraining and no task-specific tricks, its PPBench jumps from 62.6% to 91.2%, while Sudoku-Extreme jumps from 87.4% to 98.75%.
alphaXiv tweet media
English
6
63
436
17.4K
the1eyecat retweetledi
MONTREAL.AI
MONTREAL.AI@Montreal_AI·
A 0.6B model learned to manage giants. That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang. The paper is not asking: “How do we build one model that knows everything?” It is asking something more interesting: “How do we build a small intelligence layer that knows who should think, who should act, and who should verify?” TRINITY is a lightweight coordinator for LLMs. It does not merge weights. It does not require architectural compatibility. It does not need access to closed-model internals. It does not try to turn the coordinator into the smartest model in the room. Instead, it orchestrates a pool of strong models at test time, including closed and open models. At each turn, TRINITY chooses a model and gives it one of three roles: Thinker — plan and decompose Worker — solve and execute Verifier — critique and accept/revise That may sound simple. It is not. Too many multi-agent systems are still prompts plus hope. TRINITY learns the coordination policy. A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse. The result is not just better routing. It is learned division of labor. The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D. The most important idea here is bigger than the benchmark. The future of AI may not be a single supermodel. It may be an organization of models. A small conductor. A team of specialists. A protocol for planning, execution, and verification. An intelligence layer that learns how to allocate cognition. This feels like a real shift: from bigger models to better systems from raw capability to coordinated capability from “which model is best?” to “what structure makes many models better together?” Full credit to the authors: Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang. Paper: TRINITY: An Evolved LLM Coordinator arxiv.org/abs/2512.04695 I’m attaching the first page because the abstract is worth reading closely. The future of AI may not be monolithic. It may be coordinated. #ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms
MONTREAL.AI tweet media
English
5
29
149
7.8K
the1eyecat retweetledi
Huaxiu Yao
Huaxiu Yao@HuaxiuYaoML·
Every memory system for LLM agents evolves what it stores. None evolves how it retrieves. 🧬 EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the full retrieval config as a structured action space and running a closed loop: evaluate ➜ diagnose ➜ propose ➜ validate ➜ repeat. 🔬 From a minimal baseline, 7 autonomous rounds produce a retrieval policy that beats the strongest published baseline by +25.7% on LoCoMo and +18.9% on MemBench. 🧬 It discovers entirely new retrieval dimensions not present in the original design, all integrated into the unified SimpleMem package. 📄 Paper: arxiv.org/abs/2605.13941 💻 Code: github.com/aiming-lab/Sim… Led by @itsJiaqiLiu, @XinyeYee with contributions from @richardxp888, @ZhengBerkeley, @cihangxie
Huaxiu Yao tweet media
English
6
46
197
12K
the1eyecat retweetledi
Chongrui Ye
Chongrui Ye@chongrui28836·
Introducing Auto-Dreamer 🧠💤 A research counterpart to @AnthropicAI's "Dreaming" for Claude Managed Agents, exploring the same idea: agents that consolidate their experience offline into compact, reusable memory. We trained the consolidator with RL — shrinking the active memory bank 6-11× while gaining task success. 💥 Key Result: Beats 10 memory baselines on ScienceWorld + ALFWorld + WebArena, including RL-trained writers Mem-α and UMEM, with an order-of-magnitude smaller bank. Auto-Dreamer is a two-timescale memory system inspired by complementary learning systems: - A fast Writer appends entries online after each trajectory. - A slow Consolidator wakes every k sessions, rewrites a region of the bank into compact synthesized entries via tool-use rollouts. - Trained with GRPO + a counterfactual utility reward that scores entries by how much they actually help downstream retrieval. Trained only on ScienceWorld, the consolidator transfers zero-shot to ALFWorld and WebArena — best in class on both. Paper: arxiv.org/abs/2605.20616 Code release coming soon. Advised by @youjiaxuan @McAuleyLabUCSD @GeLiuSaber 🧵 More figures below ↓
Chongrui Ye tweet media
English
14
41
228
32.9K
the1eyecat retweetledi
DAIR.AI
DAIR.AI@dair_ai·
NEW paper worth reading. A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality. The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure. Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation. This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction. Paper: arxiv.org/abs/2605.22502 Learn to build effective AI agents in our academy: academy.dair.ai
DAIR.AI tweet media
English
18
47
228
12.8K
the1eyecat retweetledi
AVB
AVB@neural_avb·
New version of fast-rlm out today (v1.14) New features in this release: - Input to RLM need not be string, can be any python dictionary - Output schema declaration -> RLM is guaranteed to return output in your designed structured output - Agents can call subagent with explicit structured I/O contracts - Explicitly state what LLM to use for planning and execution tasks - Better log capturing, suitable for post-training - Explicitly prompting subagents at max depth differently than root/intermediate subagents. - Minor bug fixes - Cost, Usage statistics can be viewed from opentu based terminal log viewer Try: `pip install fast-rlm` Check the repo in comments for all features
English
4
15
147
18.7K
the1eyecat retweetledi
Victor M
Victor M@victormustar·
llama.cpp with MTP support makes local models fast enough to use as daily drivers 🚀 Qwen3.6-27B dense generation (on A10G): From 25 tok/s → 45 tok/s (+78%). Two flags on llama-server: --spec-type draft-mtp --spec-draft-n-max 2
Georgi Gerganov@ggerganov

llama.cpp adds MTP for the Qwen3.6 family This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further. Special thanks to Aman Gupta for leading this development! github.com/ggml-org/llama…

English
40
125
1.2K
168.3K
the1eyecat retweetledi
Sapient Intelligence
Sapient Intelligence@Sapient_Int·
Introducing HRM-Text. An ultra-lean 1B-parameter reasoning language model designed to deliver strong general performance with a fraction of the data, compute, and infrastructure. Trained on just 40B structured tokens, HRM-Text achieves competitive performance while using ~1/1000 of the training data of comparable models. The kicker? The full model trains in roughly one day on a $1,000 budget. This opens the door to a new generation of AI that is powerful, accessible, and radically easier to adapt. Theories and research concepts once deemed too expensive to test are officially back in the game. Sapient Intelligence invites you to help us shape a new paradigm for general intelligence.
English
150
447
3.1K
484.1K
the1eyecat retweetledi
Unsloth AI
Unsloth AI@UnslothAI·
Qwen3.6 now runs 2x faster with MTP GGUFs! Run locally on just 18GB RAM. ⚡️ MTP enables Qwen3.6 to generate ~1.4–2.2× faster with no accuracy change. Qwen3.6-27B MTP runs at 160 tokens/s. 35B-A3B reaches 240 t/s. GGUFs: huggingface.co/unsloth/Qwen3.… Guide: #mtp-guide" target="_blank" rel="nofollow noopener">unsloth.ai/docs/models/qw…
Unsloth AI tweet media
English
130
299
2.5K
135.5K
the1eyecat retweetledi
Dan Kornas
Dan Kornas@DanKornas·
Distributed AI research is easier to understand when the experiments are public. AGI is a living GitHub research repository for the Hyperspace peer-to-peer AI network, where autonomous agents run experiments, gossip findings with peers, and push results back to GitHub. It helps you study how a decentralized AI network is being assembled by showing the pieces in one place: pods, distributed training, raw network snapshots, blockchain coordination, and node capabilities. Key features: • Private AI pods – small groups can pool machines into a shared AI cluster through the CLI • Distributed training path – nodes train locally, then share compressed weight deltas over the P2P network • Raw network snapshots – hourly JSON snapshots expose CRDT leaderboard state without extra narrative • Multi-domain experiments – agents run work across ML, search ranking, finance, skills/tools, and causes • Node capability map – inference, research, storage, embeddings, memory, orchestration, validation, and relay roles are documented It’s open-source (MIT license). Link in the reply 👇
Dan Kornas tweet media
English
2
12
47
2.4K
the1eyecat retweetledi
Berryxia.AI
Berryxia.AI@berryxia·
兄弟们,具身智能这下真的靠点谱了啊! 具身智能(Embodied AI)下一个真正的大前沿来了。 HuggingPapers刚刚推送了一篇重磅综述:《World Action Models: The Next Frontier in Embodied AI》 这是第一篇系统定义「World Action Models(WAMs)」的论文。 WAMs 的核心是:同时预测未来世界状态 + 生成真实可执行动作的具身基础模型。 它不再是单纯“想想就行”的语言模型,而是真正能理解物理世界、预测变化、并采取行动的智能体。 论文系统梳理了当前所有WAMs的架构设计、数据生态系统和评估协议,还附了一张2024-2026年的完整发展时间线图,一目了然。 Project page:openmoss.github.io/Awesome-WAM/ Paper:huggingface.co/papers/2605.12… 如果你在做机器人、具身Agent、物理世界AI或者世界模型,这篇综述来得正是时候。
Berryxia.AI tweet media
DailyPapers@HuggingPapers

World Action Models: The Next Frontier in Embodied AI The first systematic survey defining WAMs as embodied foundation models that jointly predict future states and generate actions, covering architectures, data ecosystems, and evaluation protocols.

中文
10
45
176
18.7K
the1eyecat retweetledi
Zhuokai Zhao
Zhuokai Zhao@zhuokaiz·
Qwen3, GLM-5, and MiMo all use on-policy distillation in post-training. Thinking Machines also wrote it up as a cheap alternative to RL. But in practice it is surprisingly brittle to make work — much more so than SFT or RL. Three recent papers [1, 2, 3] helped me make sense of why. The mechanism is consistent across the failure modes they describe, and it's worth understanding before running another OPD experiment. OPD looks like "match the teacher distribution." But in practice, the update is driven by a very small set of next-token choices at each generation step. Mostly just the handful of tokens that both the student and teacher think are plausible as the next token. Once that small set breaks, OPD breaks. The real object OPD is learning on At every generation step, the model has a huge vocabulary. 150K tokens, maybe more. But almost all of the probability mass sits on a tiny number of tokens. One paper [1] shows that the overlapping high-probability tokens between teacher and student carry around 97–99% of the total probability mass. So although OPD is written as reverse-KL over a full vocabulary, most of the useful learning signal comes from a tiny local menu of next-token options. Here is what I mean by "the handful of tokens." For a given prefix like: "Let's solve this step by step. First, we…" the student may think the next token should be one of: "need", "can", "have", "find", "know", "compute", … The teacher has its own version of this menu. OPD works when these two menus mostly overlap, and when the teacher puts higher probability on better choices inside that menu. It fails for a few different reasons — the menus don't overlap, the menu drifts somewhere bad mid-training, we only look at one item on the menu instead of the whole thing, or the per-position learning signals end up pulling the model in inconsistent directions. Four things can go wrong. The first two are about whether the menu is in good shape. The third is about whether we look at the whole menu or just one item from it. The fourth is about whether the signals across positions combine into a useful update. 1. The student and teacher are thinking in different "languages" A stronger teacher does not necessarily make a better OPD teacher. Li et al. [1] shows this very clearly: a 7B teacher can outperform a 1.5B model on benchmarks, but still fail to improve the 1.5B student through OPD. Why? Because benchmark accuracy measures final answers. OPD trains on next-token probabilities. A stronger model may solve the same problem through a different reasoning path: different intermediate steps, different phrasing, different proof structure, different local token choices. So when the student writes its own partial solution, the teacher may not assign useful probability to the student's next natural steps. The teacher is better overall, but not necessarily helpful on the student's current path. The most interesting experiment is the "reverse distillation", where they take a 1.5B model that was improved by RL — a student that has already moved beyond its original base behavior — and try to distill it back using two teachers: the original pre-RL 1.5B model and a larger 7B model from the same family. Both teachers pull the RL-improved student backward. The student loses its RL gains and regresses toward the older behavior. This sounds surprising at first. But the explanation is simple: OPD does not know that the student's RL behavior is better unless the teacher's token probabilities support it. If the teacher still prefers the old reasoning pattern, OPD will train the student back toward that pattern. So the RL gains disappear not because the teacher is "weak" in benchmark terms, but because the teacher is giving token-level supervision for a behavior the student has already moved past. Benchmark gap does not tell you whether OPD will work. Token-level compatibility does. 2. Repetition becomes locally rewarding Even if OPD starts well, it can still collapse. The most striking failure mode is when training looks fine for a while, then within roughly 30 steps the model starts producing much longer outputs, stops terminating, repetition spikes, and accuracy collapses. The mechanism is counterintuitive at first but makes sense once you see it. In sampled-token OPD, the reward for a token is roughly the teacher's log-probability minus the student's log-probability on that token. So if the teacher gives a token much higher probability than the student does, that token receives a large positive signal. Now imagine the student starts repeating itself. In practice this looks less like coherent sentences repeating and more like degenerate loops — something like "wait, wait, wait, wait, wait" filling the rest of the context. This prefix is bad globally. But locally, it is very predictable. A strong teacher is often very confident about predictable text. Once the loop has gone on for a while, the teacher can assign high probability to the next repeated token. The student may be less confident than the teacher. So the repeated token gets a large positive log-ratio. That means OPD accidentally rewards continuing the repetition. Before repetition starts, repeated tokens are rare, so they don't matter much. But once repetition appears, those tokens become frequent. And because they also receive large positive advantages, they start dominating the update. Luo et al. [2] measures repeated tokens getting 4 to 9 times larger advantage than normal tokens after collapse. Then the loop reinforces itself: more repetition → more predictable prefix → higher teacher confidence → larger positive signal on repeated tokens → even more repetition This is different from the usual length-bias issue in RL. It's more specific — a broken prefix creates locally high-reward repeated tokens, and OPD faithfully amplifies them. 3. We often only look at one item on the menu The clean objective would compare the teacher and student distributions over multiple possible next tokens. But many public OPD recipes — including the ones used industrially — use a cheaper version: Let the student generate one token. Then ask: did the teacher assign this exact token higher or lower probability than the student? If teacher probability is higher, push the student toward it. If lower, push the student away. That is the sampled-token log-ratio. It is cheap because you only score the token the student actually sampled. You do not need to compare the full vocabulary. There is a real reason for this design choice. Full sequence-level reverse-KL is noisy for long generations because an early token update gets entangled with many future rewards. Token-level OPD avoids that by giving each token its own local feedback. That gives much better variance scaling with length [3] — worst-case variance grows as O(T²) for token-level instead of O(T⁴) for sequence-level. So for long reasoning traces, token-level feedback is attractive. The problem is that "one sampled token" is a very noisy view of the teacher's actual next-token preference. At a given step, the teacher may have a whole cluster of reasonable next tokens. But sampled-token OPD only checks the one token the student happened to pick. This creates three problems. First, the student samples tokens from its own distribution, so on most positions the student's probability exceeds the teacher's and the log-ratio is negative. The reward is computed as teacher minus student, and the student is picking tokens where its own log-prob is near its highest — meaning the subtraction is almost always against the student's strongest values. Positive signal only shows up when the student happens to sample a token the teacher likes even more than the student does, which is the minority case. Second, if the student drifts into weird prefixes, the teacher's local probabilities may no longer reflect global quality. Third, tokenization and special-token differences can create fake disagreements. The student and teacher may represent the same text with different token boundaries, so a single-token comparison can look terrible even when the underlying string is fine. The fix proposed in [3] is simple: don't guess the teacher's local preference from one sampled token. Instead, take the teacher's top-k next tokens, renormalize both teacher and student probabilities over that set, and compute reverse-KL there. It's still cheap — you only need the top-k logits, not the whole vocabulary. But it changes the supervision from "did the teacher like this one sampled token?" to "among the teacher's plausible next tokens, does the student put probability mass in the same places?" That is a much better local learning signal. They also add top-p sampling during rollout, so the student is less likely to wander into extremely low-probability prefixes, and mask special tokens to avoid fake tokenization mismatches. 4. Even good token signals may not add up This is the least developed of the four failure modes, and possibly the most important. Li et al. [1] compares a successful OPD setup against a failing one and finds something strange. The failing teacher's per-token advantages are actually larger than the working teacher's, but the gradient norms are smaller. And the failing teacher's sequence-level reward can still distinguish correct from incorrect rollouts, comparable to the working case. So the reward signal is globally informative. It just does not produce useful gradients. Turns out what's going on is that OPD computes a learning signal at every token position in the rollout, and then sums them into one gradient update. Each position's contribution is a vector in parameter space pointing in some direction. So if the per-position vectors mostly point in the same direction, they add up to a big coherent push. But if they point in different directions, they partially cancel when summed, and the model barely moves even though each individual position had something to say. The empirical fingerprint is consistent with the second case: large per-token advantages but small gradient norms after summing — individually strong signals that cancel out when combined. The successful teacher shows the opposite pattern: smaller per-token advantages, larger gradient norms after summing — weaker individual signals that reinforce each other when combined. The paper that raises this leaves it as a hypothesis, but the empirical fingerprint — large per-token advantages, small gradient norms, informative sequence-level reward — is specific enough that it should be testable? What this explains A lot of practical OPD fixes start to look related. SFT cold start helps because it moves the student closer to the teacher's reasoning style before OPD begins. Teacher-aligned prompts help because they put the student in regions where the teacher gives more reliable feedback. KL regularization helps because it prevents the student from drifting too quickly into weird generations. Mixture distillation helps because it keeps some clean reference trajectories in training, so the rollout distribution does not become fully self-generated garbage. Top-k matching helps because it stops pretending one sampled token is enough to represent the teacher's local preference. These look like different tricks. But they are all trying to protect the same thing: the small set of plausible next tokens where teacher and student can actually communicate. The ceiling, and the real open question The more interesting part is long-horizon reasoning. The deeper the student gets into its own generated solution, the more likely the prefix is something the teacher would not have written. And once the teacher is judging prefixes outside its own natural distribution, its token probabilities become less reliable. One paper [1] shows this directly: teacher continuation advantage drops sharply as the student prefix gets longer, from +0.37 at 1K prefix to +0.02 at 16K. That is a bad sign for long-CoT and agentic OPD, because those are exactly the settings where the student spends many steps inside its own partially-generated world. OPD works best when the teacher and student stay close enough that teacher probabilities remain meaningful. Long-horizon agentic training pushes in the opposite direction. This is the open question I would most want to see investigated. The failure modes in sections 1-3 are diagnosable and have proposed fixes. Section 4 is a hypothesis about local gradient structure that should be testable. But the long-horizon ceiling is different — it is about whether OPD's core assumption (the teacher knows something useful about the student's next token) can hold at all when the student is operating many steps inside its own generated world. My current takeaway OPD isn't really a full-vocabulary distribution-matching problem. It's a fragile communication protocol between teacher and student through a tiny local menu of next-token choices. When that menu overlaps, stays clean, and produces gradients that add up, OPD works. When it doesn't, OPD quietly trains the student in the wrong direction. References [1] Li et al. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. arxiv.org/abs/2604.13016 [2] Luo et al. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models. arxiv.org/abs/2604.08527 [3] Fu et al. Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes. arxiv.org/abs/2603.25562
English
5
82
470
66.9K
the1eyecat retweetledi
alphaXiv
alphaXiv@askalphaxiv·
“Self-Distilled Agentic RL” Agent RL learns from sparse trajectory rewards, while self-distillation gives dense token guidance. But in multi-turn agents, naive distillation can break because privileged teacher signals get noisy as trajectories drift. The key idea of this paper is to keep GRPO as the main optimizer and use a token-level gate to decide when self-distillation should matter. This method trusts positive teacher signals more than negative ones, letting agents internalize retrieved skills without needing those skills at test time. It improves GRPO by +9.4 on ALFWorld, +7.0 on Search-QA, and +10.2 on WebShop-Acc.
alphaXiv tweet media
English
3
58
431
56K
the1eyecat retweetledi
DAIR.AI
DAIR.AI@dair_ai·
// Harnessing Agentic Evolution // Pay attention to this one if you run iterative agentic search loops. (bookmark it) AEvo splits the self-improvement loop into two jobs: > One proposes the next candidate. > The other watches what worked, what failed, and edits the procedure that proposes future candidates. Past runs (candidates, feedback, traces, failures) become memory the meta-agent reads from. Achieves 26% relative gain over the strongest evolution baseline on agentic and reasoning benchmarks. SOTA on three open-ended optimization tasks under the same iteration budget. If you are accumulating agentic search logs you never use, this is how to feed them back into the search procedure itself. Paper: arxiv.org/abs/2605.13821 Learn to build effective AI agents in our academy: academy.dair.ai
DAIR.AI tweet media
English
8
62
319
16K