Epsilon Guanlin Lee

2.2K posts

Epsilon Guanlin Lee banner
Epsilon Guanlin Lee

Epsilon Guanlin Lee

@Epsilon_Lee

PhD, MLer, CLer (NLPer), ML Engineer at https://t.co/gX6Lem59Co, have belief in interpretability research of AI/ML/NNs

Beijing, Chengdu Katılım Ekim 2016
4.2K Takip Edilen316 Takipçiler
Epsilon Guanlin Lee retweetledi
Palash Shah
Palash Shah@palashshah·
turns out that building evals is super super challenging even now. i thought a lot of it was table stakes but turns out it has only become harder since agents are now more complex than ever! going to start tweeting more about how i design evals, especially to create autonomous improvement loops!
English
24
11
316
22.9K
Epsilon Guanlin Lee retweetledi
Zhihao Jia
Zhihao Jia@JiaZhihao·
🚀Introducing Motus Tracing: open-source observability for AI agents. Without traces, an agent is a black box that burns tokens. Yet most agent observability and tracing stacks today live behind accounts and subscription tiers. Motus Tracing is fully open source. Capture every model call, tool call, sandbox interaction, sub-agent action, retry, and error, for any agent framework. One unified interface from development to production. Same spans for debugging, evals, and Learning Agents. Blog: lithosai.com/blog/motus-age… Code: github.com/lithos-ai/motus @lithos_ai
English
12
54
349
26.7K
Epsilon Guanlin Lee retweetledi
Souradip Chakraborty
Souradip Chakraborty@SOURADIPCHAKR18·
🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL
Souradip Chakraborty tweet media
English
15
80
444
100.4K
Epsilon Guanlin Lee retweetledi
Lakshya A Agrawal
Lakshya A Agrawal@LakshyAAAgrawal·
Learning from rich textual feedback (errors, traces, partial reasoning) beats scalar reward alone for LLM optimization. GEPA demonstrated this for context-space optimization (prompts and agent harnesses), delivering frontier results at a fraction of the cost of RL. But context-only optimization is bounded by the base model's capability ceiling; weight updates can reach further. Very excited about this new line of work on Fast-Slow Training (FST), which interleaves context and model weight optimization! The idea is a clean division of labor between two interleaved loops: 🔹 Fast loop (context): GEPA reads rich rollout feedback updating the context layer. The context becomes a fast-updating scratchpad of what the model needs to know about this task, right now. 🔹 Slow loop (model parameters): RL updates the model's parameters conditioned on the evolving context. Because the prompt already carries task-specific nuances, the model parameters are freed from absorbing them and focus on what actually generalizes across tasks and pushes the frontier. ⦁ 3× more sample-efficient than RL on math, code, and physics reasoning ⦁ ~70% lower KL divergence from base at matched accuracy ⦁ Plasticity preserved: FST checkpoints respond better to additional RL on new tasks than RL-only ones ⦁ Continual learning across changing tasks (HoVer → CodeIO → Physics) where RL stalls the moment the task switches FST is a direction towards: ⦁ Addressing RL's pain points: entropy collapse, sparse rewards, long-horizon exploration ⦁ Providing a clean channel for rich feedback into weight updates ⦁ Demonstrating model-harness co-evolution ⦁ Discovery: Using fast context updates for broad exploration, while leveraging a continually improving model. Check out the full thread below:
Kusha Sareen@KushaSareen

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English
12
38
165
24.4K
Epsilon Guanlin Lee retweetledi
Mingyu_Jin19
Mingyu_Jin19@fnruji316625·
Does mechanistic interpretability really find the circuit? Our new paper, "All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs," (Accepted by ICML 2026) suggests the answer may be: not always. A common implicit assumption in mechanistic interpretability is that a model's behavior is explained by the circuit — a sparse, canonical, almost-unique mechanism. Instead, for the same LLM task, we find multiple circuits/sheaves that are: ✅ faithful ✅ sparse ✅ structurally different ✅ low-overlap This means a discovered circuit may not be the unique mechanism behind a behavior, but one realization among many possible mechanisms. We call for rethinking how circuit/sheaf discovery results should be interpreted and evaluated. Huge thanks to my amazing collaborators: @frankniujc, @YutongYin774638, and @zhaoran_wang Paper: arxiv.org/abs/2605.12671 #MechanisticInterpretability #LLM #AI #MachineLearning
Mingyu_Jin19 tweet mediaMingyu_Jin19 tweet mediaMingyu_Jin19 tweet media
English
13
63
459
46.1K
Epsilon Guanlin Lee retweetledi
Ofir Press
Ofir Press@OfirPress·
We're now entering the super-human stage of AI: Instead of training/evaluating AI on tasks that *a* human had previously solved in a few hours, we need to challenge AI to complete in a few hours tasks that previously took entire *teams* of humans *years* to accomplish. 🧵⬇️
Ofir Press tweet media
English
10
32
227
21.4K
Epsilon Guanlin Lee retweetledi
Yuandong Tian
Yuandong Tian@tydsh·
Today we launch Recursive. We are building AI that discovers knowledge automatically and improves itself recursively, an open-ended process that will fundamentally change how science and technology advance. Our 25 top researchers and engineers in San Francisco and London bring diverse expertise spanning agentic AI scientists, architecture and algorithm design, world models, optimization, and interpretability, united by a shared conviction that this is the most important problem we could be working on today. If you are interested in joining, please send your resume to talent@recursive.com. Follow us at @Recursive_SI!
Recursive@Recursive_SI

x.com/i/article/2054…

English
86
147
1.3K
163.7K
Epsilon Guanlin Lee retweetledi
jietang
jietang@jietang·
Recent thoughts: The Shift to Long-Horizon Tasks The most likely breakthrough this year will be in long-horizon tasks. We are moving toward a stage where Large Language Models (LLMs) learn to complete extended, complex missions by interacting with Agent environments. This is perhaps where the true value of LLMs lies. Take cybersecurity as an example: imagine a model that continuously hunts for software bugs and vulnerabilities. While it sounds like a search process, it’s actually the model learning the high-level intuition and methodology of a professional hacker. Unlike humans, AI can run 24/7 without fatigue. It could potentially find exploits at a much higher frequwill ency and claim bounties on platforms like HackerOne or BugCrowd. It sounds fun, but fundamentally, it's a revolution that displaces the hacker. If even hackers are being "disrupted," one can only imagine the impact on general programmers. From One-Person to None-Person Companies Building on long-horizon capabilities, Autonomous Agent Systems (AAS) will inevitably become the next frontier. Last year, we were discussing the rise of the "One Person Company" (OPC). I didn't expect us to move so quickly toward the "None Person Company" (NPC). It’s an ironic twist—we might all end up as NPCs in this new ecosystem. Engineering the Impossible: Memory and Learning To realize the vision above, we must solve three technical pillars: Memory, Continual Learning, and Self-Judging. I used to think these would require massive paradigm shifts and years of research. However, the pressure from both the technical and application sides is so intense that we are seeing these capabilities emerge through ingenious engineering "tricks": Memory: Long context windows (1M+) and RAG have significantly bridged the gap. Continual Learning: While true continual learning remains difficult, the release cycles are shrinking. Global models are updated monthly; domestic models are catching up. If we reach weekly updates by next year, it will effectively function as continual learning. Self-Judging: This remains the most elusive, yet models like Opus 4.7 are already demonstrating early self-correction and judgment capabilities. The Self-Evolving Endgame The most difficult—and most promising—path is Self-Evolution. The current wave is incredibly fierce. I suspect that models like Claude may have already achieved a baseline for self-training: writing their own code, cleaning their own data, generating synthetic data, and then training on it. It might "waste" some compute, but it saves the most precious resources: human labor and time. In the LLM era, speed is everything. Rapid iteration is what creates the cognitive gap between leaders and followers. Claude’s rumored 2-million-chip cluster for next year is likely dedicated to exactly this: autonomous model self-training. Technical Summary: 1M Context: Necessary baseline. Memory & Continual Learning: Prerequisites, likely solved first via "tricky" engineering. Harnessing Environments: The breakthrough point. Self-Judging: The tipping point. Full Self-Training: The endgame. Redefining AGI and the Industry If this is the road to AGI, then AGI’s definition should be the sum of all human collective intelligence, not just an individual’s intelligence. It must possess the creative capacity to produce something as profound as the "Theory of Relativity"—meeting the bar set by Hassabis. During this transition, every APP will need to be reconstructed as AI-native. In fact, we might move past the concept of APPs entirely. The most significant challenge will be the reconstruction of the operating system itself. In the future, you won’t see a traditional desktop; you will see an LLM OS, where applications are "generated on demand." This challenges the 80-year-old Von Neumann architecture and represents a total upheaval of the computer science industry. The Irreversible Wave From completing long-horizon tasks to fully autonomous operations, every sector—Security, Finance, Law, E-commerce—will be reshaped. Many friends have reached out lately, asking how to transform their enterprises to keep pace with AI. But few truly realize that this irreversible process has already begun. As this massive technical wave hits, we must be prepared to act, but we must also start thinking seriously about how to regulate it.
English
38
144
712
184.1K
Epsilon Guanlin Lee retweetledi
alex zhang
alex zhang@a1zhang·
RLM arXiv paper update: depth>1 results, more comparisons, more training, and more error analysis! We add depth=2/3 experiments, where the RLM now has access to recursive RLM calls. This is also a feature of the open source `rlm` repo as well. We observe significant performance gains on OOLONG-Pairs and gains on all other benchmarks! We also include various OpenCode and Claude Code comparisons now per popular request. We add a length generalization experiment on MRCRv2 to show more promising training results, add a small prompting case study on OOLONG, and update the error analysis section to discuss the effect of syntax errors, decomposition mistakes, and general observations from the RLM trajectories. The appendix is now also updated with several new experiments and plots!
alex zhang tweet media
English
5
35
232
11K
Epsilon Guanlin Lee retweetledi
Frank Hutter
Frank Hutter@FrankRHutter·
The data science revolution is here now. TabPFN-3 is live, taking tabular foundation models to enterprise scale 🤩 1M training rows on a single H100. No training. No tuning. Load and predict. 🧵 1/5 #tabpfn #tabularfoundationmodels #priorlabs
Frank Hutter tweet media
English
9
46
280
25.7K
Epsilon Guanlin Lee retweetledi
Ziqian Zhong
Ziqian Zhong@fjzzq2002·
I think automatic interp has enormous potential, but it is still immature in its current forms (and could give false illusions that we magically solved interp for ones that haven't played with such tools). Relevant criticism on AO lesswrong.com/posts/LXQBcztr… arxiv.org/abs/2509.13316
Cas (Stephen Casper)@StephenLCasper

@AnthropicAI is publicizing its Natural Language Autoencoders work and reporting that they will incorporate it into their alignment evals. But after reading into the details of the technical report, this seems terrifying. Based on their own methodology and results, it seems like using NLAs on consequential tasks is a good way to get burned. First, optimizing a natural language encoder and decoder jointly for reconstruction doesn't do anything to ensure that the intermediate text between them has the same meaning to the decoder as its meaning in English. The fact that they used a KL divergence penalty (see first screenshot) to make the English intermediates readable is strong evidence that NLAs are NOT good at faithfully representing the model's thoughts. They are literally putting optimization pressure on latent text for the sole purpose of oversight. Wasn't that a faux pas that Anthropic's own researchers have been warning us about in the past year? Won't using this type of method *actively select* for simplistic and confabulatory explanations? Second, Anthropic gave a positive spin to a pretty damning result (see second screenshot). After finding that NLAs produced, in some cases, false but contextually-related information >=50% of the time, they still spun it as a positive, saying "However, most claims are at least somewhat related to the input context." But shouldn't this terrify them? Doesn't this suggest that NLAs, by default, should be expected to produce plausible yet misleading explanations? Bear in mind that this experiment was a toy one in which the ground truth could reasonably be inferred. Given this, shouldn't we expect NLAs to be even more unreliable in cases that matter when the ground truth isn't so simple? It seems dishonest and safety-washy to me for Anthropic to structure its media strategy around cherry-picked demos seeming to illustrate successes while burying this result deep in the paper.

English
0
4
54
7.1K
Epsilon Guanlin Lee retweetledi
Sewon Min
Sewon Min@sewon__min·
As MoEs grow larger and sparser, they become memory-bottlenecked. What if experts were actually composable - so you only keep the subset relevant to your task? We show that this doesn't emerge in standard MoEs (their training makes this hard), but you can pre-train MoEs to support this kind of modularity! I hope everyone sees the right figure from @RyanYixiang 's original post - I was so excited when I saw this result!!
Ryan Yixiang Wang@RyanYixiang

MoEs are everywhere in frontier models, and they are deployed as a monolith system. But many applications only need a narrow slice of capabilities, e.g., math, code, biomedical, etc. So what if "modularity" is actually the missing opportunity for MoEs? Today, we're releasing EMO: an end-to-end pretrained MoE where modularity emerges naturally, enabling selective use of experts!

English
4
41
324
46.6K
Epsilon Guanlin Lee retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic Fellows research: Model Spec Midtraining (MSM). Standard alignment methods train AIs on examples of desired behavior. But this can fail to generalize to new situations. MSM addresses this by first teaching AIs how we would like them to generalize and why.
English
126
158
1.9K
253.3K
Epsilon Guanlin Lee retweetledi
Percy Liang
Percy Liang@percyliang·
I find myself repeatedly explaining the difference between open-weight (DeepSeek), open-source (Olmo), open-development (Marin). Let's see if this restaurant analogy helps: - Open-weight: food is made behind closed doors, server brings you the dish - Open-source: food is made behind closed doors, server brings you the dish and the recipe - Open-development: you see the chef make the dish in the kitchen (and can shout suggestions while its cooking)!
English
41
92
914
75.5K
Epsilon Guanlin Lee retweetledi
Ricardo Olmedo
Ricardo Olmedo@rdolmedo_·
Claude 3 Opus scored 4% on SWE-bench at release. Shockingly, a Pythia-scale model trained **only on pre-1931 data**, with a bit of fine-tuning, outperforms the April 2024 SOTA. Clearly, Opus is the better model. Why should we care about benchmarks, then? 👇🧵
Ricardo Olmedo tweet media
Ricardo Olmedo@rdolmedo_

We fine-tuned Alec Radford’s 1930 vintage LLM to solve SWE-bench issues. After just ‼️250‼️ training examples, the model solves its first issue, a simple patch to the xarray library. 🧵👇

English
16
22
410
114.3K
Epsilon Guanlin Lee retweetledi
Graham Neubig
Graham Neubig@gneubig·
I tell GPT 5.5, you are a manager, not a coder. Find the issues to solve and delegate to other agents. Do not write any code yourself. It does so for a while. I think "good GPT" and log off, I let it do its long running tasks with its team of subordinates. I log on an hour later and check in. GPT 5.5 is coding alone, its sub agents diligently waiting for orders. No STOP, I say, you are a manager. You MUST NOT code. My bad, says GPT 5.5, got it, I must manage, not code. One hour later, GPT 5.5 is coding. But it's OK GPT, I get you. For I am also guilty. No matter how many times a coder is told they are a manager, in their heart of hearts, they are still a coder. So I tell Claude Opus 4.7...
English
120
107
3K
414.8K
Epsilon Guanlin Lee retweetledi
Nick Jiang
Nick Jiang@nickhjiang·
New work! What if we used sparse autoencoders to analyze data, not models—where SAE latents act as a large set of data labels 🏷️? We find that SAEs beat baselines on 4 data analysis tasks and uncover surprising, qualitative insights about models (e.g. Grok-4, OpenAI) from data.
Nick Jiang tweet media
English
15
39
258
82.9K