Yoonsang Lee

109 posts

Yoonsang Lee

Yoonsang Lee

@yoonsang_

CS PhD @princeton_nlp @princetonPLI; prev @SeoulNatlUni

Katılım Ocak 2023
721 Takip Edilen393 Takipçiler
Sabitlenmiş Tweet
Yoonsang Lee
Yoonsang Lee@yoonsang_·
How should we effectively aggregate long-horizon agent trajectories? 🧐 Unlike CoT reasoning, agentic tasks pose unique challenges: they are long, multi-turn, and tool-augmented. Introducing 👉🏻 AggAgent 👈🏻 — which treats parallel trajectories as an environment to interact with.
Yoonsang Lee tweet media
English
3
39
242
19.9K
Yoonsang Lee retweetledi
Yinghui He
Yinghui He@yinghui_he_·
RLVR gives sparse supervision; On-Policy Self-Distillation often requires high-quality demonstrations. Our new method, ✨SD-Zero✨, gets the best of both worlds – we use model’s self-revision to turn binary rewards into dense token-level supervision. No external teacher. No curated demonstrations. 🚨 Introducing Self-Distillation Zero (SD-Zero), which trains one model to play two roles: (1) “Generator” that makes attempts, and (2) “Reviser” that conditions on the generator’s failed/successful attempt + binary reward to produce a better answer. ‼️Even WRONG attempts can become the training signal.‼️ 🔗Paper: arxiv.org/abs/2604.12002 🏆 SD-Zero brings 10%+ improvement over base models (Qwen3,4B; Olmo3,7B) on math & code reasoning, beating GRPO and vanilla On-Policy Self-Distillation under the same training budget. SD-Zero also enables iterative self-evolution.
Yinghui He tweet mediaYinghui He tweet media
English
7
32
176
25.8K
Yoonsang Lee
Yoonsang Lee@yoonsang_·
Hi Keivan, thanks for sharing your work! 1. While we haven't experimented in the paper, we believe this could be naturally extended to other agentic tasks such as swe and web navigation. For long-context or long-horizon reasoning tasks, it is a bit trickier as we can not exploit the structure of agentic trajectory when designing the tools. One could explore more careful design of how aggagent should traverse the context, or use scaffold like RLM for parallel rollouts. 2. One potential reason could be agentic search and deep research being different from long-context QA. This also largely depends on how well the base models are consistent, well calibrated, etc. We find our findings align with prior works (Figure 4 in arxiv.org/abs/2504.12516, Table 3 in arxiv.org/abs/2602.02486).
English
0
0
1
40
Keivan Alizadeh
Keivan Alizadeh@KeivanAlizadeh2·
@yoonsang_ This is very interesting and related to our work: arxiv.org/abs/2603.15653 some questions: - Is it possible to extend your aggregation method for regular tasks? - In the heuristic aggregations how come Best of N is better than Majority Voting? Our observation was the opposite.
English
1
0
0
73
Yoonsang Lee
Yoonsang Lee@yoonsang_·
How should we effectively aggregate long-horizon agent trajectories? 🧐 Unlike CoT reasoning, agentic tasks pose unique challenges: they are long, multi-turn, and tool-augmented. Introducing 👉🏻 AggAgent 👈🏻 — which treats parallel trajectories as an environment to interact with.
Yoonsang Lee tweet media
English
3
39
242
19.9K
Yoonsang Lee
Yoonsang Lee@yoonsang_·
@lihanc02 @18jeffreyma Hi Hanchen, thanks for sharing your work. And yes, all these prompt optimization, parallel aggregation, sequential refinement, and harness engineering could be applied together at test time!
English
0
0
2
55
Hanchen Li @ ICLR
Hanchen Li @ ICLR@lihanc02·
@yoonsang_ @18jeffreyma Very interesting work! We did something similar for Combee so I am sharing it here: arxiv.org/abs/2604.04247. Combee trains a system prompt based on mass agent trajectories. But it does not handle online aggregation. Maybe these two can be used together?
English
2
0
13
469
Yoonsang Lee retweetledi
Junlin Wang
Junlin Wang@JunlinWang3·
Nicely done. Mixture of Agents type of approach does work. And most importantly it works better than majority vote, which is one of the final bosses of test time scaling. We found something similar in arxiv.org/abs/2406.04692!
Yoonsang Lee@yoonsang_

How should we effectively aggregate long-horizon agent trajectories? 🧐 Unlike CoT reasoning, agentic tasks pose unique challenges: they are long, multi-turn, and tool-augmented. Introducing 👉🏻 AggAgent 👈🏻 — which treats parallel trajectories as an environment to interact with.

English
0
3
20
3.9K
Victor Wang
Victor Wang@victorwang37·
Excited to share that I'm starting my PhD this fall at @UTAustin, supported by an NSF GRFP and working with @EliasEskin on LLM confidence calibration and developing trustworthy and robust models. I'll be at #ICLR2026 presenting DINCO, an inference-time calibration method!
Victor Wang@victorwang37

🚨 Announcing a new LLM calibration method, DINCO, which enforces confidence coherence (that probs must sum to 1) by having the LLM verbalize its confidence independently on self-generated distractors, and normalizing by the total confidence. Major gains on long + short-form QA!

English
5
2
14
1.9K
Yoonsang Lee
Yoonsang Lee@yoonsang_·
AggAgent also achieves Pareto-optimal cost and performance trade-offs✨ Together, our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.
Yoonsang Lee tweet media
English
1
0
3
692
Yoonsang Lee retweetledi
Eunsol Choi
Eunsol Choi@eunsolc·
Do LLMs suffer from human-like cognitive biases? 🤔 Check out @arhjhaveri's new paper on how models navigate hypothesis spaces. We found that confirmation bias degrades LLM performance, and we explore strategies to mitigate it.
Ayush Jhaveri@arhjhaveri

Your AI Agent just formed a hypothesis. 💭 How does it validate it? Not by trying to prove itself wrong. Rather, it selectively seeks evidence that confirms what it already believes, often ending up with the wrong answer! Confirmation bias isn’t just human. We measure it in LLMs, and we show how to fix it! 🧵

English
0
7
19
3.1K
Yoonsang Lee retweetledi
Seungju Han
Seungju Han@SeungjuHan3·
can synthetic training beat RAG in data-constrained domains? we suggest a simple recipe for better synthetic training: - Synth Mixed Training: train on both synth QAs and synth docs - Focal Rewriting: rewrite docs with targeted topic prompts results: - beats RAG by +2.6% on QuaLITY - improves to +4.4% with Focal Rewriting - reaches +6.7% when combined with RAG Paper: arxiv.org/abs/2603.23562
Seungju Han tweet media
English
2
16
67
11.7K
Yoonsang Lee retweetledi
Shankar Padmanabhan
Shankar Padmanabhan@shankarpad8·
1/5 How do we update a model trained in 2025 with new world knowledge from 2026? ⚠️Continued training will undo skills learned by LLMs during post-training, e.g. instruction-following/math/code. 🤝Our method DiSC updates LLMs with new knowledge while preserving existing skills!
English
1
16
61
10K
Yoonsang Lee retweetledi
Manya Wadhwa
Manya Wadhwa@ManyaWadhwa1·
⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs. Making novel, meaningful connections is key for scientific & creative works. We objectively measure how well LLMs can do this. 🧵👇
Manya Wadhwa tweet media
English
4
43
143
21.2K
Yoonsang Lee retweetledi
Omar Shaikh
Omar Shaikh@oshaikh13·
What’s the point of a “helpful assistant” if you have to always tell it what to do next? In a new paper, we introduce a reasoning model that predicts what you’ll do next over long contexts (LongNAP 💤). We trained it on 1,800 hours of computer use from 20 users. 🧵
English
16
80
293
102.2K
Yoonsang Lee retweetledi
Xi Ye
Xi Ye@xiye_nlp·
We propose a new decoding algorithm, DySCO🪩 (Dynamic Attention Scaling), directly improving long-context reasoning without training. At each decoding step, we dynamically identify and upweight attention to important context for the next token. 📈20% gains on multiple tasks.
GIF
English
3
22
82
7.4K
Yoonsang Lee retweetledi
Fangyuan Xu
Fangyuan Xu@brunchavecmoi·
A lot of useful training data can't be shared due to privacy. How do we create synthetic training data without data owners ever sharing their content? 🚀 Introducing 𝐃𝐏-𝐑𝐅𝐓: using RL to train LLMs to generate high-fidelity domain data without seeing a single private sample.
Fangyuan Xu tweet media
English
5
30
130
10.9K
Yong Lin
Yong Lin@Yong18850571·
[Life update] I’ve officially left @PrincetonPLI and joined Thinking Machines Lab @thinkymachines . It feels like the right time to look back on my journey at Princeton — one and a half years that were truly transformative. During this period, I made many friends, learned tremendously, and co-founded and co-led the Goedel Project. It was one of the most rewarding experiences of my life: a small, close-knit team of about ten people working with a clear purpose, moving fast, and ultimately building something impactful. In mid-July, we released Goedel-Prover-V2 (32B), a model that significantly outperformed the previous state-of-the-art DeepSeek-Prover-V2-671B on formal mathematical reasoning, using nearly 20× fewer parameters and dramatically less compute. Even now, four months after release, it still sits at the top of the open-source leaderboard. What makes this achievement especially meaningful is that we accomplished it entirely with academic resources. Competing against large industrial labs and still coming out ahead felt almost unreal. Seeing so many research teams now building on top of Goedel-Prover-V2 is deeply gratifying — it’s proof that open, academic AI can still make a real impact. Equally fulfilling was the journey itself. Unlike industrial teams with access to large-scale, off-the-shelf RL infrastructure, we — a group of students and researchers from academia with zero prior experience in massive model training — had to build almost everything from scratch. We learned quickly, identified problems as they emerged, and fixed them with remarkable speed. Designing, scaling, and successfully training a 32B-parameter model within just three months remains one of the things I’m most proud of. The Goedel Project began in October 2024. At that time, we had no serious experience training models that could compete with the best labs. DeepSeek-Prover-V1 and V1.5 looked unbeatable — they had started a year earlier and already set an incredibly high bar. We experimented with many ideas — agentic pipelines, divide-and-conquer methods — most of which turned out to be too costly or impractical given our limited resources. Eventually, we discovered a simple yet powerful iterative-training approach that allowed us to scale efficiently within our compute limits. Bit by bit, we caught up with DeepSeek-Prover-V1.5 — and then surpassed it. Princeton winters are brutally cold. It was the first time I’d ever seen snow last for weeks. I spent the entire winter break at home, running experiments, analyzing results, and adjusting training methods and data again and again. That persistence paid off: in February, we released Goedel-Prover-V1-7B, which captured the top spot on the leaderboard. It was our first major milestone — proof that an academic team could compete with frontier models. Our celebration was short-lived. In April, Kimi-Prover-72B and DeepSeek-Prover-V2-671B both arrived — and completely outperformed us. It was a tough moment. We couldn’t even host DeepSeek-Prover-V2-671B for inference locally; communication errors kept crashing our limited infrastructure. None of us had experience deploying or training models of that scale. Still, we decided to aim higher — to beat them in the next version. We began by identifying the bottlenecks in DeepSeek and Kimi’s provers, exploring every possible angle for improvement. We experimented with compiler-based feedback loops, curriculum data synthesis, self-improvement strategies, model distillation, and model merging to improve diversity during RL. But the most critical insight was about efficiency — optimizing how we allocated limited resources across training design, data generation, and scaling. Every GPU hour had to count. After two months of exploration and countless small-scale tests, we finally established a systematic framework for the next release: Goedel-Prover-V2. I led "The Big Run" — a nearly month-long sequence of two self-improvement fine-tuning cycles followed by one large-scale RL round. We completed training just a few days before our scheduled release, leaving barely enough time for evaluation. Those last nights were intense — running tests, fixing scripts, collecting metrics — but everything came together perfectly. When we saw the final results, we could hardly believe them: Goedel-Prover-V2 solved twice as many problems on PutnamBench as DeepSeek-Prover-V2-671B. Many people have since asked what the “key” was — how an academic team managed to outperform frontier labs using a fraction of their resources. There isn’t a single magic trick, but rather a combination of principles that guided us: * build solid infrastructure early * focus on real bottlenecks instead of chasing novelty * investigate broadly with small-scale experiments * fix problems in real time * optimize resource allocation carefully * execute the final big run with precision. Each of these steps sounds simple, but together they made all the difference. Now, at Thinking Machines Lab, I’m shifting focus beyond formal reasoning toward building general-purpose models. I’m deeply inspired by TML’s mission — developing interactive AI systems and advancing open science. I’m thrilled to begin this new chapter and look forward to sharing more in the future.
English
18
18
646
78.2K