Tong Chen

179 posts

Tong Chen

Tong Chen

@tomchen0

PhD student @uwcse @uwnlp

Katılım Şubat 2023
583 Takip Edilen835 Takipçiler
Sabitlenmiş Tweet
Tong Chen
Tong Chen@tomchen0·
OpenAI's blog (openai.com/index/why-lang…) points out that today’s language models hallucinate because training and evaluation reward guessing instead of admitting uncertainty. This raises a natural question: can we reduce hallucination without hurting utility?🤔 On-policy RL with our Binary Retrieval-Augmented Reward (RAR) can improve factuality (40% reduction in hallucination) while preserving model utility (win rate and accuracy) of fully trained, capable LMs like Qwen3-8B. [1/n]
Tong Chen tweet media
English
27
122
672
113.2K
Tong Chen retweetledi
Hongxun Wu
Hongxun Wu@HongxunWu·
🧵(1/8) An @OpenAI internal reasoning LLM achieved an AI Math milestone: solving an open problem central to its mathematical subfield— in this case, the unit distance problem of discrete geometry. We came across it in a side quest to truly push our model on the hardest problems.
Hongxun Wu tweet media
English
26
132
953
136.4K
Tong Chen retweetledi
Stella Li
Stella Li@StellaLisy·
LMs can learn from human labels, training data, and stronger teachers. But what happens when all of these run out🫪 when the model is already at the frontier and there is no stronger external source to learn from❓ In EvoLM, we extract the model's own evaluative knowledge into rubrics, and use them to improve its own generation🔁 This enables self-improvement with no external signals‼️
Stella Li tweet mediaStella Li tweet media
English
6
45
230
34K
Tong Chen retweetledi
Akari Asai
Akari Asai@AkariAsai·
2 papers accepted to ICML as Spotlights (top 2.2%)🥳 - DR Tulu: RL w/ evolving rubrics for SOTA long-form deep research arxiv.org/abs/2511.19399 - Binary RAR: RL w/ binary rewards for the hallucination–capability trade-off arxiv.org/abs/2510.17733 Congrats to all collaborators!
Akari Asai tweet mediaAkari Asai tweet media
English
7
18
233
11.7K
Tong Chen retweetledi
Joongwon Kim
Joongwon Kim@danieljwkim·
New work @AIatMeta: We enable test-time scaling for long-horizon coding agents by using better representations, selection and reuse of agentic trajectories, with Claude 4.5 Opus improving by +6.7% on SWE-Bench Verified and +12.1% on Terminal-Bench 2.0. 📄: arxiv.org/abs/2604.16529
Joongwon Kim tweet media
English
9
42
358
278.6K
Tong Chen retweetledi
Teng Xiao
Teng Xiao@TengX6·
🚀 New work: Meta-Reinforcement Learning with Self-Reflection LLM agents shouldn't just solve problems. They should learn from their own attempts. Most current RL methods optimize single independent trajectories. Each attempt starts from scratch, with no mechanism to improve across attempts. But intelligent systems should get better after trying once. This raises a fundamental question: How do we train models to learn from their own attempts? We believe Meta-Reinforcement Learning may be a key paradigm for training future LLM agents, enabling models to adapt and improve across attempts and environments. In this work we introduce MR-Search, a training paradigm built around: 🧠 In-Context Meta-Reinforcement Learning 🪞 Self-Reflection 🔁 Learning to learn at test time 📄 Paper: arxiv.org/abs/2603.11327 💻 Code: github.com/tengxiao1/MR-S…
English
11
49
297
50.9K
Tong Chen retweetledi
Yike Wang
Yike Wang@yikewang_·
Small language models are not very helpful as judges, how about 🔄 backward inference—inferring the instruction given only the response, and using the similarity between the inferred and the original instructions as the reward signal? Introducing ⚙️FLIP, a reference-free and rubric-free reward modeling approach that boosts the RewardBench2 performance of 13 small language models by an average of 79.6%, and substantially outperforms LLM-as-a-Judge under test-time scaling via parallel sampling and GRPO training. 📄paper: arxiv.org/abs/2602.13551  🔗code: github.com/yikee/FLIP
Yike Wang tweet media
English
12
53
251
28.2K
Tong Chen retweetledi
Taiwei Shi
Taiwei Shi@taiwei_shi·
For decades, we’ve trained AI to chase rewards. But humans don’t just optimize outcomes. We experience, reflect, then learn. Can AI do the same? Introducing 𝐄𝐱𝐩𝐞𝐫𝐢𝐞𝐧𝐭𝐢𝐚𝐥 𝐑𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠, a step toward AI that truly learn from experience.
Taiwei Shi tweet media
English
41
219
1.3K
223.4K
Tong Chen retweetledi
Akari Asai
Akari Asai@AkariAsai·
Thrilled to share: OpenScholar - our work on scientific deep research agents for reliable literature synthesis -has been accepted to Nature! 🎉 Huge thanks to collaborators across institutions who made this possible!
Akari Asai tweet media
English
33
226
1.3K
126.7K
Tong Chen retweetledi
Jiacheng Liu
Jiacheng Liu@liujc1998·
Calling on behalf of infini-gram: does anyone know where I can get / apply for AWS credits? 💸💸 Keeping infini-gram alive costs quite some money, mostly SSD rental. If you're a fan of keeping open LLM training data readily inspectable, please reply / DM me some pointers! 🧵1/4
Jiacheng Liu tweet media
English
3
16
28
3.6K
Tong Chen retweetledi
CLS
CLS@ChengleiSi·
Can LLMs automate frontier LLM research, like pre-training and post-training? In our new paper, LLMs found post-training methods that beat GRPO (69.4% vs 48.0%), and pre-training recipes faster than nanoGPT (19.7 minutes vs 35.9 minutes). 1/
CLS tweet media
English
11
141
588
109.6K
Tong Chen retweetledi
Augmented Mind Podcast
Augmented Mind Podcast@augmind_fm·
AI used to be a distant promise; now it permeates our lives. AI is getting better, but is it making us better? We are promised that AI will augment our minds, but how? We--@EchoShao8899, @shannonzshen, and @michaelryan207--are excited to launch the Augmented Mind Podcast (The AM Podcast), a podcast about technical human-centered AI work. We'll share compelling research, infrastructure, and systems through monthly episodes, featuring interviews with the pioneering minds behind them. We release EP0 today to share who we are, why we started this podcast, and what we're looking forward to. 0:00 - Prelude: the problems we care about 1:48 - Host introduction 2:03 - Why we started the AM Podcast 2:31 - Hot takes on human-centered AI 10:45 - Format of our podcast 11:28 - Unique technical challenges in human-centered AI 16:45 - Let the journey begin!
English
10
35
82
66.3K
Jiacheng Liu
Jiacheng Liu@liujc1998·
Belated update: I defended my PhD last month! I am tremendously grateful to my advisors, @HannaHajishirzi and @YejinChoinka. Without their incredible support, I wouldn’t have had so much fun exploring bold ideas, like taking a journey into the ocean of LLM pretraining data. 🥰🥰
Jiacheng Liu tweet mediaJiacheng Liu tweet media
English
39
10
308
20.8K
Tong Chen retweetledi
Liwei Jiang
Liwei Jiang@liweijianglw·
Super happy to receive the Best Paper Award at #NeurIPS2025 for our Artificial Hivemind paper!! (Really enjoyed giving oral talk at NeurIPS as well!)
Liwei Jiang tweet mediaLiwei Jiang tweet media
Liwei Jiang@liweijianglw

⚠️Different models. Same thoughts.⚠️ Today’s AI models converge into an 𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐇𝐢𝐯𝐞𝐦𝐢𝐧𝐝 🐝, a striking case of mode collapse that persists even across heterogeneous ensembles. Our #neurips2025 𝐃&𝐁 𝐎𝐫𝐚𝐥 𝐩𝐚𝐩𝐞𝐫 (✨𝐭𝐨𝐩 𝟎.𝟑𝟓%✨) dives deep into this phenomenon, introducing 𝐈𝐧𝐟𝐢𝐧𝐢𝐭𝐲-𝐂𝐡𝐚𝐭, a real-world dataset of 26K real-world open-ended user queries spanning 17 open-ended categories + 31K dense human annotations (𝟐𝟓 𝐢𝐧𝐝𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐭 𝐚𝐧𝐧𝐨𝐭𝐚𝐭𝐨𝐫𝐬 𝐩𝐞𝐫 𝐞𝐱𝐚𝐦𝐩𝐥𝐞) to push AI’s creative and discovery potential forward. Now you can build your favorite models to be truly original, diverse, and impactful in the open-ended real world. 📍Paper: arxiv.org/abs/2510.22954 📍Data: huggingface.co/collections/li… We also systematically reveal Artificial Hivemind across: 💥 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬: not only do individual LLMs repeat themselves, but different models produce strikingly similar content, even when asked fully open-ended questions. 💥 𝐃𝐢𝐬𝐜𝐫𝐢𝐦𝐢𝐧𝐚𝐭𝐢𝐯𝐞 𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬: LLMs, LM judges, and reward models are systematically miscalibrated when rating alternative responses to open-ended queries. (1/N)

English
37
68
779
80.5K
Tong Chen retweetledi
Rui Xin
Rui Xin@rui_xin31·
I'll be at #NeurIPS2025 until 12/7! I work on post-training and reward signals (Spurious Rewards), currently curious about bridging the gap between how humans and LLMs learn. Looking forward to connecting with new and old friends—also exploring summer 2025 internships. DMs open!
English
2
7
56
15.7K
Tong Chen
Tong Chen@tomchen0·
I will be at #NeurIPS2025 12.3–12.7 Looking forward to meeting old and new friends ! ☕️🌮 Recently working on hallucination (Binary RAR) and verbatim memorization (ParaPO), issues that scaling up pretraining cannot simply fix. Also interested in making models learn more like humans: strong generalization, non-scalar rewards, episodic memory, and long-horizon abilities.
English
1
5
37
4.1K
Tong Chen retweetledi
Yiping Wang
Yiping Wang@ypwang61·
8B model can outperform AlphaEvolve on open optimization problems by scaling compute for inference or test-time RL🚀! ⭕Circle packing: AlphaEvolve (Gemini-2.0-Flash/Pro) : 2.63586276 Ours (DeepSeek-R1-0528-Qwen3-8B) : 2.63598308 🔗in🧵 [1/n]
Yiping Wang tweet media
English
8
52
201
45.3K
Tong Chen
Tong Chen@tomchen0·
PhD applicants — Join Akari’s first cohort of students! Akari's research ranges from careful benchmarking to solid methodology. She always gives sharp feedback while being thoughtful and supportive. She stayed driven throughout her PhD and now brings that same energy to her new lab. I am grateful to learn from her and to work with her — please apply!
Akari Asai@AkariAsai

1/ Hiring PhD students at CMU SCS (LTI/MLD) for Fall 2026 (Deadline 12/10) 🎓 I work on open, reliable LMs: augmented LMs & agents (RAG, tool use, deep research), safety (hallucinations, copyright), and AI for science, code & multilinguality & open to bold new ideas! FAQ in 🧵

English
2
3
84
17.5K
Tong Chen retweetledi
Akari Asai
Akari Asai@AkariAsai·
Exciting DR Tulu updates! 📈 DR Tulu-8B (new RL ckpt) sits on the performance–cost frontier, beating Tongyi DR-30B and matching OpenAI DR/Gemini 3 Pro+Search at a fraction of the cost. Now on arXiv. 🖥️ You can run an interactive CLI demo with open code, almost for free. 1/🧵
Akari Asai tweet media
Ai2@allen_ai

Today we’re releasing Deep Research Tulu (DR Tulu)—the first fully open, end-to-end recipe for long-form deep research, plus an 8B agent you can use right away. Train agents that plan, search, synthesize, & cite across sources, making expert research more accessible. 🧭📚

English
4
29
153
50.5K