Che-Ping Tsai

79 posts

Che-Ping Tsai

Che-Ping Tsai

@chepingt

PhD @mldcmu, interpretability and representation learning, machine learning theories.

Katılım Kasım 2016
650 Takip Edilen155 Takipçiler
Che-Ping Tsai retweetledi
Christina Baek
Christina Baek@_christinabaek·
Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵
Christina Baek tweet media
English
19
80
617
93.4K
Che-Ping Tsai retweetledi
Dylan Sam
Dylan Sam@dylanjsam·
I defended my PhD thesis! Also, a very (~4 month) late life update, but I've joined @OpenAI to work on safety research and pretraining safer language models! 📈 Thank you to my advisor @zicokolter and my committee: Matt Fredrikson, @andrew_ilyas, and @furongh! 🙏
Dylan Sam tweet media
English
24
9
221
21.5K
Che-Ping Tsai retweetledi
Amrith Setlur
Amrith Setlur@setlur_amrith·
I'll admit, going in I was not 100% sure this was possible: we trained a tiny 4B model (QED-Nano) to prove math theorems at the Olympiad level! Today, we release the full recipe, from the data curation done for SFT to our RL algorithm that explicitly optimizes for test-time scaling over millions of tokens (i.e., we train QED-Nano to continually improve as we apply modern day test-time scaffolds like DeepSeekMath-agent over it). 🧵⬇️
Amrith Setlur tweet media
English
7
26
149
23.6K
Che-Ping Tsai retweetledi
Yuda Song
Yuda Song@yus167·
RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵
Yuda Song tweet media
English
14
101
601
106.6K
Che-Ping Tsai retweetledi
Amrith Setlur
Amrith Setlur@setlur_amrith·
We run online RL on a mixture of problems: some are easy to explore (high pass rate), and some are very very hard (need to sample A LOT before we see any positive sample). Turns out RL on such a mixture can lead to a "rich-gets-richer" effect, where RL over-sharpens on the easy problems, at the cost of getting stuck in a "plateau" on harder ones, making it even harder to sample a correct trace on those. RL literature calls this "ray interference". In our recent work POPE, we show that using privileged info. to guide exploration on hard problems can tackle ray interference! 🧵⬇️
Amrith Setlur tweet media
English
10
38
306
16.3K
Che-Ping Tsai retweetledi
Amrith Setlur
Amrith Setlur@setlur_amrith·
RL training of LLMs spends tons of compute on sampling rollouts 🤖💸 But most runs are YOLO 🤟, telling us little about how to scale sampling compute optimally. Given a fixed sampling compute budget, how should we allocate it across: • sequential iterations ⏩ • parallel rollouts 🎲 Answers to this with scaling laws 📈 and more in our new blog post ⬇️
Amrith Setlur tweet media
English
7
30
201
12.9K
Che-Ping Tsai retweetledi
Chen Wu
Chen Wu@ChenHenryWu·
1/⚠️ Parallel test-time scaling (e.g., pass@k) usually wastes compute - models often repeat the same dominant failure❌ How should we effectively generate creative solutions? While typical methods such as increasing temperature 🌡️ usually fail, we put forward Mode‑Conditioning (ModC) - a simple yet powerful training and test-time framework that allocates compute across diverse reasoning modes🎨We show that ModC largely improves pass@k across SFT, distillation, and RL settings. With ModC, we get 4-8x efficiency gains in math reasoning using the same training data!
Chen Wu tweet media
English
13
30
133
24.5K
Che-Ping Tsai retweetledi
I-Hung Hsu
I-Hung Hsu@IHung_Hsu·
🧠🚀 Excited to introduce Supervised Reinforcement Learning—a framework that leverages expert trajectories to teach small LMs how to reason through hard problems without losing their minds. 🤯 Better than SFT && RLVR. Read more: huggingface.co/papers/2510.25… #llms #RL #reasoning
I-Hung Hsu tweet media
English
12
64
336
20.5K
Che-Ping Tsai retweetledi
Yuda Song
Yuda Song@yus167·
🤖 Robots rarely see the true world's state—they operate on partial, noisy visual observations. How should we design algorithms under this partial observability? Should we decide (end-to-end RL) or distill (from a privileged expert)? We study this trade-off in locomotion. 🧵(1/n)
Yuda Song tweet media
English
2
40
140
30.1K
Che-Ping Tsai retweetledi
Emily Byun
Emily Byun@yewonbyun_·
💡Can we trust synthetic data for statistical inference? We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data
Emily Byun tweet media
English
2
36
143
31K
Che-Ping Tsai retweetledi
Nicholas Boffi
Nicholas Boffi@nmboffi·
Consistency models, CTMs, shortcut models, align your flow, mean flow... What's the connection, and how should you learn them in practice? We show they're all different sides of the same coin connected by one central object: the flow map. arxiv.org/abs/2505.18825 🧵(1/n)
English
5
75
386
65.3K
Che-Ping Tsai retweetledi
Jeremy Cohen
Jeremy Cohen@deepcohen·
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
English
18
212
1.3K
234.2K
Che-Ping Tsai retweetledi
Junhong Shen
Junhong Shen@JunhongShen1·
Excited to share two papers accepted to #NeurIPS2025 ! 1️⃣ Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction We introduce TTI, an RL algorithm that scales the number of interaction steps beyond thinking tokens per step. Our agents learn to act longer➡️richer exploration➡️better success Paper: arxiv.org/abs/2506.07976 2️⃣ Content-Adaptive Tokenizer (CAT) We develop an image tokenizer that adapts token count based on image complexity, offering flexible 8x, 16x, or 32x compression! Importantly, we use just captions (no pixels!) to guide tokenization, enabling adaptive representation for text-to-image generation. Paper: arxiv.org/abs/2501.03120 Look forward to seeing everyone in SD!
Junhong Shen tweet media
English
10
19
301
24.2K
Che-Ping Tsai retweetledi
Dylan Sam
Dylan Sam@dylanjsam·
🚨Excited to introduce a major development in building safer language models: Safety Pretraining! Instead of post-hoc alignment, we take a step back and embed safety directly into pretraining. 🧵(1/n)
Dylan Sam tweet media
English
8
90
357
62.5K
Che-Ping Tsai retweetledi
Yuda Song
Yuda Song@yus167·
LLMs lose diversity after RL post-training, and this hurts test-time scaling & creativity. Why does this collapse happen, and how can we fix it? Our new work introduces: 🔍 RL as Sampling (analysis) 🗺️ Outcome-based Exploration (intervention) [1/n]
Yuda Song tweet media
English
9
88
467
39.9K
Che-Ping Tsai retweetledi
Wen-Tse Chen
Wen-Tse Chen@WenzeChen2·
[0/3] 🚀 Introducing Verlog – an open-source RL framework built specifically for training long-horizon, multi-turn LLM agents. 📊 Max episode length comparison: •VeRL / RAGEN → ~10 turns •verl-agent → ~50 turns •Verlog (ours) → 400+ turns 🔥 ⚙️ Technical foundation: •Built on top of the VeRL •Tested on the BALROG benchmark (BabyAI, BabaIsAI, Crafter) •Followed design principles from pytorch-a2c-ppo-acktr-gail 💡 Why Verlog? •For researchers: Skip the heavy engineering. We give you a strong, validated baseline for long-horizon, multi-turn LLM agent across diverse tasks. •For developers: Train on your own long-horizon environments with minimal setup. •Algorithmic edge: With a well-trained value function as an intermediate supervised signal, rollouts can be truncated at any point and still be used for learning. This reduces GPU idle time and boosts training efficiency. This is a genuine advantage of PPO over the GRPO family, widely recognized and leveraged in classic RL, yet often overlooked in LLM agent frameworks. Key features 🧵👇
GIF
English
2
70
398
36.2K
Che-Ping Tsai retweetledi
Lili
Lili@lchen915·
Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic.
Lili tweet media
English
26
181
1.1K
145.8K