Steffi Chern

93 posts

Steffi Chern

Steffi Chern

@steffichern

CS PhD @Penn | @NSF Graduate Fellow | B.S. @CarnegieMellon 🤠

Philadelphia, Pennsylvania Entrou em Ağustos 2020
413 Seguindo218 Seguidores
Steffi Chern retweetou
Yi wei Qin
Yi wei Qin@QinYi88814·
Should data "evolve"? 🧬 Scaling is not enough. Model performance is bounded by data, but its value is defined by processing depth. We introduce Data Darwinism, a 10-level hierarchy (L0-L9) redefining data as an eternal co-evolutionary process.(1/n) huggingface.co/papers/2602.07…
Yi wei Qin tweet media
English
2
10
19
8.3K
Ayoung Lee
Ayoung Lee@o_cube01·
Accepted to ICLR 2026!🎉 So grateful to my amazing collaborators 🫶 We introduce CLASH to evaluate value reasoning, revealing new failure modes in reasoning models and intriguing steerability results! 📰 Paper: arxiv.org/pdf/2504.10823
Ayoung Lee tweet media
English
6
2
62
4.7K
Steffi Chern retweetou
Yao Tang
Yao Tang@tyao923·
𝗧𝗵𝗶𝗻𝗸 𝘄𝗶𝗱𝗲𝗿. 𝗧𝗵𝗶𝗻𝗸 𝘀𝗵𝗼𝗿𝘁𝗲𝗿. 🚀 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: token-wise branch-and-merge reasoning for LLMs. 💸 Discrete CoT is costly. 🎛️ Existing continuous tokens often clash with 𝗼𝗻-𝗽𝗼𝗹𝗶𝗰𝘆 𝗥𝗟 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻. 🎥 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, a sampling-based continuous reasoning paradigm:
English
25
111
812
150.3K
Steffi Chern retweetou
AK
AK@_akhaliq·
LiveTalk Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation huggingface.co/papers/2512.23…
English
0
9
44
7.7K
Steffi Chern retweetou
Xiang Yue
Xiang Yue@xiangyue96·
There are competing views on whether RL can genuinely improve base model's performance (e.g., pass@128). The answer is both yes and no, largely depending on the interplay between pre-training, mid-training, and RL. We trained a few hundreds of GPT-2 scale LMs on synthetic GSM-like reasoning data from scratch. Here are what we found: 🧵
Xiang Yue tweet media
English
28
239
1.4K
325.2K
Steffi Chern retweetou
Steffi Chern retweetou
Shijie Xia
Shijie Xia@ShijieX60925·
🔥 Announcing our new paper: "SR-Scientist: Scientific Equation Discovery With Agentic AI" Most current work using LLMs for scientific discovery, like AlphaEvolve, follows a rigid "generate → evaluate → refine" loop. We challenge this paradigm for equation discovery. Our work, SR-Scientist, empowers an LLM to act as an autonomous agent, discovering scientific equations through long-horizon, tool-driven data analysis and equation evaluation—much like a human scientist. We further enhance its capabilities with multi-turn RL. 📈 Key Results: 1️⃣Consistently outperforms SOTA methods by a 6% to 35% absolute margin. 2️⃣Achieves significant performance gains after RL training. 3️⃣Demonstrates robustness to noise and generalization to out-of-domain data. 💡 Key Insights: 1️⃣ Long-horizon exploration is vital for performance. 2️⃣ Enabling agents to conduct their own data analysis is crucial. 3️⃣ An experience buffer is key for continuous optimization. 📄 Paper: arxiv.org/abs/2510.11661 💻 Code: github.com/GAIR-NLP/SR-Sc…
Shijie Xia tweet mediaShijie Xia tweet media
English
5
29
99
17.3K
Steffi Chern retweetou
Yoonho Lee
Yoonho Lee@yoonholeee·
The standard way to improve reasoning in LLMs is to train on long chains of thought. But these traces are often brute-force and shallow. Introducing RLAD, where models instead learn _reasoning abstractions_: concise textual strategies that guide structured exploration. 1/N🧵
Yoonho Lee tweet media
English
10
38
388
86.4K
Steffi Chern retweetou
Yang Xiao
Yang Xiao@Yang_Xiao_nlp·
1/9 🔥 NEW PAPER: "LIMI: Less is More for Agency" The Age of AI Agency demands systems that don't just think, but work: vibe coding and automated research. We used just 78 samples to beat GPT-5 by 14.1% and discovered the Agency Efficiency Principle. See details below! 📊
Yang Xiao tweet media
English
2
21
27
5.3K
Steffi Chern retweetou
Jiatao Gu
Jiatao Gu@thoma_gu·
Excited to STARFlow has been accepted at #NeurIPS2025 as a **Spotlight** paper! Super excited and looking forward to seeing more research directions on scalable normalizing flows as an alternative to this existing diffusion world!🧐 Huge congrats to my amazing collaborators!!
Jiatao Gu@thoma_gu

I will be attending #CVPR2025 and presenting our latest research at Apple MLR! Specifically, I will present our highlight poster--world consistent video diffusion (cvpr.thecvf.com/virtual/2025/p…), and three workshop invited talks which includes our recent preprint ★STARFlow★! (0/n)

English
4
11
114
14.4K
Steffi Chern retweetou
Ethan Chern
Ethan Chern@ethanchern·
FacTool has been accepted to COLM 2025 - two years after its arXiv debut! While the landscape of LLMs has changed a lot since then, tool-augmented LLMs and RAG are still among the most effective and practical approaches for detecting / mitigating hallucinations (ref: x.com/karpathy/statu…, x.com/_jasonwei/stat…) Reinforcement learning will push this even further. Imagine agents that almost never hallucinate - agents optimized to faithfully admit uncertainty, use external tools to cross-verify sources, recognize the limits of their knowledge, and resist sycophancy and lying. Crafting appropriate rewards & environments to train such agents still requires many effort (e.g., for scenarios with ambiguous facts or moral dilemmas), but effective progress can be anticipated!
Ethan Chern@ethanchern

In the era of 🤖#GenerativeAI, text of all forms can be generated by LLMs. How can we identify and rectify *factual errors* in the generated output? We introduce FacTool, a framework for factuality detection in Generative AI. Website: ethanc111.github.io/factool_websit… (1/n)

English
2
5
13
2.3K
Steffi Chern retweetou
Zhaochen Su
Zhaochen Su@SuZhaochen0110·
Excited to share our new survey on the reasoning paradigm shift from "Think with Text" to "Think with Image"! 🧠🖼️ Our work offers a roadmap for more powerful & aligned AI. 🚀 📜 Paper: arxiv.org/pdf/2506.23918 ⭐ GitHub (400+🌟): github.com/zhaochen0110/A…
Zhaochen Su tweet mediaZhaochen Su tweet mediaZhaochen Su tweet mediaZhaochen Su tweet media
English
7
59
159
16.1K
Steffi Chern retweetou
Zengzhi Wang
Zengzhi Wang@SinclairWang1·
What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps? (4) Is RL’s calm surface all thanks to Pre/Mid-training carrying the weight? Why does RL on LLaMA consistently underperform compared to Qwen? What makes a base model truly ready for RL scaling? What are the secrets under the hood? Due to the cost of training from scratch, we conduct extensive controlled experiments with 20B-token mid-training, systematically investigating what really matters for RL success. 💡Key insights: - High-quality math data is key to RL scaling. - QA data helps, but it depends on task similarity. - Instruction data boosts QA’s effectiveness. - More mid-training improves RL performance. Armed with these insights, we apply a two-stage (stable+decay) mid-training strategy on LLaMA, scaling up to 200B tokens—and RL performance on LLaMA now matches Qwen! To support this, we introduce MegaMath-Web-Pro-Max, a high-quality math-centric pretraining corpus. The dataset will be released soon on Hugging Face—stay tuned! 📦 huggingface.co/datasets/OctoT… Full construction details are in the paper, we hope it’s useful! arxiv.org/abs/2506.20512… Getting SOTA with a strong foundation is great 🤩, but understanding the foundation—the know-how—matters just as much. Hope this analysis inspires the community—and feel free to cite us if it helps! This work is impossible without all the brilliant co-authors @FaZhou_998 @xuefengli0301 @stefan_fee !!!
Zengzhi Wang tweet mediaZengzhi Wang tweet mediaZengzhi Wang tweet mediaZengzhi Wang tweet media
English
10
83
513
93.1K
Steffi Chern retweetou
Jiaxin Wen
Jiaxin Wen@jiaxinwen22·
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.
Jiaxin Wen tweet media
English
36
154
1.4K
240.5K
Steffi Chern retweetou
Jyo Pari
Jyo Pari@jyo_pari·
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
Jyo Pari tweet media
English
134
505
3.2K
663.9K
Steffi Chern retweetou
Zhaochen Su
Zhaochen Su@SuZhaochen0110·
To further boost the "think with images" community, we've systematically summarized the latest research in our new repository: github.com/zhaochen0110/A… 🧠🖼️Let's make LVLMs see & think! A comprehensive survey paper will be released soon! Stay tuned.
Zhaochen Su tweet mediaZhaochen Su tweet mediaZhaochen Su tweet media
English
2
18
66
3.8K
Steffi Chern retweetou
Yuqing Yang
Yuqing Yang@yyqcode·
🧐When do LLMs admit their mistakes when they should know better? In our new paper, we define this behavior as retraction: the model indicates that its generated answer was wrong. LLMs can retract—but they rarely do.🤯 arxiv.org/abs/2505.16170 👇🧵
Yuqing Yang tweet media
English
5
23
115
14.3K