Steffi Chern

93 posts

Steffi Chern

Steffi Chern

@steffichern

CS PhD @Penn | @NSF Graduate Fellow | B.S. @CarnegieMellon 🤠

Philadelphia, Pennsylvania شامل ہوئے Ağustos 2020
413 فالونگ218 فالوورز
Steffi Chern ری ٹویٹ کیا
Yi wei Qin
Yi wei Qin@QinYi88814·
Should data "evolve"? 🧬 Scaling is not enough. Model performance is bounded by data, but its value is defined by processing depth. We introduce Data Darwinism, a 10-level hierarchy (L0-L9) redefining data as an eternal co-evolutionary process.(1/n) huggingface.co/papers/2602.07…
Yi wei Qin tweet media
English
2
10
19
8.3K
Ayoung Lee
Ayoung Lee@o_cube01·
Accepted to ICLR 2026!🎉 So grateful to my amazing collaborators 🫶 We introduce CLASH to evaluate value reasoning, revealing new failure modes in reasoning models and intriguing steerability results! 📰 Paper: arxiv.org/pdf/2504.10823
Ayoung Lee tweet media
English
6
2
62
4.7K
Steffi Chern ری ٹویٹ کیا
Yao Tang
Yao Tang@tyao923·
𝗧𝗵𝗶𝗻𝗸 𝘄𝗶𝗱𝗲𝗿. 𝗧𝗵𝗶𝗻𝗸 𝘀𝗵𝗼𝗿𝘁𝗲𝗿. 🚀 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: token-wise branch-and-merge reasoning for LLMs. 💸 Discrete CoT is costly. 🎛️ Existing continuous tokens often clash with 𝗼𝗻-𝗽𝗼𝗹𝗶𝗰𝘆 𝗥𝗟 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻. 🎥 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, a sampling-based continuous reasoning paradigm:
English
25
111
812
150.3K
Steffi Chern ری ٹویٹ کیا
AK
AK@_akhaliq·
LiveTalk Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation huggingface.co/papers/2512.23…
English
0
9
44
7.7K
Steffi Chern ری ٹویٹ کیا
Xiang Yue
Xiang Yue@xiangyue96·
There are competing views on whether RL can genuinely improve base model's performance (e.g., pass@128). The answer is both yes and no, largely depending on the interplay between pre-training, mid-training, and RL. We trained a few hundreds of GPT-2 scale LMs on synthetic GSM-like reasoning data from scratch. Here are what we found: 🧵
Xiang Yue tweet media
English
28
239
1.4K
325.2K
Steffi Chern ری ٹویٹ کیا
Steffi Chern ری ٹویٹ کیا
Shijie Xia
Shijie Xia@ShijieX60925·
🔥 Announcing our new paper: "SR-Scientist: Scientific Equation Discovery With Agentic AI" Most current work using LLMs for scientific discovery, like AlphaEvolve, follows a rigid "generate → evaluate → refine" loop. We challenge this paradigm for equation discovery. Our work, SR-Scientist, empowers an LLM to act as an autonomous agent, discovering scientific equations through long-horizon, tool-driven data analysis and equation evaluation—much like a human scientist. We further enhance its capabilities with multi-turn RL. 📈 Key Results: 1️⃣Consistently outperforms SOTA methods by a 6% to 35% absolute margin. 2️⃣Achieves significant performance gains after RL training. 3️⃣Demonstrates robustness to noise and generalization to out-of-domain data. 💡 Key Insights: 1️⃣ Long-horizon exploration is vital for performance. 2️⃣ Enabling agents to conduct their own data analysis is crucial. 3️⃣ An experience buffer is key for continuous optimization. 📄 Paper: arxiv.org/abs/2510.11661 💻 Code: github.com/GAIR-NLP/SR-Sc…
Shijie Xia tweet mediaShijie Xia tweet media
English
5
29
99
17.3K
Steffi Chern ری ٹویٹ کیا
Yoonho Lee
Yoonho Lee@yoonholeee·
The standard way to improve reasoning in LLMs is to train on long chains of thought. But these traces are often brute-force and shallow. Introducing RLAD, where models instead learn _reasoning abstractions_: concise textual strategies that guide structured exploration. 1/N🧵
Yoonho Lee tweet media
English
10
38
388
86.4K
Steffi Chern ری ٹویٹ کیا
Yang Xiao
Yang Xiao@Yang_Xiao_nlp·
1/9 🔥 NEW PAPER: "LIMI: Less is More for Agency" The Age of AI Agency demands systems that don't just think, but work: vibe coding and automated research. We used just 78 samples to beat GPT-5 by 14.1% and discovered the Agency Efficiency Principle. See details below! 📊
Yang Xiao tweet media
English
2
21
27
5.3K
Steffi Chern ری ٹویٹ کیا
Jiatao Gu
Jiatao Gu@thoma_gu·
Excited to STARFlow has been accepted at #NeurIPS2025 as a **Spotlight** paper! Super excited and looking forward to seeing more research directions on scalable normalizing flows as an alternative to this existing diffusion world!🧐 Huge congrats to my amazing collaborators!!
Jiatao Gu@thoma_gu

I will be attending #CVPR2025 and presenting our latest research at Apple MLR! Specifically, I will present our highlight poster--world consistent video diffusion (cvpr.thecvf.com/virtual/2025/p…), and three workshop invited talks which includes our recent preprint ★STARFlow★! (0/n)

English
4
11
114
14.4K
Steffi Chern ری ٹویٹ کیا
Ethan Chern
Ethan Chern@ethanchern·
FacTool has been accepted to COLM 2025 - two years after its arXiv debut! While the landscape of LLMs has changed a lot since then, tool-augmented LLMs and RAG are still among the most effective and practical approaches for detecting / mitigating hallucinations (ref: x.com/karpathy/statu…, x.com/_jasonwei/stat…) Reinforcement learning will push this even further. Imagine agents that almost never hallucinate - agents optimized to faithfully admit uncertainty, use external tools to cross-verify sources, recognize the limits of their knowledge, and resist sycophancy and lying. Crafting appropriate rewards & environments to train such agents still requires many effort (e.g., for scenarios with ambiguous facts or moral dilemmas), but effective progress can be anticipated!
Ethan Chern@ethanchern

In the era of 🤖#GenerativeAI, text of all forms can be generated by LLMs. How can we identify and rectify *factual errors* in the generated output? We introduce FacTool, a framework for factuality detection in Generative AI. Website: ethanc111.github.io/factool_websit… (1/n)

English
2
5
13
2.3K
Steffi Chern ری ٹویٹ کیا
Zhaochen Su
Zhaochen Su@SuZhaochen0110·
Excited to share our new survey on the reasoning paradigm shift from "Think with Text" to "Think with Image"! 🧠🖼️ Our work offers a roadmap for more powerful & aligned AI. 🚀 📜 Paper: arxiv.org/pdf/2506.23918 ⭐ GitHub (400+🌟): github.com/zhaochen0110/A…
Zhaochen Su tweet mediaZhaochen Su tweet mediaZhaochen Su tweet mediaZhaochen Su tweet media
English
7
59
159
16.1K
Steffi Chern ری ٹویٹ کیا
Zengzhi Wang
Zengzhi Wang@SinclairWang1·
What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps? (4) Is RL’s calm surface all thanks to Pre/Mid-training carrying the weight? Why does RL on LLaMA consistently underperform compared to Qwen? What makes a base model truly ready for RL scaling? What are the secrets under the hood? Due to the cost of training from scratch, we conduct extensive controlled experiments with 20B-token mid-training, systematically investigating what really matters for RL success. 💡Key insights: - High-quality math data is key to RL scaling. - QA data helps, but it depends on task similarity. - Instruction data boosts QA’s effectiveness. - More mid-training improves RL performance. Armed with these insights, we apply a two-stage (stable+decay) mid-training strategy on LLaMA, scaling up to 200B tokens—and RL performance on LLaMA now matches Qwen! To support this, we introduce MegaMath-Web-Pro-Max, a high-quality math-centric pretraining corpus. The dataset will be released soon on Hugging Face—stay tuned! 📦 huggingface.co/datasets/OctoT… Full construction details are in the paper, we hope it’s useful! arxiv.org/abs/2506.20512… Getting SOTA with a strong foundation is great 🤩, but understanding the foundation—the know-how—matters just as much. Hope this analysis inspires the community—and feel free to cite us if it helps! This work is impossible without all the brilliant co-authors @FaZhou_998 @xuefengli0301 @stefan_fee !!!
Zengzhi Wang tweet mediaZengzhi Wang tweet mediaZengzhi Wang tweet mediaZengzhi Wang tweet media
English
10
83
513
93.1K
Steffi Chern ری ٹویٹ کیا
Jiaxin Wen
Jiaxin Wen@jiaxinwen22·
New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.
Jiaxin Wen tweet media
English
36
154
1.4K
240.5K
Steffi Chern ری ٹویٹ کیا
Jyo Pari
Jyo Pari@jyo_pari·
What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.
Jyo Pari tweet media
English
134
505
3.2K
663.9K
Steffi Chern ری ٹویٹ کیا
Zhaochen Su
Zhaochen Su@SuZhaochen0110·
To further boost the "think with images" community, we've systematically summarized the latest research in our new repository: github.com/zhaochen0110/A… 🧠🖼️Let's make LVLMs see & think! A comprehensive survey paper will be released soon! Stay tuned.
Zhaochen Su tweet mediaZhaochen Su tweet mediaZhaochen Su tweet media
English
2
18
66
3.8K
Steffi Chern ری ٹویٹ کیا
Yuqing Yang
Yuqing Yang@yyqcode·
🧐When do LLMs admit their mistakes when they should know better? In our new paper, we define this behavior as retraction: the model indicates that its generated answer was wrong. LLMs can retract—but they rarely do.🤯 arxiv.org/abs/2505.16170 👇🧵
Yuqing Yang tweet media
English
5
23
115
14.3K