Steffi Chern

93 posts

Steffi Chern

@steffichern

CS PhD @Penn | @NSF Graduate Fellow | B.S. @CarnegieMellon 🤠

Philadelphia, Pennsylvania Entrou em Ağustos 2020

413 Seguindo218 Seguidores

Steffi Chern retweetou

Yi wei Qin@QinYi88814·17 Şub

Should data "evolve"? 🧬 Scaling is not enough. Model performance is bounded by data, but its value is defined by processing depth. We introduce Data Darwinism, a 10-level hierarchy (L0-L9) redefining data as an eternal co-evolutionary process.(1/n) huggingface.co/papers/2602.07…

English

8.3K

Steffi Chern@steffichern·27 Oca

@o_cube01 yayy congrats Ayoung! 🤩

English

Ayoung Lee@o_cube01·26 Oca

Accepted to ICLR 2026!🎉 So grateful to my amazing collaborators 🫶 We introduce CLASH to evaluate value reasoning, revealing new failure modes in reasoning models and intriguing steerability results! 📰 Paper: arxiv.org/pdf/2504.10823

English

4.7K

Steffi Chern@steffichern·18 Oca

@tyao923 🐮！

205

Steffi Chern retweetou

Yao Tang@tyao923·17 Oca

𝗧𝗵𝗶𝗻𝗸 𝘄𝗶𝗱𝗲𝗿. 𝗧𝗵𝗶𝗻𝗸 𝘀𝗵𝗼𝗿𝘁𝗲𝗿. 🚀 𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴: token-wise branch-and-merge reasoning for LLMs. 💸 Discrete CoT is costly. 🎛️ Existing continuous tokens often clash with 𝗼𝗻-𝗽𝗼𝗹𝗶𝗰𝘆 𝗥𝗟 𝗲𝘅𝗽𝗹𝗼𝗿𝗮𝘁𝗶𝗼𝗻. 🎥 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲𝘅 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, a sampling-based continuous reasoning paradigm:

English

111

812

150.3K

Steffi Chern retweetou

AK@_akhaliq·30 Ara

LiveTalk Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation huggingface.co/papers/2512.23…

English

7.7K

Steffi Chern retweetou

Xiang Yue@xiangyue96·9 Ara

There are competing views on whether RL can genuinely improve base model's performance (e.g., pass@128). The answer is both yes and no, largely depending on the interplay between pre-training, mid-training, and RL. We trained a few hundreds of GPT-2 scale LMs on synthetic GSM-like reasoning data from scratch. Here are what we found: 🧵

English

239

1.4K

325.2K

Steffi Chern retweetou

Junlong Li@lockonlvange·30 Eki

Agents are killing it at coding, deep research, Q&A...But the next frontier? Seamlessly orchestrating multiple apps to solve tasks end2end in real states -- Toolathlon is just for this! So if you want to make agents truly useful in the beautiful mess of real work, don't miss it!

Junxian He@junxian_he

🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering daily and professional scenarios Toolathlon reveals significant shortcomings of SOTA LLMs in realistic tool-use tasks, where Claude Sonnet 4.5 achieves 38.6% success rate. It also indicates a clear gap between open-source and leading proprietary models. Check our blog: toolathlon.xyz/docs/blog/tool… Github: github.com/hkust-nlp/Tool… Paper: huggingface.co/papers/2510.25… 🧵⬇️

English

3.3K

Steffi Chern retweetou

Shijie Xia@ShijieX60925·14 Eki

🔥 Announcing our new paper: "SR-Scientist: Scientific Equation Discovery With Agentic AI" Most current work using LLMs for scientific discovery, like AlphaEvolve, follows a rigid "generate → evaluate → refine" loop. We challenge this paradigm for equation discovery. Our work, SR-Scientist, empowers an LLM to act as an autonomous agent, discovering scientific equations through long-horizon, tool-driven data analysis and equation evaluation—much like a human scientist. We further enhance its capabilities with multi-turn RL. 📈 Key Results: 1️⃣Consistently outperforms SOTA methods by a 6% to 35% absolute margin. 2️⃣Achieves significant performance gains after RL training. 3️⃣Demonstrates robustness to noise and generalization to out-of-domain data. 💡 Key Insights: 1️⃣ Long-horizon exploration is vital for performance. 2️⃣ Enabling agents to conduct their own data analysis is crucial. 3️⃣ An experience buffer is key for continuous optimization. 📄 Paper: arxiv.org/abs/2510.11661 💻 Code: github.com/GAIR-NLP/SR-Sc…

English

17.3K

Steffi Chern@steffichern·9 Eki

come see us at poster 15 🥰

Graham Neubig@gneubig

Now @steffichern and I are presenting about how to do better factuality detection with tools! arxiv.org/abs/2307.13528

English

15.8K

Steffi Chern retweetou

Yoonho Lee@yoonholeee·4 Eki

The standard way to improve reasoning in LLMs is to train on long chains of thought. But these traces are often brute-force and shallow. Introducing RLAD, where models instead learn _reasoning abstractions_: concise textual strategies that guide structured exploration. 1/N🧵

English

388

86.4K

Steffi Chern retweetou

Yang Xiao@Yang_Xiao_nlp·23 Eyl

1/9 🔥 NEW PAPER: "LIMI: Less is More for Agency" The Age of AI Agency demands systems that don't just think, but work: vibe coding and automated research. We used just 78 samples to beat GPT-5 by 14.1% and discovered the Agency Efficiency Principle. See details below! 📊

English

5.3K

Steffi Chern retweetou

Jiatao Gu@thoma_gu·18 Eyl

Excited to STARFlow has been accepted at #NeurIPS2025 as a **Spotlight** paper! Super excited and looking forward to seeing more research directions on scalable normalizing flows as an alternative to this existing diffusion world!🧐 Huge congrats to my amazing collaborators!!

Jiatao Gu@thoma_gu

I will be attending #CVPR2025 and presenting our latest research at Apple MLR! Specifically, I will present our highlight poster--world consistent video diffusion (cvpr.thecvf.com/virtual/2025/p…), and three workshop invited talks which includes our recent preprint ★STARFlow★! (0/n)

English

114

14.4K

Steffi Chern retweetou

Ethan Chern@ethanchern·8 Tem

FacTool has been accepted to COLM 2025 - two years after its arXiv debut! While the landscape of LLMs has changed a lot since then, tool-augmented LLMs and RAG are still among the most effective and practical approaches for detecting / mitigating hallucinations (ref: x.com/karpathy/statu…, x.com/_jasonwei/stat…) Reinforcement learning will push this even further. Imagine agents that almost never hallucinate - agents optimized to faithfully admit uncertainty, use external tools to cross-verify sources, recognize the limits of their knowledge, and resist sycophancy and lying. Crafting appropriate rewards & environments to train such agents still requires many effort (e.g., for scenarios with ambiguous facts or moral dilemmas), but effective progress can be anticipated!

Ethan Chern@ethanchern

In the era of 🤖#GenerativeAI, text of all forms can be generated by LLMs. How can we identify and rectify *factual errors* in the generated output? We introduce FacTool, a framework for factuality detection in Generative AI. Website: ethanc111.github.io/factool_websit… (1/n)

English

2.3K

Steffi Chern retweetou

Zhaochen Su@SuZhaochen0110·2 Tem

Excited to share our new survey on the reasoning paradigm shift from "Think with Text" to "Think with Image"! 🧠🖼️ Our work offers a roadmap for more powerful & aligned AI. 🚀 📜 Paper: arxiv.org/pdf/2506.23918 ⭐ GitHub (400+🌟): github.com/zhaochen0110/A…

English

159

16.1K

Steffi Chern retweetou

Zengzhi Wang@SinclairWang1·26 Haz

What Makes a Base Language Model Suitable for RL? Rumors in the community say RL (i.e., RLVR) on LLMs is full of “mysteries”: (1) Is the magic only happening on Qwen + Math? (2) Does the "aha moment" only spark during math reasoning? (3) Is evaluation hiding some tricky traps? (4) Is RL’s calm surface all thanks to Pre/Mid-training carrying the weight? Why does RL on LLaMA consistently underperform compared to Qwen? What makes a base model truly ready for RL scaling? What are the secrets under the hood? Due to the cost of training from scratch, we conduct extensive controlled experiments with 20B-token mid-training, systematically investigating what really matters for RL success. 💡Key insights: - High-quality math data is key to RL scaling. - QA data helps, but it depends on task similarity. - Instruction data boosts QA’s effectiveness. - More mid-training improves RL performance. Armed with these insights, we apply a two-stage (stable+decay) mid-training strategy on LLaMA, scaling up to 200B tokens—and RL performance on LLaMA now matches Qwen! To support this, we introduce MegaMath-Web-Pro-Max, a high-quality math-centric pretraining corpus. The dataset will be released soon on Hugging Face—stay tuned! 📦 huggingface.co/datasets/OctoT… Full construction details are in the paper, we hope it’s useful! arxiv.org/abs/2506.20512… Getting SOTA with a strong foundation is great 🤩, but understanding the foundation—the know-how—matters just as much. Hope this analysis inspires the community—and feel free to cite us if it helps! This work is impossible without all the brilliant co-authors @FaZhou_998 @xuefengli0301 @stefan_fee !!!

English

513

93.1K

Steffi Chern retweetou

Jiaxin Wen@jiaxinwen22·12 Haz

New Anthropic research: We elicit capabilities from pretrained models using no external supervision, often competitive or better than using human supervision. Using this approach, we are able to train a Claude 3.5-based assistant that beats its human-supervised counterpart.

English

154

1.4K

240.5K

Steffi Chern retweetou

Jyo Pari@jyo_pari·13 Haz

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

English

134

505

3.2K

663.9K

Steffi Chern retweetou

Zhaochen Su@SuZhaochen0110·3 Haz

To further boost the "think with images" community, we've systematically summarized the latest research in our new repository: github.com/zhaochen0110/A… 🧠🖼️Let's make LVLMs see & think! A comprehensive survey paper will be released soon! Stay tuned.

English

3.8K

Steffi Chern retweetou

Yuqing Yang@yyqcode·29 May

🧐When do LLMs admit their mistakes when they should know better? In our new paper, we define this behavior as retraction: the model indicates that its generated answer was wrong. LLMs can retract—but they rarely do.🤯 arxiv.org/abs/2505.16170 👇🧵

English

115

14.3K

Steffi Chern retweetou

Daniel Kokotajlo@DKokotajlo·3 Nis

"How, exactly, could AI take over by 2027?" Introducing AI 2027: a deeply-researched scenario forecast I wrote alongside @slatestarcodex, @eli_lifland, and @thlarsen

English

406

1.1K

5.3K

2.9M

Descobrir

@o_cube01 @tyao923 @FaZhou_998 @xuefengli0301 @stefan_fee @elonmusk @BarackObama @taylorswift13