Stephan Xie

30 posts

Stephan Xie

@stephofx

phd student @mldcmu & NSF grfp fellow | prev: @penn

Pittsburgh, PA Katılım Eylül 2018

199 Takip Edilen128 Takipçiler

Sabitlenmiş Tweet

Stephan Xie@stephofx·27 Nis

How well do AI systems (LLMs, VLMs, time series FMs) answer questions about time series data📈? On ARFBench, the best models achieve ~63% accuracy on real incident data. But models and human experts fail in different areas: combining them achieves 87% accuracy. 🧵1/

English

3.6K

Stephan Xie@stephofx·27 Nis

Check out our blog post on ARFBench here!

ML@CMU@mlcmublog

blog.ml.cmu.edu/2026/04/27/arf… How good are AI systems at time-series Q&A? On ARFBench, top models hit ~63% on real incident data. But they miss different things than humans; combine both and accuracy jumps to 87%. Read more in our latest blog post!

English

969

Stephan Xie@stephofx·27 Nis

10/ Blog post: datadoghq.com/blog/ai/introd… x-listed: blog.ml.cmu.edu/2026/04/27/arf… Paper: arxiv.org/pdf/2604.21199 Dataset+Model+Leaderboard: huggingface.co/datasets/Datad…

English

Stephan Xie@stephofx·27 Nis

9/ This was a very fun and insightful collaboration between Datadog AI Research and collaborators at CMU, including Ben, @MononitoGoswami , @JunhongShen1 , Emaad, Chenghao, David, @ThisIsOthmane , and my advisor @atalwalkar.

English

166

Stephan Xie@stephofx·27 Nis

English

3.6K

Stephan Xie retweetledi

maxwell jones@maxwell54650346·24 Şub

Video Editing is great - but what if you want to apply an effect to your input video described by another video?? Introducing RefVFX, the first method that takes in both an input video and a reference effect video for generative video editing!

English

116

21.3K

Stephan Xie retweetledi

Fahim Tajwar@FahimTajwar10·5 Şub

Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: zanette-labs.github.io/MaxRL/ 🧵 1/n

English

161

808

207.4K

Stephan Xie retweetledi

Sang Michael Xie@sangmichaelxie·4 Şub

Excited to release PrefixRL, where we achieved what I thought to be a contradiction - learning from off-policy data with purely on-policy updates. This avoids all the instabilities of off-policy RL. I think this will let us reuse previous RL and sampling FLOPs much more efficiently in the future - just check out PrefixRL’s 2x compute efficiency gain and huge plateau increase over SFT then RL. arxiv.org/abs/2601.18795

English

192

23.2K

Stephan Xie retweetledi

Yuda Song@yus167·3 Şub

RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵

English

102

600

107.1K

Stephan Xie retweetledi

Valerie Chen@valeriechen_·3 Kas

Understanding how humans fit into agent workflows is essential, but we still lack concrete ways to measure collaboration. Our Collaborative Effort Scaling framework introduces metrics grounded in real-world studies and simulations. More details below👇

Shannon Shen@shannonzshen

Today's AI agents are optimized to complete tasks in one shot. But real-world tasks are iterative, with evolving goals that need collaboration with users. We introduce collaborative effort scaling to evaluate how well agents work with people—not just complete tasks 🧵

English

3.5K

Stephan Xie retweetledi

Yuda Song@yus167·15 Eki

🤖 Robots rarely see the true world's state—they operate on partial, noisy visual observations. How should we design algorithms under this partial observability? Should we decide (end-to-end RL) or distill (from a privileged expert)? We study this trade-off in locomotion. 🧵(1/n)

English

142

30.5K

Stephan Xie retweetledi

Emily Byun@yewonbyun_·9 Eki

💡Can we trust synthetic data for statistical inference? We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data

English

144

31.2K

Stephan Xie@stephofx·22 May

@steph_milani @JHUCompSci @NYU_Courant Huge congrats Steph!! Super exciting!!

English

Stephanie Milani@steph_milani·21 May

Another life update!! 🎉 I’m joining @JHUCompSci as an Assistant Professor starting Fall 2026! Apply to work with me on reinforcement learning, foundation models, & human-centered AI. Let’s build better AI agents 🤖🙆‍♀️🦀 Before that, I’ll join @NYU_Courant as an Assistant Professor/Faculty Fellow. Excited to spend a year in NYC!

English

645

63.3K

Stephan Xie@stephofx·22 May

Hard to overstate how important observability data is in forecasting! The complex nature of the data led to huge challenges in even evaluating time series models but also helped us make Toto super capable. Excited to share this work led by Ben and Emaad at Datadog AI Research!

Ameet Talwalkar@atalwalkar

I’m excited to share new work from Datadog AI Research! We just released Toto, a new SOTA (by a wide margin!) time series foundation model, and BOOM, the largest benchmark of observability metrics. Both are available under the Apache 2.0 license. 🧵

English

1.1K

Keşfet

@MononitoGoswami @JunhongShen1 @ThisIsOthmane @atalwalkar @steph_milani @JHUCompSci @NYU_Courant @elonmusk