Stephan Xie

30 posts

Stephan Xie

Stephan Xie

@stephofx

phd student @mldcmu & NSF grfp fellow | prev: @penn

Pittsburgh, PA Katılım Eylül 2018
199 Takip Edilen128 Takipçiler
Sabitlenmiş Tweet
Stephan Xie
Stephan Xie@stephofx·
How well do AI systems (LLMs, VLMs, time series FMs) answer questions about time series data📈? On ARFBench, the best models achieve ~63% accuracy on real incident data. But models and human experts fail in different areas: combining them achieves 87% accuracy. 🧵1/
Stephan Xie tweet media
English
1
16
41
3.6K
Stephan Xie
Stephan Xie@stephofx·
Check out our blog post on ARFBench here!
ML@CMU@mlcmublog

blog.ml.cmu.edu/2026/04/27/arf… How good are AI systems at time-series Q&A? On ARFBench, top models hit ~63% on real incident data. But they miss different things than humans; combine both and accuracy jumps to 87%. Read more in our latest blog post!

English
0
0
9
969
Stephan Xie
Stephan Xie@stephofx·
How well do AI systems (LLMs, VLMs, time series FMs) answer questions about time series data📈? On ARFBench, the best models achieve ~63% accuracy on real incident data. But models and human experts fail in different areas: combining them achieves 87% accuracy. 🧵1/
Stephan Xie tweet media
English
1
16
41
3.6K
Stephan Xie retweetledi
maxwell jones
maxwell jones@maxwell54650346·
Video Editing is great - but what if you want to apply an effect to your input video described by another video?? Introducing RefVFX, the first method that takes in both an input video and a reference effect video for generative video editing!
English
6
23
116
21.3K
Stephan Xie retweetledi
Fahim Tajwar
Fahim Tajwar@FahimTajwar10·
Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: zanette-labs.github.io/MaxRL/ 🧵 1/n
English
14
161
808
207.4K
Stephan Xie retweetledi
Sang Michael Xie
Sang Michael Xie@sangmichaelxie·
Excited to release PrefixRL, where we achieved what I thought to be a contradiction - learning from off-policy data with purely on-policy updates. This avoids all the instabilities of off-policy RL. I think this will let us reuse previous RL and sampling FLOPs much more efficiently in the future - just check out PrefixRL’s 2x compute efficiency gain and huge plateau increase over SFT then RL. arxiv.org/abs/2601.18795
Sang Michael Xie tweet media
English
2
26
192
23.2K
Stephan Xie retweetledi
Yuda Song
Yuda Song@yus167·
RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵
Yuda Song tweet media
English
14
102
600
107.1K
Stephan Xie retweetledi
Valerie Chen
Valerie Chen@valeriechen_·
Understanding how humans fit into agent workflows is essential, but we still lack concrete ways to measure collaboration. Our Collaborative Effort Scaling framework introduces metrics grounded in real-world studies and simulations. More details below👇
Shannon Shen@shannonzshen

Today's AI agents are optimized to complete tasks in one shot. But real-world tasks are iterative, with evolving goals that need collaboration with users. We introduce collaborative effort scaling to evaluate how well agents work with people—not just complete tasks 🧵

English
1
4
24
3.5K
Stephan Xie retweetledi
Yuda Song
Yuda Song@yus167·
🤖 Robots rarely see the true world's state—they operate on partial, noisy visual observations. How should we design algorithms under this partial observability? Should we decide (end-to-end RL) or distill (from a privileged expert)? We study this trade-off in locomotion. 🧵(1/n)
Yuda Song tweet media
English
2
40
142
30.5K
Stephan Xie retweetledi
Emily Byun
Emily Byun@yewonbyun_·
💡Can we trust synthetic data for statistical inference? We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data
Emily Byun tweet media
English
2
36
144
31.2K
Stephanie Milani
Stephanie Milani@steph_milani·
Another life update!! 🎉 I’m joining @JHUCompSci as an Assistant Professor starting Fall 2026! Apply to work with me on reinforcement learning, foundation models, & human-centered AI. Let’s build better AI agents 🤖🙆‍♀️🦀 Before that, I’ll join @NYU_Courant as an Assistant Professor/Faculty Fellow. Excited to spend a year in NYC!
Stephanie Milani tweet mediaStephanie Milani tweet mediaStephanie Milani tweet media
English
71
21
645
63.3K
Stephan Xie
Stephan Xie@stephofx·
Hard to overstate how important observability data is in forecasting! The complex nature of the data led to huge challenges in even evaluating time series models but also helped us make Toto super capable. Excited to share this work led by Ben and Emaad at Datadog AI Research!
Ameet Talwalkar@atalwalkar

I’m excited to share new work from Datadog AI Research! We just released Toto, a new SOTA (by a wide margin!) time series foundation model, and BOOM, the largest benchmark of observability metrics. Both are available under the Apache 2.0 license. 🧵

English
0
4
14
1.1K