Yuling Gu

130 posts

Yuling Gu

@gu_yuling

First year of PhD-ing at NYU in NYC 🚕🍎 | Previously @nyuniversity ➡️ @UW ➡️ @allen_ai @[email protected]

Katılım Eylül 2019

801 Takip Edilen771 Takipçiler

Yuling Gu@gu_yuling·6 Mar

Work done during my time at @allen_ai with wonderful collaborators Oyvind Tafjord, @hyunw_kim, @jaredlcm, @Ronan_LeBras, Peter Clark, @YejinChoinka. 📜 Paper: arxiv.org/abs/2410.13648 💻 Code: github.com/yulinggu-cs/Si… 6/

English

342

Yuling Gu@gu_yuling·6 Mar

SimpleToM exposes this gap 🔎 and provides a benchmark to diagnose, improve, and push LLMs toward robust social reasoning 🚀 Try SimpleToM on any model : huggingface.co/datasets/allen… 5/

English

377

Yuling Gu@gu_yuling·6 Mar

🎉 SimpleToM has been accepted to #ICLR2026! LLMs can tell you what someone knows (explicit ToM). But when asked to apply it to predict behavior or judge actions (applied ToM), even frontier LLMs still fail. 🤯 The gap between knowing and applying is real… and huge. 👀 1/

English

117

12.3K

Yuling Gu retweetledi

Kyunghyun Cho@kchonyc·9 Ara

i gave a keynote talk at NeurIPS'25 just last week. here's the slide deck (link below) i've used to share my thoughts on who we are and what we do.

English

248

20.2K

Yuling Gu@gu_yuling·20 Kas

Super proud of the amazing work that my Ai2 friends have been doing! 🤩 Check this out! ✨

Ai2@allen_ai

Announcing Olmo 3, a leading fully open LM suite built for reasoning, chat, & tool use, and an open model flow—not just the final weights, but the entire training journey. Best fully open 32B reasoning model & best 32B base model. 🧵

English

1.1K

Yuling Gu@gu_yuling·8 Eki

@soldni @kylelostat @allen_ai Looks cool 🤩 Is this bag the latest swag? 👀

English

476

Luca Soldaini 🎀@soldni·8 Eki

yo has anyone heard of this Olmo model, loss looks good

English

194

26.5K

Yuling Gu retweetledi

Danica Dillion@danicajdillion·19 Eyl

🌍 Introducing WorldValuesBench! A benchmark to evaluate how well LLMs reflect cultural differences in human values. Built from 94k+ participants in the World Values Survey → 20M examples of (demographics, value question → answer). 🧵

English

633

Yuling Gu retweetledi

David Heineman@heinemandavidj·19 Ağu

Evaluating language models is tricky, how do we know if our results are real, or due to random chance? We find an answer with two simple metrics: signal, a benchmark’s ability to separate models, and noise, a benchmark’s random variability between training steps 🧵

Ai2@allen_ai

📢 New paper from Ai2: Signal & Noise asks a simple question—can language model benchmarks detect a true difference in model performance? 🧵

English

241

47K

Yuling Gu@gu_yuling·30 Haz

@code_star Super excited to have more people like you joining in, looking into the details behind evals, and asking these interesting + important questions! 👍

English

206

Cody Blakeney@code_star·28 Haz

Ok so Tl;Dr I should have read the OLMES paper. Sorry @gu_yuling

Cody Blakeney@code_star

Does anyone have an explanation for why this is (from the dclm paper)? full answer choice ppl seems to give better signal for small (low flop models), but worse signal for bigger/better models compared to letter choice (official version)?

English

1.4K

Yuling Gu@gu_yuling·30 Nis

Come to our poster session on Friday, May 2, 9-10.30 am (Hall 3) to chat more!

English

361

Yuling Gu@gu_yuling·30 Nis

This effort toward an open language model evaluation standard doesn’t just end here. Since the submission of our NAACL paper, we have added more tasks to OLMES, including generative and reasoning tasks, all openly available in our repository (github.com/allenai/olmes).

English

643

Yuling Gu@gu_yuling·30 Nis

Excited to be at #NAACL2025 in Albuquerque this week! I'll be presenting "OLMES: A Standard for Language Model Evaluations" (arxiv.org/abs/2406.08446)! Work done with my wonderful collaborators at @allen_ai ❤️

English

4.3K

Yuling Gu retweetledi

Ai2@allen_ai·31 Mar

Imagine AI doing science: reading papers, generating ideas, designing and running experiments, analyzing results… How many more discoveries can we reveal? 🧐 Meet CodeScientist, a promising next step toward autonomous scientific discovery. 🧵

English

365

41.5K

Yuling Gu retweetledi

Kyle Lo@kylelostat·3 Oca

kicking off 2025 with our OLMo 2 tech report while payin homage to the sequelest of sequels 🫡 🚗 2 OLMo 2 Furious 🔥 is everythin we learned since OLMo 1, with deep dives into: 🚖 stable pretrain 🚔 lr anneal 🤝 data curricula 🤝 soups 🚘 tulu post-train 🚜 compute infra 👇🧵

English

363

47.3K

Yuling Gu retweetledi

Ai2@allen_ai·21 Kas

Meet Tülu 3 -- a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms. We invented new methods for fine-tuning language models with RL and built upon best practices in the community to scale synthetic instruction and preference data. Demo, GitHub, technical report, and models below 👇

English

131

526

217.8K

Yuling Gu@gu_yuling·3 Kas

@Itay_itzhak_ With a more specific CoT* and also reminding models of their own answers to relevant questions, performance improves. But this involves task and question-specific guidance. Ideally, we want LLMs that implicitly make & apply such inferences without the fragile human hand-holding!

English

Itay Itzhak@Itay_itzhak_·1 Kas

@gu_yuling Interesting! So given a context that will "remind" the models to consider ToM aspects they will do better in applied ToM? Or generally, there exist a context for samples that can restore benchmark-level performance?

English

Yuling Gu@gu_yuling·25 Eki

⚠️ Introducing SimpleToM, exposing a jarring gap in the Theory-of-Mind capabilities of current frontier LLMs: 😲 They fail to implicitly apply mental state inferences, even when they can easily infer these states for two-sentence stories. 😲 📜 arxiv.org/abs/2410.13648 1/

English

160

24.9K

Keşfet

@allen_ai @hyunw_kim @jaredlcm @Ronan_LeBras @YejinChoinka @soldni @kylelostat @code_star