Yuling Gu

130 posts

Yuling Gu banner
Yuling Gu

Yuling Gu

@gu_yuling

First year of PhD-ing at NYU in NYC 🚕🍎 | Previously @nyuniversity ➡️ @UW ➡️ @allen_ai @[email protected]

Katılım Eylül 2019
801 Takip Edilen771 Takipçiler
Yuling Gu
Yuling Gu@gu_yuling·
SimpleToM exposes this gap 🔎 and provides a benchmark to diagnose, improve, and push LLMs toward robust social reasoning 🚀 Try SimpleToM on any model : huggingface.co/datasets/allen… 5/
English
1
2
5
377
Yuling Gu
Yuling Gu@gu_yuling·
🎉 SimpleToM has been accepted to #ICLR2026! LLMs can tell you what someone knows (explicit ToM). But when asked to apply it to predict behavior or judge actions (applied ToM), even frontier LLMs still fail. 🤯 The gap between knowing and applying is real… and huge. 👀 1/
Yuling Gu tweet media
English
1
17
117
12.3K
Yuling Gu retweetledi
Kyunghyun Cho
Kyunghyun Cho@kchonyc·
i gave a keynote talk at NeurIPS'25 just last week. here's the slide deck (link below) i've used to share my thoughts on who we are and what we do.
Kyunghyun Cho tweet mediaKyunghyun Cho tweet mediaKyunghyun Cho tweet mediaKyunghyun Cho tweet media
English
3
28
248
20.2K
Luca Soldaini 🎀
Luca Soldaini 🎀@soldni·
yo has anyone heard of this Olmo model, loss looks good
English
7
12
194
26.5K
Yuling Gu retweetledi
Danica Dillion
Danica Dillion@danicajdillion·
🌍 Introducing WorldValuesBench! A benchmark to evaluate how well LLMs reflect cultural differences in human values. Built from 94k+ participants in the World Values Survey → 20M examples of (demographics, value question → answer). 🧵
English
1
2
5
633
Yuling Gu retweetledi
David Heineman
David Heineman@heinemandavidj·
Evaluating language models is tricky, how do we know if our results are real, or due to random chance? We find an answer with two simple metrics: signal, a benchmark’s ability to separate models, and noise, a benchmark’s random variability between training steps 🧵
David Heineman tweet media
Ai2@allen_ai

📢 New paper from Ai2: Signal & Noise asks a simple question—can language model benchmarks detect a true difference in model performance? 🧵

English
4
46
241
47K
Yuling Gu
Yuling Gu@gu_yuling·
@code_star Super excited to have more people like you joining in, looking into the details behind evals, and asking these interesting + important questions! 👍
English
1
0
1
206
Yuling Gu
Yuling Gu@gu_yuling·
Come to our poster session on Friday, May 2, 9-10.30 am (Hall 3) to chat more!
English
0
0
2
361
Yuling Gu
Yuling Gu@gu_yuling·
This effort toward an open language model evaluation standard doesn’t just end here. Since the submission of our NAACL paper, we have added more tasks to OLMES, including generative and reasoning tasks, all openly available in our repository (github.com/allenai/olmes).
English
1
0
2
643
Yuling Gu retweetledi
Ai2
Ai2@allen_ai·
Imagine AI doing science: reading papers, generating ideas, designing and running experiments, analyzing results… How many more discoveries can we reveal? 🧐 Meet CodeScientist, a promising next step toward autonomous scientific discovery. 🧵
Ai2 tweet media
English
6
95
365
41.5K
Yuling Gu retweetledi
Kyle Lo
Kyle Lo@kylelostat·
kicking off 2025 with our OLMo 2 tech report while payin homage to the sequelest of sequels 🫡 🚗 2 OLMo 2 Furious 🔥 is everythin we learned since OLMo 1, with deep dives into: 🚖 stable pretrain 🚔 lr anneal 🤝 data curricula 🤝 soups 🚘 tulu post-train 🚜 compute infra 👇🧵
Kyle Lo tweet media
English
3
67
363
47.3K
Yuling Gu retweetledi
Ai2
Ai2@allen_ai·
Meet Tülu 3 -- a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms. We invented new methods for fine-tuning language models with RL and built upon best practices in the community to scale synthetic instruction and preference data. Demo, GitHub, technical report, and models below 👇
Ai2 tweet media
English
14
131
526
217.8K
Yuling Gu
Yuling Gu@gu_yuling·
@Itay_itzhak_ With a more specific CoT* and also reminding models of their own answers to relevant questions, performance improves. But this involves task and question-specific guidance. Ideally, we want LLMs that implicitly make & apply such inferences without the fragile human hand-holding!
English
0
0
1
31
Itay Itzhak
Itay Itzhak@Itay_itzhak_·
@gu_yuling Interesting! So given a context that will "remind" the models to consider ToM aspects they will do better in applied ToM? Or generally, there exist a context for samples that can restore benchmark-level performance?
English
1
0
1
42
Yuling Gu
Yuling Gu@gu_yuling·
⚠️ Introducing SimpleToM, exposing a jarring gap in the Theory-of-Mind capabilities of current frontier LLMs: 😲 They fail to implicitly apply mental state inferences, even when they can easily infer these states for two-sentence stories. 😲 📜 arxiv.org/abs/2410.13648 1/
Yuling Gu tweet media
English
8
36
160
24.9K