Liyan Tang

177 posts

Liyan Tang

Liyan Tang

@LiyanTang4

Research Scientist @ Google Research || NLP || MiniCheck || Prev PhD @UTAustin || Intern @GoogleDeepMind, @bespokelabsai, @AmazonScience

Austin, TX, US Katılım Şubat 2022
143 Takip Edilen242 Takipçiler
Sabitlenmiş Tweet
Liyan Tang
Liyan Tang@LiyanTang4·
🔎📄New model & benchmark to check LLMs’ output against docs (e.g., fact-check RAG) 🕵️ MiniCheck: a model w/GPT-4 accuracy @ 400x cheaper 📚LLM-AggreFact: collects 10 human-labeled datasets of errors in model outputs arxiv.org/abs/2404.10774 w/ @PhilippeLaban, @gregd_nlp 🧵
Liyan Tang tweet mediaLiyan Tang tweet media
English
2
27
90
16.6K
Liyan Tang retweetledi
Greg Durrett
Greg Durrett@gregd_nlp·
Check out Manya's benchmark for LLM creativity! Inspired by work on creativity in graphs (@AdtRaghunathan's "roll the dice" paper), CREATE isolates testing of creative insights for discovery. Future: understand how LLMs derive insights & how they can be better creative partners!
Manya Wadhwa@ManyaWadhwa1

⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs. Making novel, meaningful connections is key for scientific & creative works. We objectively measure how well LLMs can do this. 🧵👇

English
0
13
57
7.9K
Liyan Tang retweetledi
Manya Wadhwa
Manya Wadhwa@ManyaWadhwa1·
⚛️ Introducing CREATE, a benchmark for creative associative reasoning in LLMs. Making novel, meaningful connections is key for scientific & creative works. We objectively measure how well LLMs can do this. 🧵👇
Manya Wadhwa tweet media
English
4
43
144
21.8K
Liyan Tang retweetledi
Wenxuan Ding
Wenxuan Ding@Wenxuan_Ding_·
Agents interact with environments to gather information. But exploration can be expensive. Tool use, retrieval, and user interaction carry latency or monetary cost. Calibrate-Then-Act allows LLM agents to balance exploration with cost: 📐 Estimate uncertainty about the environment 💭 Reason about cost-uncertainty tradeoffs ⚙️ Act accordingly
Wenxuan Ding tweet media
English
7
32
119
12.3K
Liyan Tang retweetledi
Greg Durrett
Greg Durrett@gregd_nlp·
I'm at NeurIPS until Friday! This morning, catch: @LiyanTang4 presenting ChartMuseum, testing if VLMs can do visual reasoning over charts @sebajoed presenting AstroVisBench, testing if coding LLMs can work with real astro data workflows & link in thread if you want to meet!
Greg Durrett tweet mediaGreg Durrett tweet media
English
4
12
60
3.7K
Liyan Tang retweetledi
Greg Durrett
Greg Durrett@gregd_nlp·
📢 Postdoc position 📢 I’m recruiting a postdoc for my lab at NYU! Topics include LM reasoning, creativity, limitations of scaling, AI for science, & more! Apply by Feb 1. (Different from NYU Faculty Fellows, which are also great but less connected to my lab.) Link in 🧵
Greg Durrett tweet media
English
4
58
146
21.8K
Liyan Tang
Liyan Tang@LiyanTang4·
Our paper "ChartMuseum 🖼️" is now accepted to #NeurIPS2025 Datasets and Benchmarks Track! Even the latest models, such as GPT-5 and Gemini-2.5-Pro, still cannot do well on challenging 📉chart understanding questions , especially on those that involve visual reasoning 👀!
Liyan Tang tweet media
Liyan Tang@LiyanTang4

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

English
1
22
37
3.7K
Liyan Tang retweetledi
Greg Durrett
Greg Durrett@gregd_nlp·
📢I'm joining NYU (Courant CS + Center for Data Science) starting this fall! I’m excited to connect with new NYU colleagues and keep working on LLM reasoning, reliability, coding, creativity, and more! I’m also looking to build connections in the NYC area more broadly. Please reach out if you're interested in chatting! This move comes after 8 years working with incredible students and collaborators at UT Austin. Thank you to everyone who supported me in my first academic appointment; I look forward to continuing our collaborations but I will miss you! (and the breakfast tacos!)
Greg Durrett tweet mediaGreg Durrett tweet media
English
93
48
762
65.1K
Liyan Tang retweetledi
Leo Liu
Leo Liu@ZEYULIU10·
LLMs trained to memorize new facts can’t use those facts well.🤔 We apply a hypernetwork to ✏️edit✏️ the gradients for fact propagation, improving accuracy by 2x on a challenging subset of RippleEdit!💡 Our approach, PropMEND, extends MEND with a new objective for propagation.
Leo Liu tweet mediaLeo Liu tweet media
English
5
71
197
31.4K
Liyan Tang retweetledi
Xi Ye
Xi Ye@xiye_nlp·
🤔 Recent mech interp work showed that retrieval heads can explain some long-context behavior. But can we use this insight for retrieval? 📣 Introducing QRHeads (query-focused retrieval heads) that enhance retrieval Main contributions: 🔍 Better head detection: we find a different and more useful set of heads vs original retrieval head 📊Practical utility: a general-purpose retriever for long-context reasoning and re-ranking
Xi Ye tweet mediaXi Ye tweet media
English
2
19
70
17.1K
Liyan Tang retweetledi
Fangcong Yin
Fangcong Yin@fangcong_y10593·
Solving complex problems with CoT requires combining different skills. We can do this by: 🧩Modify the CoT data format to be “composable” with other skills 🔥Train models on each skill 📌Combine those models Lead to better 0-shot reasoning on tasks involving skill composition!
Fangcong Yin tweet mediaFangcong Yin tweet media
English
5
39
87
12.4K
Liyan Tang retweetledi
Greg Durrett
Greg Durrett@gregd_nlp·
Check out ChartMuseum from @LiyanTang4 @_grace_kim and many other collaborators from UT! Charts questions take us beyond current benchmarks for math/multi-hop QA/etc., which CoT is very good at, to *visual reasoning*, which is hard to express with text CoT!
Liyan Tang@LiyanTang4

Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B

English
1
10
34
2.8K
Liyan Tang
Liyan Tang@LiyanTang4·
Introducing ChartMuseum🖼️, testing visual reasoning with diverse real-world charts! ✍🏻Entirely human-written questions by 13 CS researchers 👀Emphasis on visual reasoning – hard to be verbalized via text CoTs 📉Humans reach 93% but 63% from Gemini-2.5-Pro & 38% from Qwen2.5-72B
Liyan Tang tweet mediaLiyan Tang tweet media
English
2
34
77
18.4K
Liyan Tang retweetledi
Philippe Laban
Philippe Laban@PhilippeLaban·
🆕paper: LLMs Get Lost in Multi-Turn Conversation In real life, people don’t speak in perfect prompts. So we simulate multi-turn conversations — less lab-like, more like real use. We find that LLMs get lost in conversation. 👀What does that mean? 🧵1/N 📄arxiv.org/abs/2505.06120
Philippe Laban tweet mediaPhilippe Laban tweet media
English
4
36
132
10.3K
Liyan Tang retweetledi
Anirudh Khatry
Anirudh Khatry@AnirudhKhatry·
🚀Introducing CRUST-Bench, a dataset for C-to-Rust transpilation for full codebases 🛠️ A dataset of 100 real-world C repositories across various domains, each paired with: 🦀 Handwritten safe Rust interfaces. 🧪 Rust test cases to validate correctness. 🧵[1/6]
Anirudh Khatry tweet mediaAnirudh Khatry tweet media
English
3
20
69
15.3K