Dan Roth

46 posts

Dan Roth

Dan Roth

@DanRothNLP

Chief AI Scientist, Oracle, and the Eduardo D. Glandt Distinguished Professor, CIS, University of Pennsylvania. Former VP/Distinguished Scientist, AWS AI Labs.

Philadelphia, PA Katılım Mayıs 2010
55 Takip Edilen1.9K Takipçiler
Dan Roth retweetledi
EMNLP 2026
EMNLP 2026@emnlpmeeting·
Social Impact Award: "AccessEval: Benchmarking Disability Bias in Large Language Models" by Srikant Panda, Amit Agarwal, and Hitesh Laxmichand Patel aclanthology.org/2025.emnlp-mai… 10/n
English
1
4
13
3.7K
Dan Roth retweetledi
Siyi Liu
Siyi Liu@liusiyi64198·
📷 New #EMNLP2025 Findings survey paper! “Conflicts in Texts: Data, Implications, and Challenges” Paper: aclanthology.org/2025.findings-… Conflicts are everywhere in NLP — news articles reflecting different perspectives or opposing views, annotators who disagree, LLMs that hallucinate or contradict themselves, and personal/enterprise document collections that grow apart and are conflicting. Most research tackles these in isolation, and our survey provides the first unified view of conflicting information in NLP. We chart the path toward conflict-aware, reliable NLP systems. Builds on our earlier work on: - Multi-perspective dataset aclanthology.org/2021.naacl-mai… and search aclanthology.org/2022.findings-… - Hallucination detection aclanthology.org/2025.findings-… - Open-domain QA with conflicting contexts aclanthology.org/2025.findings-…
Siyi Liu tweet media
English
0
3
12
773
Dan Roth retweetledi
Tomer Wolfson
Tomer Wolfson@TomerWolfson·
✨Yesterday we released MoNaCo, an @allen_ai benchmark of 1,315 hard human-written questions that, on average, require 43.3 documents per question!✨ The three aforementioned questions were actually some of the easier ones in MoNaCo 😉 (8/) x.com/allen_ai/statu…
Ai2@allen_ai

LLMs power research, decision‑making, and exploration—but most benchmarks don’t test how well they stitch together evidence across dozens (or hundreds) of sources. Meet MoNaCo, our new eval for question-answering cross‑source reasoning. 👇

English
1
1
3
614
Dan Roth retweetledi
Ai2
Ai2@allen_ai·
MoNaCo evaluates complex question-answering with: 📚 1,315 multi‑step queries 🔎 Retrieval, filtering & aggregation across text and tables 🌟 Avg 43.3 distinct documents per query
Ai2 tweet media
English
1
1
15
1.3K
Dan Roth retweetledi
Ai2
Ai2@allen_ai·
LLMs power research, decision‑making, and exploration—but most benchmarks don’t test how well they stitch together evidence across dozens (or hundreds) of sources. Meet MoNaCo, our new eval for question-answering cross‑source reasoning. 👇
Ai2 tweet media
English
10
38
228
21.6K
Dan Roth retweetledi
Weijia Shi
Weijia Shi@WeijiaShi2·
Augmenting GPT-4o with Visual Sketchpad ✏️ We introduce Sketchpad agent, a framework that equips multimodal LLMs with a visual canvas and drawing tools 🎨 . Improving GPT-4o's performance in vision and math tasks 📈 🔗: visualsketchpad.github.io
Weijia Shi tweet media
Yushi Hu@huyushi98

Humans draw to facilitate reasoning and communication. Why not let LLMs do so? 🚀We introduce✏️Sketchpad, which gives multimodal LLMs a sketchpad to draw and facilitate reasoning! arxiv.org/abs/2406.09403 Sketchpad gives GPT-4o great boosts on many vision and math tasks 📈 The video shows how GPT-4o with Sketchpad reasons with interleaved visual and textual steps. For more, visit our project page: visualsketchpad.github.io 📌 For math tasks, ✏️Sketchpad allows LLMs to draw auxiliary lines on geometry diagrams, plotting functions, graphs, and even games. GPT-4o does math better when it can sketch! (+12.7% acc on average) 📌 For computer vision tasks, ✏️Sketchpad allows LLMs to sketch with vision specialists (e.g., GroundingDINO draws bounding boxes, SegmentAnything draws masks). Sketchpad substantially improves GPT-4o's vision abilities. GPT-4o + Sketchpad compared with prior SOTAs: 1️⃣ V*Bench: 75.4% -> 80.3% 2️⃣ BLINK correspondence: 42.4% -> 80.8% 3️⃣ BLINK relative depth: 67.7% -> 83.9% 4️⃣ BLINK spatial relation: 76.2% -> 81.1% ... See more interesting examples in the thread!

English
10
51
284
50K
Dan Roth retweetledi
Xingyu Fu
Xingyu Fu@XingyuFu2·
🔥Highlights of the Commonsense-T2I benchmark: 📚Pairwise text prompts with minimum token change ⚙️Rigorous automatic evaluation with descriptions for expected outputs ❗️Even DALL-E 3 only achieves below 50% accuracy (2/n)
Xingyu Fu tweet media
English
1
2
10
1.6K
Dan Roth retweetledi
Xingyu Fu
Xingyu Fu@XingyuFu2·
Can Text-to-Image models understand common sense? 🤔 Can they generate images that fit everyday common sense? 🤔 tldr; NO, they are far less intelligent than us 💁🏻‍♀️ Introducing Commonsense-T2I 💡 zeyofu.github.io/CommonsenseT2I/, a novel evaluation and benchmark designed to measure commonsense reasoning in T2I models 🔥🔥 Paper: arxiv.org/abs/2406.07546 (1/n)
Xingyu Fu tweet media
English
7
37
130
48.8K
Dan Roth retweetledi
Zijian Wang
Zijian Wang@zijianwang30·
Best-fit Packing completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation with <0.01% overhead tested on popular pre-training datasets like @TIIuae's RefinedWeb and @BigCodeProject's Stack.🧵5/n
Zijian Wang tweet media
English
1
1
3
631
Dan Roth retweetledi
Zijian Wang
Zijian Wang@zijianwang30·
The common practice in LLM pre-training is to concat all docs then split into equal-length chunks. This is efficient but hurts data integrity: doc fragmentation leads to loss of info, and causes next-token prediction to be ungrounded, making model prone to hallucination.🧵2/n
Zijian Wang tweet media
English
1
2
4
1.2K
Dan Roth retweetledi
Zijian Wang
Zijian Wang@zijianwang30·
🚀Introducing "Fewer Truncations Improve Language Modeling" at #ICML2024 We tackle a fundamental issue in LLM pre-training: docs are often broken into pieces. Such truncation hinders model from learning to compose logically coherent and factually grounded content. 👇🧵1/n
Zijian Wang tweet mediaZijian Wang tweet media
English
4
10
46
7.3K
Dan Roth retweetledi
Xingyu Fu
Xingyu Fu@XingyuFu2·
Can GPT-4V and Gemini-Pro perceive the world the way humans do? 🤔 Can they solve the vision tasks that humans can in the blink of an eye? 😉 tldr; NO, they are far worse than us 💁🏻‍♀️ Introducing BLINK👁 zeyofu.github.io/blink/, a novel benchmark that studies visual perception abilities NOT yet “emerged” in Multimodal LLMs 🔥🔥 Paper: arxiv.org/abs/2404.12390 (1/n)
Xingyu Fu tweet media
AK@_akhaliq

BLINK Multimodal Large Language Models Can See but Not Perceive We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans

English
9
124
407
105.7K
Dan Roth retweetledi
vinayshekhar
vinayshekhar@vinayshekhar000·
We are thrilled to announce our second workshop on natural language interfaces, held in conjunction with the prestigious IJCNL-AACL conference! In collaboration with researchers from AWS AI Labs, Google Research, Meta AI Research, and Microsoft Research, this workshop aims to
English
1
3
6
1.5K
Dan Roth retweetledi
Randall Hunt
Randall Hunt@ranman·
I’ve been working with @awscloud’s #Bedrock service for a couple of months now at @caylentinc, and I’d like to share some of what I’ve learned. 🧵
English
8
90
314
89.9K
Dan Roth retweetledi
Adam Seligman
Adam Seligman@adamse·
aws.amazon.com/codewhisperer/ is really neat. Helps you code faster, checks for security vulns, discloses licenses of code it drew from, and works great for AWS APIs. Boom! @awscloud putting ML to work for developers
English
1
3
5
0