Dan Roth

46 posts

Dan Roth

@DanRothNLP

Chief AI Scientist, Oracle, and the Eduardo D. Glandt Distinguished Professor, CIS, University of Pennsylvania. Former VP/Distinguished Scientist, AWS AI Labs.

Philadelphia, PA Katılım Mayıs 2010

55 Takip Edilen1.9K Takipçiler

Dan Roth retweetledi

EMNLP 2026@emnlpmeeting·8 Kas

Social Impact Award: "AccessEval: Benchmarking Disability Bias in Large Language Models" by Srikant Panda, Amit Agarwal, and Hitesh Laxmichand Patel aclanthology.org/2025.emnlp-mai… 10/n

English

3.8K

Dan Roth retweetledi

Siyi Liu@liusiyi64198·7 Kas

📷 New #EMNLP2025 Findings survey paper! “Conflicts in Texts: Data, Implications, and Challenges” Paper: aclanthology.org/2025.findings-… Conflicts are everywhere in NLP — news articles reflecting different perspectives or opposing views, annotators who disagree, LLMs that hallucinate or contradict themselves, and personal/enterprise document collections that grow apart and are conflicting. Most research tackles these in isolation, and our survey provides the first unified view of conflicting information in NLP. We chart the path toward conflict-aware, reliable NLP systems. Builds on our earlier work on: - Multi-perspective dataset aclanthology.org/2021.naacl-mai… and search aclanthology.org/2022.findings-… - Hallucination detection aclanthology.org/2025.findings-… - Open-domain QA with conflicting contexts aclanthology.org/2025.findings-…

English

857

Dan Roth retweetledi

Tomer Wolfson@TomerWolfson·19 Ağu

✨Yesterday we released MoNaCo, an @allen_ai benchmark of 1,315 hard human-written questions that, on average, require 43.3 documents per question!✨ The three aforementioned questions were actually some of the easier ones in MoNaCo 😉 (8/) x.com/allen_ai/statu…

Ai2@allen_ai

LLMs power research, decision‑making, and exploration—but most benchmarks don’t test how well they stitch together evidence across dozens (or hundreds) of sources. Meet MoNaCo, our new eval for question-answering cross‑source reasoning. 👇

English

629

Dan Roth retweetledi

Ai2@allen_ai·18 Ağu

MoNaCo evaluates complex question-answering with: 📚 1,315 multi‑step queries 🔎 Retrieval, filtering & aggregation across text and tables 🌟 Avg 43.3 distinct documents per query

English

1.3K

Dan Roth retweetledi

Ai2@allen_ai·18 Ağu

English

227

21.7K

Dan Roth retweetledi

Weijia Shi@WeijiaShi2·14 Haz

Augmenting GPT-4o with Visual Sketchpad ✏️ We introduce Sketchpad agent, a framework that equips multimodal LLMs with a visual canvas and drawing tools 🎨 . Improving GPT-4o's performance in vision and math tasks 📈 🔗: visualsketchpad.github.io

Yushi Hu@huyushi98

Humans draw to facilitate reasoning and communication. Why not let LLMs do so? 🚀We introduce✏️Sketchpad, which gives multimodal LLMs a sketchpad to draw and facilitate reasoning! arxiv.org/abs/2406.09403 Sketchpad gives GPT-4o great boosts on many vision and math tasks 📈 The video shows how GPT-4o with Sketchpad reasons with interleaved visual and textual steps. For more, visit our project page: visualsketchpad.github.io 📌 For math tasks, ✏️Sketchpad allows LLMs to draw auxiliary lines on geometry diagrams, plotting functions, graphs, and even games. GPT-4o does math better when it can sketch! (+12.7% acc on average) 📌 For computer vision tasks, ✏️Sketchpad allows LLMs to sketch with vision specialists (e.g., GroundingDINO draws bounding boxes, SegmentAnything draws masks). Sketchpad substantially improves GPT-4o's vision abilities. GPT-4o + Sketchpad compared with prior SOTAs: 1️⃣ V*Bench: 75.4% -> 80.3% 2️⃣ BLINK correspondence: 42.4% -> 80.8% 3️⃣ BLINK relative depth: 67.7% -> 83.9% 4️⃣ BLINK spatial relation: 76.2% -> 81.1% ... See more interesting examples in the thread!

English

283

50K

Dan Roth retweetledi

Xingyu Fu@XingyuFu2·14 Haz

😺 This work is done with my amazing collaborators: @yujielu_10, muyu he, @WilliamWangNLP @DanRothNLP YOU ARE THE BEST!!! 😎🔥 (n/n)

English

1.3K

Dan Roth retweetledi

Xingyu Fu@XingyuFu2·14 Haz

🔥Error Examples from DALL-E 3 👀More Visualizations: zeyofu.github.io/CommonsenseT2I… (3/n)

English

1.4K

Dan Roth retweetledi

Xingyu Fu@XingyuFu2·14 Haz

🔥Highlights of the Commonsense-T2I benchmark: 📚Pairwise text prompts with minimum token change ⚙️Rigorous automatic evaluation with descriptions for expected outputs ❗️Even DALL-E 3 only achieves below 50% accuracy (2/n)

English

1.6K

Dan Roth retweetledi

Xingyu Fu@XingyuFu2·14 Haz

Can Text-to-Image models understand common sense? 🤔 Can they generate images that fit everyday common sense? 🤔 tldr; NO, they are far less intelligent than us 💁🏻‍♀️ Introducing Commonsense-T2I 💡 zeyofu.github.io/CommonsenseT2I/, a novel evaluation and benchmark designed to measure commonsense reasoning in T2I models 🔥🔥 Paper: arxiv.org/abs/2406.07546 (1/n)

English

129

48.8K

Dan Roth retweetledi

Zijian Wang@zijianwang30·3 May

Best-fit Packing completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation with <0.01% overhead tested on popular pre-training datasets like @TIIuae's RefinedWeb and @BigCodeProject's Stack.🧵5/n

English

635

Dan Roth retweetledi

Zijian Wang@zijianwang30·3 May

The common practice in LLM pre-training is to concat all docs then split into equal-length chunks. This is efficient but hurts data integrity: doc fragmentation leads to loss of info, and causes next-token prediction to be ungrounded, making model prone to hallucination.🧵2/n

English

1.2K

Dan Roth retweetledi

Zijian Wang@zijianwang30·3 May

🚀Introducing "Fewer Truncations Improve Language Modeling" at #ICML2024 We tackle a fundamental issue in LLM pre-training: docs are often broken into pieces. Such truncation hinders model from learning to compose logically coherent and factually grounded content. 👇🧵1/n

English

7.3K

Dan Roth retweetledi

Xingyu Fu@XingyuFu2·19 Nis

Can GPT-4V and Gemini-Pro perceive the world the way humans do? 🤔 Can they solve the vision tasks that humans can in the blink of an eye? 😉 tldr; NO, they are far worse than us 💁🏻‍♀️ Introducing BLINK👁 zeyofu.github.io/blink/, a novel benchmark that studies visual perception abilities NOT yet “emerged” in Multimodal LLMs 🔥🔥 Paper: arxiv.org/abs/2404.12390 (1/n)

AK@_akhaliq

BLINK Multimodal Large Language Models Can See but Not Perceive We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans

English

125

408

105.8K

Dan Roth retweetledi

Sopan Khosla@KhoslaSopan·11 Tem

Super excited to announce that our "3rd Workshop on NLP for Medical Conversations" will be co-located with IJCNLP-AACL 2023!! Website and CFP: nlpmc-2023.github.io @aaclmeeting #AACL2023 #NLProc #NLP #AI #DigitalHealth #HealthTech #Healthcare

English

2.8K

Dan Roth retweetledi

vinayshekhar@vinayshekhar000·11 Tem

We are thrilled to announce our second workshop on natural language interfaces, held in conjunction with the prestigious IJCNL-AACL conference! In collaboration with researchers from AWS AI Labs, Google Research, Meta AI Research, and Microsoft Research, this workshop aims to

English

1.5K

Dan Roth retweetledi

Randall Hunt@ranman·4 Ağu

I’ve been working with @awscloud’s #Bedrock service for a couple of months now at @caylentinc, and I’d like to share some of what I’ve learned. 🧵

English

313

89.9K

Dan Roth@DanRothNLP·13 Nis

Just out from AWS AI: aws.amazon.com/blogs/machine-…

English

3.2K

Dan Roth retweetledi

Adam Seligman@adamse·24 Haz

aws.amazon.com/codewhisperer/ is really neat. Helps you code faster, checks for security vulns, discloses licenses of code it drew from, and works great for AWS APIs. Boom! @awscloud putting ML to work for developers

English

Dan Roth@DanRothNLP·24 Haz

Excited to announce a new product from AWS AI: Amazon CodeWhisperer aws.amazon.com/blogs/aws/now-…

English

Keşfet

@allen_ai @yujielu_10 @WilliamWangNLP @TIIuae @BigCodeProject @aaclmeeting @awscloud @caylentinc