Xingang Guo

12 posts

Xingang Guo

@Xingang20

Ph.D. student @ ECE UIUC

Champaign, IL Katılım Mayıs 2020

57 Takip Edilen35 Takipçiler

Xingang Guo retweetledi

Utkarsh Tyagi@utkarsh4430·20 May

1/ New from @ScaleAILabs: Rubrics (a.k.a. checklists) have become the default reward interface for RL on open-ended tasks without final verifiable answers. But most rubric RL still relies on static aggregation: fixed human weights over criteria, summed into one scalar reward. We show that this conflates what should matter in the final answer with what can actually teach the current policy. arxiv.org/abs/2605.20164

English

8.2K

Xingang Guo@Xingang20·3 Ara

📉 Frontier models score <35% overall. ⚡ Analog IC design → 0% across all models. Yet with iterative, feedback-driven refinement, strong reasoning models reach ~60% on solvable tasks—showing real promise for AI-assisted engineering workflows.

English

Xingang Guo@Xingang20·3 Ara

Modern LLMs shine at Q&A, but real engineering design demands synthesis, constraints, simulation, and trade-off reasoning. So we built EngDesign: the first benchmark that tests whether LLMs can actually solve engineering design problems, not just answer exam-style questions. 🔧 EngDesign includes: • 101 real-world engineering design tasks • 9 engineering domains • 473 rubric-based evaluation items • MATLAB / SPICE / MQSim simulation pipelines

English

177

Xingang Guo@Xingang20·3 Ara

Excited to share that our paper Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs has been accepted to NeurIPS 2025 (Datasets & Benchmarks Track)! 🚀🚀🚀 📍 Poster: Thu, Dec 4, 2025 • 11 AM–2 PM PST Exhibit Hall C,D,E • #1912 Would love to chat and connect, please drop by our poster tomorrow!

English

2.9K

Xingang Guo retweetledi

Bing Liu@vbingliu·16 Eki

🧠 Can your model think with images? Today we’re releasing VisualToolBench, a new benchmark for multimodal reasoning with tool use, that tests whether multimodal LLMs can think-with-images, not just think about them.

English

1.2K

Xingang Guo retweetledi

Scale AI@scale_AI·15 Eki

📣 Releasing our newest benchmark, VisualToolBench (VTB), the first benchmark designed to evaluate how well multimodal large language models (MLLMs) can dynamically interact with and reason about visual information. VTB goes beyond thinking about images, it’s about thinking with them. The benchmark features leaderboard results across 16 diverse MLLMs, including reasoning, non-reasoning, open-source, and closed-source models.

English

4.6K

Xingang Guo@Xingang20·30 Eki

🧠 Unlike humans, we found VLMs like GPT-4o often struggle to adapt solution steps to similar problems, revealing critical gaps in their reasoning abilities. 📉 Most existing visual math benchmarks are static, making it tough to truly assess robustness. To bridge this gap, we introduce DynaMATH—a dynamic visual math benchmark! 🔥🔍 ✨ How does DynaMATH work? It starts with 501 carefully crafted seed questions, each encoded as a Python program to generate diverse visual and textual question variations. 🔄💻 With DynaMATH, we created 10 unique variants for each seed question, totaling 5,010 questions! 🧩📊 We evaluate both average-case and worst-case accuracy, defining reasoning robustness as the ratio of worst-case to average-case accuracy. 🚨 Many models show reasoning robustness at or below 50%—a substantial drop! 📉⚠️

English

212

Xingang Guo@Xingang20·30 Eki

🔍 Can Vision-Language Models (VLMs) truly reason about math, or are they just reflecting patterns they’ve seen before? 🤔 We tested GPT-4o and Claude-3.5 with our new benchmark, DynaMATH, and the results were eye-opening! 🧩 GPT-4o struggles to recognize when a shifted absolute value function is differentiable at x = 0, only getting it right when the non-differentiable point coincides with zero. 🧩 Claude-3.5, on the other hand, often insists the period of a sinusoidal function is always 6.28, regardless of context. 📉 These findings spark big questions: Can VLMs generalize and reason about math, or are they limited to familiar patterns? 🔗 Dive into our research to learn more about! Project: dynamath.github.io Paper: github.com/DynaMath/DynaM… Dataset: huggingface.co/datasets/DynaM… #DynaMATH #AI #VisionLanguageModels #Mathematics #AIFailures #MachineLearning #Innovation #Benchmarking

English

3.6K

Xingang Guo@Xingang20·30 Eki

🚀 DynaMATH is a dynamic benchmark that transforms 501 seed questions-each crafted as a Python program-into infinitely many concrete problems to test robustness and generalization. 🧠🔢 After analyzing 14 state-of-the-art VLMs, we uncovered striking accuracy gaps under varied input conditions. While Claude 3.5 Sonnet leads with a 64.8% average accuracy, it plunges to 35.3% in the worst-case scenario. 📉🔍 Our findings highlight the urgent need to strengthen VLMs' reasoning abilities, pushing them beyond pattern recognition to true mathematical understanding. 🌐🤖

English

235

Xingang Guo retweetledi

Lianhui Qin@Lianhuiq·6 Mar

📢Introducing ❄️COLD-Attack⚔️, a unified framework for controllable jailbreaking of LLMs. Thanks to the controllability, COLD-Attack enables new jailbreak scenarios that are hard to detect🧐: 1⃣revising a user query adversarially with minimal paraphrasing 2⃣inserting stealthy attacks in a given context 3⃣generating fluent suffix attacks 🔥The energy-based text modeling (introduced in our earlier COLD decoding arxiv.org/abs/2202.11705) ensures diverse user-desired constraints simultaneously in attack, like success rate, stealthiness, semantic similarity, fluency, lexical constraints, … 📢COLD-Attack enables: 1⃣Compositional attack 2⃣10X faster than GCG 3⃣Great stealthiness due to high fluency and contextual coherence xi1ngang.github.io/cold-attack/ arxiv.org/abs/2402.08679