Xingang Guo

12 posts

Xingang Guo

Xingang Guo

@Xingang20

Ph.D. student @ ECE UIUC

Champaign, IL Katılım Mayıs 2020
57 Takip Edilen35 Takipçiler
Xingang Guo retweetledi
Utkarsh Tyagi
Utkarsh Tyagi@utkarsh4430·
1/ New from @ScaleAILabs: Rubrics (a.k.a. checklists) have become the default reward interface for RL on open-ended tasks without final verifiable answers. But most rubric RL still relies on static aggregation: fixed human weights over criteria, summed into one scalar reward. We show that this conflates what should matter in the final answer with what can actually teach the current policy. arxiv.org/abs/2605.20164
Utkarsh Tyagi tweet media
English
2
21
73
8.2K
Xingang Guo
Xingang Guo@Xingang20·
📉 Frontier models score <35% overall. ⚡ Analog IC design → 0% across all models. Yet with iterative, feedback-driven refinement, strong reasoning models reach ~60% on solvable tasks—showing real promise for AI-assisted engineering workflows.
Xingang Guo tweet mediaXingang Guo tweet media
English
0
0
3
92
Xingang Guo
Xingang Guo@Xingang20·
Modern LLMs shine at Q&A, but real engineering design demands synthesis, constraints, simulation, and trade-off reasoning. So we built EngDesign: the first benchmark that tests whether LLMs can actually solve engineering design problems, not just answer exam-style questions. 🔧 EngDesign includes: • 101 real-world engineering design tasks • 9 engineering domains • 473 rubric-based evaluation items • MATLAB / SPICE / MQSim simulation pipelines
Xingang Guo tweet media
English
1
1
3
177
Xingang Guo
Xingang Guo@Xingang20·
Excited to share that our paper Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs has been accepted to NeurIPS 2025 (Datasets & Benchmarks Track)! 🚀🚀🚀 📍 Poster: Thu, Dec 4, 2025 • 11 AM–2 PM PST Exhibit Hall C,D,E • #1912 Would love to chat and connect, please drop by our poster tomorrow!
Xingang Guo tweet media
English
2
7
10
2.9K
Xingang Guo retweetledi
Bing Liu
Bing Liu@vbingliu·
🧠 Can your model think with images? Today we’re releasing VisualToolBench, a new benchmark for multimodal reasoning with tool use, that tests whether multimodal LLMs can think-with-images, not just think about them.
Bing Liu tweet media
English
1
4
27
1.2K
Xingang Guo retweetledi
Scale AI
Scale AI@scale_AI·
📣 Releasing our newest benchmark, VisualToolBench (VTB), the first benchmark designed to evaluate how well multimodal large language models (MLLMs) can dynamically interact with and reason about visual information. VTB goes beyond thinking about images, it’s about thinking with them. The benchmark features leaderboard results across 16 diverse MLLMs, including reasoning, non-reasoning, open-source, and closed-source models.
Scale AI tweet media
English
2
5
20
4.6K
Xingang Guo
Xingang Guo@Xingang20·
🧠 Unlike humans, we found VLMs like GPT-4o often struggle to adapt solution steps to similar problems, revealing critical gaps in their reasoning abilities. 📉 Most existing visual math benchmarks are static, making it tough to truly assess robustness. To bridge this gap, we introduce DynaMATH—a dynamic visual math benchmark! 🔥🔍 ✨ How does DynaMATH work? It starts with 501 carefully crafted seed questions, each encoded as a Python program to generate diverse visual and textual question variations. 🔄💻 With DynaMATH, we created 10 unique variants for each seed question, totaling 5,010 questions! 🧩📊 We evaluate both average-case and worst-case accuracy, defining reasoning robustness as the ratio of worst-case to average-case accuracy. 🚨 Many models show reasoning robustness at or below 50%—a substantial drop! 📉⚠️
Xingang Guo tweet media
English
0
0
3
212
Xingang Guo
Xingang Guo@Xingang20·
🔍 Can Vision-Language Models (VLMs) truly reason about math, or are they just reflecting patterns they’ve seen before? 🤔 We tested GPT-4o and Claude-3.5 with our new benchmark, DynaMATH, and the results were eye-opening! 🧩 GPT-4o struggles to recognize when a shifted absolute value function is differentiable at x = 0, only getting it right when the non-differentiable point coincides with zero. 🧩 Claude-3.5, on the other hand, often insists the period of a sinusoidal function is always 6.28, regardless of context. 📉 These findings spark big questions: Can VLMs generalize and reason about math, or are they limited to familiar patterns? 🔗 Dive into our research to learn more about! Project: dynamath.github.io Paper: github.com/DynaMath/DynaM… Dataset: huggingface.co/datasets/DynaM… #DynaMATH #AI #VisionLanguageModels #Mathematics #AIFailures #MachineLearning #Innovation #Benchmarking
Xingang Guo tweet media
English
3
9
24
3.6K
Xingang Guo
Xingang Guo@Xingang20·
🚀 DynaMATH is a dynamic benchmark that transforms 501 seed questions-each crafted as a Python program-into infinitely many concrete problems to test robustness and generalization. 🧠🔢 After analyzing 14 state-of-the-art VLMs, we uncovered striking accuracy gaps under varied input conditions. While Claude 3.5 Sonnet leads with a 64.8% average accuracy, it plunges to 35.3% in the worst-case scenario. 📉🔍 Our findings highlight the urgent need to strengthen VLMs' reasoning abilities, pushing them beyond pattern recognition to true mathematical understanding. 🌐🤖
Xingang Guo tweet media
English
0
0
3
235
Xingang Guo retweetledi
Lianhui Qin
Lianhui Qin@Lianhuiq·
📢Introducing ❄️COLD-Attack⚔️, a unified framework for controllable jailbreaking of LLMs. Thanks to the controllability, COLD-Attack enables new jailbreak scenarios that are hard to detect🧐: 1⃣revising a user query adversarially with minimal paraphrasing 2⃣inserting stealthy attacks in a given context 3⃣generating fluent suffix attacks 🔥The energy-based text modeling (introduced in our earlier COLD decoding arxiv.org/abs/2202.11705) ensures diverse user-desired constraints simultaneously in attack, like success rate, stealthiness, semantic similarity, fluency, lexical constraints, … 📢COLD-Attack enables: 1⃣Compositional attack 2⃣10X faster than GCG 3⃣Great stealthiness due to high fluency and contextual coherence xi1ngang.github.io/cold-attack/ arxiv.org/abs/2402.08679
Lianhui Qin tweet media
English
5
48
231
24.8K
Xingang Guo retweetledi
US Open Tennis
US Open Tennis@usopen·
No words.
US Open Tennis tweet media
English
87
1.5K
10K
0
Xingang Guo retweetledi
IFAC_Control
IFAC_Control@IFAC_Control·
EUROPEAN CONTROL CONFERENCE 2021 June 29-July 2 2021 Rotterdam - De Doelen ecc21.euca-ecc.org
English
0
9
22
0