Martin Vechev

226 posts

Martin Vechev

@mvechev

Professor of Computer Science, ETH Zurich. Founder of INSAIT (https://t.co/bqKTA6e8X0). Works on Safe/Secure AI, LLMs, Quantum. Co-founder of 6 Deep-Tech start-ups.

Katılım Haziran 2012

26 Takip Edilen1.9K Takipçiler

Martin Vechev retweetledi

Jasper Dekoninck@j_dekoninck·13 Mar

How often do LLMs claim to prove false mathematical statements? In our latest benchmark, BrokenArXiv, we find they do so very often. The best model, GPT-5.4, only rejects 40% of incorrect statements obtained by perturbing recent ArXiv papers, and other models do much worse.

English

119

827

84.7K

Martin Vechev@mvechev·24 Şub

some of our work on (not) using CLAUDE.md/AGENTS.md has been widely profiled recently.

Theo - t3.gg@theo

You should delete your CLAUDE․md/AGENTS․md file. I have a study to prove it.

English

3.2K

Martin Vechev retweetledi

Niels Mündler@nielstron·17 Şub

Are AGENTS actually useful for coding agents? In our latest preprint, we analyze the effect of context files on coding model performance. Short TLDR: It depends. Manually-written files help, while LLM-generated files (e.g. by the agent) don't. More details in the thread 🧵

English

8.4K

Martin Vechev retweetledi

Jasper Dekoninck@j_dekoninck·28 Kas

How far can we push LLM performance on Project Euler (PE), a set of challenging mathematical programming problems? In a new blog, we explore this by designing specialized agents. With the help of the PE community, we also perform a qualitative analysis of model performance. 🧵

English

1.1K

Martin Vechev@mvechev·18 Kas

Grateful that Google gave us access to Gemini 3 to evaluate its performance on MathArena (math performance) before the release and these results are now included in their release and model card! Gemini 3 is topping all categories on MathArena, great work!

Nikola Jovanović@ni_jovanovic

Full @GoogleAI Gemini 3 results are up on MathArena: ➡️ #1 on 2025 Final-Answer Competitions ➡️ #1 on Apex: 5.2% -> 23.4% new SOTA ➡️ #1 on Visual Math: 79% -> 84% new SOTA ➡️ #2 on Project Euler: 62%, huge jump compared to 2.5 Pro (15%)

English

921

Martin Vechev retweetledi

INSAIT Institute@INSAITinstitute·21 Eki

🎥 Can AI video models truly understand physics? The newly released Physics-IQ benchmark, developed by INSAIT and Google DeepMind and led by Saman Motamed, PhD student at INSAIT, has sparked major discussion across the AI community following its presentation at #ICCV2025. 🔬 The work marks a significant step forward in understanding the physical reasoning limits of today’s generative video models - paving the way for future AI systems that not only generate realistic videos but also reason about the physical world with accuracy and depth. 📊 Physics-IQ provides a comprehensive benchmark of 396 real-world videos, covering diverse physical scenarios - from fluid dynamics to solid mechanics, challenging AI models to predict future frames and interactions beyond surface-level visual cues. 🤔 The findings were eye-opening: even state-of-the-art models like #Sora, #Runway, and #VideoPoet create visually stunning clips but fail to capture true physical dynamics, revealing the gap between perception and understanding. 🚀 The project has been met with great interest from the research community, highlighting the importance of integrating experiential and interactive learning into next-generation video models. 📂 Explore the open-source dataset, evaluation code, and results - links in comments #GenerativeAI #VideoAI #AIResearch #PhysicsInAI #PhysicalReasoning #AIUnderstanding #AIBenchmark #OpenSourceAI #FutureOfAI #AIInnovation #INSAIT

English

656

Martin Vechev retweetledi

INSAIT Institute@INSAITinstitute·23 Eki

🔥 We’re releasing SPEAR-1 (spear.insait.ai) - a new robotic AI foundation model that achieves state-of-the-art performance with 20× less robotic data 🧠 Why it matters:SPEAR-1 is like the ChatGPT for robots - a single model that can perform many tasks, on any robot, in any environment. 💡 What’s new: unlike others, SPEAR-1 learns from both robotic and non-robotic 3D data, breaking the data bottleneck that slows robotic AI. 🤖 Open-weight, general-purpose, and multilingual for robots - a major step toward scalable robot learning. #Robotics #FoundationModels #3DPerception #Manipulation #INSAIT #Europe #DataEfficiency

English

17.5K

Martin Vechev retweetledi

INSAIT Institute@INSAITinstitute·25 Eki

🚀 Big news for INSAIT, Bulgaria & Europe! @WIRED (read by 30M+ people/month) just profiled SPEAR-1 — INSAIT’s new foundation robotic model, the first released by Europe! 🤖 As WIRED notes, SPEAR-1 matches leading global models trained on many times more data - a huge leap toward ChatGPT-like AI for robotics! 👏 Congrats to all involved: Nikolay Nikolov, @giualbanese1, @DSombit, Jan-Nico Zaech, Danda Pani Paudel, Luc Van Gool & Alex Yanev. #AI #Robotics #Europe #INSAIT

English

542

Martin Vechev retweetledi

INSAIT Institute@INSAITinstitute·28 Eki

🚀 Major news! @Google expands its support for INSAIT with a new $1,000,000 contribution - targeting groundbreaking AI research and expanding INSAIT’s local ecosystem initiatives. 💰 Google’s total support for INSAIT now well exceeds $6 000 000, further strengthening INSAIT’s capacity to conduct world-class research and cultivate the next generation of AI talent in Bulgaria. 🌍 This milestone builds on years of collaboration - from @GoogleDeepMind PhD fellowships to funding for training AI models - helping position INSAIT as a world-class AI research organization. Google also 2x profiled BgGPT, the first Bulgarian LLM built by INSAIT in a series of articles read by millions around the world. ✨ A huge thank you to Google for the continued trust and meaningful support to INSAIT! #INSAIT #Google #AI #BgGPT #Innovation #Research #DeepMind #Gemma #Bulgaria

English

681

Martin Vechev retweetledi

INSAIT Institute@INSAITinstitute·22 Eki

🚀 INSAIT makes an impact at @ICCVConference! With more than 10,000 participants, ICCV is the world’s leading conference in AI and computer vision - and INSAIT’s booth has become one of its most vibrant spots. Hundreds of researchers, industry leaders, and students stopped by to discover what INSAIT is all about and to connect with our team. 🤖 From cutting-edge robotics and embodied AI to next-generation foundation models, visitors experienced firsthand how INSAIT is pushing the boundaries of global AI research. 🇧🇬 We’re proud to represent Bulgaria on the world stage - proving that world-class deep tech innovation thrives right here in Sofia. #ICCV2025 #INSAIT #AI #ComputerVision #Research #DeepTech #Bulgaria #Robotics

English

800

Martin Vechev retweetledi

Nikola Jovanović@ni_jovanovic·22 Eki

MathArena most viewed on alphaXiv😲 Cool work @askalphaxiv (although Apex is a different dataset, you should title this USAMO and link to our USAMO eval instead)

alphaXiv@askalphaxiv

We used DeepSeek OCR to extract every dataset from tables/charts across 500k+ AI arXiv papers for $1000 🚀 See which benchmarks are trending and discover datasets you didn't know existed Doing the same task with Mistral OCR would've cost $7500 👀

English

901

Martin Vechev retweetledi

Jasper Dekoninck@j_dekoninck·20 Eki

New competition on MathArena 🥳 This is a nice one and can highly recommend to check out some of the traces. Seeing GPT-5 write dozens of pages and still fail for a problem you can solve in <10sec is very satisfying.

Nikola Jovanović@ni_jovanovic

MathArena goes visual: We evaluated models such as GPT-5 on Math Kangaroo 2025, a recent contest for ages 6-19 where most tasks require visual reasoning. Models struggle the most with tasks for younger kids. For example, they get this task for 1st graders only 3% of the time 🧵

English

1.6K

Martin Vechev retweetledi

Nikola Jovanović@ni_jovanovic·20 Eki

English

11.9K

Martin Vechev retweetledi

Kazuki Egashira@kazukiega·13 Eki

🚨 Be careful when pruning an LLM! 🚨 Even when the model appears benign, it might start behaving maliciously (e.g., jailbroken) once you download and prune it. Here’s how our attack works 🧵 arxiv.org/abs/2510.07985

English

2.7K

Martin Vechev retweetledi

INSAIT Institute@INSAITinstitute·13 Eki

🚀⚛️ Major result: we are announcing qblaze – a state-of-the-art quantum simulator, built by researchers at INSAIT and @ETH_en! 🥇 qblaze sets a record for the largest number factored to date with Shor’s algorithm by a quantum circuits simulator – a 39 bit number (549 755 813 701). In comparison, despite recent advances, the largest number factored on an actual quantum computer to date with Shor's algorithm, is 21. 📜 qblaze matches the previous record set with the specialized (for Shor’s algorithm) emulator shorgpu – except shorgpu used 2048 GPUs, while qblaze only uses 2 CPUs! ⚡qblaze outperforms publicly available industry quantum simulators including @IBM’s Qiskit Aer and @Microsoft’s Q# and on Shor and Grover archives a speed-up of over 2000x! 🧠 qblaze scales thanks to a novel sparse data structure and highly-optimized parallel algorithms – the research paper describing qblaze’s operation was accepted at ACM OOPSLA’25, and will be presented this week in Singapore. OOPSLA is a top research conference in programming languages and systems. 💻 qblaze is fully open source, documented, has an easy to use Python API and can be used as a drop-in replacement for IBM’s Qiskit simulators as well as other quantum frameworks. All about qblaze can be found at qblaze.org. 👏Congratulations to all qblaze authors: Hristo Venev (INSAIT), Dimitar Dimitrov (INSAIT), Timon Gehr (ETH Zurich), Martin Vechev (INSAIT, ETH Zurich) and Thien Udomsrirungruang (former INSAIT summer research fellow).

English

849

Martin Vechev retweetledi

INSAIT Institute@INSAITinstitute·10 Eki

🚀 We are excited to announce BrokenMath, the first benchmark designed to systematically evaluate sycophancy in theorem proving with large language models (LLMs), now live at sycophanticmath.ai! 🧩 We show that even the best LLMs can produce convincing but wrong proofs when given false statements by users - a behavior known as sycophancy. This poses a major challenge for deploying AI systems in math and science, where truthfulness and rigor are essential. 📘 BrokenMath introduces 504 expertly verified false theorems derived from 2025 national and international math competition problems, creating a realistic and challenging environment for studying model reliability and reasoning integrity. 📊 The results show that sycophancy is widespread, with even GPT-5 producing proofs for false statements 29% of the time. The issue worsens as problems become more difficult and when tasks involve proof-based reasoning. While mitigation strategies such as prompting, agentic reasoning, and fine-tuning provide partial relief, none fully resolve the issue. 🌐 Explore benchmark, datasets, and paper - links in comments. 👩‍🔬 Congratulations to all authors: Ivo Petrov (INSAIT), Jasper Dekoninck (ETH Zürich), Martin Vechev (INSAIT, ETH Zürich)

English

787

Martin Vechev retweetledi

INSAIT Institute@INSAITinstitute·6 Eki

🤖 INSAIT had a strong presence at #CoRL2025 – the leading conference in AI for robotics – held this year in Seoul which gathered more than 2500 participants! 🚀 In exciting news, INSAIT’s robotics team was one of two able to qualify for RoboArena - a challenge for evaluating robotics foundation AI models. INSAIT’s model was able to outperform state-of-the-art models such as Physical Intelligence’s pi0, while trained on 10x less data. Keep an eye on the release, coming soon! 🧠 We also presented MotoVLA, a new method for training robotics AI systems that drastically reduces the need for large labeled datasets, moving us a step closer to generalist robotic systems. 🏙️ Congratualtions to Alexander Marc Spiridonov, Nikolay Nikolov, Giuliano Albanese who represented INSAIT at CoRL 2025! 🔮 Its exciting to see that Bulgaria with INSAIT is now at the forefront of the emerging direction of physical intelligence!

English

484

Martin Vechev retweetledi

Nikola Jovanović@ni_jovanovic·23 Eyl

MathArena Update: Claims about Grok 4 Fast seem to check out, it matches the performance of Grok 4 but is much faster and 20-50x cheaper. Good release! This holds across final-answer competitions, Apex problems, and Project Euler. 🧵

English

627

96.1K

Martin Vechev retweetledi

Jasper Dekoninck@j_dekoninck·12 Eyl

A new open reasoning model, K2-Think, was recently released boasting scores comparable to GPT-OSS 120B and getting a lot of media attention. However, their performance relies on flawed evaluation marked by contamination, unfair comparisons, and misrepresentation of results. 🧵

English

320

52.6K

Martin Vechev retweetledi

Nikola Jovanović@ni_jovanovic·18 Ağu

Introducing MathArena Apex: A set of curated final-answer problems from recent competitions that even best LLMs still can't solve. Top models are correct at most 5% of the time🧵 (1/8)

English

136

16.4K

Keşfet

@WIRED @giualbanese1 @DSombit @Google @GoogleDeepMind @ICCVConference @askalphaxiv @ETH_en