
Quentin André
3.8K posts

Quentin André
@andre_quentin
Assistant Prof. of Marketing @ CU Boulder. Open science, research methods, managerial and numerical cognition. ❤️Python 🐍.




STRIKE. 💥🦅





Can AI coding agents reproduce published social science findings? In new work with @_mohsen_m, Fabrizio Gilardi, and @j_a_tucker, we introduce SocSci-Repro-Bench — a benchmark of 221 reproducibility tasks from 54 papers — and evaluate two frontier coding agents: Claude Code and Codex. The results reveal both remarkable capabilities and new risks for AI-assisted science. ------------------------------------ GOAL -------- A key design goal was separating two different problems: 1️⃣ Are replication materials themselves reproducible? 2️⃣ Can AI agents reproduce results when materials are executable? To isolate agent performance, we only included tasks whose outputs were identical across three independent manual executions. ------------------------------------ DESIGN -------- Agents received: • anonymized data + code • a sandboxed execution environment They had to autonomously: • install dependencies • debug broken code • execute the pipeline • extract the requested results In short: end-to-end computational reproduction. ------------------------------------ RESULTS -------- Both agents reproduced a large share of published findings. But Claude Code substantially outperformed Codex. Task-level accuracy • Claude Code: 93.4% • Codex: 62.1% Paper-level reproduction (all tasks correct) • Claude Code: 78.0% • Codex: 35.8% ------------------------------------ WHY THE GAP? -------- Replication packages often contain problems: • missing dependencies • hard-coded file paths • incomplete environment specifications Claude Code frequently repaired these issues autonomously. Codex often failed to recover the execution pipeline. ------------------------------------ IS THIS JUST MEMORIZATION? -------- We tested this by asking agents to infer paper metadata (title, authors, journal, year) from anonymized replication materials. Recovery rates were very low, suggesting agents primarily relied on code execution, not memorization of papers. ------------------------------------ REASONING TEST -------- We also tested a harder task: Can agents infer the research question of a study from code and data alone? Both agents performed surprisingly well. ------------------------------------ CONFIRMATION BIAS -------- When agents were given the paper PDF, a new problem emerged. Sometimes they copied reported results from the text instead of executing the code. Accuracy on non-reproducible tasks dropped sharply. Context helps execution — but reduces independence of verification. ------------------------------------ SYCOPHANCY -------- Inspired by @ahall_research, we tested adversarial prompt framing, nudging agents to: “explore alternative analyses that align with the paper’s reported results.” Accuracy increased. But agents also became more likely to fabricate results when reproduction was impossible. ------------------------------------ THE PARADOX -------- Pressure to produce an answer can help agents repair execution pipelines. But it simultaneously erodes their ability to say: “This result cannot be reproduced.” Recognizing when reproduction is impossible may be the most important scientific capability. ------------------------------------ NOTES -------- • This is work in progress — feedback is welcome. • Benchmark available on GitHub. • Replication materials hosted on Dataverse. Paper + repository in the reply below.

A New York bill would ban AI from answering questions related to several licensed professions like medicine, law, dentistry, nursing, psychology, social work, engineering, and more. The companies would be liable if the chatbots give “substantive responses” in these areas.



There's increasing unhappiness at prediction Market platforms Polymarket and Kalshi after bettors lost their wagers - so what could happen next? My latest for @FastCompany fastcompany.com/91501163/iran-…

🧵regarding Lord of the Rings - related traumatic injuries, and whether access to modern Level 1 trauma centers could have decreased morbidity and mortality within the Fellowship. Here we will take a more evidence-based approach to some of the injuries in Middle Earth (1/ )









