
ETH Zurich stress-tested AGENTS.md files on Claude Code, Codex, and Qwen Code across 438 real tasks.
LLM-generated context files made agents worse. Success rates dropped, costs jumped 20%+, reasoning tokens up 22%. The agents followed every instruction faithfully - that was the problem. More instructions meant more aimless exploration.
Turns out agents are already good at navigating code. Most of what we're putting in these files is stuff they'd find on their own with grep.
Human-written files helped slightly, but only when kept brutally short. The paper's advice: skip LLM-generated files entirely and keep human-written ones to minimal, essential requirements only.
English





