
Shady
667 posts

Shady
@ShadyAlii0
Learning, and trying to make the Machine Learn | Research Assistant @MinnesotaNLP











🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵







Johan's submission does a multi-model ensemble. It runs the same task through GPT-5.2, Gemini-3, and Claude Opus 4.5 in parallel. Tries multiple times with different prompting strategies (standard, deep thinking, with images). Then, instead of predicting the grid directly, the LLMs write Python functions that describe the transformation rule, then execute that code in a sandbox to produce the answer. After collecting many candidate answers, separate AI "judge" models evaluate and vote on which solution is most likely correct. See the repo here: github.com/beetree/ARC-AGI

Reading Torch’s codebase and feeling like a big fucking failure of a cs student rn














