

Steven-Shine Chen
84 posts

@stevenshinechen
CS Master's Student at @MIT, previously @imperialcollege Researching multimodal reasoning at the MIT @medialab








Very mature and humble from the president of ARC-AGI to ask for feedback on its benchmarks It's a strong signal that ARC-AGI could improve and hopefully won't "derail the quest for AGI" x.com/redtachyon/sta…



As test-time compute scales, we need evals for long-horizon, open-ended reasoning Introducing PuzzleWorld🧩a multimodal puzzlehunt benchmark with human-annotated reasoning traces - testing diverse, creative reasoning Paper: arxiv.org/abs/2506.06211 Data: github.com/MIT-MI/PuzzleW…

prompt optimization + context distillation are underexplored primitives for post-training pipelines imo



Announcing the ARC Prize 2025 Top Score & Paper Award winners The Grand Prize remains unclaimed Our analysis on AGI progress marking 2025 the year of the refinement loop




The @ilyasut episode 0:00:00 – Explaining model jaggedness 0:09:39 - Emotions and value functions 0:18:49 – What are we scaling? 0:25:13 – Why humans generalize better than models 0:35:45 – Straight-shotting superintelligence 0:46:47 – SSI’s model will learn from deployment 0:55:07 – Alignment 1:18:13 – “We are squarely an age of research company” 1:29:23 – Self-play and multi-agent 1:32:42 – Research taste Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify. Enjoy!



OpenAI realesed new paper. "Why language models hallucinate" Simple ans - LLMs hallucinate because training and evaluation reward guessing instead of admitting uncertainty. The paper puts this on a statistical footing with simple, test-like incentives that reward confident wrong answers over honest “I don’t know” responses. The fix is to grade differently, give credit for appropriate uncertainty and penalize confident errors more than abstentions, so models stop being optimized for blind guessing. OpenAI is showing that 52% abstention gives substantially fewer wrong answers than 1% abstention, proving that letting a model admit uncertainty reduces hallucinations even if accuracy looks lower. Abstention means the model refuses to answer when it is unsure and simply says something like “I don’t know” instead of making up a guess. Hallucinations drop because most wrong answers come from bad guesses. If the model abstains instead of guessing, it produces fewer false answers. 🧵 Read on 👇




Since my undergraduate days at CMU, I've been participating in puzzlehunts: involving complex, multi-step puzzles, lacking well-defined problem definitions, with creative and subtle hints and esoteric world knowledge, requiring language, spatial, and sometimes even physical interaction. These are major challenges for humans, requiring expert teams hours or even days to solve, and even greater challenges for AI. I'm excited to release our research endeavors towards benchmarking and building AI for solving puzzles! Our first step is PuzzleWorld: a new benchmark of puzzlehunt problems challenging models to think creatively with language, spatial, and physical reasoning. AI that can successfully solve puzzles have direct impact on education, logic, scientific discovery, and more. Paper - arxiv.org/abs/2506.06211 Dataset - github.com/MIT-MI/PuzzleW… see full thread by @mmtjandrasuwita for more details!


