George Mourginakis
286 posts




@yacineMTB Is the exposed propeller for cutting dandelions?

Jagged Intelligence The word I came up with to describe the (strange, unintuitive) fact that state of the art LLMs can both perform extremely impressive tasks (e.g. solve complex math problems) while simultaneously struggle with some very dumb problems. E.g. example from two days ago - which number is bigger, 9.11 or 9.9? Wrong. x.com/karpathy/statu… or failing to play tic-tac-toe: making non-sensical decisions: x.com/polynoamial/st… or another common example, failing to count, e.g. the number of times the letter "r" occurs in the word "barrier", ChatGPT-4o claims it's 2: x.com/karpathy/statu… The same is true in other modalities. State of the art LLMs can reasonably identify thousands of species of dogs or flowers, but e.g. can't tell if two circles overlap: x.com/fly51fly/statu… Jagged Intelligence. Some things work extremely well (by human standards) while some things fail catastrophically (again by human standards), and it's not always obvious which is which, though you can develop a bit of intuition over time. Different from humans, where a lot of knowledge and problem solving capabilities are all highly correlated and improve linearly all together, from birth to adulthood. Personally I think these are not fundamental issues. They demand more work across the stack, including not just scaling. The big one I think is the present lack of "cognitive self-knowledge", which requires more sophisticated approaches in model post-training instead of the naive "imitate human labelers and make it big" solutions that have mostly gotten us this far. For an example of what I'm talking about, see Llama 3.1 paper section on mitigating hallucinations: x.com/karpathy/statu… For now, this is something to be aware of, especially in production settings. Use LLMs for the tasks they are good at but be on a lookout for jagged edges, and keep a human in the loop.


Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline







For any researchers in my network: I want to take research more seriously to produce useful info, I have no academic background. Beyond prompting what resources and practices would you recommend?

Meanwhile, Meese Engine can stream from disk at hypersonic speeds on an office laptop, at a pure 60 FPS with a render distance of 64 chunks.






chagpt images 2.0 can cook hard








