

Jonathan @SF
169 posts

@lightetal
I’m a founding member @AsariAILabs and PhD researcher @Caltech @RPI @NEC working in LLM-agents, reasoning, RL, test-time scaling, and computer use agents.



Post-training LLMs is like mixing a cocktail: Too much easy data → no learning Too much hard data → instability Wrong balance → collapse And today, we mix it by hand. What if the data mixture could be learned instead of hand-tuned? arxiv.org/abs/2602.20532 🧵👇




Full house at the Computer History Museum today. Great speakers, AI enthusiasts, and people who truly care about memories, all under one roof. We shared our thoughts on: • Memory and why it matters • The future of memory infrastructure • The future of the memory OS This was also the grand finale of our Genesis 2026 competition. The quality of the projects, the presentations, and the sheer volume of code submitted truly blew us away. It is clear that this community is pushing the boundaries of what memory infrastructure can be. More to come. Stay tuned.







Post-training LLMs is like mixing a cocktail: Too much easy data → no learning Too much hard data → instability Wrong balance → collapse And today, we mix it by hand. What if the data mixture could be learned instead of hand-tuned? arxiv.org/abs/2602.20532 🧵👇


(1/8) 🚀 New preprint: stop training reasoning models uniformly. Uniform prompt sampling + fixed rollouts waste compute on easy questions. We adapt (1) *what* we train on online, or (2) *how much compute* we spend online. #ContinualLearning #lifelonglearning #Reasoningmodels


I actually think that “Claude Code can solve it” is a prerequisite for a great research problem because it allows you to explore hypotheses much faster. In fact if CC can’t solve it I’d flag it as a bad problem because it will solve it in six months that you’ll be wasting on it


We scored 36.08% on ARC-AGI-3 in one day using the Agentica SDK.






Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu

Post-training LLMs is like mixing a cocktail: Too much easy data → no learning Too much hard data → instability Wrong balance → collapse And today, we mix it by hand. What if the data mixture could be learned instead of hand-tuned? arxiv.org/abs/2602.20532 🧵👇


