

achal
102 posts




2 weeks ago @raw_works's profile published an announcement about hitting state-of-the-art on LongCoT. A relatively small model like Qwen3.5-9B beat GPT-5.2 on a long-horizon reasoning benchmark by over 60% using the right scaffold. That question is "Is true intelligence just locked behind the right scaffolding" First, What Is Long CoT, and Why Does It Matter? LongCoT is a benchmark for difficult reasoning problems. It is specifically designed to measure whether models can sustain coherent reasoning over extremely long horizons. The tasks span mathematics, chemistry, computer science, chess, and logic, where each individual reasoning step is usually within the capability of frontier models. The difficulty comes from maintaining correctness across a massive graph of interdependent steps that can stretch across tens to hundreds of thousands of reasoning tokens. These tasks break most models and act as a real test of complex task solving abilities. Let's talk about what @a1zhang (MIT CSAIL) published recently. Using a refined prompting setup within the RLM harness, they pushed performance on LongCoT-mini from 38.7% to 65.6%. A nearly 2x improvement on one of the hardest compositional reasoning benchmarks out there, just from better scaffold design. Earlier results with dspy.RLM on Claude Sonnet 4.5 showed a jump from roughly 13% to 45.4% overall. Specific categories like Dungeon, Packaging, Hanoi, Sudoku, and Wizards went from near-zero to perfect scores. Chess hit 85 out of 100. Then there's @raw_works's result: Qwen3.5-9B paired with dspy.RLM achieved 15.69% on LongCoT-Full compared to GPT-5.2's 9.83%. A 9 billion parameter open model beating one of the most capable frontier models available, by a meaningful margin, on a hard benchmark. The 27B variant ranked highly on the mini split too, beating models many times its size. It's Not Just LongCoT. This same pattern is showing up across benchmark categories. On LongMemEval, dspy.RLM variants are consistently hitting 87–89.8% accuracy. A model like Gemini 3 Flash paired with dspy.RLM and observational memory reached 89.8% at roughly $0.035 per query. That's approaching dedicated memory system benchmarks like Mastra (~95%) and Vectorize Hindsight (~91%), without any specialized memory architecture. On multi-hop reasoning tasks and large-context aggregation problems where you're slicing through 10 million+ tokens and need to pull out specific signals RLMs are outperforming both vanilla long-context models and traditional RAG setups. The Takeaway @a1zhang's The Mismanaged Geniuses Hypothesis is very apt here. These frontier models already have the raw capability for hard task decomposition. The bottleneck isn't intelligence it's task management. Standard prompting essentially hands a genius a disorganized to-do list and wonders why they underperform. RLMs fix this. By giving the model a recursive execution environment a shared REPL state, typed inputs and outputs via DSPy signatures, structured delegation. The models we already have are more capable than our current interfaces allow them. RLMs, and DSPy's implementation in particular, are surfacing that latent capability at scale. It would be interesting to watch this space and see far RLMs take us. These are the sources which will allow you to go deeper: @a1zhang, @raw_works, alexzhang13.github.io alexzhang13.github.io


Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…




