Carlo
1.1K posts

Carlo
@CarloFCesar
On my own journey towards financial literacy and freedom | Travel planner & Bike guide | 🇮🇹 🇳🇱


Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…













Just cut my agent's cloud costs significantly without sacrificing quality. Thanks to @witcheer and @AlexFinn, I starting building from your posts! The journey: Started on Mac mini M4 (24GB…) running Claude Haiku/Sonnet for all background tasks ~60 API calls/day. Tried Qwen3.5-35B-A3B via Ollama first. 23GB model. OOM. Killed. Tried Qwen2.5-14B. Fit fine, but reasoning quality too weak for my workflow. Then TurboQuant dropped @GoogleResearch. KV cache compression, 4.6× smaller memory footprint! My new stack: Qwen3.5-27B-IQ3_XXS (10.7GB) via llama.cpp TurboQuant fork → 13.6 tok/s on M4 Pro → zero API cost. Validation before shipping: → 15 tool-use scenarios: 12/12 pass → Shadow test against Haiku on live data: output indistinguishable → Did have 3 timeout failures on error recovery — but a lot of crons can run overnight with a higher timeout, not a quality issue My full model stack today: 🟢 Local (free) → Qwen3.5-27B IQ3_XXS — health monitoring, training load, environmental logging. (Based on multimodal biomarkers, keen on getting that data vectorized!) → MedGemma 4B — offline domain-specific model 🔵 Anthropic → Claude Sonnet — interactive sessions, memory synthesis → Claude Haiku — research briefings, clinical alerts, complex reasoning chains 🟡 Google → Gemini 2.5 Pro — fallback when Anthropic unavailable ⚫ DeepSeek → DeepSeek V3.2 — cost-efficient tasks when applicable The "local = cheap but dumb" assumption is breaking down fast fast faster! Probably it is already outdated, goes so fast! Curious to get it smarter and cheaper everyday 🥳



Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI




