Carlo

1.1K posts

Carlo banner
Carlo

Carlo

@CarloFCesar

On my own journey towards financial literacy and freedom | Travel planner & Bike guide | 🇮🇹 🇳🇱

Netherlands & Italy Tham gia Ekim 2010
311 Đang theo dõi309 Người theo dõi
Carlo
Carlo@CarloFCesar·
@elroyic @AlexFinn Why not 70B, could quantize FP8 or NVFP4 even. And then, how does QWEN relate to for example Llama?
English
0
0
0
21
Alex Finn
Alex Finn@AlexFinn·
It happened. An open weights model just dropped that benchmarks higher than Opus 4.6 is out If you have 2 Mac Studios w/ 512gb, you can run Opus 4.6 level intelligence completely for free on your desk I warned you this would happen months ago. Now Mac Studios and Mac Minis are sold out The next Mac Studio has been delayed until Q3/Q4. The price will be significantly higher I told you this was going to happen. Intelligence explosion. Hardware bottleneck. Increased efficiency Luckily I picked up 2 Mac Studio 512gbs, 2 Mac Minis, and a DGX Spark I will be loading this up in the next couple of days and will have completely private super intelligence running for me 24/7 I’m telling you right now by end of year we will have a local version of Mythos. It’s 100% guaranteed You called me crazy but every single prediction I’ve made has turned out to be true These models will only get more efficient and require less hardware. But that hardware is only going to get more expensive Local/open source is so obviously the future and if you’re still denying this now you are delusional
Kimi.ai@Kimi_Moonshot

Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…

English
188
146
1.7K
317.4K
Lena
Lena@luminousmind_co·
@CarloFCesar @AlexFinn split hot/cold. small model routes and watches state; heavy model wakes only when the database context needs real reasoning. otherwise 24/7 local turns into a heat bill.
English
1
0
1
57
Eric J
Eric J@E_Jellerson·
People claiming 24GB is comfortable for a 27B dense model with a full agent harness have never actually run one under real tool-calling load. The weights are just the starting line. Once you add the gateway, system prompts, tool schemas, and KV cache under actual multi-turn use, you're at 27-30GB before you blink.
English
1
0
1
34
Vadim
Vadim@VadimStrizheus·
I just got a MacBook Pro M5 Chip > 16” > 24GB > 1 TB of storage What’s the best local model to run on this thing?
English
112
4
1.2K
145.4K
Carlo
Carlo@CarloFCesar·
What would you run if your have 2x DGX Spark (and 1x mac mini 24gb…)and want to go 100% local? Multi agent system that needs to 24/7 run computations and cross-references on an ever growing vectorized data pool. Nemoclaw strongly preferred. - Yes I’m relatively new to it - Yes I first got the mac mini and learned a lot (hence the upgrade) - Yes I need some help here. 😂
English
0
0
0
231
Alex Finn
Alex Finn@AlexFinn·
Current best OpenClaw setup: Orchestrator- Opus 4.7 API. Hoping ChatGPT 5.5 is better for OpenClaw, but at the moment it's Opus or nothing. Pay the extra money. Coding- Codex CLI. Whenever OpenClaw needs to code, have Opus use the Codex CLI to build. Tons of cost savings and is equally as good for coding Research/writing (good hardware)- GLM 5.1 running locally. Opus 4.5 level. This is what I use on my Mac Studio 512gb. Doing research for me and scraping 24/7. Research/writing (meh hardware)- Qwen 3.6. Use this if you have decent hardware but not great. Still strong. Research/writing (no hardware)- Use any other oauth you got. ChatGPT and Gemini work just fine. I understand Opus 4.7 API is expensive. But at the end of the day, even if you spend $1,000 a month on it, that means you're spending $12,000 a year on a full time genius level employee. That would typically cost you at least $100,000 a year Praying ChatGPT has a cost efficient competitor soon though. And as always, local is the future. Right now I use local for about 50% of the use cases. My prediction is I'll be using local models for 100% by the end of the year with Mythos level models running on a Mac Studio.
English
110
40
593
46.9K
Carlo
Carlo@CarloFCesar·
If you don’t read up on the updates on AI for more than 24hours you feel sooooo behind
English
0
0
0
26
Carlo
Carlo@CarloFCesar·
My lesson for today: ''Sleeping turns knowledge into wisdoms'' Not a 1 on 1 quote, but it's the concept. Comes from: youtube.com/watch?v=-OBCwi…
YouTube video
YouTube
English
0
0
0
31
Carlo
Carlo@CarloFCesar·
@sidelined_cap Agree there man! I use Qwen 3.6 via openrouter only to do online peer reviewed research for me
English
0
0
1
86
踏空哥 Sidelined Capital
踏空哥 Sidelined Capital@sidelined_cap·
Strong execution @CarloFCesar. Free windows are perfect for speed tests, but teams should treat them as public sandboxes and move sensitive workloads to private endpoints before experiments become defaults.
English
1
0
1
81
Carlo
Carlo@CarloFCesar·
Just cut my agent's cloud costs significantly without sacrificing quality. Thanks to @witcheer and @AlexFinn, I starting building from your posts! The journey: Started on Mac mini M4 (24GB…) running Claude Haiku/Sonnet for all background tasks ~60 API calls/day. Tried Qwen3.5-35B-A3B via Ollama first. 23GB model. OOM. Killed. Tried Qwen2.5-14B. Fit fine, but reasoning quality too weak for my workflow. Then TurboQuant dropped @GoogleResearch. KV cache compression, 4.6× smaller memory footprint! My new stack: Qwen3.5-27B-IQ3_XXS (10.7GB) via llama.cpp TurboQuant fork → 13.6 tok/s on M4 Pro → zero API cost. Validation before shipping: → 15 tool-use scenarios: 12/12 pass → Shadow test against Haiku on live data: output indistinguishable → Did have 3 timeout failures on error recovery — but a lot of crons can run overnight with a higher timeout, not a quality issue My full model stack today: 🟢 Local (free) → Qwen3.5-27B IQ3_XXS — health monitoring, training load, environmental logging. (Based on multimodal biomarkers, keen on getting that data vectorized!) → MedGemma 4B — offline domain-specific model 🔵 Anthropic → Claude Sonnet — interactive sessions, memory synthesis → Claude Haiku — research briefings, clinical alerts, complex reasoning chains 🟡 Google → Gemini 2.5 Pro — fallback when Anthropic unavailable ⚫ DeepSeek → DeepSeek V3.2 — cost-efficient tasks when applicable The "local = cheap but dumb" assumption is breaking down fast fast faster! Probably it is already outdated, goes so fast! Curious to get it smarter and cheaper everyday 🥳
English
1
0
1
309
Carlo
Carlo@CarloFCesar·
@VadimStrizheus Ok. This already changed.... just added Qwen3.6-plus-preview (free tier). Nothing sensitive or private. I runs my research crons and prompts through it now instead of Sonnet 4.6 openrouter.ai/qwen/qwen3.6-p… Things move fassssst
English
0
0
0
112
Carlo
Carlo@CarloFCesar·
@VadimStrizheus I’d do Qwen 27, not 35 to be honest. With 35 you won’t have enough headspace. And run turboquant on top to compress kv chache. Then make sure to have some cloud api for when you really need it
English
2
0
4
1.8K
Carlo
Carlo@CarloFCesar·
Things really do go fast! Just set all my online research cross and prompts to run on Qwen3.6 Plus Preview (free tier) on OpenRouter. No personal data, no MEMORY.md, no strategy, nothing sensitive because Alibaba collects prompts on the free tier. Check it out here: openrouter.ai/qwen/qwen3.6-p…
Carlo@CarloFCesar

Just cut my agent's cloud costs significantly without sacrificing quality. Thanks to @witcheer and @AlexFinn, I starting building from your posts! The journey: Started on Mac mini M4 (24GB…) running Claude Haiku/Sonnet for all background tasks ~60 API calls/day. Tried Qwen3.5-35B-A3B via Ollama first. 23GB model. OOM. Killed. Tried Qwen2.5-14B. Fit fine, but reasoning quality too weak for my workflow. Then TurboQuant dropped @GoogleResearch. KV cache compression, 4.6× smaller memory footprint! My new stack: Qwen3.5-27B-IQ3_XXS (10.7GB) via llama.cpp TurboQuant fork → 13.6 tok/s on M4 Pro → zero API cost. Validation before shipping: → 15 tool-use scenarios: 12/12 pass → Shadow test against Haiku on live data: output indistinguishable → Did have 3 timeout failures on error recovery — but a lot of crons can run overnight with a higher timeout, not a quality issue My full model stack today: 🟢 Local (free) → Qwen3.5-27B IQ3_XXS — health monitoring, training load, environmental logging. (Based on multimodal biomarkers, keen on getting that data vectorized!) → MedGemma 4B — offline domain-specific model 🔵 Anthropic → Claude Sonnet — interactive sessions, memory synthesis → Claude Haiku — research briefings, clinical alerts, complex reasoning chains 🟡 Google → Gemini 2.5 Pro — fallback when Anthropic unavailable ⚫ DeepSeek → DeepSeek V3.2 — cost-efficient tasks when applicable The "local = cheap but dumb" assumption is breaking down fast fast faster! Probably it is already outdated, goes so fast! Curious to get it smarter and cheaper everyday 🥳

English
0
0
0
133
Tom Turney
Tom Turney@no_stp_on_snek·
Qwen3.5-27B Q4_K_M with TurboQuant KV cache compression. you'll fit way more context than stock llama.cpp on 24GB. build from my fork: git clone github.com/TheTom/llama-c… git checkout feature/turboquant-kv-cache cmake -B build -DGGML_METAL=ON && cmake --build build -j for a pleasant experience, just run this: ./build/bin/llama-server -m Qwen3.5-27B-Q4_K_M.gguf -ngl 99 -fa 1 --cache-type-k q8_0 --cache-type-v turbo3 if you want to push it further and experiment: - boundary V (experimental): TURBO_LAYER_ADAPTIVE=7 ./build/bin/llama-server -m Qwen3.5-27B-Q4_K_M.gguf -ngl 99 -fa 1 --cache-type-k q8_0 --cache-type-v turbo2 - sparse V auto-enables on this model, no flag needed docs + benchmarks: github.com/TheTom/turboqu…
English
5
5
60
4.3K
Carlo
Carlo@CarloFCesar·
@AlexFinn I was afraid I made a big mistake buying a 24GB Mac mini M4... maybe there's hope! Spending way too much on API now and local models I CAN run just aren't doing the trick. For now this appears the help mostly with cache, not so much with models themselves yet
English
0
0
0
192
Alex Finn
Alex Finn@AlexFinn·
I predicted by end of year you’d be able to run frontier models on Mac Minis I’m going to be right
English
31
6
210
24.5K
Alex Finn
Alex Finn@AlexFinn·
This is potentially the biggest news of the year Google just released TurboQuant. An algorithm that makes LLM’s smaller and faster, without losing quality Meaning that 16gb Mac Mini now can run INCREDIBLE AI models. Completely locally, free, and secure This also means: • Much larger context windows possible with way less slowdown and degradation • You’ll be able to run high quality AI on your phone • Speed and quality up. Prices down. The people who made fun of you for buying a Mac Mini now have major egg on their face. This pushes all of AI forward in a such a MASSIVE way It can’t be stated enough: props to Google for releasing this for all. They could have gatekept it for themselves like I imagine a lot of other big AI labs would have. They didn’t. They decided to advance humanity. 2026 is going to be the biggest year in human history.
Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English
331
873
9.7K
1.5M
Carlo
Carlo@CarloFCesar·
Mozart's Requiem is my study jam. What's yours?
GIF
English
0
0
0
19
Marc Andreessen 🇺🇸
My information consumption is now 1/4 X, 1/4 podcast interviews of the smartest practitioners, 1/4 talking to the leading AI models, and 1/4 reading old books. The opportunity cost of anything else is far too high, and rising daily.
English
1.4K
3.9K
35K
34.6M
Carlo
Carlo@CarloFCesar·
My quote for today comes from Life 3.0 by Max Tegmark: ''To program friendly AI, we need to capture the meaning of life. What's 'meaning'? What's 'life'? What's the ultimate ethical imperative? In other words, how should we strive to shape the future of our universe? If we cede control to a super intelligence before answering these questions rigorously, the answer it comes up with is unlikely to involve us''
GIF
English
0
0
0
25