Willy Drucker

279 posts

Willy Drucker

Willy Drucker

@WillyDrucker

How do I make this not sound like the inscription on my tombstone?

New Fairfield, CT Beigetreten Aralık 2010
116 Folgt155 Follower
Tristan Rhee
Tristan Rhee@Tristanrhee3·
Everyone has AI now. That advantage expired fast. What's the new advantage?
English
190
8
167
21.5K
Willy Drucker
Willy Drucker@WillyDrucker·
@TraffAlex I have a web scraper tool I use for real world data on an actual project. Scrape > Categorize Data > Update Fields > Import to DB. Gemma 4 12B fabricates data and fails every step. Qwopus 3.6 35B gets all steps right. Like you said, need this one to be correct even if slower.
English
0
0
0
21
AlexAImaginator
AlexAImaginator@TraffAlex·
that's a solid real-world comparison honestly. the "fast but wrong vs slower but right" framing nails it. I'd take the 5 minute correct answer every time for anything that actually matters. the 35B-A3B on 12GB is kinda the sweet spot nobody talks about enough, you get MoE speed benefits but still enough active params to not hallucinate constantly. good data point with the 3060 numbers
English
0
0
5
1.2K
AlexAImaginator
AlexAImaginator@TraffAlex·
🖥️ Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Every model below runs via llama.cpp with a simple one-liner — no Docker, no Python env, no cloud. ━━━ 8-16GB VRAM ━━━ 🔹 Gemma 4-12B (Google) • Smartest model in this size class — competes with stuff 2× bigger • Unsloth's MTP GGUFs: 162 tok/s vs 52 tok/s normal (3× speedup) • Minimum 8GB VRAM recommended for Q4_K_M quant • GGUF → huggingface.co/unsloth/gemma-… 🔹 LFM2.5-8B-A1B (LiquidAI) • Hybrid MoE, only 1B active params — absurdly fast for its size • Perfect for 8-12GB cards, MacBooks, or anyone on a tight budget • GGUF → huggingface.co/LiquidAI/LFM2.… ━━━ 16-32GB VRAM ━━━ 🔹 Qwen3.6-27B (Qwen) • Scored 1.00 on tool-efficiency benchmarks — best local agent available • 40 deterministic tasks, 32k/128k context needle tests — all passed • GGUF → huggingface.co/unsloth/Qwen3.… • MTP version (faster) → huggingface.co/unsloth/Qwen3.… 🔹 Qwopus3.6-27B-v2 (Jackrong) • Best quantization of Qwen3.6-27B — topped 5 agent & coding benchmarks (1200 samples) • If you're running Q4, this is the one to grab • GGUF → huggingface.co/Jackrong/Qwopu… • MTP version → huggingface.co/Jackrong/Qwopu… 🔹 Gemma 4-31B QAT (Google/Unsloth) • QAT variant with MTP draft head: 76-125 tok/s (1.67× speedup) • Excellent for multi-agent / subagent workflows • GGUF → huggingface.co/unsloth/gemma-… 🔹 Nex-N2-Mini (Nex AGI) • Post-train of Qwen3.5-35B-A3B — MoE with only 3B active params • Fits on 16GB+ VRAM, overflow loads from system RAM • Adaptive thinking saves ~20% tokens with no quality loss • For deep multi-step reasoning, nothing in this size comes close • GGUF → huggingface.co/sjakek/Nex-N2-… ━━━ Quick Picks ━━━ • 16GB all-rounder → Gemma 4-12B with MTP GGUFs • 32GB all-rounder → Qwen3.6-27B / Qwopus-v2 • Agents & tool use → Qwen3.6-27B or Qwopus Q4 • Deep reasoning → Nex-N2-Mini (MoE, fits 16GB+) • Tight budget → LFM2.5-8B-A1B • Cheapest full build: 1× used RTX 3090 (24GB) + rest of PC ≈ $1000-1500 ━━━ Setup on Windows ━━━ 1. Download llama.cpp → github.com/ggml-org/llama… (latest .zip) 2. Extract to any folder (e.g. C:\llama.cpp) 3. Download a .gguf from the links above (Q4_K_M or Q5_K_M for best quality/speed balance) 4. Run one of the commands below depending on your hardware ━━━ Launch Commands ━━━ SINGLE GPU — Standard model (no MTP): llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja SINGLE GPU — MTP model (faster inference): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU — Split across two cards: llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ --tensor-split 0.55,0.45 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU + MTP + Vision (multimodal): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ --tensor-split 0.60,0.40 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja ^ --mmproj C:\models\mmproj-F16.gguf ━━━ Parameter Breakdown ━━━ -m Path to your .gguf model file. Change this to wherever you downloaded it. --ctx-size 180000 Context window in tokens. 180k = huge context for long conversations or big codebases. Reduce to 32768 or 65536 if you don't need long context — uses less VRAM. --flash-attn on Flash Attention — dramatically speeds up inference and reduces VRAM usage. Works on RTX 30xx/40xx/50xx. Always enable this. --cache-type-k q4_0 / --cache-type-v q4_0 Quantizes the KV cache (key/value attention cache) to 4-bit. This is what makes 180k context fit in VRAM. Without it, huge contexts eat all your memory. Quality impact is minimal — this is a free performance win. --batch-size 1024 / --ubatch-size 512 batch-size = how many tokens are processed in one forward pass (throughput). ubatch-size = micro-batch actually sent to the GPU per step. Higher = faster prompt processing but needs more VRAM. If you run out of VRAM, lower these (e.g. 512/256). -ngl 100 Number of layers to offload to GPU. 100 = all layers on GPU (full offload). This is what you want if the model fits in your VRAM. If it doesn't fit, reduce this (e.g. -ngl 40) — remaining layers run on CPU/RAM. --tensor-split 0.55,0.45 How to split model layers across multiple GPUs. Values are ratios. 0.55,0.45 = GPU 0 gets 55% of layers, GPU 1 gets 45%. Adjust based on your VRAM — give more to the card with more memory. Example: 0.70,0.30 for a 24GB + 12GB setup. Not needed for single GPU setups. --main-gpu 0 Which GPU handles the batch computation (the "orchestrator"). Set to 0 (your primary GPU). The other GPU(s) handle their assigned layers. Minor performance impact — usually just leave it at 0. -np 1 Number of parallel slots (concurrent requests). 1 = one user at a time. Increase to 2-4 if you want multiple clients connected simultaneously. Each extra slot uses additional VRAM for its own KV cache. --port 8080 Which port the server listens on. Change if port 8080 is busy. --jinja Enables Jinja2 template processing — required for proper chat formatting. Most modern models expect this. Always include it. --spec-type draft-mtp Enables Multi-Token Prediction (MTP) speculative decoding. Only works with MTP GGUF models (downloaded separately). The model predicts multiple tokens at once and verifies them — big speed boost. --spec-draft-n-max 3 How many tokens the MTP draft head proposes per step. 3 is a good default. Higher = potentially faster but more VRAM and may reduce quality. --mmproj Path to the multimodal projector file (for vision models). Enables image understanding — paste screenshots into the web chat. Only needed if you want vision capabilities. Omit for text-only use. ━━━ Your Hardware → Your Command ━━━ Single GPU (8-24GB VRAM): Use the "Single GPU" command. Change -m to your model path. 8GB card → Gemma 4-12B Q4 or LFM2.5-8B 12GB card → Gemma 4-12B Q5/Q6 16GB card → Gemma 4-31B QAT Q4 or Nex-N2-Mini 24GB card → Qwen3.6-27B Q4/Q5, Qwopus-v2, Gemma 4-31B QAT Q5/Q6 Dual GPU: Use the "Dual GPU" command. Adjust --tensor-split based on your VRAM ratio. 24GB + 24GB → --tensor-split 0.50,0.50 24GB + 12GB → --tensor-split 0.70,0.30 24GB + 8GB → --tensor-split 0.75,0.25 Want speed? Use MTP versions of models with the "MTP" commands. Want vision? Add --mmproj with the projector file from the model's HuggingFace repo. 5. Once running, you get: • Web chat UI → http://localhost:8080 • OpenAI-compatible API → http://localhost:8080/v1 • Playground → http://localhost:8080/playground ━━━ Why /v1 API Is the Killer Feature ━━━ One local endpoint replaces your entire cloud API bill. The /v1 endpoint is drop-in OpenAI-spec compatible — every tool that speaks OpenAI just works. No custom code, no glue layer. Works out of the box with: • IDEs: Cursor, Continue, Windsurf, Cline, Roo Code • CLI tools: aider, Open Interpreter, OpenCode • Frameworks: LangChain, LlamaIndex, LiteLLM • Any OpenAI SDK (Python, Node, Go, Rust) Why this beats cloud APIs: • 100% private — code never leaves your machine • $0 per token — no rate limits, no quotas, no surprise bills • Works fully offline • Zero telemetry, no training on your data • Swap models by dropping in a different .gguf — no app changes needed • Run 32k–128k context windows without burning money Good combos: • Cursor + Qwopus-v2 → near-frontier quality, zero API cost • Continue + Qwen3.6-27B → best local coding agent • aider + Gemma 4-12B MTP → 162 tok/s, feels instant • OpenCode + Nex-N2-Mini → deep reasoning on 16GB Set any OpenAI-compatible client to your local endpoint: set OPENAI_API_KEY=sk-dummy (any non-empty string works) set OPENAI_BASE_URL=http://localhost:8080/v1 # every OpenAI-compatible tool now hits your local GPU Shoutouts: @0xSero @rS_alonewolf @witcheer @UnslothAI @LottoLabs
English
52
135
1.4K
181.8K
Willy Drucker
Willy Drucker@WillyDrucker·
AI Tip#19: Naming things with AI matters. Your file names, row/column names (tables), groupings, functions, features, so on. Best practices first always, but AI is not playing by the same book. Push back on the AI to make sure everything named owns its purpose and you'll avoid 90% of the problems out there. Bloat, slop, poor documentation, multiple projects, back-and-forth bugs, memory and quality issues the list goes on and can be steered with good simple naming. The AI will even do it for you. Spend 10% now to avoid 90% of issues later.
English
0
0
0
15
OpenRouter
OpenRouter@OpenRouter·
Introducing the Fusion API, the smartest compound model in the market. Fusion achieves Fable-level intelligence at half the price. How it works 👇
OpenRouter tweet media
English
647
1.7K
14.1K
5.4M
Bill Mitchell
Bill Mitchell@mitchellvii·
@emilykschrader We have never had an issue with Iran having nuclear energy. There is a vast difference between nuclear power level Uranium and weapons grade Uranium.
English
9
0
22
688
Emily Schrader - אמילי שריידר امیلی شریدر
Trump: Under the agreement Iran will be able to enrich uranium at a low level that can never serve military purposes …So the JCPOA. This is a catastrophic mistake by the Trump admin and shows they understand very little about nuclear proliferation. Talk about destroying your own legacy.
English
279
687
3.4K
62.4K
Gerald Morrison
Gerald Morrison@GeraldMorrison·
@TraffAlex I'm running a Mac Studio M4 Max with 36 GB unified. I can fit a 35B model. Do you have an opinion on whether Qwen 3.6 27B or Qwen 3.6 35B is the better model. I realize "better" is subjective. Just looking for an opinion.
English
2
0
8
1.8K
Willy Drucker retweetet
Willy Drucker
Willy Drucker@WillyDrucker·
@OpenRouter I believe the results! Built an extension for VS Code that allows Claude and Codex to iterate back and forth and the results are mind blowing.
English
16
4
379
56.6K
American AF 🇺🇸
American AF 🇺🇸@iAnonPatriot·
Old footage of a 1920s amusement park ride designed to throw people off.. People were built DIFFERENT back then. Lmao
English
263
1.4K
12.2K
1.7M
Willy Drucker
Willy Drucker@WillyDrucker·
It's extremely similar, I register an MCP server as the "bridge" between Claude and Codex (or other models). Everything communicates back and forth on the bridge. Since AI treats the MCP server as a registration in memory or a tool-call essentially, you can use natural language to communicate on the bridge. There a messaging system included so the models can talk back and forth and even iterate but it needs to be a little more matured. But you can ask pretty much any question for now and expect a response. I had an issue once where Codex couldn't establish an SSH connection to one of my PCs and Claude noticed it was doing something wrong. Without my intervention, Claude reinitiated the bridge multiple times until Codex got it. Was pretty wild to see them talking to each other.
Willy Drucker tweet media
English
0
0
3
202
BWS
BWS@BWS14TW·
@WillyDrucker @OpenRouter I do something similar in VS Code with Claude Code Extension I ssh connect to my VPS then I call Codex Adversarial Review of all plans and code implementation and have Opus review Codex’s suggestions/improvements to come up with final answers. How would you compare to yours?
English
1
0
1
237
Willy Drucker
Willy Drucker@WillyDrucker·
It actually already works with it! I use an OpenCode harness to help with tool calls and you can put your local IP or remote LLM IP into the settings and it will connect. You can even use OpenCode's free model's like Big Pickle. I didn't expect it to get this much attention, so it has limitations, but does work. Looks like I'll be expanding on this, though the Git is open source. github.com/WillyDrucker/W…
English
2
0
1
113
Frits Karl
Frits Karl@FritsKarl·
@WillyDrucker @OpenRouter Could that extension be modified to work with locally hosted models? Or even better refactor to a proxy server?
English
1
0
0
229
Willy Drucker
Willy Drucker@WillyDrucker·
@GetRichGiveBack @OpenRouter Hey there, yes you can. It uses the VS Code extensions (or CLI technically) you've already registered in your .claude and .codex settings. It never touches your keys or tokens those stay within the tools themselves. Basically, the way you already have it set up now.
Willy Drucker tweet media
English
1
0
1
23
Willy Drucker
Willy Drucker@WillyDrucker·
@elonmusk I mean, did anyone really buy that COVID came from a fish market one mile from the Wuhan bio lab?
English
0
1
4
1.6K
Willy Drucker
Willy Drucker@WillyDrucker·
@ChaseIrons I weigh more than Chase Irons! I weigh more than Chase Irons. 😭*Puts back donut*
English
0
0
1
35
Chase Irons
Chase Irons@ChaseIrons·
I've lost 70lbs in the last 3 years but not for the reason most people think
English
5
0
22
3.4K
Willy Drucker
Willy Drucker@WillyDrucker·
@henokcrypto Just rewrite your CLAUDE.md using the standard language. "You are a Fable 5 model and senior developer..."
English
0
0
0
70
Henok
Henok@henokcrypto·
I’m experiencing Fable 5 withdrawal
English
10
0
41
1.7K
Willy Drucker
Willy Drucker@WillyDrucker·
@mitchellvii The small wins are always fun. 10lbs more on bench, 10lbs more on military. Serves as that angel on your shoulder for days you can't see your toes. Having someone to go with helps too.
English
0
0
0
6
Bill Mitchell
Bill Mitchell@mitchellvii·
I don't enjoy working out not because it's hard but because it's boring. If I could find a way to make working out fun, I think I'd do it more.
English
22
0
22
1.9K
Willy Drucker
Willy Drucker@WillyDrucker·
@DaveBlundin So much to unpack here. If someone can influence AI results to steer an outcome, they will. If someone can restrict AI usage to force a result, they will. These are not, maybe's, and it's coming to a western civilization near you. TV had its run, AI's turn.
English
0
0
0
251
Dave Blundin
Dave Blundin@DaveBlundin·
This is one of the most pivotal events in history because of the precedent it sets. Even if it doesn't stick, the cat is out of the bag: the government can now unilaterally decide who gets to use AI. Over the last few days I had fully transitioned to orchestrating with Fable. It initially felt like the cookie jar was moved out of reach. Now after spending the day with Opus, I've accepted just how monumental this is. Being denied the best AI model is effectively the same as being denied future employment.
Anthropic@AnthropicAI

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: anthropic.com/news/fable-myt…

English
59
44
360
22.1K