Willy Drucker

279 posts

Willy Drucker

@WillyDrucker

How do I make this not sound like the inscription on my tombstone?

New Fairfield, CT Beigetreten Aralık 2010

116 Folgt155 Follower

Willy Drucker@WillyDrucker·2h

@Tristanrhee3 Two AIs!

English

Tristan Rhee@Tristanrhee3·17h

Everyone has AI now. That advantage expired fast. What's the new advantage?

English

190

167

21.5K

Willy Drucker@WillyDrucker·2h

@TraffAlex I have a web scraper tool I use for real world data on an actual project. Scrape > Categorize Data > Update Fields > Import to DB. Gemma 4 12B fabricates data and fails every step. Qwopus 3.6 35B gets all steps right. Like you said, need this one to be correct even if slower.

English

AlexAImaginator@TraffAlex·8h

that's a solid real-world comparison honestly. the "fast but wrong vs slower but right" framing nails it. I'd take the 5 minute correct answer every time for anything that actually matters. the 35B-A3B on 12GB is kinda the sweet spot nobody talks about enough, you get MoE speed benefits but still enough active params to not hallucinate constantly. good data point with the 3060 numbers

English

1.2K

AlexAImaginator@TraffAlex·15h

🖥️ Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Every model below runs via llama.cpp with a simple one-liner — no Docker, no Python env, no cloud. ━━━ 8-16GB VRAM ━━━ 🔹 Gemma 4-12B (Google) • Smartest model in this size class — competes with stuff 2× bigger • Unsloth's MTP GGUFs: 162 tok/s vs 52 tok/s normal (3× speedup) • Minimum 8GB VRAM recommended for Q4_K_M quant • GGUF → huggingface.co/unsloth/gemma-… 🔹 LFM2.5-8B-A1B (LiquidAI) • Hybrid MoE, only 1B active params — absurdly fast for its size • Perfect for 8-12GB cards, MacBooks, or anyone on a tight budget • GGUF → huggingface.co/LiquidAI/LFM2.… ━━━ 16-32GB VRAM ━━━ 🔹 Qwen3.6-27B (Qwen) • Scored 1.00 on tool-efficiency benchmarks — best local agent available • 40 deterministic tasks, 32k/128k context needle tests — all passed • GGUF → huggingface.co/unsloth/Qwen3.… • MTP version (faster) → huggingface.co/unsloth/Qwen3.… 🔹 Qwopus3.6-27B-v2 (Jackrong) • Best quantization of Qwen3.6-27B — topped 5 agent & coding benchmarks (1200 samples) • If you're running Q4, this is the one to grab • GGUF → huggingface.co/Jackrong/Qwopu… • MTP version → huggingface.co/Jackrong/Qwopu… 🔹 Gemma 4-31B QAT (Google/Unsloth) • QAT variant with MTP draft head: 76-125 tok/s (1.67× speedup) • Excellent for multi-agent / subagent workflows • GGUF → huggingface.co/unsloth/gemma-… 🔹 Nex-N2-Mini (Nex AGI) • Post-train of Qwen3.5-35B-A3B — MoE with only 3B active params • Fits on 16GB+ VRAM, overflow loads from system RAM • Adaptive thinking saves ~20% tokens with no quality loss • For deep multi-step reasoning, nothing in this size comes close • GGUF → huggingface.co/sjakek/Nex-N2-… ━━━ Quick Picks ━━━ • 16GB all-rounder → Gemma 4-12B with MTP GGUFs • 32GB all-rounder → Qwen3.6-27B / Qwopus-v2 • Agents & tool use → Qwen3.6-27B or Qwopus Q4 • Deep reasoning → Nex-N2-Mini (MoE, fits 16GB+) • Tight budget → LFM2.5-8B-A1B • Cheapest full build: 1× used RTX 3090 (24GB) + rest of PC ≈ $1000-1500 ━━━ Setup on Windows ━━━ 1. Download llama.cpp → github.com/ggml-org/llama… (latest .zip) 2. Extract to any folder (e.g. C:\llama.cpp) 3. Download a .gguf from the links above (Q4_K_M or Q5_K_M for best quality/speed balance) 4. Run one of the commands below depending on your hardware ━━━ Launch Commands ━━━ SINGLE GPU — Standard model (no MTP): llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja SINGLE GPU — MTP model (faster inference): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU — Split across two cards: llama-server.exe ^ -m C:\models\Qwen3.6-27B-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ -ngl 100 ^ --tensor-split 0.55,0.45 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja DUAL GPU + MTP + Vision (multimodal): llama-server.exe ^ -m C:\models\Qwen3.6-27B-MTP-Q5_K_M.gguf ^ --ctx-size 180000 ^ --flash-attn on ^ --cache-type-k q4_0 ^ --cache-type-v q4_0 ^ --batch-size 1024 --ubatch-size 512 ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ -ngl 100 ^ --tensor-split 0.60,0.40 ^ --main-gpu 0 ^ -np 1 ^ --port 8080 ^ --jinja ^ --mmproj C:\models\mmproj-F16.gguf ━━━ Parameter Breakdown ━━━ -m Path to your .gguf model file. Change this to wherever you downloaded it. --ctx-size 180000 Context window in tokens. 180k = huge context for long conversations or big codebases. Reduce to 32768 or 65536 if you don't need long context — uses less VRAM. --flash-attn on Flash Attention — dramatically speeds up inference and reduces VRAM usage. Works on RTX 30xx/40xx/50xx. Always enable this. --cache-type-k q4_0 / --cache-type-v q4_0 Quantizes the KV cache (key/value attention cache) to 4-bit. This is what makes 180k context fit in VRAM. Without it, huge contexts eat all your memory. Quality impact is minimal — this is a free performance win. --batch-size 1024 / --ubatch-size 512 batch-size = how many tokens are processed in one forward pass (throughput). ubatch-size = micro-batch actually sent to the GPU per step. Higher = faster prompt processing but needs more VRAM. If you run out of VRAM, lower these (e.g. 512/256). -ngl 100 Number of layers to offload to GPU. 100 = all layers on GPU (full offload). This is what you want if the model fits in your VRAM. If it doesn't fit, reduce this (e.g. -ngl 40) — remaining layers run on CPU/RAM. --tensor-split 0.55,0.45 How to split model layers across multiple GPUs. Values are ratios. 0.55,0.45 = GPU 0 gets 55% of layers, GPU 1 gets 45%. Adjust based on your VRAM — give more to the card with more memory. Example: 0.70,0.30 for a 24GB + 12GB setup. Not needed for single GPU setups. --main-gpu 0 Which GPU handles the batch computation (the "orchestrator"). Set to 0 (your primary GPU). The other GPU(s) handle their assigned layers. Minor performance impact — usually just leave it at 0. -np 1 Number of parallel slots (concurrent requests). 1 = one user at a time. Increase to 2-4 if you want multiple clients connected simultaneously. Each extra slot uses additional VRAM for its own KV cache. --port 8080 Which port the server listens on. Change if port 8080 is busy. --jinja Enables Jinja2 template processing — required for proper chat formatting. Most modern models expect this. Always include it. --spec-type draft-mtp Enables Multi-Token Prediction (MTP) speculative decoding. Only works with MTP GGUF models (downloaded separately). The model predicts multiple tokens at once and verifies them — big speed boost. --spec-draft-n-max 3 How many tokens the MTP draft head proposes per step. 3 is a good default. Higher = potentially faster but more VRAM and may reduce quality. --mmproj Path to the multimodal projector file (for vision models). Enables image understanding — paste screenshots into the web chat. Only needed if you want vision capabilities. Omit for text-only use. ━━━ Your Hardware → Your Command ━━━ Single GPU (8-24GB VRAM): Use the "Single GPU" command. Change -m to your model path. 8GB card → Gemma 4-12B Q4 or LFM2.5-8B 12GB card → Gemma 4-12B Q5/Q6 16GB card → Gemma 4-31B QAT Q4 or Nex-N2-Mini 24GB card → Qwen3.6-27B Q4/Q5, Qwopus-v2, Gemma 4-31B QAT Q5/Q6 Dual GPU: Use the "Dual GPU" command. Adjust --tensor-split based on your VRAM ratio. 24GB + 24GB → --tensor-split 0.50,0.50 24GB + 12GB → --tensor-split 0.70,0.30 24GB + 8GB → --tensor-split 0.75,0.25 Want speed? Use MTP versions of models with the "MTP" commands. Want vision? Add --mmproj with the projector file from the model's HuggingFace repo. 5. Once running, you get: • Web chat UI → http://localhost:8080 • OpenAI-compatible API → http://localhost:8080/v1 • Playground → http://localhost:8080/playground ━━━ Why /v1 API Is the Killer Feature ━━━ One local endpoint replaces your entire cloud API bill. The /v1 endpoint is drop-in OpenAI-spec compatible — every tool that speaks OpenAI just works. No custom code, no glue layer. Works out of the box with: • IDEs: Cursor, Continue, Windsurf, Cline, Roo Code • CLI tools: aider, Open Interpreter, OpenCode • Frameworks: LangChain, LlamaIndex, LiteLLM • Any OpenAI SDK (Python, Node, Go, Rust) Why this beats cloud APIs: • 100% private — code never leaves your machine • $0 per token — no rate limits, no quotas, no surprise bills • Works fully offline • Zero telemetry, no training on your data • Swap models by dropping in a different .gguf — no app changes needed • Run 32k–128k context windows without burning money Good combos: • Cursor + Qwopus-v2 → near-frontier quality, zero API cost • Continue + Qwen3.6-27B → best local coding agent • aider + Gemma 4-12B MTP → 162 tok/s, feels instant • OpenCode + Nex-N2-Mini → deep reasoning on 16GB Set any OpenAI-compatible client to your local endpoint: set OPENAI_API_KEY=sk-dummy (any non-empty string works) set OPENAI_BASE_URL=http://localhost:8080/v1 # every OpenAI-compatible tool now hits your local GPU Shoutouts: @0xSero @rS_alonewolf @witcheer @UnslothAI @LottoLabs

English

135

1.4K

181.8K

Willy Drucker@WillyDrucker·4h

AI Tip#19: Naming things with AI matters. Your file names, row/column names (tables), groupings, functions, features, so on. Best practices first always, but AI is not playing by the same book. Push back on the AI to make sure everything named owns its purpose and you'll avoid 90% of the problems out there. Bloat, slop, poor documentation, multiple projects, back-and-forth bugs, memory and quality issues the list goes on and can be steered with good simple naming. The AI will even do it for you. Spend 10% now to avoid 90% of issues later.

English

Willy Drucker@WillyDrucker·4h

@ChatsFi @OpenRouter It's free my friend. Cost is only your Claude/Codex usage, which it helps by spreading the load. It's searchable in the VS Code marketplace under "WAT321" or Google "WAT321" but here's the links VS Marketplace marketplace.visualstudio.com/items?itemName… Open VSX open-vsx.org/extension/Will…

English

Chats 🇨🇦@ChatsFi·5h

@WillyDrucker @OpenRouter I would like to use this for my VS code as I had to stop copilot pro subscription, what’s the cost like ?

English

OpenRouter@OpenRouter·1d

Introducing the Fusion API, the smartest compound model in the market. Fusion achieves Fable-level intelligence at half the price. How it works 👇

English

647

1.7K

14.1K

5.4M

Willy Drucker@WillyDrucker·4h

@siriusmolecules @OpenRouter It's searchable in the VS Code marketplace under "WAT321" or Google "WAT321" but here's the links VS Marketplace marketplace.visualstudio.com/items?itemName… Open VSX open-vsx.org/extension/Will…

English

Sirius Molecules@siriusmolecules·7h

@WillyDrucker @OpenRouter Care to share the extension?

English

Willy Drucker@WillyDrucker·4h

@mitchellvii @emilykschrader Might want to double check on that one brother.

English

Bill Mitchell@mitchellvii·4h

@emilykschrader We have never had an issue with Iran having nuclear energy. There is a vast difference between nuclear power level Uranium and weapons grade Uranium.

English

688

Emily Schrader - אמילי שריידר امیلی شریدر@emilykschrader·11h

Trump: Under the agreement Iran will be able to enrich uranium at a low level that can never serve military purposes …So the JCPOA. This is a catastrophic mistake by the Trump admin and shows they understand very little about nuclear proliferation. Talk about destroying your own legacy.

English

279

687

3.4K

62.4K

Willy Drucker@WillyDrucker·8h

@GeraldMorrison @TraffAlex Agree, if you can run 27B it's probably the best local model available right now.

English

Gerald Morrison@GeraldMorrison·14h

@TraffAlex I'm running a Mac Studio M4 Max with 36 GB unified. I can fit a 35B model. Do you have an opinion on whether Qwen 3.6 27B or Qwen 3.6 35B is the better model. I realize "better" is subjective. Just looking for an opinion.

English

1.8K

Willy Drucker@WillyDrucker·9h

@theo

QME

Theo - t3.gg@theo·9h

You guys will believe literally anything

Sauers@Sauers_

Big if true

English

149

1.8K

129.7K

Willy Drucker retweetet

Willy Drucker@WillyDrucker·1d

@OpenRouter I believe the results! Built an extension for VS Code that allows Claude and Codex to iterate back and forth and the results are mind blowing.

English

379

56.6K

Willy Drucker@WillyDrucker·15h

@iAnonPatriot This place looks like Action Park in the making.

English

1.8K

American AF 🇺🇸@iAnonPatriot·22h

Old footage of a 1920s amusement park ride designed to throw people off.. People were built DIFFERENT back then. Lmao

English

263

1.4K

12.2K

1.7M

Willy Drucker@WillyDrucker·16h

It's extremely similar, I register an MCP server as the "bridge" between Claude and Codex (or other models). Everything communicates back and forth on the bridge. Since AI treats the MCP server as a registration in memory or a tool-call essentially, you can use natural language to communicate on the bridge. There a messaging system included so the models can talk back and forth and even iterate but it needs to be a little more matured. But you can ask pretty much any question for now and expect a response. I had an issue once where Codex couldn't establish an SSH connection to one of my PCs and Claude noticed it was doing something wrong. Without my intervention, Claude reinitiated the bridge multiple times until Codex got it. Was pretty wild to see them talking to each other.

English

202

BWS@BWS14TW·19h

@WillyDrucker @OpenRouter I do something similar in VS Code with Claude Code Extension I ssh connect to my VPS then I call Codex Adversarial Review of all plans and code implementation and have Opus review Codex’s suggestions/improvements to come up with final answers. How would you compare to yours?

English

237

Willy Drucker@WillyDrucker·16h

It actually already works with it! I use an OpenCode harness to help with tool calls and you can put your local IP or remote LLM IP into the settings and it will connect. You can even use OpenCode's free model's like Big Pickle. I didn't expect it to get this much attention, so it has limitations, but does work. Looks like I'll be expanding on this, though the Git is open source. github.com/WillyDrucker/W…

English

113

Frits Karl@FritsKarl·1d

@WillyDrucker @OpenRouter Could that extension be modified to work with locally hosted models? Or even better refactor to a proxy server?

English

229

Willy Drucker@WillyDrucker·16h

@GetRichGiveBack @OpenRouter Hey there, yes you can. It uses the VS Code extensions (or CLI technically) you've already registered in your .claude and .codex settings. It never touches your keys or tokens those stay within the tools themselves. Basically, the way you already have it set up now.

English

MindGamer.tez | .eth@GetRichGiveBack·1d

@WillyDrucker @OpenRouter Hey Willy, thanks for sharing. Question: can I use my Claude and Codex subscriptions or is it API access only? Thanks

English

Willy Drucker@WillyDrucker·17h

@elonmusk I mean, did anyone really buy that COVID came from a fish market one mile from the Wuhan bio lab?

English

1.6K

Elon Musk@elonmusk·18h

Wow

DNI Tulsi Gabbard@DNIGabbard

Today, I’m releasing never before seen intelligence revealing new evidence of past US government funding for more than 120 biolabs in over 30 countries, including Ukraine. In support of President Trump‘s Executive Order to end federal funding of dangerous gain of function research around the world, and increase transparency and accountability, ODNI will continue working with partners across the Administration to identify where these labs are, what pathogens they contain, and what “research” is being conducted. odni.gov/index.php/news…

QST

6.7K

40.2K

222.5K

16.1M

Willy Drucker@WillyDrucker·1d

@ChaseIrons I weigh more than Chase Irons! I weigh more than Chase Irons. 😭*Puts back donut*

English

Chase Irons@ChaseIrons·1d

I've lost 70lbs in the last 3 years but not for the reason most people think

English

3.4K

Willy Drucker@WillyDrucker·1d

@henokcrypto Just rewrite your CLAUDE.md using the standard language. "You are a Fable 5 model and senior developer..."

English

Henok@henokcrypto·1d

I’m experiencing Fable 5 withdrawal

English

1.7K

Willy Drucker@WillyDrucker·1d

@mitchellvii The small wins are always fun. 10lbs more on bench, 10lbs more on military. Serves as that angel on your shoulder for days you can't see your toes. Having someone to go with helps too.

English

Bill Mitchell@mitchellvii·2d

I don't enjoy working out not because it's hard but because it's boring. If I could find a way to make working out fun, I think I'd do it more.

English

1.9K

Willy Drucker@WillyDrucker·1d

@DaveBlundin So much to unpack here. If someone can influence AI results to steer an outcome, they will. If someone can restrict AI usage to force a result, they will. These are not, maybe's, and it's coming to a western civilization near you. TV had its run, AI's turn.

English

251

Dave Blundin@DaveBlundin·1d

This is one of the most pivotal events in history because of the precedent it sets. Even if it doesn't stick, the cat is out of the bag: the government can now unilaterally decide who gets to use AI. Over the last few days I had fully transitioned to orchestrating with Fable. It initially felt like the cookie jar was moved out of reach. Now after spending the day with Opus, I've accepted just how monumental this is. Being denied the best AI model is effectively the same as being denied future employment.

Anthropic@AnthropicAI

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: anthropic.com/news/fable-myt…

English

360

22.1K

Entdecken

@Tristanrhee3 @TraffAlex @0xSero @rS_alonewolf @witcheer @UnslothAI @LottoLabs @ChatsFi