unbug

2.4K posts

unbug

@unbug

https://t.co/5sKsiEXX8Y, CODELF (Github star 14k, https://t.co/z1Mfw3yNcy), #MIHTool (Mentioned in Google I/O'13, https://t.co/HS3Jxj8Zho)

Los Angeles, CA เข้าร่วม Ocak 2008

670 กำลังติดตาม1.2K ผู้ติดตาม

unbug@unbug·46m

@TeksEdge V100 with PCIE is better, $200

English

David Hendrickson@TeksEdge·22h

🧪 New Benchmarks: Intel Arc Pro B70 32GB LLM benchmark for Qwen3.5-27B Q4 Single GPU • Single user: • vLLM: 13.43 tokens/s (tg512) • LM Studio (Vulkan): 11.87 tokens/s • Best tuned (SYCL llama.cpp): 22.47 tokens/s Strong prompt processing, but token generation is still slower than a used RTX 3060 or RTX 5070 Ti in real-world single-user chat. 32GB VRAM is nice… but the speed needs work.

English

5.9K

unbug@unbug·52m

@davinder0110v @ollama Also every time restart the server 99% downloads progress reset to zero

English

David V — e/acc@davinder0110v·14h

Remove model is not working on @ollama web :(

English

2.1K

unbug@unbug·10h

@no_stp_on_snek @vmiss33 Yeah, local models is better

English

Tom Turney@no_stp_on_snek·17h

I've been using nemotron here and there with hermes and not having this problem. It's probably not the model but hermes itself. little tests that may help: 1. try with a fresh conversation (shorter context) and see if the gibberish goes away. If it does, their KV cache is corrupting under memory pressure. 2. if OpenRouter is running this at aggressive quant (Q4 on a 120B model) the KV cache might be quantized too aggressively. try a different provider or model variant. 3. if you're self hosting (not sure based on the screenshot) check if KV cache quantization is enabled. the mixed-script output is a dead giveaway of corrupted attention values

English

362

vmiss@vmiss33·17h

I'm getting a lot of responses like this using hermes - open router with free nemotron 3 super - just a case of you get what you pay for?

English

3.2K

unbug@unbug·10h

@no_stp_on_snek @Prince_Canuma How is it compared to TurboQuant?

English

Tom Turney@no_stp_on_snek·12h

ran a NIAH-style check on the mlx-vlm triattention PR using the same model (gemma-4-26b-a4b 4-bit). inserted “PURPLE ELEPHANT 7742” at start / middle / end of a ~6.6k token context and asked the model to retrieve it. baseline: middle PASS, end PASS TA-512: middle FAIL, end PARTIAL (drops “ELEPHANT”) TA-1024: middle FAIL → outputs “PURPLE RAIN 774” TA-2048: matches baseline the interesting part is the 1024 case. the model doesn’t just miss the needle, it hallucinates a semantically similar phrase. that’s consistent with the token being evicted but still partially activated. at 2048 it looks fine, but that’s also a low-pressure regime relative to context length. this is the gap i was pointing at earlier. MATH500 is mostly self-contained, so it doesn’t stress whether the eviction policy is keeping the right tokens under pressure. NIAH directly tests that. if important info gets dropped, you either see a miss or this kind of near-semantic hallucination. implementation looks solid. i think adding a targeted long-context retrieval test would give a more complete picture.

English

226

Prince Canuma@Prince_Canuma·14h

TriAttention MLX benchmark run on the full MATH500 is done after ~30h. We ran Gemma4-26B (5-bit) on M3 Ultra with KV cache budgets of 512, 1024, and 2048: → TA-2048: 76.6% vs 77.4% baseline — 4 problems lost out of 500 (-0.8%) → TA-1024: 75.6% — 9 problems lost (-1.8%) → TA-512: 72.0% — 27 problems lost (-5.4%) → Speed: ~76 tok/s across all modes — zero overhead For reference, the original paper reports TriAttention on Qwen3-8B: → 512: 55.5%, 1k: 68.5%, 2k: 69.0%, 3k: 69.8% (baseline ~70%) Our results on a different model family and scale track the same pattern. The 30-sample pilot estimated -3.4% for TA-2048. Full eval: -0.8%. Scaling up the eval mattered. Paper link: arxiv.org/pdf/2604.04921

Prince Canuma@Prince_Canuma

🧮 MATH 500 results for TriAttention on Gemma4-26B-A4B-it (5-bit quantized, M3 Ultra 512GB) using MLX-VLM TA-2048 preserves 96% of baseline accuracy (22/30 vs 23/30) with KV cache capped at 2048 tokens, regardless of reasoning length. Throughput stays rock-solid at ~77 tok/s across all modes. Our gap is larger than the paper's (-3.4% vs -1.2% at budget=2048) because: 1. We ran Gemma4 A4B in non-thinking mode 2. Only 5 full-attention layers (50 are sliding window), less surface area for TriAttention. 3. 5-bit quantization maybe adding noise on top of KV compression The takeaway: TriAttention works on Apple Silicon with MLX. Even on a non-reasoning mode with aggressive quantization, TA-2048 keeps accuracy intact. 🍎

English

unbug@unbug·21h

@steipete @openclaw Peter, please, we out of money for claw, local models is our only hope

English

189

Peter Steinberger 🦞@steipete·21h

If you look at GPT 5.4-Cyber and it's ability for closed source reverse engineering, I have bad news for you. I do very much feel the pain though, there's hundreds of teams that try to poke holes into @openclaw. Our response has been of rapid iteration and code hardening. Which did introduce occasiaonal regression (and yes you all been yelling at me), but I see as the only way forward. I would be very careful of other open source projects/harnesses that ignore this work and do not publish their advisories. github.com/openclaw/openc…

Bailey Pumfleet@pumfleet

Open source is dead. That’s not a statement we ever thought we’d make. @calcom was built on open source. It shaped our product, our community, and our growth. But the world has changed faster than our principles could keep up. AI has fundamentally altered the security landscape. What once required time, expertise, and intent can now be automated at scale. Code is no longer just read. It is scanned, mapped, and exploited. Near zero cost. In that world, transparency becomes exposure. Especially at scale. After a lot of deliberation, we’ve made the decision to close the core @calcom codebase. This is not a rejection of what open source gave us. It’s a response to what risks AI is making possible. We’re still supporting builders, releasing the core code under a new MIT-licensed open source project called cal. diy for hobbyists and tinkerers, but our priority now is simple: Protecting our customers and community at all costs. This may not be the most popular call. But we believe many companies will come to the same conclusion. My full explanation below ↓

English

1.5K

363.9K

unbug@unbug·21h

@Teknium Missing qwen3.5, that’s why “See results” wins

English

285

Teknium (e/λ)@Teknium·1d

For local models, which is better in Hermes Agent?

English

105

119

29K

unbug@unbug·21h

Don’t waste your time, IQ3 models are completely broken for tool calling, means it’s never going to be an option for your OpenClaw/Herness/xxxCode

English

unbug@unbug·1d

@TheGeorgePu That’s why you tokens routers making money, they don’t pay for a dc

English

George Pu@TheGeorgePu·1d

I've been GPU shipping for my company. One H100 on Google Cloud: $8,000 a month. Retail price: $30,000. Just renting for 4 months you could own them for life. With cloud GPUs, you don't own ANYTHING. You're just paying someone else's GPU mortgage. Can't host it at your house because of noise/cooling? Try a colo place - I have one right next to my office. Starting at $1k/mo. Makes sense fast. Here's where I think this is going: Personal use - a Mac Mini or two running local models. Forever. Business - stacking Mac Studios first. Then own GPUs in a colo rack. Everyone's arguing about which model is best. Nobody's asking who owns the computer it runs on. Testing both paths now. Will document everything.

English

263

30.1K

unbug@unbug·1d

@basecampbernie That’s shame, my 9yrs old v100 almost runs better , and it’s only $200

English

181

Base Camp Bernie@basecampbernie·1d

Small model MoE shootout on DGX Spark GB10. 262K context, 2048 tok generation: Qwen3.5-35B-A3B MoE (Q4_K_XL q8/q8): 60 t/s Gemma 4 26B-A4B MoE (Q4_K_XL q4/q4): 51 t/s Qwen3.5-35B-A3B MoE (Q8_K_XL q8/q4): 35 t/s Gemma 4 26B-A4B MoE (Q8_K_XL q4/q4): 39 t/s MiniMax-M2.7 228B MoE (IQ4 101GB): 24 t/s 3-4B active params pushing 60 t/s at full 262K context. The MoE efficiency is wild. Qwen 3B active outpaces Gemma 4B active on raw speed, but Gemma's reasoning edge closes the gap on quality. All llama.cpp on a single GB10. These small MoEs are the real hero models of 2026.

Indonesia

119

10.8K

unbug@unbug·1d

@TheAhmadOsman Man could you compare them to qwen3-code-next? I just don’t have the vram to test it but plan to double vram.

English

272

Ahmad@TheAhmadOsman·1d

Dense models like Qwen 3.5 27B & Gemma 4 31B on unified memory are a bad idea Simple rule: Lower memory bandwidth works best w/ fewer active parameters per token MoE like Gemma 4 26B-A4B would work much faster on Unified Memory

English

378

28.5K

unbug@unbug·1d

@TheAhmadOsman Any case? Long task

English

Ahmad@TheAhmadOsman·1d

I changed my mind Gemma 3 31B is better and more capable than Qwen 3.5 27B Requires better prompting but it’s more capable, intelligent, and token efficient

English

799

45.3K

unbug@unbug·1d

@ollama @GoogleDeepMind @sgl_project @radixark So when we gonna be able to get TurboQuant on Ollama?

English

ollama@ollama·1d

Ollama and Google Gemma team is hosting an Ollama Gemma Day in Palo Alto tomorrow night (Wednesday, April 15th at 6pm). Lots of amazing speakers from @GoogleDeepMind and @sgl_project / @radixark! Looking forward to seeing many of you! RSVP + Approval required: luma.com/ollama-gemma4

English

286

29.9K

unbug@unbug·1d

@TheAhmadOsman nice map, I like you put all things on the table, everyone got what he want from it

English

Ahmad@TheAhmadOsman·3d

MiniMax M2.7 Benchmarks on 4x DGX Spark + vLLM - 45k / 110k / 178k requests, 1.63k prefill + 14.30 decode token/sec - 25k / 49k / 74k requests, 2.52k prefill + 23.38 decode token/sec - 4k / 8k / 16k requests, 3.45k prefill + 33.44 decode token/sec What else do you wanna see?

Ahmad@TheAhmadOsman

MiniMax M2.7 at home running on 4x DGX Sparks vLLM serving full BF16 weights, 200k context OpenCode having the model monitor its own hardware and report thermals, tokens/sec, TTFT, and other runtime stats in real time What benchmarks / workflows / things do you wanna see next?

English

191

27.9K

unbug@unbug·1d

@ClementDelangue How about V100+4070ti?

English

213

clem 🤗@ClementDelangue·1d

Introducing Kernels on the Hugging Face Hub ✨ What if shipping a GPU kernel was as easy as pushing a model? - Pre-compiled for your exact GPU, PyTorch & OS - Multiple kernel versions coexist in one process - torch.compile compatible - 1.7x–2.5x speedups over PyTorch baselines

English

219

1.6K

194.8K

unbug@unbug·1d

@0xCVYH WSL2 is nightmear for TCC GPU, it don't even read the cards

English

CV.YH@0xCVYH·1d

@unbug vLLM leans on Linux kernel primitives (CUDA runtime + shared memory paths) that are messy to replicate on Windows natively. practical path: WSL2 (full vLLM) or llama.cpp / LM Studio for native Windows

English

116

CV.YH@0xCVYH·2d

Red Hat AI: Gemma 4 31B comprimido via LLM Compressor = 2x tokens/sec, metade da memoria, 99%+ accuracy retida rodando em vLLM. e o argumento central do local AI — o teto nao ta no modelo maior, ta na quantizacao que preserva qualidade. mesmo modelo, 2x mais rapido, 50% VRAM. consumer GPU ganha um tier inteiro

Português

5.4K

unbug@unbug·1d

@nvidianewsroom come on Nvidia, I know you are rich, so why not open source a 80b MoE like coding model which is the best kind, I know can do it! I am sick of the price of xxxCode

English

NVIDIA Newsroom@nvidianewsroom·1d

Introducing NVIDIA Ising, the world’s first open AI models to accelerate the path to useful quantum computers. Researchers and enterprises can now use AI-powered workflows for scalable, high-performance quantum systems with quantum processor calibration capabilities and quantum error-correction decoding. Learn more: nvidianews.nvidia.com/news/nvidia-la…

English

405

1.9K

305.9K

unbug@unbug·1d

everytime a new tech shows me we might run a big model with lees vram and high speed, but I just can't have it right now until I buy a new device years later, so WTF is the point!

English

unbug@unbug·1d

@techNmak OK，very impressive, when the hell I can use it?

English

156

Tech with Mak@techNmak·2d

Microsoft made 100B parameter models run on a single CPU. bitnet.cpp: The official inference framework for 1-bit LLMs. The math behind 1-bit LLMs is what makes them revolutionary. Traditional LLMs use 16-bit floating point weights. Every parameter is a number like 0.0023847 or -1.4729. When you run inference, you multiply these floats together. Billions of times. That's why you need GPUs, they're optimized for floating point matrix multiplication. BitNet b1.58 uses ternary weights: {-1, 0, 1}. That's not a simplification. That's a fundamental change in the math. When your weights are only -1, 0, or 1: → Multiply by 1 = keep the value → Multiply by -1 = flip the sign → Multiply by 0 = skip entirely Matrix multiplication becomes addition and subtraction. No floating point operations. No GPU required. This is why bitnet.cpp achieves: → 2.37x to 6.17x speedup on x86 CPUs → 1.37x to 5.07x speedup on ARM CPUs → 71.9% to 82.2% energy reduction on x86 → 55.4% to 70.0% energy reduction on ARM The speedups scale with model size. Larger models see bigger gains because there are more operations to simplify. A 100B parameter model running at human reading speed (5-7 tokens/second) on a single CPU. That's not optimization. That's a different paradigm. Why 1.58 bits? Because log₂(3) ≈ 1.58. Three possible values = 1.58 bits of information per weight. The key insight: These models aren't quantized after training. They're trained from scratch with ternary weights. The model learns to work within the constraint. No precision loss. No quality tradeoff.