edge distiller

186 posts

edge distiller

@edgedistiller

AI, crypto, and technology content. https://t.co/yZpbZkh9ck

Katılım Ağustos 2022

47 Takip Edilen445 Takipçiler

edge distiller@edgedistiller·17 May

@LottoLabs Grok actually feels different from other models. Purely for the sake of "cognitive diversity" I want it to succeed. Even if that means finding a different niche at a different price point from gigantic frontier models tuned for coding.

English

Lotto@LottoLabs·16 May

Idk why but I want grok to win the

English

1.7K

edge distiller@edgedistiller·17 May

New video up on my youtube channel about the new Qwen MTP models! I also compare quality benchmarks using BenchLoop, thanks to @outsource_

English

1.7K

edge distiller@edgedistiller·17 May

@leftcurvedev_ I also found no difference in speed between 2 and 6 draft tokens on different hardware with the same flags as you. Almost exactly 1.8x using MTP vs. without. I cover it in my recent video. youtu.be/RdEzYwPBwDo

YouTube

English

221

left curve dev@leftcurvedev_·16 May

I nearly 2x'd the speed while only using +1GB VRAM with the new MTP update in llama.cpp 🤯 You need to add these flags to start using it: --spec-type draft-mtp \ --spec-draft-p-min 0.75 \ --spec-draft-n-max 2 My results with Qwen3.6 27B on a single RTX 5080 ↓ ⚪️ no flag (without mtp) → 54.3 tok/s with 13.26GB VRAM 🔵 --spec-draft-n-max 2 → 90.7 tok/s with 14.29GB VRAM 🔴 --spec-draft-n-max 2 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.30GB VRAM 🟢 --spec-draft-n-max 6 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.87GB VRAM Increasing to 6 draft tokens didn't help my setup for some reason. I made sure to test with a low context length to have enough headroom and eliminate risk of vram stress. From my understanding: 1) The speed gains are very task-dependent. You need to test across a wide range of tasks to get a realistic idea of the benefits 2) We’re already running heavily quantized GGUF models (Q3, Q4, Q6, etc.), so we already benefit from strong speed/performance thanks to the reduced size. That’s why some people are seeing little to no improvement compared to MLX or other quantized versions The progress over the past few days has been insane to say the least. However, MTP now consumes significantly more VRAM. Personally 16GB just isn't enough to use MTP and run it with a good context size. Time to upgrade lads, 24GB+ users are eating GOOD today 🔥 Full setup below ↓

English

430

28.1K

edge distiller@edgedistiller·15 May

@k_flowstate Why do we need to trust anyone? Either a statement is true or it is not. People that produce a lot of true, useful statements are generally worth giving attention to.

English

flowstate@k_flowstate·14 May

Now, who do I trust here when it comes to Open Sourced AI Don't get me wrong, both of them share really great insights when it comes to local models profiling. But how can we trust local AI to win when the two top-most reliable sources don't even trust each other?

English

100

177

19.7K

edge distiller@edgedistiller·14 May

@ItsmeAjayKV Ultimately is it worth it if you have to use a smaller (worse) quantization due to the additional VRAM overhead? MTP seems like an optimization that only makes sense when your VRAM too large for the current tier you are using but too small for the next tier up.

English

561

AJ@ItsmeAjayKV·14 May

Why does VRAM usage jump when MTP is enabled? From my Qwen3.6-35B-A3B-MTP runs on a 12GB RTX 3060: Keeping everything else identical, same model, quant, ngl, ncmoe, KV cache and only changing --spec-draft-n-max I noticed this: Without MTP: ~5.98GB VRAM MTP enabled (spec-draft-n-max): ~8.47GB VRAM Then increasing n_max 2 -> n_max 4 only added ~0.05GB more VRAM afterward. Basically, the VRAM jump happens because the runtime has to load those extra MTP prediction heads upfront. Once those "draft weights" are taking up the space, increasing n_max just adds a fraction of room for the extra tokens after that. Normal decoding is basically: predict token -> append token -> next forward pass -> repeat With MTP/speculative decoding, the runtime uses those extra heads to: - draft multiple future tokens ahead - maintain verifier state - track speculative execution paths - manage accept/reject logic for drafted tokens So enabling MTP introduces a fairly large baseline infrastructure cost immediately to get those heads ready. After that, increasing n_max mostly changes how far ahead the runtime speculates rather than scaling memory usage linearly.

English

136

10.5K

edge distiller@edgedistiller·14 May

@dadhalfdev @LottoLabs Happy to help! Let me know if you have any suggestions or questions.

English

Marco Rodrigues@dadhalfdev·14 May

@LottoLabs I'm not playing in the local league yet. But this is super useful 🙏 amazing work!

English

Lotto@LottoLabs·13 May

A good little video overviewing local inference Using localmaxxing + llama.cpp server + cline youtu.be/oISvtpHKRfk?si…

YouTube

English

1.4K

edge distiller@edgedistiller·13 May

@outsource_ Haha I just recorded a video on local benchmarking before this was posted, I definitely want to check this out! youtube.com/watch?v=4jCmXU…

YouTube

English

Eric ⚡️ Building...@outsource_·13 May

🚨Introducing BenchLoop for Local Model benchmarks We Built the missing piece for local LLMs👇🏻 One app to pull, chat, benchmark, and compare models on your hardware. Try it now 👉🏻 bench-loop.com pipx install benchloop-cli

English

1.7K

edge distiller@edgedistiller·12 May

It's only May and local LLM benchmarks already got me like this

English

edge distiller@edgedistiller·11 May

@regularaugust What was his name again?

English

881

august@regularaugust·11 May

Average viral tweet about relationships: “ladies, if you’re on a date with a guy and he asks to split the bill, you’re dating a woman” People who will find true love in their lifetimes: “yeah this one’s called Samurai Flamenco. Wait until Guillotine Gorilla shows up…”

English

948

20.6K

august@regularaugust·11 May

Cost zero dollars to plug in an hdmi cable and show them some bullshit

unusual_whales@unusual_whales

"The average cost of a date for a millennial is now $252," per BMO

English

826

14.2K

274.2K

edge distiller@edgedistiller·9 May

Claude Code is the Windows 11 of agent harnesses.

himanshu@himanshustwts

the harness of claude code is very interesting. a random unstable header at the start of the prompt was breaking KV-cache reuse on a 52k-token context. NVIDIA stripped it out and TTFT dropped by 5x.

English

edge distiller@edgedistiller·8 May

@sudoingX I just made a video covering this (llama.cpp build for local inference), would be happy to hear any thoughts: x.com/edgedistiller/…

edge distiller@edgedistiller

I made a video on running LLMs locally, specifically by using other people's benchmarks on LocalMaxxing. All criticism/feedback is welcome! youtube.com/watch?v=oISvtp…

English

201

Sudo su@sudoingX·8 May

anyone interested in or getting started with local ai personal inference, pay attention. start with the right practice. compile llama.cpp from source. i know lm studio and ollama exist. they're great onramps. but they're mostly wrappers around llama.cpp with abstraction layers that hide the flags you actually need to tune. what compiling once gets you: > the best inference engine for personal use, full stop > latest features the day they merge (vulkan flash attention dp4a, kv cache quant, fa toggles) > exact gpu arch optimization (sm_120 for 5090, sm_89 for 4090, sm_86 for 3090) > direct flag control > openai-compatible llama-server api ready out of the box the build (3-5 minutes on a modern cpu): git clone github.com/ggerganov/llam… cd llama.cpp cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 cmake --build build --config Release -j (replace 120 with 86 for 3090, 89 for 4090, 80 for A100. for AMD GPUs swap GGML_CUDA for GGML_VULKAN.) when to NOT use llama.cpp: > multi-gpu batch serving at scale = vllm > production async high-throughput = vllm or sglang > apple silicon = mlx is faster for single-gpu personal inference + agentic workflows + benchmarking: llama.cpp from source. every time.

English

493

23.2K

edge distiller@edgedistiller·8 May

I made a video on running LLMs locally, specifically by using other people's benchmarks on LocalMaxxing. All criticism/feedback is welcome! youtube.com/watch?v=oISvtp…

YouTube

English

428

edge distiller@edgedistiller·6 May

@ChemPhysMajor @arena @GoogleDeepMind Both Qwen 3.6 models are much more expensive than Gemma 4.

English

Cynical Optimist@ChemPhysMajor·6 May

@arena @GoogleDeepMind This doesn't seem right. Everyone using these models will tell you that Gemma4 has its place but Qwen3.6 is the best at frontend web dev. Stranger, there's no Qwen3.6-27B or 35B-A3B on this chart. Either strategic omission in favor of the cloud Qwen3.6-plus, or poor tests.

English

808

Arena.ai@arena·6 May

Gemma-4 lands in Code Arena: Frontend Webdev and shifts the Pareto Frontier! Among open models, Gemma-4-31b ranks #13 and Gemma-4-26b-a4b ranks #17. Congrats to @GoogleDeepMind on shifting the frontier!

Google DeepMind@GoogleDeepMind

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵

English

387

40K

edge distiller@edgedistiller·5 May

@MindMechanical @greenTetra_ Literally thought of this as I read it, still can't decide which one to pick.

English

Mechanical Mind@MindMechanical·4 May

@greenTetra_ Defectors:

Català

517

Tetra@greenTetra_·4 May

A perfect clone of the you who is as of this moment reading my post is made (completely identical down to the smallest factor including mentally/emotionally), the two of you are then instantly teleported into the canonical Prisoner's Dilemna situation. Do you cooperate or defect?

English

8.7K

edge distiller@edgedistiller·1 May

@AlphaMFPEFM @Elaina43114880 That's fair, but it can also just be extrapolated from the price, since xAI has a massive amount of compute and we know they can make a 10T model or do whatever they want. Ultimately the constraint is not parameter size, but performance for a given price class.

English

AlphaMFPEFM@AlphaMFPEFM·1 May

@akhsurgin @Elaina43114880 From a simple user point of view, you're right, but fir people who try to judge if the model is good for its size and thus if the next models size (1T, 1.5T, 6T...) might bring something good to the table, it's worth noting that this is a 0.5T parameter model

English

Elaina@Elaina43114880·1 May

Some people may say that Grok 4.3 is “only” a 500B model, and that its performance is already very impressive for that size. First of all, Grok 4.x is not open-source, and xAI has not open-sourced a flagship model for a long time. For users, a closed model, whether it is 50B or 5000B, is ultimately just an API endpoint. Second, Kimi K2.6 uses a native INT4 quantization method. This means that even with 1.1T parameters, the total size of all its weight files is still under 600GB. In other words, Grok 4.3 would need to use a native INT8 quantization scheme and keep its total weight size in the 500GB+ range to be comparable to Kimi K2.6. Otherwise, if it uses a traditional BF16 format, its total weight size would be nearly twice that of K2.6. Where Grok 4.3 is better than Kimi K2.6 is its tighter integration with the X/Twitter ecosystem, which allows it to access more timely information, as well as its more favorable API output price ($2.5 < $4.0) and larger context window (1M > 256K). So simply emphasizing that Grok 4.x is a 500B model is basically meaningless.

Lisan al Gaib@scaling01

Grok-4.3 still behind chinese open-source

English

241

31K

edge distiller@edgedistiller·1 May

@AlphaMFPEFM @Elaina43114880 If it's closed source then all that matters is the API price and the quality of output. The model could be 500T parameters for all I care.

English

AlphaMFPEFM@AlphaMFPEFM·1 May

@Elaina43114880 The number of parameters for a given type of model architecture will have a bigger effect on its quality than a 8 bit quantization vs BF16 (or even 4bit if well done). So mentioning that Grok 4.3 is a 500B parameter still have some meaning.

English

451

edge distiller@edgedistiller·1 May

@LottoLabs You know you've made it when someone is willing to rent an H200 to make big number even bigger on the leaderboard.

English

294

Lotto@LottoLabs·30 Nis

Lol if you ever need a fast 0.8b model 🥹

English

433

20.5K

edge distiller@edgedistiller·29 Nis

@banteg "hedging" and it's literally just confidence intervals based on measurement error.

English

610

banteg@banteg·29 Nis

i've never seen someone hedge so much (9x). i think the ranking is more interesting than the "predicted" size.

English

147

24.8K

edge distiller@edgedistiller·29 Nis

@Rokieee__ @bojie_li "Reasoning compresses. Factual knowledge doesn't."

English

523

˗ˏˋ Rookie ˎˊ˗@Rokieee__·29 Nis

@bojie_li According to you post, gemini 2.5 pro ~1.7 trillion, sonnet ~1.7 , and we all know kimi k2.6 and glm 5.1 are ~800 billion then why such difference in their performance, I understand that can be due to different in training data but it should not be this much,

English

9.4K

Bojie Li@bojie_li·29 Nis

Closed labs hide model sizes. They can't hide what their models know, and what a model knows is an indicator on how big it is. Reasoning compresses. Factual knowledge doesn't. So you can size a frontier model from black-box API calls alone, and across releases you can literally watch a single fact arrive in the parameters over time. For three years, my friends Jiyan He and Zihan Zheng have been asking frontier LLMs the same question: "what do you know about USTC Hackergame?", a CTF contest. May 2024: GPT-4o invented fake titles. Feb 2025: Claude 3.7 Sonnet listed 19 verified 2023 challenges. By April 2026, frontier models recall specific challenges across consecutive years. After DeepSeek-V4 dropped, I instructed my agent to spend four days autonomously turning that habit into Incompressible Knowledge Probes (IKP) — 1,400 questions, 7 tiers of obscurity, 188 models, 27 vendors. Three findings: 1/ You can approximately size any black-box LLM from factual accuracy alone. Penalized accuracy is log-linear in log(params), R² = 0.917 on 89 open-weight models from 135M to 1.6T params. Project closed APIs onto the curve → GPT-5.5 ~9T, Claude Opus 4.7 ~4T, GPT-5.4 ~2.2T, Claude Sonnet 4.6 ~1.7T, Gemini 2.5 Pro ~1.2T (90% CI: 0.3-3x size). 2/ Citation count and h-index don't predict whether a frontier model recognizes a researcher. Two researchers with similar citation profiles get very different responses. Models memorize impact — work that shaped a field, not many incremental papers. 3/ Factual capacity doesn't compress over time. Across 96 open-weight models across 3 years, the IKP time coefficient is statistically zero, rejecting the Densing-Law prediction of +0.0117/month at p<10⁻¹⁵. Reasoning benchmarks saturate; factual capacity keeps scaling with parameters. Website: 01.me/research/ikp/ Paper: arxiv.org/pdf/2604.24827

English

235

2.2K

389.1K

edge distiller@edgedistiller·29 Nis

@witchof0x20 @halvarflake This is the only correct answer in the replies and it got 0 engagement, what a shame. The correct answer was remote attestation, people.

English

etherret🐾@witchof0x20·28 Nis

@halvarflake Nvidia has some sort of attestation thing I think but nobody uses it

English

476

Halvar Flake@halvarflake·28 Nis

How do I know that a token provider is providing the model itself and not a hardcore quantization?

English

11.6K

Keşfet

@LottoLabs @outsource_ @leftcurvedev_ @k_flowstate @ItsmeAjayKV @dadhalfdev @regularaugust @sudoingX