edge distiller

186 posts

edge distiller banner
edge distiller

edge distiller

@edgedistiller

AI, crypto, and technology content. https://t.co/yZpbZkh9ck

Katılım Ağustos 2022
47 Takip Edilen445 Takipçiler
edge distiller
edge distiller@edgedistiller·
@LottoLabs Grok actually feels different from other models. Purely for the sake of "cognitive diversity" I want it to succeed. Even if that means finding a different niche at a different price point from gigantic frontier models tuned for coding.
English
0
0
1
43
Lotto
Lotto@LottoLabs·
Idk why but I want grok to win the
English
13
0
38
1.7K
edge distiller
edge distiller@edgedistiller·
New video up on my youtube channel about the new Qwen MTP models! I also compare quality benchmarks using BenchLoop, thanks to @outsource_
edge distiller tweet media
English
1
0
7
1.7K
edge distiller
edge distiller@edgedistiller·
@leftcurvedev_ I also found no difference in speed between 2 and 6 draft tokens on different hardware with the same flags as you. Almost exactly 1.8x using MTP vs. without. I cover it in my recent video. youtu.be/RdEzYwPBwDo
YouTube video
YouTube
edge distiller tweet media
English
0
0
0
221
left curve dev
left curve dev@leftcurvedev_·
I nearly 2x'd the speed while only using +1GB VRAM with the new MTP update in llama.cpp 🤯 You need to add these flags to start using it: --spec-type draft-mtp \ --spec-draft-p-min 0.75 \ --spec-draft-n-max 2 My results with Qwen3.6 27B on a single RTX 5080 ↓ ⚪️ no flag (without mtp) → 54.3 tok/s with 13.26GB VRAM 🔵 --spec-draft-n-max 2 → 90.7 tok/s with 14.29GB VRAM 🔴 --spec-draft-n-max 2 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.30GB VRAM 🟢 --spec-draft-n-max 6 --spec-draft-p-min 0.75 → 93.9 tok/s with 14.87GB VRAM Increasing to 6 draft tokens didn't help my setup for some reason. I made sure to test with a low context length to have enough headroom and eliminate risk of vram stress. From my understanding: 1) The speed gains are very task-dependent. You need to test across a wide range of tasks to get a realistic idea of the benefits 2) We’re already running heavily quantized GGUF models (Q3, Q4, Q6, etc.), so we already benefit from strong speed/performance thanks to the reduced size. That’s why some people are seeing little to no improvement compared to MLX or other quantized versions The progress over the past few days has been insane to say the least. However, MTP now consumes significantly more VRAM. Personally 16GB just isn't enough to use MTP and run it with a good context size. Time to upgrade lads, 24GB+ users are eating GOOD today 🔥 Full setup below ↓
English
30
37
430
28.1K
edge distiller
edge distiller@edgedistiller·
@k_flowstate Why do we need to trust anyone? Either a statement is true or it is not. People that produce a lot of true, useful statements are generally worth giving attention to.
English
0
0
1
30
flowstate
flowstate@k_flowstate·
Now, who do I trust here when it comes to Open Sourced AI Don't get me wrong, both of them share really great insights when it comes to local models profiling. But how can we trust local AI to win when the two top-most reliable sources don't even trust each other?
flowstate tweet mediaflowstate tweet media
English
100
7
177
19.7K
edge distiller
edge distiller@edgedistiller·
@ItsmeAjayKV Ultimately is it worth it if you have to use a smaller (worse) quantization due to the additional VRAM overhead? MTP seems like an optimization that only makes sense when your VRAM too large for the current tier you are using but too small for the next tier up.
English
1
0
1
561
AJ
AJ@ItsmeAjayKV·
Why does VRAM usage jump when MTP is enabled? From my Qwen3.6-35B-A3B-MTP runs on a 12GB RTX 3060: Keeping everything else identical, same model, quant, ngl, ncmoe, KV cache and only changing --spec-draft-n-max I noticed this: Without MTP: ~5.98GB VRAM MTP enabled (spec-draft-n-max): ~8.47GB VRAM Then increasing n_max 2 -> n_max 4 only added ~0.05GB more VRAM afterward. Basically, the VRAM jump happens because the runtime has to load those extra MTP prediction heads upfront. Once those "draft weights" are taking up the space, increasing n_max just adds a fraction of room for the extra tokens after that. Normal decoding is basically: predict token -> append token -> next forward pass -> repeat With MTP/speculative decoding, the runtime uses those extra heads to: - draft multiple future tokens ahead - maintain verifier state - track speculative execution paths - manage accept/reject logic for drafted tokens So enabling MTP introduces a fairly large baseline infrastructure cost immediately to get those heads ready. After that, increasing n_max mostly changes how far ahead the runtime speculates rather than scaling memory usage linearly.
AJ tweet media
English
17
12
136
10.5K
Marco Rodrigues
Marco Rodrigues@dadhalfdev·
@LottoLabs I'm not playing in the local league yet. But this is super useful 🙏 amazing work!
English
1
0
1
31
Lotto
Lotto@LottoLabs·
A good little video overviewing local inference Using localmaxxing + llama.cpp server + cline youtu.be/oISvtpHKRfk?si…
YouTube video
YouTube
English
2
1
13
1.4K
Eric ⚡️ Building...
Eric ⚡️ Building...@outsource_·
🚨Introducing BenchLoop for Local Model benchmarks We Built the missing piece for local LLMs👇🏻 One app to pull, chat, benchmark, and compare models on your hardware. Try it now 👉🏻 bench-loop.com pipx install benchloop-cli
Eric ⚡️ Building... tweet media
English
6
3
18
1.7K
edge distiller
edge distiller@edgedistiller·
It's only May and local LLM benchmarks already got me like this
edge distiller tweet media
English
0
0
1
54
august
august@regularaugust·
Average viral tweet about relationships: “ladies, if you’re on a date with a guy and he asks to split the bill, you’re dating a woman” People who will find true love in their lifetimes: “yeah this one’s called Samurai Flamenco. Wait until Guillotine Gorilla shows up…”
English
7
86
948
20.6K
Sudo su
Sudo su@sudoingX·
anyone interested in or getting started with local ai personal inference, pay attention. start with the right practice. compile llama.cpp from source. i know lm studio and ollama exist. they're great onramps. but they're mostly wrappers around llama.cpp with abstraction layers that hide the flags you actually need to tune. what compiling once gets you: > the best inference engine for personal use, full stop > latest features the day they merge (vulkan flash attention dp4a, kv cache quant, fa toggles) > exact gpu arch optimization (sm_120 for 5090, sm_89 for 4090, sm_86 for 3090) > direct flag control > openai-compatible llama-server api ready out of the box the build (3-5 minutes on a modern cpu): git clone github.com/ggerganov/llam… cd llama.cpp cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 cmake --build build --config Release -j (replace 120 with 86 for 3090, 89 for 4090, 80 for A100. for AMD GPUs swap GGML_CUDA for GGML_VULKAN.) when to NOT use llama.cpp: > multi-gpu batch serving at scale = vllm > production async high-throughput = vllm or sglang > apple silicon = mlx is faster for single-gpu personal inference + agentic workflows + benchmarking: llama.cpp from source. every time.
English
43
46
493
23.2K
edge distiller
edge distiller@edgedistiller·
I made a video on running LLMs locally, specifically by using other people's benchmarks on LocalMaxxing. All criticism/feedback is welcome! youtube.com/watch?v=oISvtp…
YouTube video
YouTube
English
0
0
2
428
Cynical Optimist
Cynical Optimist@ChemPhysMajor·
@arena @GoogleDeepMind This doesn't seem right. Everyone using these models will tell you that Gemma4 has its place but Qwen3.6 is the best at frontend web dev. Stranger, there's no Qwen3.6-27B or 35B-A3B on this chart. Either strategic omission in favor of the cloud Qwen3.6-plus, or poor tests.
English
1
0
2
808
Tetra
Tetra@greenTetra_·
A perfect clone of the you who is as of this moment reading my post is made (completely identical down to the smallest factor including mentally/emotionally), the two of you are then instantly teleported into the canonical Prisoner's Dilemna situation. Do you cooperate or defect?
English
26
1
68
8.7K
edge distiller
edge distiller@edgedistiller·
@AlphaMFPEFM @Elaina43114880 That's fair, but it can also just be extrapolated from the price, since xAI has a massive amount of compute and we know they can make a 10T model or do whatever they want. Ultimately the constraint is not parameter size, but performance for a given price class.
English
1
0
1
28
AlphaMFPEFM
AlphaMFPEFM@AlphaMFPEFM·
@akhsurgin @Elaina43114880 From a simple user point of view, you're right, but fir people who try to judge if the model is good for its size and thus if the next models size (1T, 1.5T, 6T...) might bring something good to the table, it's worth noting that this is a 0.5T parameter model
English
1
0
0
14
Elaina
Elaina@Elaina43114880·
Some people may say that Grok 4.3 is “only” a 500B model, and that its performance is already very impressive for that size. First of all, Grok 4.x is not open-source, and xAI has not open-sourced a flagship model for a long time. For users, a closed model, whether it is 50B or 5000B, is ultimately just an API endpoint. Second, Kimi K2.6 uses a native INT4 quantization method. This means that even with 1.1T parameters, the total size of all its weight files is still under 600GB. In other words, Grok 4.3 would need to use a native INT8 quantization scheme and keep its total weight size in the 500GB+ range to be comparable to Kimi K2.6. Otherwise, if it uses a traditional BF16 format, its total weight size would be nearly twice that of K2.6. Where Grok 4.3 is better than Kimi K2.6 is its tighter integration with the X/Twitter ecosystem, which allows it to access more timely information, as well as its more favorable API output price ($2.5 < $4.0) and larger context window (1M > 256K). So simply emphasizing that Grok 4.x is a 500B model is basically meaningless.
Lisan al Gaib@scaling01

Grok-4.3 still behind chinese open-source

English
28
5
241
31K
edge distiller
edge distiller@edgedistiller·
@AlphaMFPEFM @Elaina43114880 If it's closed source then all that matters is the API price and the quality of output. The model could be 500T parameters for all I care.
English
1
0
0
20
AlphaMFPEFM
AlphaMFPEFM@AlphaMFPEFM·
@Elaina43114880 The number of parameters for a given type of model architecture will have a bigger effect on its quality than a 8 bit quantization vs BF16 (or even 4bit if well done). So mentioning that Grok 4.3 is a 500B parameter still have some meaning.
English
1
0
4
451
edge distiller
edge distiller@edgedistiller·
@LottoLabs You know you've made it when someone is willing to rent an H200 to make big number even bigger on the leaderboard.
English
1
0
2
294
Lotto
Lotto@LottoLabs·
Lol if you ever need a fast 0.8b model 🥹
Lotto tweet media
English
17
3
433
20.5K
edge distiller
edge distiller@edgedistiller·
@banteg "hedging" and it's literally just confidence intervals based on measurement error.
English
0
0
0
610
banteg
banteg@banteg·
i've never seen someone hedge so much (9x). i think the ranking is more interesting than the "predicted" size.
banteg tweet media
English
12
9
147
24.8K
˗ˏˋ Rookie ˎˊ˗
˗ˏˋ Rookie ˎˊ˗@Rokieee__·
@bojie_li According to you post, gemini 2.5 pro ~1.7 trillion, sonnet ~1.7 , and we all know kimi k2.6 and glm 5.1 are ~800 billion then why such difference in their performance, I understand that can be due to different in training data but it should not be this much,
English
2
0
19
9.4K
Bojie Li
Bojie Li@bojie_li·
Closed labs hide model sizes. They can't hide what their models know, and what a model knows is an indicator on how big it is. Reasoning compresses. Factual knowledge doesn't. So you can size a frontier model from black-box API calls alone, and across releases you can literally watch a single fact arrive in the parameters over time. For three years, my friends Jiyan He and Zihan Zheng have been asking frontier LLMs the same question: "what do you know about USTC Hackergame?", a CTF contest. May 2024: GPT-4o invented fake titles. Feb 2025: Claude 3.7 Sonnet listed 19 verified 2023 challenges. By April 2026, frontier models recall specific challenges across consecutive years. After DeepSeek-V4 dropped, I instructed my agent to spend four days autonomously turning that habit into Incompressible Knowledge Probes (IKP) — 1,400 questions, 7 tiers of obscurity, 188 models, 27 vendors. Three findings: 1/ You can approximately size any black-box LLM from factual accuracy alone. Penalized accuracy is log-linear in log(params), R² = 0.917 on 89 open-weight models from 135M to 1.6T params. Project closed APIs onto the curve → GPT-5.5 ~9T, Claude Opus 4.7 ~4T, GPT-5.4 ~2.2T, Claude Sonnet 4.6 ~1.7T, Gemini 2.5 Pro ~1.2T (90% CI: 0.3-3x size). 2/ Citation count and h-index don't predict whether a frontier model recognizes a researcher. Two researchers with similar citation profiles get very different responses. Models memorize impact — work that shaped a field, not many incremental papers. 3/ Factual capacity doesn't compress over time. Across 96 open-weight models across 3 years, the IKP time coefficient is statistically zero, rejecting the Densing-Law prediction of +0.0117/month at p<10⁻¹⁵. Reasoning benchmarks saturate; factual capacity keeps scaling with parameters. Website: 01.me/research/ikp/ Paper: arxiv.org/pdf/2604.24827
Bojie Li tweet mediaBojie Li tweet mediaBojie Li tweet media
English
71
235
2.2K
389.1K
edge distiller
edge distiller@edgedistiller·
@witchof0x20 @halvarflake This is the only correct answer in the replies and it got 0 engagement, what a shame. The correct answer was remote attestation, people.
English
0
0
0
32
etherret🐾
etherret🐾@witchof0x20·
@halvarflake Nvidia has some sort of attestation thing I think but nobody uses it
English
1
0
2
476
Halvar Flake
Halvar Flake@halvarflake·
How do I know that a token provider is providing the model itself and not a hardcore quantization?
English
13
4
38
11.6K