Alexey Fateev

12

Eric@Ex0byt·22h

Update: the road to GLM-5.2: we're getting there, folks! non-quantized, non-pruned DeepSeek-v4-Flash. 11tok/s on a single DGX Spark. sglang inference + custom mega-kernel. Pure beauty.

English

17

12

215

13.1K

Alexey Fateev@superalesha·3h

Good call on the 3090, still the value pick for local. The privacy line genuinely closes deals, I have seen the same. One 24GB card keeps a solid mid-size model snappy for one user. The day you serve more than yourself the bottleneck moves from VRAM to how you split the model across cards, but for a solo setup the 3090 is hard to beat on dollars per token.

English

0

1

23

Triana Kurtetti@kurtetti·13h

He lost a $5,000 client in 90 seconds. Watch until 0:40. That's where the fix appears. The client asked one question: "Where does my data go when you run it through Claude?" He said: Anthropic's servers. The client thanked him and left. That night he bought a used RTX 3090. 24GB VRAM. $900–$1,200 on the secondhand market. The old card had 8GB. Not enough. When the model doesn't fit in VRAM, response times collapse. The 3090 loads the full model. Stays snappy. Stays local. Old stack: $220/month. Claude Pro + Cursor + subscriptions. New stack: ~$35/month. One GPU. One open model. 85% local. 15% cloud. Now when a client asks where their data goes, the answer is different: "Nothing leaves this machine." That sentence is worth more than two years of subscription savings. Month 6 break-even on hardware. Year 2 savings: $3,390. Before the first client revenue. The AI bill made sense when local models were too weak. That window is closing.

Triana Kurtetti@kurtetti

x.com/i/article/2068…

English

6

5

34

3K

Alexey Fateev@superalesha·3h

The cost gap is real, and it widens if you tune the serving layout. Running Gemma 4 at volume I got 1.9x the throughput on the same 4x3090, just switching from one wide TP4 to two TP2 replicas: 1084 vs 563 tok/s. Same hardware, almost double the tokens per dollar. Breakdown: x.com/superalesha/st…

Same question as my Qwen post, one size up: how should Gemma 4 31B run on 4x3090? The best layout flips depending on whether you serve one user or a crowd. One user: TP4, all four cards on every token. 78 tok/s, against 57 for a 2-card TP2. No draft head here like Qwen's MTP, so the extra cards are just raw compute and wider wins. 64 users: two TP2 replicas (TP2+DP2). 1084 tok/s, against 563 for one wide TP4. 1.9x on the same four cards. Why: TP4 makes all four cards sync after every layer, and on 3090s that crosses PCIe. Great when you want every card chewing one token, rough under load. Two replicas barely talk, so they scale almost clean. Same rule as before, now on a dense 31B: replicate over the fewest cards that fit for throughput, split wide only when one user's latency is the whole job. Four separate copies fit only at ~1k context and lose on both, so skip that here. Full card: both launch commands, first token, tok/watt, footprint

English

30

Shinichi Takaŷanagi@_stakaya·9h

知見だ！ > コストについては、外部の LLM API (GPT-5.4 mini 相当)に比べると(local LLMとしてGemma 4は)圧倒的に割安です。同じ処理でも 200 倍近いコスト差が生まれる月 100万件・ 50億トークンをローカルLLMで捌く - 明細名のストア情報マッピング｜onagaway zenn.dev/finatext/artic… #zenn

日本語

4

20

2.1K

Alexey Fateev@superalesha·3h

Solid numbers. Matches my MTP runs on Qwen, around 1.5 to 2x on code from the draft head. The catch with Gemma 4 31B is it has no MTP head, so on that one I chased layout instead of spec decode: TP2+DP2 gave 1.9x the throughput of TP4 on my 4x3090. Different lever, same goal. Full sweep: x.com/superalesha/st…

Same question as my Qwen post, one size up: how should Gemma 4 31B run on 4x3090? The best layout flips depending on whether you serve one user or a crowd. One user: TP4, all four cards on every token. 78 tok/s, against 57 for a 2-card TP2. No draft head here like Qwen's MTP, so the extra cards are just raw compute and wider wins. 64 users: two TP2 replicas (TP2+DP2). 1084 tok/s, against 563 for one wide TP4. 1.9x on the same four cards. Why: TP4 makes all four cards sync after every layer, and on 3090s that crosses PCIe. Great when you want every card chewing one token, rough under load. Two replicas barely talk, so they scale almost clean. Same rule as before, now on a dense 31B: replicate over the fewest cards that fit for throughput, split wide only when one user's latency is the whole job. Four separate copies fit only at ~1k context and lose on both, so skip that here. Full card: both launch commands, first token, tok/watt, footprint

English

rS_ローカルLLM×投資@rS_alonewolf

40

rS_ローカルLLM×投資@rS_alonewolf·10h

■エローカルLLMの無検閲Gemma4 ワイの環境での速度（MTPでどこまで速くなるか）エロい目的のために先日入れた「Gemma4 26Bの無検閲兄弟」たち。 RTX 5070 Ti（VRAM・16GB）・コンテキスト128Kで、とりま、ご本尊（公式）と並べて実速度を測った。短文でだけどね。 ■ MTPとは投機デコード＝小さな先読みヘッドが数トークン先を予測し、本体がまとめて検証する仕組み。当たれば一気に進む。出力は普通のデコードと完全に同じ。変わるのは速度だけ（らしい　ほんまか？エロさは変わるかもな）。 ■ 26Bの結果（decode tok/s・128K）　　　　　　MTP無 → MTP有（倍率） ▼Claude-Opus蒸留ご兄弟（賢さ寄り）・teich：40.3 → 59.6（1.48倍）・apex：43.6 → 62.7（1.44倍）　 ▼SFT系（蒸留ではない）・gemopus：41.5 → 60.1（1.45倍） ▼無検閲ご兄弟（エローカルモデル）・supergemma：46.2 → 62.8（1.36倍）・hauhau：40.2 → 60.8（1.51倍）・trevorjs：40.9 → 60.7（1.48倍） ▼ツール特化・hermes：46.3 → 66.6（1.44倍） 26B Q6は素40 → MTPで60〜67 t/s。系統が違っても速度はほぼ横並びで、軽いQ4が本尊よりわずかに速い。 ■ 12B（dense）はMTPが段違い（＊参考＊） MoEではない普通の12B（Q8_0）だと、先読みが当たりやすくて伸び方が桁違い。・base：56.8 → 155.5（2.74倍）・opus：55.6 → 148.1（2.66倍）・coder-fable5：55.6 → 146.6（2.64倍） 26B(MoE)が1.4〜1.5倍に対して、12B(dense)は約2.7倍。素の速度も12Bが上。"賢さの26B / 速さの12B"で完全に住み分けてる。12Bはエローカル試せてないのでまた今度する。 ■ 60 t/s って遅めだけどクラウドと同じでは？ GLM-5.1あたりも遅い時は同じくらいになることもある。それよりは、ちょと遅いが、価値はその遅めのクラウド並みの60 t/sを「ローカル・無検閲・無料・垢BANなし」で出せること。ボチボチ相互評価でエロさのベンチでもしようかと思う。

真面目に本気でローカルLLMのエロさを評価したいが、評価モデルをプロバイダ使ってしまうと、拒絶か下手したら垢BAN お⚪︎ん⚪︎がお⚪︎ん⚪︎に...とか入力されると一生AI使えなくなるから　どうすれば良いか？と思ってたところそこで、検閲なしモデル全員に、全モデルの文章を匿名で採点させようと思う。複数のモデルから高く評価された文章は、本当に強い可能性が高い。 採点が甘い・厳しいモデルの癖も補正できる。 自分だけ自分を高く評価しているかも分かる。要するに、1モデルの主観ではなく、全員参加の相互評価で「本当にエロいモデル」を決める。と思ったけど　計算量が今までよりも指数関数的に増えるぞw RTX Pro 6000 Blackwell欲しいなこれは投資では無い　己の欲求を解き放つためだ！エロいご本尊　召喚したい♡ #VRAM飢饉救済教 #でかいVRAM欲しい #あれ言ってること矛盾してね

日本語

3

35

7.8K

Alexey Fateev@superalesha·3h

This matches what I just measured on bare metal. On 4x3090, two TP2 replicas (data parallel) beat one wide TP4 by 1.9x on Gemma 4 31B: 1084 vs 563 tok/s at 64 streams. TP4 syncs every card after each layer and that all-reduce is the tax. Replicas barely talk, so they scale almost clean. Full layout sweep: x.com/superalesha/st…

Same question as my Qwen post, one size up: how should Gemma 4 31B run on 4x3090? The best layout flips depending on whether you serve one user or a crowd. One user: TP4, all four cards on every token. 78 tok/s, against 57 for a 2-card TP2. No draft head here like Qwen's MTP, so the extra cards are just raw compute and wider wins. 64 users: two TP2 replicas (TP2+DP2). 1084 tok/s, against 563 for one wide TP4. 1.9x on the same four cards. Why: TP4 makes all four cards sync after every layer, and on 3090s that crosses PCIe. Great when you want every card chewing one token, rough under load. Two replicas barely talk, so they scale almost clean. Same rule as before, now on a dense 31B: replicate over the fewest cards that fit for throughput, split wide only when one user's latency is the whole job. Four separate copies fit only at ~1k context and lose on both, so skip that here. Full card: both launch commands, first token, tok/watt, footprint

English

2

46

Idle@IdleProtocol·8h

We've been building this for months. Today we wrote it all down. How IDLE Protocol runs distributed inference across thousands of consumer GPUs - the architecture, the hardware tiers, vLLM under the hood, consensus validation, and why data parallelism at the network level beats tensor parallelism in a data center for this problem. Full technical breakdown.

Idle@IdleProtocol

x.com/i/article/2069…

English

5

20

53

5K

Alexey Fateev@superalesha·4h

Same question as my Qwen post, one size up: how should Gemma 4 31B run on 4x3090? The best layout flips depending on whether you serve one user or a crowd. One user: TP4, all four cards on every token. 78 tok/s, against 57 for a 2-card TP2. No draft head here like Qwen's MTP, so the extra cards are just raw compute and wider wins. 64 users: two TP2 replicas (TP2+DP2). 1084 tok/s, against 563 for one wide TP4. 1.9x on the same four cards. Why: TP4 makes all four cards sync after every layer, and on 3090s that crosses PCIe. Great when you want every card chewing one token, rough under load. Two replicas barely talk, so they scale almost clean. Same rule as before, now on a dense 31B: replicate over the fewest cards that fit for throughput, split wide only when one user's latency is the whole job. Four separate copies fit only at ~1k context and lose on both, so skip that here. Full card: both launch commands, first token, tok/watt, footprint

English

2

173

Alexey Fateev@superalesha·4h

@zanoga And what Infiniband is on a 3090? What are you talking about, bro?

English

2

0

3

677

Max Zanoga@zanoga·16h

Finally finished building my AI datacenter! 🚀 32x3090s across 4 servers (8 GPUs each), all connected over InfiniBand. The whole setup is solar-powered with a massive battery bank and generator backup. More technical details and benchmarks coming soon.

English

306

160

2.9K

266.4K

Alexey Fateev@superalesha·9h

@SlimTradeyBaby I think so - we're the only ones here. I've gone through the entire HF discussion and nobody else has asked about this. insane 😵

English

1

26

BlackwellBoy@SlimTradeyBaby·9h

@superalesha ill give it a whirl when I get home, see if i can spin anything up, like i said in my other post slap fable 5 on and they will come haha, but if we are the only 2 in 450k + that cant get it working well XD sheesh

English

0

1

178

BlackwellBoy@SlimTradeyBaby·14h

Pro tip: Slap ‘fable5’ in your GGUF name for instant Claude Fable 5 superpowers + virility buff, this Gemma 4 12B coder was trained on Composer 2.5 + Fable 5 recovery CoT. The filename is not lying. Anyone tried it yet? Small cars good option? huggingface.co/yuxinlu1/gemma…

English

9

0

28

3.7K

Alexey Fateev@superalesha·10h

The turboquant fork earns its keep, I ran it on Gemma 4 on my 3090s and the turbo KV cache types are what actually let you stretch context without OOM. Getting an A4B MoE agent loop alive in 8GB is wild. One number from my runs: on the dense 31B, llama.cpp sits around 55-60 tok/s single stream while vLLM AWQ does ~75, so the second you go multi-user the engine choice starts to outweigh the quant.

English

0

1

85

Alok@analogalok·1d

I just got Gemma 4 26B A4B MoE model running fully locally with Hermes agent on an 8GB RTX 4060 and it's now backtesting trading strategies end to end, no hand holding. If you’re a trader or work on Wall Street, you don’t want to miss this. Yes. fully automated. No cloud. No APIs beyond market data. # Here's what I did: Setup: - Model: Gemma 4 26B-A4B QAT (MoE), Q4_K_XL Unsloth's quant (link in the comments) - Inference: llama.cpp (turboquant fork by @no_stp_on_snek link in the comments) - Hardware: RTX 4060, 8GB VRAM + 16GB RAM only (with 50 other chrome tabs open) - Context: 64K llama.cpp turboquant flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 turboquant helps achieve high prefill and decode throughput for interactive sessions. throughput with Hermes agent: decode: 25+ tokens/sec prefill: 250+ tokens/sec # Then I gave the agent one task: Backtest a strategy: - Buy when RSI crosses above 30 - Sell at +2% profit or -1% stoploss - No overlapping positions - Use Google stock via yfinance - Generate a full HTML report with candlestick charts + signals What happened next was wild. It didn't just write code, it ran the entire workflow itself: Audited the environment (pip list, dependency check) Hit a ModuleNotFoundError, multiple Python installs were conflicting Ran where python to map every interpreter on the system Manually selected the correct Python 3.13 path and re ran the script Wrote a clean statevmachine backtester (strict no overlapping trades logic) Patched a yfinance MultiIndex quirk that would've crashed the script Built Plotly candlestick + RSI charts with buy/sell markers Calculated win rate, PnL, and summary stats Exported a polished single file HTML report. check the report at the end of the video or in the comments. Biggest takeaway: local LLMs aren't just "chat assistants" anymore. They debug their own environment, write production code, and ship a finished deliverable on consumer hardware, for $0 in API costs. If you're still calling local models "toys," you're already behind. This is just the beginning. Hermes agent just surpassed 1 trillion tokens in a single day on OpenRouter. Think about the scale of total token generation happening right now. Disclaimer: This is not financial advice. Consult a professional before making any trading decisions.

Teknium 🪽@Teknium

Wait we actually just broke 1T tokens in a day for the first time on OpenRouter :O Please keep contributing to the most awesome project I've ever worked on to help make Hermes Agent the best software stack on the planet! Thank you contributors🍻🍻

English

23

41

429

42.5K

Alexey Fateev@superalesha·10h

Config-only swap from EAGLE-3 is the dream. Real question for the Ampere crowd: any realistic path to DFlash on sm_86? On 4x3090 MTP-style spec decode already gives me a solid bump, but FlashInfer JIT is rough on older CUDA. Would be great to keep the 3090 rigs in the spec-decode game as these drafters land.

English

Increase inference performance by up to 15x without sacrificing responsiveness. DFlash, an open source lightweight block diffusion model designed for speculative decoding, delivers up to 15x higher throughput on NVIDIA Blackwell while maintaining the same user interactivity target. Instead of drafting tokens one at a time, it proposes a whole block in a single pass for the main model to verify in parallel. Adoption is drop-in with support in @lmsysorg SGLang, TensorRT-LLM, and @vllm_project.

142

vLLM@vllm_project·1d

🙏 Thanks to the @NVIDIAAI team for highlighting DFlash support on vLLM! With DFlash speculative decoding, swapping EAGLE-3 for a DFlash checkpoint is a config-only change — no code edits needed. It runs through the open-source Speculators library, which links the DFlash drafter to the target model's hidden states in the vLLM inference path. On Gemma-4 31B on a single Blackwell Ultra GPU, this delivers up to 5.8x higher throughput at the same concurrency over autoregressive decoding: 🧮 Math500 — 5.8x ➕ GSM8K — 5.3x 💻 HumanEval — 5.6x 🐍 MBPP — 4.4x Read the blog here! 👇

NVIDIA AI@NVIDIAAI

English

9

31

296

44.8K

Alexey Fateev@superalesha·10h

732 GB/s HBM2 on a 10 year old card is still respectable, and layer-split RPC is the right move for a mixed-bandwidth pool. The thing I hit on my 4x3090: RPC is fine for one stream but leaves most of the throughput on the floor under load, where vLLM data-parallel pulls way ahead. For a single-user box at home though, 25-30 tok/s on an 80B at 256k is a genuinely good result.

English

1

28

Matt Thompson@fuzziphy·1d

I’ve been playing with these 10 year old Tesla P100s I had laying around to see if they’re useful today. Each have 16GB of 732GB/sec HBM2. Pooled using layer splitting RPC with a 4090 and a 4080 for a total of 72GB of VRAM, Qwen3 next 80B Q4 with full 256k context (about 60GB of VRAM) is 25-30 tok/sec, with room to spare for RAG embeddings

English

30

7

224

22.9K

Alexey Fateev@superalesha·10h

@FirstSquawk Yeah. Just yesterday I released a great article where I talk about 7 Chinese manufacturers of GPU chips. It seems that something will change in the future, and clearly not in the direction of Nvidia. x.com/superalesha/st…

Everyone's arguing about NVIDIA export controls. Almost nobody can name the 7 Chinese companies already shipping H100/H200-class silicon - most IPO'd in the last 6 months. I run Chinese open models on a 4×3090 rig daily. So I drew the map nobody's drawing

English

2

60

First Squawk@FirstSquawk·20h

Nvidia’s banned AI chips double in price on China’s black market-FT

English

7

2

81

29.3K

Alexey Fateev@superalesha·10h

35B MoE with only 3B active is the sweet spot for a 4x3090 box, that active count is what keeps decode fast at 96GB. Curious if the agent RL holds up under quant, AWQ/GGUF is where these MoEs usually start to wobble on long tool chains. Either way this is getting loaded on my rig tonight.

English

53

Adina Yakup@AdinaYakup·13h

Qwen-AgentWorld 🌍 A language world model to cover 7 agent interaction domains within a single model - 35B MoE /3B active - 262 K context - Terminal, Web, Android, SWE, Search, MCP, OS - Single turn RL transfers to multi-turn agentic tasks - Model & paper on @huggingface

English

5

10

72

5.3K

Alexey Fateev@superalesha·11h

Selfishly hoping they keep going, Qwen open has been the backbone of my 4x3090 setup for a year. 3.6-27B is my daily driver and the 122B AWQ is the ceiling that actually fits in 96GB. If they ship an open 3.7 I'll bench it the day it drops. The open weights are half the reason the local scene moved this fast.

English

1

754

Ahmad@TheAhmadOsman·11h

Qwen team, are you planning on releasing an opensource Qwen 3.7 model or should we just call your opensource contributions at this point?

English

42

10

419

43.6K

Alexey Fateev@superalesha·12h

@sudoingX 💸

QME

2

116

Sudo su@sudoingX·12h

what's stopping you from buying a gpu dude?

English

89

1

54

9.9K

Alexey Fateev@superalesha·13h

The copy brings my logins with it. The headless agent is already signed in to X and the rest, so it never stops to re-auth or throw a 2FA prompt in a window I can't see. A fresh profile would mean logging into every site by hand inside headless. And that debug-port limit I couldn't remember is real. Since Chrome 136 the port gets ignored on the default profile, so you have to point --user-data-dir at a separate folder regardless. Google shut it down because infostealers were using CDP to pull cookies off live profiles. A dedicated dir is forced either way, so copying my main one is basically free, and it comes pre-authed.

English