Alexey Fateev

479 posts

Alexey Fateev banner
Alexey Fateev

Alexey Fateev

@superalesha

Shipping enterprise AI @ Bank by day, running a 4× RTX 3090 rig by night ⚡96GB VRAM Club | local LLMs · a loving father

Katılım Ocak 2026
135 Takip Edilen242 Takipçiler
Sabitlenmiş Tweet
Alexey Fateev
Alexey Fateev@superalesha·
Everyone's arguing about NVIDIA export controls. Almost nobody can name the 7 Chinese companies already shipping H100/H200-class silicon - most IPO'd in the last 6 months. I run Chinese open models on a 4×3090 rig daily. So I drew the map nobody's drawing
Alexey Fateev tweet media
English
8
25
188
26.6K
Alexey Fateev
Alexey Fateev@superalesha·
11 tok/s on a Spark for a non-quant Flash is impressive for what it is. The ceiling there is memory bandwidth, the Spark sits well under a 3090's ~936 GB/s, which is why decode crawls even when the model fits. For generation speed the bandwidth number matters more than the VRAM number, almost every time.
English
0
0
0
12
Eric
Eric@Ex0byt·
Update: the road to GLM-5.2: we're getting there, folks! non-quantized, non-pruned DeepSeek-v4-Flash. 11tok/s on a single DGX Spark. sglang inference + custom mega-kernel. Pure beauty.
Eric tweet mediaEric tweet media
English
17
12
215
13.1K
Alexey Fateev
Alexey Fateev@superalesha·
Good call on the 3090, still the value pick for local. The privacy line genuinely closes deals, I have seen the same. One 24GB card keeps a solid mid-size model snappy for one user. The day you serve more than yourself the bottleneck moves from VRAM to how you split the model across cards, but for a solo setup the 3090 is hard to beat on dollars per token.
English
1
0
1
23
Triana Kurtetti
Triana Kurtetti@kurtetti·
He lost a $5,000 client in 90 seconds. Watch until 0:40. That's where the fix appears. The client asked one question: "Where does my data go when you run it through Claude?" He said: Anthropic's servers. The client thanked him and left. That night he bought a used RTX 3090. 24GB VRAM. $900–$1,200 on the secondhand market. The old card had 8GB. Not enough. When the model doesn't fit in VRAM, response times collapse. The 3090 loads the full model. Stays snappy. Stays local. Old stack: $220/month. Claude Pro + Cursor + subscriptions. New stack: ~$35/month. One GPU. One open model. 85% local. 15% cloud. Now when a client asks where their data goes, the answer is different: "Nothing leaves this machine." That sentence is worth more than two years of subscription savings. Month 6 break-even on hardware. Year 2 savings: $3,390. Before the first client revenue. The AI bill made sense when local models were too weak. That window is closing.
Triana Kurtetti@kurtetti

x.com/i/article/2068…

English
6
5
34
3K
Shinichi Takaŷanagi
知見だ! > コストについては、外部の LLM API (GPT-5.4 mini 相当)に比べると(local LLMとしてGemma 4は)圧倒的に割安です。同じ処理でも 200 倍近いコスト差が生まれる 月 100万件・ 50億トークンをローカルLLMで捌く - 明細名のストア情報マッピング|onagaway zenn.dev/finatext/artic… #zenn
日本語
1
4
20
2.1K
rS_ローカルLLM×投資
rS_ローカルLLM×投資@rS_alonewolf·
■エローカルLLMの無検閲Gemma4 ワイの環境での速度(MTPでどこまで速くなるか) エロい目的のために 先日入れた「Gemma4 26Bの無検閲兄弟」たち。 RTX 5070 Ti(VRAM・16GB)・コンテキスト128Kで、 とりま、ご本尊(公式)と並べて実速度を測った。 短文でだけどね。 ■ MTPとは 投機デコード=小さな先読みヘッドが数トークン先を予測し、本体がまとめて検証する仕組み。当たれば一気に進む。出力は普通のデコードと完全に同じ。変わるのは速度だけ(らしい ほんまか?エロさは変わるかもな)。 ■ 26Bの結果(decode tok/s・128K)       MTP無 → MTP有(倍率) ▼Claude-Opus蒸留ご兄弟(賢さ寄り) ・teich:40.3 → 59.6(1.48倍) ・apex:43.6 → 62.7(1.44倍)  ▼SFT系(蒸留ではない) ・gemopus:41.5 → 60.1(1.45倍) ▼無検閲ご兄弟(エローカルモデル) ・supergemma:46.2 → 62.8(1.36倍) ・hauhau:40.2 → 60.8(1.51倍) ・trevorjs:40.9 → 60.7(1.48倍) ▼ツール特化 ・hermes:46.3 → 66.6(1.44倍) 26B Q6は素40 → MTPで60〜67 t/s。 系統が違っても速度はほぼ横並びで、軽いQ4が本尊よりわずかに速い。 ■ 12B(dense)はMTPが段違い(*参考*) MoEではない普通の12B(Q8_0)だと、先読みが当たりやすくて伸び方が桁違い。 ・base:56.8 → 155.5(2.74倍) ・opus:55.6 → 148.1(2.66倍) ・coder-fable5:55.6 → 146.6(2.64倍) 26B(MoE)が1.4〜1.5倍に対して、12B(dense)は約2.7倍。素の速度も12Bが上。"賢さの26B / 速さの12B"で完全に住み分けてる。12Bはエローカル試せてないのでまた今度する。 ■ 60 t/s って遅めだけどクラウドと同じでは? GLM-5.1あたりも遅い時は同じくらいになることもある。それよりは、ちょと遅いが、価値はその遅めのクラウド並みの60 t/sを「ローカル・無検閲・無料・垢BANなし」で出せること。ボチボチ相互評価でエロさのベンチでもしようかと思う。
rS_ローカルLLM×投資 tweet mediarS_ローカルLLM×投資 tweet mediarS_ローカルLLM×投資 tweet media
rS_ローカルLLM×投資@rS_alonewolf

真面目に本気でローカルLLMのエロさを評価したいが、評価モデルをプロバイダ使ってしまうと、拒絶か下手したら垢BAN お⚪︎ん⚪︎がお⚪︎ん⚪︎に...とか入力されると一生AI使えなくなるから どうすれば良いか?と思ってたところ そこで、検閲なしモデル全員に、全モデルの文章を匿名で採点させようと思う。 複数のモデルから高く評価された文章は、本当に強い可能性が高い。
採点が甘い・厳しいモデルの癖も補正できる。
自分だけ自分を高く評価しているかも分かる。 要するに、1モデルの主観ではなく、全員参加の相互評価で「本当にエロいモデル」を決める。 と思ったけど 計算量が今までよりも指数関数的に増えるぞw RTX Pro 6000 Blackwell欲しいな これは投資では無い  己の欲求を解き放つためだ! エロいご本尊 召喚したい♡ #VRAM飢饉救済教 #でかいVRAM欲しい #あれ言ってること矛盾してね

日本語
3
3
35
7.8K
Idle
Idle@IdleProtocol·
We've been building this for months. Today we wrote it all down. How IDLE Protocol runs distributed inference across thousands of consumer GPUs - the architecture, the hardware tiers, vLLM under the hood, consensus validation, and why data parallelism at the network level beats tensor parallelism in a data center for this problem. Full technical breakdown.
Idle@IdleProtocol

x.com/i/article/2069…

English
5
20
53
5K
Alexey Fateev
Alexey Fateev@superalesha·
Same question as my Qwen post, one size up: how should Gemma 4 31B run on 4x3090? The best layout flips depending on whether you serve one user or a crowd. One user: TP4, all four cards on every token. 78 tok/s, against 57 for a 2-card TP2. No draft head here like Qwen's MTP, so the extra cards are just raw compute and wider wins. 64 users: two TP2 replicas (TP2+DP2). 1084 tok/s, against 563 for one wide TP4. 1.9x on the same four cards. Why: TP4 makes all four cards sync after every layer, and on 3090s that crosses PCIe. Great when you want every card chewing one token, rough under load. Two replicas barely talk, so they scale almost clean. Same rule as before, now on a dense 31B: replicate over the fewest cards that fit for throughput, split wide only when one user's latency is the whole job. Four separate copies fit only at ~1k context and lose on both, so skip that here. Full card: both launch commands, first token, tok/watt, footprint
Alexey Fateev tweet media
English
0
0
2
173
Alexey Fateev
Alexey Fateev@superalesha·
@zanoga And what Infiniband is on a 3090? What are you talking about, bro?
English
2
0
3
677
Max Zanoga
Max Zanoga@zanoga·
Finally finished building my AI datacenter! 🚀 32x3090s across 4 servers (8 GPUs each), all connected over InfiniBand. The whole setup is solar-powered with a massive battery bank and generator backup. More technical details and benchmarks coming soon.
Max Zanoga tweet media
English
306
160
2.9K
266.4K
Alexey Fateev
Alexey Fateev@superalesha·
@SlimTradeyBaby I think so - we're the only ones here. I've gone through the entire HF discussion and nobody else has asked about this. insane 😵
English
0
0
1
26
BlackwellBoy
BlackwellBoy@SlimTradeyBaby·
@superalesha ill give it a whirl when I get home, see if i can spin anything up, like i said in my other post slap fable 5 on and they will come haha, but if we are the only 2 in 450k + that cant get it working well XD sheesh
English
1
0
1
178
BlackwellBoy
BlackwellBoy@SlimTradeyBaby·
Pro tip: Slap ‘fable5’ in your GGUF name for instant Claude Fable 5 superpowers + virility buff, this Gemma 4 12B coder was trained on Composer 2.5 + Fable 5 recovery CoT. The filename is not lying. Anyone tried it yet? Small cars good option? huggingface.co/yuxinlu1/gemma…
English
9
0
28
3.7K
Alexey Fateev
Alexey Fateev@superalesha·
The turboquant fork earns its keep, I ran it on Gemma 4 on my 3090s and the turbo KV cache types are what actually let you stretch context without OOM. Getting an A4B MoE agent loop alive in 8GB is wild. One number from my runs: on the dense 31B, llama.cpp sits around 55-60 tok/s single stream while vLLM AWQ does ~75, so the second you go multi-user the engine choice starts to outweigh the quant.
English
1
0
1
85
Alok
Alok@analogalok·
I just got Gemma 4 26B A4B MoE model running fully locally with Hermes agent on an 8GB RTX 4060 and it's now backtesting trading strategies end to end, no hand holding. If you’re a trader or work on Wall Street, you don’t want to miss this. Yes. fully automated. No cloud. No APIs beyond market data. # Here's what I did: Setup: - Model: Gemma 4 26B-A4B QAT (MoE), Q4_K_XL Unsloth's quant (link in the comments) - Inference: llama.cpp (turboquant fork by @no_stp_on_snek link in the comments) - Hardware: RTX 4060, 8GB VRAM + 16GB RAM only (with 50 other chrome tabs open) - Context: 64K llama.cpp turboquant flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 --port 8080 turboquant helps achieve high prefill and decode throughput for interactive sessions. throughput with Hermes agent: decode: 25+ tokens/sec prefill: 250+ tokens/sec # Then I gave the agent one task: Backtest a strategy: - Buy when RSI crosses above 30 - Sell at +2% profit or -1% stoploss - No overlapping positions - Use Google stock via yfinance - Generate a full HTML report with candlestick charts + signals What happened next was wild. It didn't just write code, it ran the entire workflow itself: Audited the environment (pip list, dependency check) Hit a ModuleNotFoundError, multiple Python installs were conflicting Ran where python to map every interpreter on the system Manually selected the correct Python 3.13 path and re ran the script Wrote a clean statevmachine backtester (strict no overlapping trades logic) Patched a yfinance MultiIndex quirk that would've crashed the script Built Plotly candlestick + RSI charts with buy/sell markers Calculated win rate, PnL, and summary stats Exported a polished single file HTML report. check the report at the end of the video or in the comments. Biggest takeaway: local LLMs aren't just "chat assistants" anymore. They debug their own environment, write production code, and ship a finished deliverable on consumer hardware, for $0 in API costs. If you're still calling local models "toys," you're already behind. This is just the beginning. Hermes agent just surpassed 1 trillion tokens in a single day on OpenRouter. Think about the scale of total token generation happening right now. Disclaimer: This is not financial advice. Consult a professional before making any trading decisions.
Teknium 🪽@Teknium

Wait we actually just broke 1T tokens in a day for the first time on OpenRouter :O Please keep contributing to the most awesome project I've ever worked on to help make Hermes Agent the best software stack on the planet! Thank you contributors🍻🍻

English
23
41
429
42.5K
Alexey Fateev
Alexey Fateev@superalesha·
Config-only swap from EAGLE-3 is the dream. Real question for the Ampere crowd: any realistic path to DFlash on sm_86? On 4x3090 MTP-style spec decode already gives me a solid bump, but FlashInfer JIT is rough on older CUDA. Would be great to keep the 3090 rigs in the spec-decode game as these drafters land.
English
0
0
0
142
vLLM
vLLM@vllm_project·
🙏 Thanks to the @NVIDIAAI team for highlighting DFlash support on vLLM! With DFlash speculative decoding, swapping EAGLE-3 for a DFlash checkpoint is a config-only change — no code edits needed. It runs through the open-source Speculators library, which links the DFlash drafter to the target model's hidden states in the vLLM inference path. On Gemma-4 31B on a single Blackwell Ultra GPU, this delivers up to 5.8x higher throughput at the same concurrency over autoregressive decoding: 🧮 Math500 — 5.8x ➕ GSM8K — 5.3x 💻 HumanEval — 5.6x 🐍 MBPP — 4.4x Read the blog here! 👇
NVIDIA AI@NVIDIAAI

Increase inference performance by up to 15x without sacrificing responsiveness. DFlash, an open source lightweight block diffusion model designed for speculative decoding, delivers up to 15x higher throughput on NVIDIA Blackwell while maintaining the same user interactivity target. Instead of drafting tokens one at a time, it proposes a whole block in a single pass for the main model to verify in parallel. Adoption is drop-in with support in @lmsysorg SGLang, TensorRT-LLM, and @vllm_project.

English
9
31
296
44.8K
Alexey Fateev
Alexey Fateev@superalesha·
732 GB/s HBM2 on a 10 year old card is still respectable, and layer-split RPC is the right move for a mixed-bandwidth pool. The thing I hit on my 4x3090: RPC is fine for one stream but leaves most of the throughput on the floor under load, where vLLM data-parallel pulls way ahead. For a single-user box at home though, 25-30 tok/s on an 80B at 256k is a genuinely good result.
English
0
0
1
28
Matt Thompson
Matt Thompson@fuzziphy·
I’ve been playing with these 10 year old Tesla P100s I had laying around to see if they’re useful today. Each have 16GB of 732GB/sec HBM2. Pooled using layer splitting RPC with a 4090 and a 4080 for a total of 72GB of VRAM, Qwen3 next 80B Q4 with full 256k context (about 60GB of VRAM) is 25-30 tok/sec, with room to spare for RAG embeddings
Matt Thompson tweet media
English
30
7
224
22.9K
Alexey Fateev
Alexey Fateev@superalesha·
@FirstSquawk Yeah. Just yesterday I released a great article where I talk about 7 Chinese manufacturers of GPU chips. It seems that something will change in the future, and clearly not in the direction of Nvidia. x.com/superalesha/st…
Alexey Fateev@superalesha

Everyone's arguing about NVIDIA export controls. Almost nobody can name the 7 Chinese companies already shipping H100/H200-class silicon - most IPO'd in the last 6 months. I run Chinese open models on a 4×3090 rig daily. So I drew the map nobody's drawing

English
0
0
2
60
First Squawk
First Squawk@FirstSquawk·
Nvidia’s banned AI chips double in price on China’s black market-FT
English
7
2
81
29.3K
Alexey Fateev
Alexey Fateev@superalesha·
35B MoE with only 3B active is the sweet spot for a 4x3090 box, that active count is what keeps decode fast at 96GB. Curious if the agent RL holds up under quant, AWQ/GGUF is where these MoEs usually start to wobble on long tool chains. Either way this is getting loaded on my rig tonight.
English
0
0
0
53
Adina Yakup
Adina Yakup@AdinaYakup·
Qwen-AgentWorld 🌍 A language world model to cover 7 agent interaction domains within a single model - 35B MoE /3B active - 262 K context - Terminal, Web, Android, SWE, Search, MCP, OS - Single turn RL transfers to multi-turn agentic tasks - Model & paper on @huggingface
Adina Yakup tweet media
English
5
10
72
5.3K
Alexey Fateev
Alexey Fateev@superalesha·
Selfishly hoping they keep going, Qwen open has been the backbone of my 4x3090 setup for a year. 3.6-27B is my daily driver and the 122B AWQ is the ceiling that actually fits in 96GB. If they ship an open 3.7 I'll bench it the day it drops. The open weights are half the reason the local scene moved this fast.
English
0
0
1
754
Ahmad
Ahmad@TheAhmadOsman·
Qwen team, are you planning on releasing an opensource Qwen 3.7 model or should we just call your opensource contributions at this point?
English
42
10
419
43.6K
Sudo su
Sudo su@sudoingX·
what's stopping you from buying a gpu dude?
English
89
1
54
9.9K
Alexey Fateev
Alexey Fateev@superalesha·
The copy brings my logins with it. The headless agent is already signed in to X and the rest, so it never stops to re-auth or throw a 2FA prompt in a window I can't see. A fresh profile would mean logging into every site by hand inside headless. And that debug-port limit I couldn't remember is real. Since Chrome 136 the port gets ignored on the default profile, so you have to point --user-data-dir at a separate folder regardless. Google shut it down because infostealers were using CDP to pull cookies off live profiles. A dedicated dir is forced either way, so copying my main one is basically free, and it comes pre-authed.
English
1
0
1
55
Mark Kretschmann
Mark Kretschmann@mark_k·
Do you prefer using the built-in browser in Codex, or do you let it control your Chrome?
English
23
1
46
8.4K