Alexey Fateev
479 posts

Alexey Fateev
@superalesha
Shipping enterprise AI @ Bank by day, running a 4× RTX 3090 rig by night ⚡96GB VRAM Club | local LLMs · a loving father




Same question as my Qwen post, one size up: how should Gemma 4 31B run on 4x3090? The best layout flips depending on whether you serve one user or a crowd. One user: TP4, all four cards on every token. 78 tok/s, against 57 for a 2-card TP2. No draft head here like Qwen's MTP, so the extra cards are just raw compute and wider wins. 64 users: two TP2 replicas (TP2+DP2). 1084 tok/s, against 563 for one wide TP4. 1.9x on the same four cards. Why: TP4 makes all four cards sync after every layer, and on 3090s that crosses PCIe. Great when you want every card chewing one token, rough under load. Two replicas barely talk, so they scale almost clean. Same rule as before, now on a dense 31B: replicate over the fewest cards that fit for throughput, split wide only when one user's latency is the whole job. Four separate copies fit only at ~1k context and lose on both, so skip that here. Full card: both launch commands, first token, tok/watt, footprint


Same question as my Qwen post, one size up: how should Gemma 4 31B run on 4x3090? The best layout flips depending on whether you serve one user or a crowd. One user: TP4, all four cards on every token. 78 tok/s, against 57 for a 2-card TP2. No draft head here like Qwen's MTP, so the extra cards are just raw compute and wider wins. 64 users: two TP2 replicas (TP2+DP2). 1084 tok/s, against 563 for one wide TP4. 1.9x on the same four cards. Why: TP4 makes all four cards sync after every layer, and on 3090s that crosses PCIe. Great when you want every card chewing one token, rough under load. Two replicas barely talk, so they scale almost clean. Same rule as before, now on a dense 31B: replicate over the fewest cards that fit for throughput, split wide only when one user's latency is the whole job. Four separate copies fit only at ~1k context and lose on both, so skip that here. Full card: both launch commands, first token, tok/watt, footprint




真面目に本気でローカルLLMのエロさを評価したいが、評価モデルをプロバイダ使ってしまうと、拒絶か下手したら垢BAN お⚪︎ん⚪︎がお⚪︎ん⚪︎に...とか入力されると一生AI使えなくなるから どうすれば良いか?と思ってたところ そこで、検閲なしモデル全員に、全モデルの文章を匿名で採点させようと思う。 複数のモデルから高く評価された文章は、本当に強い可能性が高い。 採点が甘い・厳しいモデルの癖も補正できる。 自分だけ自分を高く評価しているかも分かる。 要するに、1モデルの主観ではなく、全員参加の相互評価で「本当にエロいモデル」を決める。 と思ったけど 計算量が今までよりも指数関数的に増えるぞw RTX Pro 6000 Blackwell欲しいな これは投資では無い 己の欲求を解き放つためだ! エロいご本尊 召喚したい♡ #VRAM飢饉救済教 #でかいVRAM欲しい #あれ言ってること矛盾してね

Same question as my Qwen post, one size up: how should Gemma 4 31B run on 4x3090? The best layout flips depending on whether you serve one user or a crowd. One user: TP4, all four cards on every token. 78 tok/s, against 57 for a 2-card TP2. No draft head here like Qwen's MTP, so the extra cards are just raw compute and wider wins. 64 users: two TP2 replicas (TP2+DP2). 1084 tok/s, against 563 for one wide TP4. 1.9x on the same four cards. Why: TP4 makes all four cards sync after every layer, and on 3090s that crosses PCIe. Great when you want every card chewing one token, rough under load. Two replicas barely talk, so they scale almost clean. Same rule as before, now on a dense 31B: replicate over the fewest cards that fit for throughput, split wide only when one user's latency is the whole job. Four separate copies fit only at ~1k context and lose on both, so skip that here. Full card: both launch commands, first token, tok/watt, footprint









Wait we actually just broke 1T tokens in a day for the first time on OpenRouter :O Please keep contributing to the most awesome project I've ever worked on to help make Hermes Agent the best software stack on the planet! Thank you contributors🍻🍻


Increase inference performance by up to 15x without sacrificing responsiveness. DFlash, an open source lightweight block diffusion model designed for speculative decoding, delivers up to 15x higher throughput on NVIDIA Blackwell while maintaining the same user interactivity target. Instead of drafting tokens one at a time, it proposes a whole block in a single pass for the main model to verify in parallel. Adoption is drop-in with support in @lmsysorg SGLang, TensorRT-LLM, and @vllm_project.




Everyone's arguing about NVIDIA export controls. Almost nobody can name the 7 Chinese companies already shipping H100/H200-class silicon - most IPO'd in the last 6 months. I run Chinese open models on a 4×3090 rig daily. So I drew the map nobody's drawing
















