fullstack

18.4K posts

fullstack banner
fullstack

fullstack

@DavidFSWD

Latent Space Cowboy

Inscrit le Mayıs 2022
3.1K Abonnements1.4K Abonnés
Tweet épinglé
fullstack
fullstack@DavidFSWD·
fullstack tweet media
ZXX
3
0
30
9K
fullstack retweeté
IroncladDev
IroncladDev@IroncladDev·
It would be a shame if everyone emailed `dylan @ dylanmtaylor [.] com` with their concerns about him opening age verification PRs Absolutely don't email dylan @ dylanmtaylor [.] com with your concerns about him opening age verification PRs
IroncladDev tweet media
Vini B |「 thecoding 」@vinibarbosabr

@LundukeJournal daylanmtaylor tried the same thing on Arch Linux's archinstall repo he is pushing these implementations everywhere

English
12
37
230
6.5K
fullstack retweeté
Yeb Havinga
Yeb Havinga@YebHavinga·
I just asked claude to rerun the batched test for 27b and in in the <4K window size (where the int8 kv cache is not really helpful, but speed max) and got 1312 tok/s at batch size 256 where it plateaued. Maybe there the int8 kv cache is only interesting to allow higher batch sizes.
English
2
0
1
17
Yeb Havinga
Yeb Havinga@YebHavinga·
I have a dual RTX 3090 system that has lately been mostly collecting dust. Inspired by Karpathy's autoresearch post, I wondered: can I get a frontier-class open model running well on this setup with a coding agent? Turns out yes. I’m particularly interested in the Gemma 3 series, since they handle Dutch well and the dense Gemma 3 27B follows complex instructions reliably. Gemma 3 27B now runs at 67 tok/s for short prompts and 45 tok/s for long context (7K+ tokens), with a 128K context window. The starting point was bad, out of the box I got 11 tok/s with vLLM. The model loaded fine (W4A16 quantized fits in 48 GB with tensor parallelism), but something was killing performance. Discovery 1: CUDA graphs need `--disable-custom-all-reduce` on RTX 3090. vLLM’s custom all-reduce kernel crashes during CUDA graph capture at 94% completion. Disabling it and falling back to NCCL from 11 tok/s to 67 tok/s. Discovery 2: Long-context performance cliffs at ~4K tokens. Gemma 3 uses hybrid attention : 52 layers with 4K sliding window and 10 layers with full attention. Beyond 4K tokens, vLLM falls back from CUDA graphs to eager mode, causing speed to drop from 67 to 24 tok/s. The initial hypothesis was wrong. Claude thought cascade attention was disabling CUDA graphs and spent a day chasing that. In reality, cascade attention is already disabled for sliding-window models. The bottleneck is memory bandwidth: the 10 global attention layers must read the entire KV cache every decode step. Discovery 3, main idea: Use an INT8 KV cache to fix long context memory bound inference. The Ampere RTX 3090 lacks FP8 hardware (Ada/Hopper only), but has INT8 tensor cores. Claude wrote Triton kernels to quantize K and V to INT8 on write and dequantize on read. Result: 24 tok/s → 45 tok/s at 7K context (+87%). KV memory halved, so 128K context now fits, where before with 16bits 32K was the max. Discovery 4: Per-layer scales matter. Gemma 3 has 62 attention layers. With uniform / global scaling, layer 42 (v_absmax=884) and layer 59 (v_absmax=2.6) share the same quantization budget, a 340x ratio. Layer 59 uses only 47 of 127 INT8 levels. Per-layer calibration gives each layer its full dynamic range. Discovery 5: V values need FP8, not INT8. This was the final piece. K values flow through Q·Kᵀ into softmax—linear quantization error maps linearly to attention logits, so INT8’s uniform spacing works fine. But V values have heavy-tailed distributions in deeper layers. I borrowed the FP8-E4M3 encoding from Qwen’s quantization work (thanks to @QuixiAI for reverse engineering it, see link below). This datatype preserves relative precision across three orders of magnitude. The main idea for the RTX 3090: it has no FP8 hardware, but it can be emulated. Store FP8 bit patterns in INT8 bytes, then decode them in the Triton attention kernel. Same memory footprint, better precision where it matters. What didn’t work: - Native Triton FP8 (tl.float8e4nv): requires Ada/Hopper - Disabling cascade attention: no effect, memory bandwidth is the bottleneck - Piecewise CUDA graphs: OOM on 24 GB GPUs with TP=2 - Speculative decoding: draft model overhead exceeded gains This wasn’t autoresearch in Karpathy’s sense , there was no training involved, also I did not setup a autoresearch prompt + loop. But the spirit was similar: iterative experimentation, let Opus write Triton kernels and systematic measurement. Claude helped me stay organized, generate hypotheses, and most importantly write the Triton kernels. The whole project took perhaps an hour of two of my time over the course of a week. The images show single-session inference speed with the 27B model, and batched inference with the 4B model.
Yeb Havinga tweet mediaYeb Havinga tweet media
English
2
0
2
84
fullstack
fullstack@DavidFSWD·
@YebHavinga i got 1500tk/sec on 9b on one 3090. but your first 27b test seems right you should be able to split the cards and run bs=8 I think that would be the ultimate setup. 600-800watts 24/7 though
English
1
0
1
14
Yeb Havinga
Yeb Havinga@YebHavinga·
I (Claude :-) tested AWQ INT4 but it failed (torchao compatibility). The max batched tokens/s for 27B I tested was 244 tok/s. For smaller models (1B/4B) I got to 7-12K tok/s with DP=2 on two 3090's. Would be interested to learn more about the 500-1000 tokens/s config/hardware/model especially if that was with the dense 27b!
English
1
0
1
18
Dmitriy Kovalenko
Dmitriy Kovalenko@neogoose_btw·
I just want to remind the rest of the world: every single AI company here in Silicon Valley get tokens for absolutely free rn It’s only you paying those 200$ for max plan
TFTC@TFTC21

Jensen Huang: "If that $500,000 engineer did not consume at least $250,000 worth of tokens, I am going to be deeply alarmed. This is no different than a chip designer who says 'I'm just going to use paper and pencil. I don't think I'm going to need any CAD tools.'"

English
11
3
99
10K
fullstack
fullstack@DavidFSWD·
@neogoose_btw if you are a $250k engineer and you don't get it for free, i'd be like what the fuck is wrong with you
English
0
0
0
243
fullstack retweeté
Wei Ping
Wei Ping@_weiping·
🚀 Introducing Nemotron-Cascade 2 🚀 Just 3 months after Nemotron-Cascade 1, we’re releasing Nemotron-Cascade 2: an open 30B MoE with 3B active parameters, delivering best-in-class reasoning and strong agentic capabilities. 🥇 Gold Medal-level performance on IMO 2025, IOI 2025, and ICPC World Finals 2025: • Capabilities once thought achievable only by frontier proprietary models (e.g. Gemini Deep Think) or frontier-scale open models (i.e. DeepSeek-V3.2-Speciale-671B-A37B). • Remarkably high intelligence density with 20× fewer parameters. 🏆 Best-in-class across math, code reasoning, alignment, and instruction following: • Outperforms the latest Qwen3.5-35B-A3B (2026-02-24) and even larger Qwen3.5-122B-A10B (2026-03-11). 🧠 Powered by Cascade RL + multi-domain on-policy distillation: • Significantly expand Cascade RL across a much broader range of reasoning and agentic domains than Nemotron-Cascade 1, while distilling from the strongest intermediate teacher models throughout training to recover regressions and sustain gains. 🤗 Model + SFT + RL data: 👉 huggingface.co/collections/nv… 📄 Technical report: 👉 research.nvidia.com/labs/nemotron/…
Wei Ping tweet media
English
19
77
444
33.4K
fullstack retweeté
fullstack retweeté
Adolf Elmer
Adolf Elmer@adolfelmer·
I ain’t dying for no chicken swingers [FULL SONG]
English
134
1.2K
8.3K
445.7K
fullstack
fullstack@DavidFSWD·
@jakeshieldsajj it'll be $250/b and they'll print beast-bucks loaf of bread will be $1200.
English
0
0
1
29
fullstack
fullstack@DavidFSWD·
@pareen he's dual saying he can justify 1/2 X because he wants to replace X but being a jerk and not saying it outloud.
English
0
0
1
44
Catalin
Catalin@catalinmpit·
Lately, Claude makes some shocking mistakes. ⟶ Implements overly complex code ⟶ Ignores the codebase's code style ⟶ Removes working code for no reason ⟶ Replaces code that's out of scope from the task at hand It feels like it needs 100% supervision. At this point, you're better off writing everything yourself.
Catalin tweet media
English
197
22
448
36.8K
fullstack
fullstack@DavidFSWD·
@KentonVarda you didn't spend enough tokens. welcome the the permanent underclass.
fullstack tweet media
English
0
0
0
69
fullstack retweeté
CatFu
CatFu@catfusolana·
the duel with the wuxia princess revealed a technique within a technique 🐱🌸
English
10
57
352
7.3K
fullstack retweeté
awesome_visuals
awesome_visuals@awesome_visuals·
do you mind if i take your photo? 😁
English
90
344
3.4K
282.2K
The_Real_Fly
The_Real_Fly@The_Real_Fly·
Oil tanker operator paid Iran $2,000,000 for safe passage through Strait of Hormuz, Financial Times reports.
English
9
9
88
9.5K