Divyansh Singh

89 posts

Divyansh Singh

Divyansh Singh

@L2_cache_miss

L2 cache in gpus | prev research @adobe | cs grad @IITKanpur

Katılım Ocak 2026
87 Takip Edilen2 Takipçiler
Divyansh Singh
Divyansh Singh@L2_cache_miss·
@0xSero how does one get an api which doesn't use the data for training?
English
0
0
1
947
0xSero
0xSero@0xSero·
117.4M tokens for 2.24$ for a genius.
0xSero tweet media
English
123
80
2.4K
206.5K
Divyansh Singh
Divyansh Singh@L2_cache_miss·
@MoonHeead @nrehiew_ yeah 😂 I have created a design reference for this sort of aesthetics and creating walk throughs for multiple papers.. x.com/L2_cache_miss/…
Divyansh Singh tweet media
Divyansh Singh@L2_cache_miss

@nrehiew_ this is very addictive btw... my weekly limit was resetting tonight so I generated html walkthroughs of multiple papers from my reading list... it is so much fun to read this way first and then trace back in the paper... thanks a lot @nrehiew_

English
0
0
0
28
wh
wh@nrehiew_·
How I read papers now. This is an explainer by Claude about the new Compressed Sparse Attention v4 uses to compress the KV cache.
wh tweet media
wh@nrehiew_

Now reading:

English
6
69
699
55.5K
Divyansh Singh
Divyansh Singh@L2_cache_miss·
@nrehiew_ this is very addictive btw... my weekly limit was resetting tonight so I generated html walkthroughs of multiple papers from my reading list... it is so much fun to read this way first and then trace back in the paper... thanks a lot @nrehiew_
English
0
0
1
68
Divyansh Singh
Divyansh Singh@L2_cache_miss·
@nrehiew_ they all know how to extract full information from less
English
0
0
3
262
wh
wh@nrehiew_·
DSA, NSA, CSA, CIA, HCA, KDA, FBI What do these have in common?
English
5
1
33
5.1K
Divyansh Singh
Divyansh Singh@L2_cache_miss·
@GuggaLeunnam just keep posting more of such edits on X and then wait for next grok Imagine model
English
0
0
1
2.5K
Gugga Leunnam
Gugga Leunnam@GuggaLeunnam·
AI couldn't edit this
English
241
5.6K
41.5K
1.5M
Divyansh Singh
Divyansh Singh@L2_cache_miss·
what kind of visual wizard library are these @claudeai folks using, this is literally so smooth visual UI
Divyansh Singh tweet media
English
0
0
0
6
Divyansh Singh
Divyansh Singh@L2_cache_miss·
@LLMJunky I see, thanks I was looking for some theoretical backing instead of opinions, for some reason this feels like using fp32 kv cache for bf16 model weights 😂
English
0
0
1
6
am.will
am.will@LLMJunky·
@L2_cache_miss more accurate, this is the recommendation of the creator of the quant. you can run it at fp8. some say it doesn't matter, others do. 🤷‍♂️🤷‍♂️
English
1
0
1
63
am.will
am.will@LLMJunky·
Minimax M2.7 running locally on just two baby GPUs. Here’s a side-by-side of two leading local LLM serving engines for MiniMax M2.7 NVFP4: vLLM and SGLang What do you think? These results are surprising to me, I expected them to scale more or less linearly.
am.will tweet media
am.will@LLMJunky

Finally putting these RTX 6000s to good use. Minimax M2.7 running locally. There have been learning curves indeed. Running NVFP4 with full 16-bit KVCache @ with a 140K context window. I can get a full 200K context window but only with vLLM, which is slower. I think I'll opt for the speed, I dont need 200K anyway. Dedicating my free time to leveling up my game. Thanks to everyone who's helping me. YKWYA 🫶

English
27
4
117
18.3K
Divyansh Singh
Divyansh Singh@L2_cache_miss·
@danveloper @p_nawrot one reason might be the per layer compute time will be high enough to give room for prefetching next layer's weight otherwise you will be just waiting for the next layer's weights to come from Host
English
0
0
1
205
Dan Woods
Dan Woods@danveloper·
@p_nawrot why do Llama-3.1-405B instead of something more modern like Qwen3.5-397B?
English
2
0
11
1.6K
Piotr Nawrot
Piotr Nawrot@p_nawrot·
💾🚀 Run Llama-3.1-405B FP8 (410GB) on a single 180GB GPU #NVIDIA Introducing FlexTensor — NVIDIA's new library that makes host RAM a transparent extension of your GPU memory. One call: flextensor.offload(model). No model rewrites, no framework changes. Works with vLLM, HuggingFace, and any PyTorch model. Traditional offloading is reactive — move data when you run out of memory, stall the GPU while you wait. FlexTensor instead profiles your model's layer access patterns, then solves a knapsack optimization to schedule prefetches that overlap with compute. By the time a layer needs its weights, they're already there. The freed VRAM gives vLLM more room for KV cache — enabling 4x longer contexts (8K→32K) or 4x larger batches. For video generation (Wan2.2-T2V-A14B on GB200): +0.1% overhead. Handles FP8, custom Triton kernels, and multi-GPU. Profiles saved to disk — no warmup on repeated runs. Check it out: github.com/ai-dynamo/flex…
Piotr Nawrot tweet media
English
14
33
220
56K