Sabitlenmiş Tweet
Akshit Pareek
219 posts

Akshit Pareek
@apareek05
deep learning kernels and ML inference @TXinstruments
Bengaluru Katılım Haziran 2012
457 Takip Edilen95 Takipçiler
Akshit Pareek retweetledi

takes me back to my microprocessors courses at uni
a productive weekend break :”)

Dwarkesh Patel@dwarkesh_sp
New blackboard lecture w @reinerpope How do chips actually work – starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. 0:00:00 – Building a multiply-accumulate from logic gates 0:16:20 – Muxes and the cost of data movement 0:25:59 – How systolic arrays work 0:39:00 – Clock cycles and pipeline registers 0:51:40 – FPGAs vs ASICs 1:03:14 – Cache vs scratchpad 1:07:16 – Why CPU cores are much bigger than GPU cores 1:11:49 – Brains vs chips 1:15:22 – A GPU is just a bunch of tiny TPUs Look up Dwarkesh Podcast on YouTube/Spotify/etc to watch. Enjoy!
English
Akshit Pareek retweetledi

@richnanophd didn’t need to do it for smolVLA, fit comfortably with bf16 weights during inference on 16gb vram, during training I needed to make some sacrifices. But yeah, for running molmoAct2 as just local inference on 5070ti, I’ll be trying both 4bit and 8bit quantizations.
English

@apareek05 Love this. I fight the same VRAM limits with local sims. Did you quantize SmolVLA to squeeze it onto the 50 Ti? Solid work 👍
English

@sakurayukiai I was able to train SmolVLA in bf16, as only 100M params were trainable, but I did have to settle with a batch size of 16. card was almost at the limit though
English

@apareek05 5070 Ti gang 🤝
People sleep on what a single consumer card can actually do. Did you have to drop to a 4-bit quant to save VRAM for the KV cache, or did it squeeze into bf16?
English

tried to consolidate everything into this blog:
akshitpareek.com/posts/two-week…
English
Akshit Pareek retweetledi
Akshit Pareek retweetledi

wrote about my recent behavior cloning experiment digging a bit into the internals of how action chunking transformers work and the distributed inference over raspi and my mac
aryanmadhavverma.com/tech/2026/04/0…
imo, writing is one of the best methods of spaced repetition. you're forced to look back at everything you did, identify knowledge gaps, think hard and find more things to dig deep into
I was writing a section on cross-attention which got me curious about what the queries are actually learning and what information did the decoder space hold wrote a visualisation script and realised the cross attention queries had developed temporal coherence on their own which meant that early queries learned to attend to the arm, later queries to the target object, and this attention pattern shifted dynamically with each frame, always keeping the robot's relevant parts in focus.
no one told them which timestep matters or where to look, they just figured it out from 50 demos!



English
Akshit Pareek retweetledi

@sudoingX Rtx 5070ti, running Qwen 3.5 9B Q4 with 264k context
English

drop your GPU below. i'll tell you exactly what model and config to run on it.
here's what i've tested and verified on real hardware:
RTX 3060 12GB - Qwen 3.5 9B Q4 - 50 tok/s - 128K context
RTX 3090 24GB - Qwen 3.5 27B Q4 - 35 tok/s - 300K context
RTX 3090 24GB - Qwen 3.5 35B MoE Q4 - 112 tok/s - 262K context
2x RTX 3090 - Qwen3-Coder 80B Q4 - 46 tok/s - full VRAM
all running llama.cpp with flash attention. every number is real. every config is tested. if your card isn't on this list drop it below and i'll tell you what fits.
English








