Akshit Pareek

219 posts

Akshit Pareek

Akshit Pareek

@apareek05

deep learning kernels and ML inference @TXinstruments

Bengaluru Katılım Haziran 2012
457 Takip Edilen95 Takipçiler
Sabitlenmiş Tweet
Akshit Pareek
Akshit Pareek@apareek05·
hacked around with Franka Panda on MuJoCo over the last couple of weeks, trying to get it to pick books off a table and drop them in a box. limited myself to SmolVLA and ACT since i was on my own 5070 Ti.
English
6
10
108
12.5K
Akshit Pareek retweetledi
Fabrizio Romano
Fabrizio Romano@FabrizioRomano·
❤️🤍🏆 Arsenal lift Premier League trophy after winning the title 2025/26! ✨
Fabrizio Romano tweet media
English
1.5K
14.3K
98.1K
1.2M
amv
amv@aryanmadhaverma·
takes me back to my microprocessors courses at uni a productive weekend break :”)
amv tweet media
Dwarkesh Patel@dwarkesh_sp

New blackboard lecture w @reinerpope How do chips actually work – starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. 0:00:00 – Building a multiply-accumulate from logic gates 0:16:20 – Muxes and the cost of data movement 0:25:59 – How systolic arrays work 0:39:00 – Clock cycles and pipeline registers 0:51:40 – FPGAs vs ASICs 1:03:14 – Cache vs scratchpad 1:07:16 – Why CPU cores are much bigger than GPU cores 1:11:49 – Brains vs chips 1:15:22 – A GPU is just a bunch of tiny TPUs Look up Dwarkesh Podcast on YouTube/Spotify/etc to watch. Enjoy!

English
1
0
21
2.1K
Akshit Pareek retweetledi
Stepan Feduniak
Stepan Feduniak@FeduniakS·
Spent last week benchmarking policy speedup methods. Then we just collected faster data and it beat all baselines... Although obvious, but turns out first step to speed up your policy is … collect faster data.
English
9
8
97
21.6K
neural nets.
neural nets.@cneuralnetwork·
use claude 4.7 opus xhigh so much that I reached top 50 on cisco leaderboards for ai usage on day 1 itself 😭😭😭
English
28
2
745
24.9K
Akshit Pareek
Akshit Pareek@apareek05·
@richnanophd didn’t need to do it for smolVLA, fit comfortably with bf16 weights during inference on 16gb vram, during training I needed to make some sacrifices. But yeah, for running molmoAct2 as just local inference on 5070ti, I’ll be trying both 4bit and 8bit quantizations.
English
0
0
0
80
Dr. Richard
Dr. Richard@richnanophd·
@apareek05 Love this. I fight the same VRAM limits with local sims. Did you quantize SmolVLA to squeeze it onto the 50 Ti? Solid work 👍
English
1
0
2
94
Akshit Pareek
Akshit Pareek@apareek05·
hacked around with Franka Panda on MuJoCo over the last couple of weeks, trying to get it to pick books off a table and drop them in a box. limited myself to SmolVLA and ACT since i was on my own 5070 Ti.
English
6
10
108
12.5K
Akshit Pareek
Akshit Pareek@apareek05·
@sakurayukiai I was able to train SmolVLA in bf16, as only 100M params were trainable, but I did have to settle with a batch size of 16. card was almost at the limit though
English
0
0
1
115
Sakura Yuki
Sakura Yuki@sakurayukiai·
@apareek05 5070 Ti gang 🤝 People sleep on what a single consumer card can actually do. Did you have to drop to a 4-bit quant to save VRAM for the KV cache, or did it squeeze into bf16?
English
1
0
2
155
Akshit Pareek
Akshit Pareek@apareek05·
then tried a harder scene from scratch. bigger rack, fancier prompts, 240 fresh demos. SmolVLA stopped following the prompt entirely. 0/10. next: renting an A100 to try pi 0.5 or MolmoAct2.
English
2
0
3
350
Jino Rohit
Jino Rohit@jino_rohit·
cuda, triton, cutlass, cute, tilelang, thunderkittens, mojo, helion. so which one do you even learn at this point?
English
45
4
255
17.7K
Akshit Pareek retweetledi
TC
TC@totalcristiano·
There’s only one Cristiano Ronaldo.
English
130
3K
19.8K
421.8K
Akshit Pareek retweetledi
amv
amv@aryanmadhaverma·
wrote about my recent behavior cloning experiment digging a bit into the internals of how action chunking transformers work and the distributed inference over raspi and my mac aryanmadhavverma.com/tech/2026/04/0… imo, writing is one of the best methods of spaced repetition. you're forced to look back at everything you did, identify knowledge gaps, think hard and find more things to dig deep into I was writing a section on cross-attention which got me curious about what the queries are actually learning and what information did the decoder space hold wrote a visualisation script and realised the cross attention queries had developed temporal coherence on their own which meant that early queries learned to attend to the arm, later queries to the target object, and this attention pattern shifted dynamically with each frame, always keeping the robot's relevant parts in focus. no one told them which timestep matters or where to look, they just figured it out from 50 demos!
amv tweet mediaamv tweet mediaamv tweet media
English
2
6
23
1.1K
Vinit Sarode
Vinit Sarode@vinitsarode_·
who's the hottest man in blr who we can cast in our launch video. imagine a young 27 yo harvey specter. we shoot this weekend.
English
89
3
180
49.4K
amv
amv@aryanmadhaverma·
this is not what I pay you 100$/month for @claudeai
amv tweet media
English
1
0
5
374
Akshit Pareek retweetledi
Peer Richelsen
Peer Richelsen@peer_rich·
the male urge to build a GPU cluster at home
English
81
237
2K
80.4K
Sudo su
Sudo su@sudoingX·
drop your GPU below. i'll tell you exactly what model and config to run on it. here's what i've tested and verified on real hardware: RTX 3060 12GB - Qwen 3.5 9B Q4 - 50 tok/s - 128K context RTX 3090 24GB - Qwen 3.5 27B Q4 - 35 tok/s - 300K context RTX 3090 24GB - Qwen 3.5 35B MoE Q4 - 112 tok/s - 262K context 2x RTX 3090 - Qwen3-Coder 80B Q4 - 46 tok/s - full VRAM all running llama.cpp with flash attention. every number is real. every config is tested. if your card isn't on this list drop it below and i'll tell you what fits.
English
724
101
1.6K
192.8K
amv
amv@aryanmadhaverma·
filling in the knowledge gaps ASAP
English
3
0
4
211