ayush

272 posts

ayush banner
ayush

ayush

@ayushrgarg

co-founder @hyperictech | prev @uwaterloo se

California, USA Katılım Aralık 2017
358 Takip Edilen735 Takipçiler
ayush retweetledi
Reese Chong
Reese Chong@_reesechong·
I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: mni-ml.github.io/demos/kv-cache/ In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at mni.ml!
English
20
22
489
47.7K
Reese Chong
Reese Chong@_reesechong·
Behind the scenes of mni-ml: January 4th 2026 - my roommate @MankyDankyBanky and I wanted to do a big project together. ”maybe we should try to build pytorch from scratch” We found @srush_nlp's minitorch curriculum and committed to grinding through it Jan to April. February - autodiff and tensor internals done. lots of late night PR reviews, stacked diffs, Kinton ramen runs to Toronto when I'd visit Aadi at Shopify. We started posting on X to keep ourselves accountable. March - the month of parallelization: Aadi shipped tiled matmul using the same algo @nvidia teaches in their CUDA guide, wrapped by end of month - pooling, conv1d/2d forward+backward, softmax, dropout. March 22-23 — @socraticainfo symposium & we see the tinytpu team on the stage which filled us with determination 🫡 cc: @evanliin @XanderChin @suryasure05 @kennykgguo March 24 - chose the mni-ml brand and started the educational blog March 30 - minitorch is DONE ahead of schedule. now we build on top of the framework. April 5-6 - cuBLAS matmul via koffi FFI. buffer pooling, strided batched GEMM, kernel optimizations. CUDA backend takes shape. April 7 - huge day. cross-platform CI pipeline, prebuilt npm binaries, v0.3.0 — CUDA live on @npmjs. flatten the monorepo, add @WebGPU + Windows CUDA build targets by eod. April 12 - flash attention CUDA kernel ships. we caught a bug where head dim > 32 was truncating. April 14 (during exam season), we recorded the demo in @Shopify recording studio during Aadi’s lunch break. Everything over the last 4mo finally came together. Cc: @fnthawar @tobi @alspee April 17: launch post and bought the domain mni.ml and we’re just getting started. We have so much in store for this summer, stay tuned 🫡 cc: @sundeep @GavinSherry
Reese Chong tweet media
Aadi Kulshrestha@MankyDankyBanky

I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: mni-ml.github.io/demos/transfor… Built with @_reesechong. Check out the repos and blog if you want to learn more. Shoutout to @modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc @sundeep @GavinSherry

English
15
12
239
40K
ayush retweetledi
Vishnu Satish
Vishnu Satish@VishnuSatish_·
I built and trained a ~6M parameter GPT-2 entirely from scratch in C++, and it actually generates English text with mostly correct grammar! No PyTorch and no external dependencies. Just pure C++ 20. More info, GitHub link, and screenshots below!
English
70
63
1.2K
64.7K
ayush
ayush@ayushrgarg·
@adiprasadd unironically pu 9 pm tn lets run it 😭
English
1
0
0
164
ayush
ayush@ayushrgarg·
99% of gamblers quit before they win I won
ayush tweet media
English
1
0
31
1.1K
ayush retweetledi
Aadi Kulshrestha
Aadi Kulshrestha@MankyDankyBanky·
I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: mni-ml.github.io/demos/transfor… Built with @_reesechong. Check out the repos and blog if you want to learn more. Shoutout to @modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc @sundeep @GavinSherry
English
130
256
3.5K
771.6K
ayush
ayush@ayushrgarg·
is ts tuff
ayush tweet media
English
12
0
70
4.9K
ayush
ayush@ayushrgarg·
we do things a lil diff around here
ayush tweet media
English
1
0
16
892
Umesh Khanna 🇨🇦🇺🇸
Umesh Khanna 🇨🇦🇺🇸@forwarddeploy·
Thinking of hosting more chai and samosas in SF at ours ☕️ Want to come hang out with good people and have fun snacks, lmk below! 🙌
Umesh Khanna 🇨🇦🇺🇸 tweet media
English
224
7
544
57.6K
ayush retweetledi
Modal
Modal@modal·
The future of artificial intelligence is physical. @physical_int runs robotic control inference on Modal with >2x lower latency than the lag between your brain and your finger.
English
3
30
300
94.1K
ayush
ayush@ayushrgarg·
anybody really good @ flying drones located in the bay? will pay you $$$ to fly drones all day free lunch + unlimited snacks & drinks
English
6
1
28
2.4K
ayush
ayush@ayushrgarg·
@PoG_Shmerb DM me proof of your drone skills; as for the job itself you'll fly a drone in an environment we choose (we'll provide drones + transmitters and anything else you may need) & we'll collect that data
English
0
0
0
151
Shmerb
Shmerb@PoG_Shmerb·
@ayushrgarg What kind of footage do you want to capture?
English
1
0
1
154
ayush
ayush@ayushrgarg·
@krupaad lets chat, we're building autonomous drones
English
1
0
7
847
krupa
krupa@krupaad·
bit late to the recruiting cycle, but looking for a summer internship in ML/hardware/inference!! i've been working on CUDA kernel writing, FPGA acceleration and RTL. would love to find a team doing similar work this summer dual US/Canada citizen, can relocate anywhere DMs open :)
English
36
13
264
30.4K
Neha Kasoju
Neha Kasoju@NehaKasoju·
relentlessly build the world you want to live in
English
1
0
4
127
ayush
ayush@ayushrgarg·
recently rediscovered an impulse purchase from 2 years ago
ayush tweet media
English
0
0
7
557