Tushar Deshpande

1.1K posts

Tushar Deshpande banner
Tushar Deshpande

Tushar Deshpande

@tsd_10

26 | AI/DL/RL/Optimization🤓 Right Politics! Virginia Tech, Masters

Blacksburg, VA Beigetreten Temmuz 2015
1.8K Folgt137 Follower
Tushar Deshpande retweetet
VG🌪️
VG🌪️@HelloVyom·
bookmark this!!! The AI interview meta changed. companies like Anthropic & OpenAI are now asking you to implement attention mechanisms from scratch in live rounds. free repos that actually cover this 👇
VG🌪️ tweet media
English
11
120
1.3K
54.8K
Tushar Deshpande retweetet
Aadi Kulshrestha
Aadi Kulshrestha@MankyDankyBanky·
I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: mni-ml.github.io/demos/transfor… Built with @_reesechong. Check out the repos and blog if you want to learn more. Shoutout to @modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc @sundeep @GavinSherry
English
114
199
2.9K
554.5K
Tushar Deshpande retweetet
Vincent Weisser
Vincent Weisser@vincentweisser·
Scaling to ultra-long horizon agents requires novel benchmarks and RL environments. FrontierSWE by @ProximalHQ is exactly that: 11h average runtime, open-ended tasks like end-to-end model optimization, and frontier agents fail almost all of them. We co-designed granite_inf, where the agent optimizes a model's forward pass inside an inference engine against a human-tuned baseline, and host it on our environments hub. primeintellect.ai/blog/frontier-…
Vincent Weisser tweet media
Prime Intellect@PrimeIntellect

We are excited to host @ProximalHQ's FrontierSWE on the Environments Hub as a launch partner. As an ultra-long horizon coding evaluation, even today's frontier models struggle to solve the tasks after running for hours.

English
4
10
137
13.5K
Tushar Deshpande retweetet
Justus Mattern
Justus Mattern@MatternJustus·
Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed
Justus Mattern tweet media
English
77
131
1.3K
191.7K
Tushar Deshpande retweetet
SANATAN
SANATAN@Eternaldharma_·
Lal Bahadur Shastri never told his own mother that he was the Railway Minister. To her, he was just a man doing a simple railway job. One day, his mother showed up at Rail Bhavan, asking around, saying, “My son works here.” When she revealed his name, people were stunned some even thought she was lying. She was taken to Shastri Ji. She instantly recognized him: “Yes, he’s my son.” Officials turned to him Is she really your mother?” He calmly brought her in, sat her beside him for a while, and then sent her home. Later, journalists questioned him why didn’t you address her publicly? His reply cut through everything: “If she finds out I’m a minister, she’ll start recommending people. I won’t be able to refuse her. And it might bring pride into her heart.” No drama. No display. Just discipline. That’s the difference between power worn as a badge… and responsibility carried like a burden.
SANATAN tweet media
English
61
735
4.7K
252.9K
Tushar Deshpande retweetet
Pralhad Joshi
Pralhad Joshi@JoshiPralhad·
Yesterday, an Indian company demonstrated an imported stove that uses electricity to generate flame-like burners, similar to LPG, for cooking. I was truly impressed by this innovative technology and would like to see Indian manufacturers adopt and scale it domestically. When combined with @PMSuryaGhar, which enables electricity generation through solar power, this innovation could be a game changer in reducing dependence on LPG.
Pralhad Joshi tweet mediaPralhad Joshi tweet media
English
371
664
4.2K
617.8K
Tushar Deshpande
Tushar Deshpande@tsd_10·
@zeeshanp_ why 10T though? and will you guys serve this model with fp16? would be around 25-30K GB200s for 2M concurrent users if im not wrong with some moe routing optimization?
English
0
0
0
631
Elon Musk
Elon Musk@elonmusk·
SpaceXAI Colossus 2 now has 7 models in training: - Imagine V2 - 2 variants of 1T - 2 variants of 1.5T - 6T - 10T Some catching up to do.
English
6.7K
7.6K
68K
28.1M
Tushar Deshpande retweetet
Antonio Lupetti
Antonio Lupetti@antoniolupetti·
“An overview of gradient descent optimization algorithms.” This is the standard tool for optimization and how we train neural networks. It’s often treated as a black box, while practical explanations of its behavior are rare. This paper by Sebastian Ruder offers one of the clearest insights into what’s really happening under the hood. A fundamental read for anyone working in machine learning. arxiv.org/abs/1609.04747
English
2
76
501
30.6K
Tushar Deshpande retweetet
steve
steve@gpusteve·
you're interviewing for an ml performance role at mistral ai and they ask: "how have moe models changed the design of distributed communication in inference systems?" you say: "they haven't - it's still nccl collectives end to end." wrong! here's how you answer:
steve tweet media
English
2
2
80
8.2K
Tushar Deshpande retweetet
Jack Zhang
Jack Zhang@jcz42·
We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵
Jack Zhang tweet media
English
17
165
1K
209.6K
Tushar Deshpande retweetet
Mishig Davaadorj
Mishig Davaadorj@mishig25·
autoresearch skips the middle men (researchers) and directly turns capital into novel insight. at hf, we've built infra where an agent can launch GPU experiments, access all of hf datasets as local filesystem, and semantically search arxiv for ideas github.com/mishig25/hf-au…
Mishig Davaadorj tweet media
English
5
22
223
37.9K
Tushar Deshpande retweetet
difficultyang
difficultyang@difficultyang·
Let's say we wanted to rewrite PyTorch from scratch, because such a thing is topic du jour in the age of LLMs. What would the goals of such a rewrite be? What problems could a rewrite solve that incremental evolution from where the code is today not? 🧵
English
25
44
623
71.9K
Tushar Deshpande retweetet
AI at Meta
AI at Meta@AIatMeta·
Today we're introducing TRIBE v2 (Trimodal Brain Encoder), a foundation model trained to predict how the human brain responds to almost any sight or sound. Building on our Algonauts 2025 award-winning architecture, TRIBE v2 draws on 500+ hours of fMRI recordings from 700+ people to create a digital twin of neural activity and enable zero-shot predictions for new subjects, languages, and tasks. Try the demo and learn more here: go.meta.me/tribe2
English
737
2.5K
16K
6.8M
Tushar Deshpande retweetet
Nav Toor
Nav Toor@heynavtoor·
🚨 Meta, Google DeepMind, and OpenAI all ask the same thing in ML interviews: "Implement softmax from scratch.." Most candidates fail. Someone just open sourced the training ground for it. It's called TorchCode. LeetCode, but for PyTorch. 39 problems that test the exact skills top AI labs hire for. No tutorials. No hand-holding. Implement it or fail. Instant auto-grading. Here's what's inside this thing: → Implement ReLU, softmax, LayerNorm, dropout from scratch → Build multi-head attention, full Transformer blocks, GPT-2 → Automated judge checks correctness, gradients, and timing → Colored pass/fail per test case like competitive programming → Hints when you're stuck. Full reference solutions after you try. → Progress tracking. What you solved, best times, attempt counts. → Runs in your browser. No GPU needed. No signup. No cloud. Here's the wildest part: Every problem is a real interview question from top AI companies. You're not learning theory. You're practicing the exact exercises that get people $400K+ offers at Meta AI, DeepMind, and OpenAI. Try it right now on Hugging Face. Zero install. Opens in your browser. ML bootcamps charge $10,000 to $30,000 to teach this. Interview prep courses charge $2,000+. This is free. 776 GitHub stars. MIT License. 100% Open Source.
Nav Toor tweet media
English
34
157
1.2K
188.2K
Tushar Deshpande retweetet
chuyi shang
chuyi shang@chuyishang·
Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training! If you’re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training, check out this blog post: chuyishang.com/blog/2026/jax-… Comes with code + an assignment and test cases so you can follow along!
chuyi shang tweet mediachuyi shang tweet media
English
9
66
603
32.3K