Tushar Deshpande

1.1K posts

Tushar Deshpande

@tsd_10

26 | AI/DL/RL/Optimization🤓 Right Politics! Virginia Tech, Masters

Blacksburg, VA Beigetreten Temmuz 2015

1.8K Folgt137 Follower

Angehefteter Tweet

Tushar Deshpande@tsd_10·12 May

Sums up what happened in the past few days, next goal is coming! "Beware" is all I can say! Massive respect to the Indian forces, thankyou very much❤️❤️ Also, huge respect for the PM and LOP, thank you! #IndianArmy @adgpi @IAF_MCC #indiannavy #RAW @narendramodi @RahulGandhi

English

498

Tushar Deshpande retweetet

VG🌪️@HelloVyom·15h

bookmark this!!! The AI interview meta changed. companies like Anthropic & OpenAI are now asking you to implement attention mechanisms from scratch in live rounds. free repos that actually cover this 👇

English

120

1.3K

54.8K

Tushar Deshpande retweetet

Patrick Heizer@PatrickHeizer·1d

Virginia Tech owns and operates a limestone quarry, and mandates that all campus buildings be built with that stone.

Dume@gietzschean

Rare aesthetic: American educational institutions

English

187

5.3K

485.7K

Tushar Deshpande retweetet

Aadi Kulshrestha@MankyDankyBanky·1d

I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: mni-ml.github.io/demos/transfor… Built with @_reesechong. Check out the repos and blog if you want to learn more. Shoutout to @modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc @sundeep @GavinSherry

English

114

199

2.9K

554.5K

Tushar Deshpande retweetet

Vincent Weisser@vincentweisser·2d

Scaling to ultra-long horizon agents requires novel benchmarks and RL environments. FrontierSWE by @ProximalHQ is exactly that: 11h average runtime, open-ended tasks like end-to-end model optimization, and frontier agents fail almost all of them. We co-designed granite_inf, where the agent optimizes a model's forward pass inside an inference engine against a human-tuned baseline, and host it on our environments hub. primeintellect.ai/blog/frontier-…

Prime Intellect@PrimeIntellect

We are excited to host @ProximalHQ's FrontierSWE on the Environments Hub as a launch partner. As an ultra-long horizon coding evaluation, even today's frontier models struggle to solve the tasks after running for hours.

English

137

13.5K

Tushar Deshpande retweetet

Justus Mattern@MatternJustus·2d

Introducing FrontierSWE, an ultra-long horizon coding benchmark. We test agents on some of the hardest technical tasks like optimizing a video rendering library or training a model to predict the quantum properties of molecules. Despite having 20 hours, they rarely succeed

English

131

1.3K

191.7K

Tushar Deshpande retweetet

SANATAN@Eternaldharma_·4d

Lal Bahadur Shastri never told his own mother that he was the Railway Minister. To her, he was just a man doing a simple railway job. One day, his mother showed up at Rail Bhavan, asking around, saying, “My son works here.” When she revealed his name, people were stunned some even thought she was lying. She was taken to Shastri Ji. She instantly recognized him: “Yes, he’s my son.” Officials turned to him Is she really your mother?” He calmly brought her in, sat her beside him for a while, and then sent her home. Later, journalists questioned him why didn’t you address her publicly? His reply cut through everything: “If she finds out I’m a minister, she’ll start recommending people. I won’t be able to refuse her. And it might bring pride into her heart.” No drama. No display. Just discipline. That’s the difference between power worn as a badge… and responsibility carried like a burden.

English

735

4.7K

252.9K

Tushar Deshpande retweetet

Tom Dörr@tom_doerr·11 Nis

Minimal autoregressive world model on Genie Architecture github.com/AlmondGod/tiny…

English

262

14.1K

Tushar Deshpande retweetet

Pralhad Joshi@JoshiPralhad·10 Nis

Yesterday, an Indian company demonstrated an imported stove that uses electricity to generate flame-like burners, similar to LPG, for cooking. I was truly impressed by this innovative technology and would like to see Indian manufacturers adopt and scale it domestically. When combined with @PMSuryaGhar, which enables electricity generation through solar power, this innovation could be a game changer in reducing dependence on LPG.

English

371

664

4.2K

617.8K

Tushar Deshpande@tsd_10·8 Nis

@zeeshanp_ why 10T though? and will you guys serve this model with fp16? would be around 25-30K GB200s for 2M concurrent users if im not wrong with some moe routing optimization?

English

631

Zeeshan Patel@zeeshanp_·8 Nis

Now people can see how big the models trained on Blackwell will be. ~10-12T parameters is about the size needed to achieve max utilization on a full GB300 rack with reasonable inference speed. Model scaling is still very much alive.

Elon Musk@elonmusk

SpaceXAI Colossus 2 now has 7 models in training: - Imagine V2 - 2 variants of 1T - 2 variants of 1.5T - 6T - 10T Some catching up to do.

English

305

14K

Tushar Deshpande@tsd_10·8 Nis

@elonmusk would love to hear the rationale behind a 10T model?

English

Elon Musk@elonmusk·8 Nis

SpaceXAI Colossus 2 now has 7 models in training: - Imagine V2 - 2 variants of 1T - 2 variants of 1.5T - 6T - 10T Some catching up to do.

English

6.7K

7.6K

68K

28.1M

Tushar Deshpande retweetet

Antonio Lupetti@antoniolupetti·4 Nis

“An overview of gradient descent optimization algorithms.” This is the standard tool for optimization and how we train neural networks. It’s often treated as a black box, while practical explanations of its behavior are rare. This paper by Sebastian Ruder offers one of the clearest insights into what’s really happening under the hood. A fundamental read for anyone working in machine learning. arxiv.org/abs/1609.04747

English

501

30.6K

Tushar Deshpande retweetet

Archie Sengupta@archiexzzz·2 Nis

x.com/i/article/2039…

ZXX

315

77.7K

Tushar Deshpande retweetet

steve@gpusteve·30 Mar

you're interviewing for an ml performance role at mistral ai and they ask: "how have moe models changed the design of distributed communication in inference systems?" you say: "they haven't - it's still nccl collectives end to end." wrong! here's how you answer:

English

8.2K

Tushar Deshpande retweetet

Jack Zhang@jcz42·30 Mar

We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

English

165

209.6K

Tushar Deshpande retweetet

Mishig Davaadorj@mishig25·29 Mar

autoresearch skips the middle men (researchers) and directly turns capital into novel insight. at hf, we've built infra where an agent can launch GPU experiments, access all of hf datasets as local filesystem, and semantically search arxiv for ideas github.com/mishig25/hf-au…

English

223

37.9K

Tushar Deshpande retweetet

difficultyang@difficultyang·29 Mar

Let's say we wanted to rewrite PyTorch from scratch, because such a thing is topic du jour in the age of LLMs. What would the goals of such a rewrite be? What problems could a rewrite solve that incremental evolution from where the code is today not? 🧵

English

623

71.9K

Tushar Deshpande retweetet

AT@AliesTaha·26 Mar

x.com/i/article/2037…

ZXX

622

72.1K

Tushar Deshpande retweetet

AI at Meta@AIatMeta·26 Mar

Today we're introducing TRIBE v2 (Trimodal Brain Encoder), a foundation model trained to predict how the human brain responds to almost any sight or sound. Building on our Algonauts 2025 award-winning architecture, TRIBE v2 draws on 500+ hours of fMRI recordings from 700+ people to create a digital twin of neural activity and enable zero-shot predictions for new subjects, languages, and tasks. Try the demo and learn more here: go.meta.me/tribe2

English

737

2.5K

16K

6.8M

Tushar Deshpande retweetet

Nav Toor@heynavtoor·25 Mar

🚨 Meta, Google DeepMind, and OpenAI all ask the same thing in ML interviews: "Implement softmax from scratch.." Most candidates fail. Someone just open sourced the training ground for it. It's called TorchCode. LeetCode, but for PyTorch. 39 problems that test the exact skills top AI labs hire for. No tutorials. No hand-holding. Implement it or fail. Instant auto-grading. Here's what's inside this thing: → Implement ReLU, softmax, LayerNorm, dropout from scratch → Build multi-head attention, full Transformer blocks, GPT-2 → Automated judge checks correctness, gradients, and timing → Colored pass/fail per test case like competitive programming → Hints when you're stuck. Full reference solutions after you try. → Progress tracking. What you solved, best times, attempt counts. → Runs in your browser. No GPU needed. No signup. No cloud. Here's the wildest part: Every problem is a real interview question from top AI companies. You're not learning theory. You're practicing the exact exercises that get people $400K+ offers at Meta AI, DeepMind, and OpenAI. Try it right now on Hugging Face. Zero install. Opens in your browser. ML bootcamps charge $10,000 to $30,000 to teach this. Interview prep courses charge $2,000+. This is free. 776 GitHub stars. MIT License. 100% Open Source.

English

157

1.2K

188.2K

Tushar Deshpande retweetet

chuyi shang@chuyishang·24 Mar

Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training! If you’re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training, check out this blog post: chuyishang.com/blog/2026/jax-… Comes with code + an assignment and test cases so you can follow along!

English

603

32.3K

Entdecken

@_reesechong @modal @sundeep @GavinSherry @ProximalHQ @PMSuryaGhar @zeeshanp_ @elonmusk