Pruthviraj P

817 posts

Pruthviraj P

@spidernvdev

Deep Learning SDE @nvidia · ex-nobody cuda guy

California, USA เข้าร่วม Ocak 2017

188 กำลังติดตาม301 ผู้ติดตาม

ทวีตที่ปักหมุด

Pruthviraj P@spidernvdev·2d

what’s missing from this setup?

English

742

Pruthviraj P@spidernvdev·1h

learn more: docs.nvidia.com/megatron-core/…

English

Pruthviraj P@spidernvdev·1h

simple mental model: large model training is like moving a huge machine across multiple rooms. each gpu holds part of the work. but the hard part is not only splitting it. the hard part is making all parts communicate fast enough.

English

Pruthviraj P@spidernvdev·1h

morning from the green matrix. green matrix note 01: why big models need parallelism a 70b parameter model needs ~140gb just for fp16 weights. that is before activations, gradients, optimizer states, and training data. one gpu is not enough.

English

Pruthviraj P@spidernvdev·7h

this changes everything. must try !!

Pruthviraj P@spidernvdev

what’s missing from this setup?

English

Pruthviraj P@spidernvdev·10h

@Xbotter Exactly !

English

Xbotter@Xbotter·10h

@spidernvdev Indeed, products only need imagination, but engineers have a ton of other stuff to handle.

English

Pruthviraj P@spidernvdev·10h

most ai advice online is about prompts. but if you want to build real ai products, you also need to understand: latency memory cost evaluation deployment prompting helps. systems knowledge compounds.

English

148

Pruthviraj P@spidernvdev·10h

@sama

GIF

QME

164

Sam Altman@sama·10h

we're starting rollout of GPT-5.5-Cyber, a frontier cybersecurity model, to critical cyber defenders in the next few days. we will work with the entire ecosystem and the government to figure out trusted access for cyber; we want to rapidly help secure companies/infrastructure.

English

885

583

9.5K

608.1K

Pruthviraj P@spidernvdev·14h

@Siddurp2 For attention yes.

English

Thefitdoc@Siddurp2·14h

I don't know who needs to hear this but, Your physique makes the 1st impression whever you go. Build your body 🙂

English

458

Pruthviraj P@spidernvdev·14h

@sama Coffee ?

English

215

Sam Altman@sama·14h

GPT-5.5 is going to have a party for itself. it chose 5/5 at 5:55 pm for the date and time. if you'd like to come, let us know here: luma.com/5.5 codex will help the team pick people from the replies. 5.5 had some good ideas/requests for the party, which we'll do.

English

1.7K

328

5.2K

553K

Pruthviraj P@spidernvdev·15h

one thing i underestimated early in ai: the model is only one layer. behind every good ai product there is data loading, preprocessing, gpu memory, serving, monitoring, and debugging. that full stack is where engineering gets interesting.

English

204

Pruthviraj P@spidernvdev·17h

@seunosewa @hiarun02 It can, but total cost isn’t just per-token, its also how many passes it takes. Fewer retries can offset higher pricing.

English

Seun Osewa 🇳🇬@seunosewa·17h

@spidernvdev @hiarun02 But the task costs more?

English

Arun@hiarun02·1d

Anyone cancelled Claude Code for Codex yet? Feels like dev's are switching not because Codex is better. but because it’s cheaper to actually get work done. What’s your experience?

English

422

691

58.2K

Pruthviraj P@spidernvdev·18h

@devnp2007 Even past cs grads won’t know all of those :)

English

Dev Patel@devnp2007·19h

Most cs grads studying rn would not even know all these terms including me.... Still learning.....long way to go.....

maharshi@maharshii

triton, gluon, cutedsl, hopper, blackwell, tensorcores, layouts, composition, local_tile, partitionS, partitionD, wgmma, tcgen05, TMA, block scaling, coalesced access, ampere, ada lovelace, cutlass, cublas, cudnn, flash attention, gemm, sgemm, fp16, bf16, mxfp8, nvfp4, int4, quantization, mixed precision, occupancy, reductions, warp divergence, bank conflicts, memory coalescing, shared memory, global memory, texture memory, constant memory, unified memory, epilogues, kernel fusion, graph optimization, tensorrt, torch compile, dynamo, inductor, graph capture, thread blocks, warps, SIMT, streaming multiprocessors, L1 cache, L2 cache, register spilling, thread divergence, memory bandwidth, compute capability, CUDA cores, ldg, stg, ncu, nsys, atomic operations, syncthreads, cooperative groups, dynamic parallelism, persistent kernels, vectorized loads, static quantization, tensors, swizzling, predication, instruction throughput, memory latency hiding...

English

Pruthviraj P@spidernvdev·20h

anthropic adding claude connectors for tools like blender, adobe, autodesk, and splice is the right direction. the next ai jump may not be only smarter models. it may be models that understand the tools people already use.

English

226

Pruthviraj P@spidernvdev·20h

sometimes i wonder if i'm learning fast enough or just keeping up with noise

English

167

Pruthviraj P@spidernvdev·20h

@unkonfined And I need 1k. Huge difference.

English

Unkonfined@unkonfined·20h

I want 1M followers.

English

1.7K

383

2.1K

88.3K

Pruthviraj P@spidernvdev·21h

Kath Korevec@simpsoka

I’m studying how agents communicate with developers. Looking for screenshots of isolated chat outputs where an agent response made you feel impressed, frustrated, confused, or where the communication style changed how useful the answer felt.

ZXX

216

Pruthviraj P@spidernvdev·22h

@PengmingWang this took me a while to learn you can have a great model and terrible evals, and you won't know until something breaks in production

English

Pengming Wang@PengmingWang·23h

My updated take on "The 'it' in AI models is the dataset." is that the 'it' is the evals you're using. I rather have bad scores but great benchmarks than good scores but bad benchmarks.

English

1.3K

Pruthviraj P@spidernvdev·22h

@nvidia blackwell costs 2x more than hopper on paper delivers 35x lower cost per million tokens in practice that gap is the what needs to be understood

English

197

NVIDIA@nvidia·22h

x.com/i/article/2049…

ZXX

190

27.2K

Pruthviraj P@spidernvdev·22h

groot-x has 24,000 simulated teleoperation runs across humanoids and manipulators mixing this synthetic data with real improved model performance by 40% n1.7 now runs with a 3.3B param backbone on ~6GB vram open source physical ai is moving faster than anyone expected

NVIDIA Robotics@NVIDIARobotics

The Physical AI Robotics GR00T‑X Embodiment Sim dataset has surpassed 10 million downloads on @HuggingFace. 🥳 A huge shoutout to the global research and developer community exploring the future of embodied AI and robotics with this open dataset — you made this milestone possible. 📥 Try it on Hugging Face 👉 nvda.ws/3Qv64Ul

English

288

Pruthviraj P@spidernvdev·1d

@ivanfioravanti @kernelpool curious to see how it holds up in the 8-bit tests.

English

Ivan Fioravanti ᯅ@ivanfioravanti·1d

Look at the amazing performance boost after applying a Metal Kernel as suggested by @kernelpool 🤩 Night & Day in Prefill that is now much faster! Gonna publish an 8bit comparison test soon and then testing this super model with coding soon!

Ivan Fioravanti ᯅ@ivanfioravanti

MLX Ling-2.6-flash support added! 💪 Here my (preliminary, because I bet @angeloskath will improve performance) context benchmark for the 4bit version running on M3 Ultra (cooking a new version) I created the PR with the amazing transformer_to_mlx skill by @pcuenq and Opus 4.7. Few iterations and it seems 😂 working 100%! Ultra fast model created by @TheInclusionAI. Can't wait to test it with a code harness! Raw results: Ling-2.6-flash-mlx-4bit MLX Benchmark Results Hardware: Apple M3 Ultra, 512.0GB RAM, 32 CPU cores, 80 GPU cores 0.5k pp 632 tg 79 t/s mem 59.4GB kv 0.03GB 1k pp 676 tg 79 t/s mem 59.9GB kv 0.04GB 2k pp 693 tg 79 t/s mem 61.1GB kv 0.04GB 4k pp 704 tg 79 t/s mem 61.1GB kv 0.05GB 8k pp 708 tg 78 t/s mem 61.2GB kv 0.07GB 16k pp 700 tg 77 t/s mem 61.9GB kv 0.11GB 32k pp 678 tg 74 t/s mem 64.5GB kv 0.18GB 64k pp 637 tg 70 t/s mem 69.5GB kv 0.33GB 128k pp 564 tg 63 t/s mem 79.6GB kv 0.64GB Total generated tokens: 1135 Batch TPS: b1 78 b2 123 b4 164 b8 218 b16 307 b32 418 Batch KV : b1 0.04GB b2 0.08GB b4 0.16GB b8 0.32GB b16 0.63GB b32 1.26GB

English

2.9K

ค้นพบ

@Xbotter @sama @Siddurp2 @seunosewa @hiarun02 @devnp2007 @elonmusk @BarackObama