Raja Koduri

4.5K posts

Raja Koduri banner
Raja Koduri

Raja Koduri

@RajaXg

Create, Clean, Consume is my aspirational routine. My interests math, computer graphics, silicon, software and music.

San Francisco Katılım Aralık 2009
2.1K Takip Edilen51.1K Takipçiler
Raja Koduri
Raja Koduri@RajaXg·
30 years of working with Dell from 2d Bitblt engines, 3d accelerators, GPUs, CPUs and now token factories, finally got a picture with the great man himself @MichaelDell at Dell Technologies World in Las Vegas
Raja Koduri tweet media
English
1
2
71
3.7K
Raja Koduri
Raja Koduri@RajaXg·
Whenever I meet someone visiting silicon valley from Taiwan I jokingly ask if they brought me a wafer 😀 friends from whalechip in Taiwan did one better, they sent me wafer-on-wafer, hybrid bonded logic and dram wafers. Thank you Wenchung I know we are in "wafer scale" buzz since last week. Congratulations Cerebras! Andrew, Dhiraj, Gary, Julie and other friends and acquaintances. 👏👏 Easily one of the most courageous, innovative and hardworking silicon engineering in the world right now.
Raja Koduri tweet media
English
6
6
142
11.8K
Raja Koduri retweetledi
David Bennett
David Bennett@DavidBennett__·
This is very cool! RRR x Takarazuka (famous Japanese all female theater group) You must have seen this @RajaXg, but what a brilliant collaboration. 👏🏻 @ssrajamouli
David Bennett tweet media
English
0
2
7
1.9K
Raja Koduri
Raja Koduri@RajaXg·
Good to see good old GPU shaders in action at Google cloud next’26 in Las Vegas!
English
0
2
20
3.5K
Matt
Matt@m13v_·
@RajaXg which center did you sit at? the "attention and equanimity" framing is perfect honestly. six courses in and that's still the core of what i'd tell anyone. the hard part nobody warns you about is maintaining that equanimity when you're back at your desk on monday
English
1
0
1
30
Raja Koduri
Raja Koduri@RajaXg·
Finished my first 10-day Vipassana meditation retreat. Silent and fully disconnected. It was hard. I will link some resources in the thread to learn more about this. For the AI nerds, I would summarize my experience as "Attention and equanimity is all you need"
English
5
3
61
5.3K
Raja Koduri retweetledi
Omer Cheema
Omer Cheema@OmerCheeema·
In 1969, the Apollo Guidance Computer (AGC) took humans to the Moon with a 2.048 MHz crystal clock (effective ~1.024 MHz internal), executing roughly 40,000–85,000 instructions per second. It had just 2,048 words of magnetic-core RAM (~4 KB) and 36,864 words of hand-woven core-rope ROM (~72 KB total memory). No fancy frameworks. No bloated buffers. Every word and cycle was sacred. Software in tight assembly handled real-time guidance, descent, rendezvous — and even recovered gracefully from 1202 alarms during Apollo 11's landing. My old professor (ex-NASA researcher) once yelled at me for initializing an unnecessary array. His point hit hard: "If a ~2 MHz machine with <100 KB of memory could land on the Moon, why do we waste resources so recklessly on far less demanding apps today?" He's right. I've seen it dozens of times in software teams. Give engineers abundant RAM/CPU/GPU, and usage magically expands to fill it. Unneeded allocations, oversized buffers "just in case," heavy dependency trees, and telemetry bloat everywhere. Product managers who set hard memory optimization goals? Teams suddenly deliver wonders. Low-hanging fruit abounds. Yet it's rarely prioritized until costs or OOM errors bite. This mirrors Raja's wisdom on innovation thriving in constrained environments. Scarcity forces elegance, reuse, and true mastery of the problem. Fast-forward to today's AI stack, especially LLM inference: A 70B-parameter model in FP16 weighs ~140 GB just for weights. Add a 32K-token context with modest batching? The KV cache alone can explode to 40–80+ GB (or 300+ GB at 128K context with multiple users) — often dwarfing the model itself. For Llama 3.1 70B at FP16, one long context request can eat tens of GB in cache before you even generate tokens. No wonder serving costs skyrocket and concurrency suffers. Yet the fixes echo the AGC mindset. Quantization (INT4/FP8/NF4): Shrinks a 70B model from ~140 GB (FP16) to ~35 GB with minimal quality loss. Up to 4x memory reduction + big speedups. PagedAttention (vLLM): Treats KV cache like virtual memory with small blocks — slashes fragmentation/waste from 60-80% down to <4%, unlocking 2–4x higher throughput and far more concurrent users on the same GPUs. FlashAttention: Fuses ops and tiles computation for up to 10–20x memory savings on attention (especially long sequences) plus 2–4x speed. Combine them and you serve dramatically more with the same hardware, or run bigger models/contexts without burning money. The lesson from the Moon: constraints aren't enemies they're forcing functions for better engineering. In AI (and software at large), deliberate discipline, profiling ruthlessly, enforcing budgets, questioning every allocation, still wins. Key optimization areas delivering the biggest savings today: KV cache management (paged + quantized), model weight quantization, efficient attention kernels, mixed precision + checkpointing in training, and system-level profiling to kill unnecessary bloat. Abundance made us sloppy. Scarcity (or enforced budgets) will make us brilliant again.
Raja Koduri@RajaXg

I warned my memory friends a few months ago..there are tons of optimizations available across the whole stack to reduce memory capacity and bandwidth...as long as memory was relatively "cheap" , we stay lazy...constraints unleash creativity..I hear the memory supply chain constraints won't be solved till 2030..prepare for deluge of creativity..it hasn't been a week since Turbo quant... not only in software, but you will some insanely cool hardware improvisations and new suppliers emerge to to the top as well

English
2
8
55
9.5K
Raja Koduri
Raja Koduri@RajaXg·
@Midnight_Captl I’m saying watch for disruptions in the hierarchy..I’m bullish that there will be new winners in the memory hierarchy..bearish that the current leaderboard will be stable
English
0
0
3
148
Midnight Capital LLC
Midnight Capital LLC@Midnight_Captl·
@RajaXg You’re flip flopping. Are you saying you’re bearish memory pricing power or not 😵‍💫
English
1
0
0
175
Raja Koduri
Raja Koduri@RajaXg·
@Midnight_Captl Memory capacity and bandwidth will remain important, but who and how it will be served could change...producing new winners..
English
2
0
3
193
Raja Koduri
Raja Koduri@RajaXg·
@Midnight_Captl About the system architecture changes that could have more lasting impact in 3-4 years..
English
0
0
1
34
Raja Koduri
Raja Koduri@RajaXg·
@benitoz There will be no change in demand...but will be fascinating to see how much more efficient software and hardware will get in memory utilization as the demand far exceeds supply...and don't discount left field ideas that change the system architecture..
English
1
1
17
2.2K
Ben Pouladian
Ben Pouladian@benitoz·
@RajaXg Yes. Compression doesn’t kill demand. It makes bigger workloads viable
English
1
0
32
3.1K
Raja Koduri
Raja Koduri@RajaXg·
I warned my memory friends a few months ago..there are tons of optimizations available across the whole stack to reduce memory capacity and bandwidth...as long as memory was relatively "cheap" , we stay lazy...constraints unleash creativity..I hear the memory supply chain constraints won't be solved till 2030..prepare for deluge of creativity..it hasn't been a week since Turbo quant... not only in software, but you will some insanely cool hardware improvisations and new suppliers emerge to to the top as well
ComfyUI@ComfyUI

Upgrading your RAM is now unnecessary. Introducing our new ComfyUI Dynamic VRAM optimization. Running local models is now possible on even the most memory constrained hardware. Read more here: blog.comfy.org/p/dynamic-vram…

English
26
53
445
162.5K
Raja Koduri
Raja Koduri@RajaXg·
Plenty of DRAM supply at BiRite in SF
Raja Koduri tweet media
English
8
3
67
4.5K
Raja Koduri
Raja Koduri@RajaXg·
No supply constraints!
Raja Koduri tweet media
English
9
1
24
2.5K
Raja Koduri
Raja Koduri@RajaXg·
We are working with all the usual suspects in the GPU-TPU-XPU-DRAM ecosystem to power these giga-watts. Been at it for several months, good to be able to share this ahead of GTC today. crnasia.com/india/news/202…
English
16
98
1.5K
1.7M
Raja Koduri retweetledi
rajamouli ss
rajamouli ss@ssrajamouli·
For years our stories pushed the limits of our canvas. Today we stretch it further. Excited to launch A&M MoCap Lab, India’s largest Motion Capture facility at @AnnapurnaStudios set up in collaboration with @mihiravisualabs and @Animatrik Looking forward for storytellers to explore its limitless potential in animation, live action, gaming and more.
English
225
3.2K
23.2K
687.4K
Raja Koduri
Raja Koduri@RajaXg·
Zettascale India is no longer just a vision. It’s becoming reality. Back in 2020, I gave a talk laying out why India must build sovereign AI compute infrastructure at zettascale. 🎬 Watch my FICCI talk here: youtu.be/nd-EFhFlAr8?si… Fast forward to today: AM Green Group has announced a $25 billion, 1 GW AI compute hub in Greater Noida. One of the largest AI infrastructure investments in India’s history. The facility will house nearly 500,000 high-performance chipsets and run entirely on 24/7 carbon-free energy from solar, wind, and pumped storage. I’m proud to serve as an advisor to this effort. This is a bold and necessary ambition. But announcing it is only the beginning. The real work lies ahead: building the supply chains, attracting and developing talent, executing on the engineering, and delivering on the sustainability promise. There is a lot to do to turn this vision into reality. Phase 1 targets 2028. Full capacity by 2030. The commitment is there. Now it’s about relentless execution. Let’s build it right. livemint.com/ai/am-group-25… Will be at India AI Impact summit in Delhi this week
YouTube video
YouTube
English
3
13
68
8.1K
Raja Koduri retweetledi
OXMIQ
OXMIQ@realoxmiqlabs·
@RajaXg explains why the GPU industry lacks the kind of standardized IP ecosystem that ARM created for CPUs, and why that matters for the future of AI acceleration. This is exactly the gap OXMIQ is addressing with our licensable chiplet-based GPU IP architecture. #OXMIQ
English
0
2
7
1.7K