Ujjwal Upadhyay

500 posts

Ujjwal Upadhyay

@theujjwal9

Vision Language Models | Medical Imaging | Neuroscience

Riemann Space Katılım Şubat 2017

475 Takip Edilen78 Takipçiler

Ujjwal Upadhyay retweetledi

Andrej Karpathy@karpathy·10 Mar

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

965

2.1K

19.5K

3.6M

Ujjwal Upadhyay retweetledi

Aniket@0xaniketsharma·27 Eki

Interestingly every state of the art model out there fails to understand videos like these x.com/theujjwal9/sta…

tldraw@tldraw

if you pause this at any moment the tldraw disappears

English

474

Ujjwal Upadhyay retweetledi

Aleksa Gordić (水平问题)@gordic_aleksa·29 Eyl

New in-depth blog post time: "Inside NVIDIA GPUs: Anatomy of high performance matmul kernels". If you want to deeply understand how one writes state of the art matmul kernels in CUDA read along. (Remember matmul is the single most important operation that transformers execute both during training and inference. Most of NVIDIA compute is spent on it. Gaining 1% in efficiency translates to massive savings in the order of many nuclear reactors :P) I, yet again, realized i underestimated the effort. 😅 Here is one more booklet (lol). 47 figures! I covered: * The fundamentals of the GPU architecture with an emphasis on the memory hierarchy, building mental models for GMEM, SMEM, and L1/L2, and then connecting them to the CUDA programming model. Along the way we also looked at the "speed of light," how it's bounded by power, with hardware reality leaking into our model. * PTX/SASS, and how to steer the compiler into generating what we actually want (is that loop being unrolled, are we using vectorized loads like LDG.128, etc.). I've annotated one PTX/SASS example for a simple matmul kernel in excruciating detail. Even if you're new to compilers you should find this useful. (i actually found various inefficiencies in both compilers - fun!) * Many core concepts such as tile/wave quantization, occupancy, ILP (instruction-level parallelism), roofline model, etc. Also building intuition around fundamental equivalences: dot product as a sum of partial outer products, why square tiles are the right shape for high arithmetic intensity, etc. * The warp tiling method - which is near SOTA assuming you can't use tensor cores, TMA, async mem instructions, and bf16. Just maximizing GPU's performance using nothing but CUDA cores, registers and shared memory. * Finally, we step into Hopper (H100): TMA, swizzling, tensor cores and the wgmma instruction, async load/store pipelines, scheduling policies like Hilbert curves, clusters with TMA multicast, faster PTX barriers, and more. As always lots of examples, lots of visuals. This is the first time i could see warp tiling kernel and be like "oh i get it completely". I just needed my mental image transformed into an actual image. A few years ago I was really inspired by @Si_Boehm's excellent blog post on how matmul works, but I also found it had several errors, some unclear explanations, and it was quite outdated. Building on @pranjalssh amazing work (who did a great job building sota kernels for H100) and my own research, this is the final result. --- Again a huge thank you to @Hyperstackcloud (GPU cloud) for giving me an H100 (PCIe) node to run some of the experiments and analysis that i needed to write this up. Also a big thank you to my friends Aroun (who did a very thorough review of the post; Aroun's doing cool GPU/AI stuff at Magic and was previously GPU architect at Apple and Imagine, he's one of the best GPU people i know and we worked together on llm.c w/ @karpathy) and the amazing @marksaroufim! (PyTorch) for taking the time during weekend when they didn't have to. :)

English

392

2.5K

282K

Ujjwal Upadhyay retweetledi

Jason Weston@jaseweston·3 Eyl

🌀Diversity Aware RL (DARLING)🌀 📝: arxiv.org/abs/2509.02534 - Jointly optimizes for quality & diversity using a learned partition function - Outperforms standard RL in quality AND diversity metrics, e.g. higher pass@1/p@k - Works for both non-verifiable & verifiable tasks 🧵1/5

English

425

87K

Ujjwal Upadhyay retweetledi

PyTorch@PyTorch·4 Eyl

Large Language Models (#LLMs) are optimized for Intel GPUs labeled as xpu in #PyTorch. Learn how to speed up local inference on Intel Arc discrete, built-in, and Arc Pro GPUs, bringing advanced AI to laptops and desktops. 🔗 hubs.la/Q03GYFrV0 #PyTorch #LLM #OpenSourceAI

English

106

11.5K

Ujjwal Upadhyay retweetledi

Linus ✦ Ekenstam@LinusEkenstam·2 Ağu

This is next level. MeshBlend for Unreal Engine Just wow.

English

202

437

8.5K

889.4K

Ujjwal Upadhyay retweetledi

Sukjun (June) Hwang@sukjun_hwang·11 Tem

Tokenization has been the final barrier to truly end-to-end language models. We developed the H-Net: a hierarchical network that replaces tokenization with a dynamic chunking process directly inside the model, automatically discovering and operating over meaningful units of data

GIF

English

732

4.7K

794.8K

Ujjwal Upadhyay retweetledi

Anne Ouyang@anneouyang·29 May

✨ New blog post 👀: We have some very fast AI-generated kernels generated with a simple test-time only search. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. (1/6) [🔗 link in final post]

English

131

972

185K

Ujjwal Upadhyay retweetledi

Jyo Pari@jyo_pari·13 Haz

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

English

132

508

3.2K

665K

Ujjwal Upadhyay retweetledi

Turing Post@TheTuringPost·7 Haz

Log-linear attention — a new type of attention proposed by @MIT which is: - fast and efficient as linear attention - expressive as softmax It uses a small but growing number of memory slots that increases logarithmically with the sequence length. Here's how it works:

English

213

1.4K

103.9K

Ujjwal Upadhyay retweetledi

Rohan Paul@rohanpaul_ai·7 Haz

Time Blindness: Why Video-Language Models Can’t See What Humans Can? LLMs struggle capturing purely temporal patterns when spatial information is obscured. This paper introduces SpookyBench to evaluate this limitation, showing a significant gap compared to human perception. Methods 🔧: → SpookyBench videos encode text, shapes, object images, or dynamic scenes using binary noise patterns. → Content is visible only when viewed as a temporal sequence, not in individual frames. → This relies on opposing motion patterns between foreground and background noise based on content masks or depth map thresholds. → Temporal coherence and motion contrast Signal-to-Noise Ratios reveal signals humans use but current models miss. 📌 Current models architectural bias prevents temporal pattern recognition. 📌 0% model accuracy versus 98% human shows profound temporal blindness. 📌 Temporal coherence metrics highlight crucial information missed by spatial focus. ---------------------------- Paper - arxiv. org/abs/2505.24867 Paper Title: "Time Blindness: Why Video-Language Models Can't See What Humans Can?"

English

1.4K

Ujjwal Upadhyay retweetledi

Toby Ford-Monroe@tobyfordmonroe·3 Haz

Very interesting paper introducing SpookyBench, which is one of the only benchmarks where the VLM-human gap remains near 100 percentage points Due to architectural limitations, no VLM can perceive meaning dispersed across individually meaningless frames ("Temporal Encoding"). In other words, current models lack temporal understanding, since that ability doesn't emerge from frame-by-frame perception. I look forward to seeing how this gets solved -- maybe through some kind of motion-aware tokenization that makes movement a native input? arxiv.org/pdf/2505.24867

English

338

Ujjwal Upadhyay@theujjwal9·4 Haz

A joint work with @mranjan54, @szq0214, @moElhoseiny 🔗 Full paper: arxiv.org/abs/2505.24867

English

197

Ujjwal Upadhyay@theujjwal9·4 Haz

10/ To conclude: We’ve built “vision” models… …that don’t really watch. They stare at stills. Guess between gaps. And miss the magic of time. Time to rethink the architecture. Time to teach models how to see through time.

English

138

Ujjwal Upadhyay@theujjwal9·4 Haz

8/ Why this matters. Imagine: - In medical imaging, key signals may only emerge over time, not in any single frame. - Security systems miss suspicious behavior if they only see stills, not patterns over time.

English

108

Ujjwal Upadhyay@theujjwal9·4 Haz

7/ What’s the fix? Today’s models treat temporal reasoning as an afterthought, just stitching frames together. But the brain doesn’t do that. Neuroscience shows that time perception is distributed, dynamic, and doesn’t depend on clear snapshots. We need similar architecture.

English

Ujjwal Upadhyay@theujjwal9·4 Haz

5/ SpookyBench exposes this perfectly. In these videos: • Every single frame looks like random noise • But if you let it play, you see words emerge • It’s like a magic trick powered by motion Humans get it instantly. Machines fall flat.

English

Ujjwal Upadhyay@theujjwal9·4 Haz

6/ It’s not a fluke. We tested: - GPT-4o - Gemini 1.5 & 2.0 - Qwen2.5-VL - InternVL2.5 - Video-LLaVA, TimeChat, LLaVA-Next Doesn’t matter if the model has 2B or 600B parameters. Every. Single. One. Scored. 0%.

English

119

Ujjwal Upadhyay@theujjwal9·4 Haz

4/ But humans do. The brain naturally groups motion. It's how we read flashing signs, interpret Morse code, or even watch fireflies communicate. Even when spatial information is gone, we can decode meaning from purely temporal patterns. LLMs? Can’t.

English

102

Ujjwal Upadhyay@theujjwal9·4 Haz

Time Blindness: Why Video-Language Models Can’t See What Humans Can? We just dropped a paper exposing a major flaw in top Video-Language Models like GPT-4o & Gemini: They're completely blind to temporal patterns. Humans score 98%. These models? 0%. Here’s what we found 🧵

English

810

Ujjwal Upadhyay@theujjwal9·4 Haz

3/ Why does this happen? Because today’s Video-Language Models (VLMs) aren’t really watching videos. They just look at frames. Extract spatial features. Then try to guess what's happening between them. They don’t truly see through time. They see time through spatial lens.

English

107

Keşfet

@Si_Boehm @pranjalssh @Hyperstackcloud @karpathy @marksaroufim @MIT @szq0214 @moElhoseiny