Laura Wang

17 posts

Laura Wang banner
Laura Wang

Laura Wang

@LaurawlyLaura

Bergabung Ekim 2012
97 Mengikuti109 Pengikut
Laura Wang me-retweet
Bing Xu
Bing Xu@bingxu_·
This may be one of the first real signs of superhuman intelligence in software. On some of the most optimized attention workloads, agents can now outperform almost all human GPU experts by searching continuously for 7 days with no human intervention inside the optimization loop. Terry and I started agentic coding efforts at NVIDIA 1.5 years ago. Neither of us knew GPU programming, so from day one we pushed toward fully automated, human-out-of-the-loop systems. We call it blind coding. Over those 1.5 years, the two of us generated 4 generations across 2 agent systems. Since the 2nd generation, the stacks have been self-evolving. Each agent is now around 100k non-empty LOC. When we released the blind-coding framework VibeTensor in January, the implication was easy to miss. AVO makes the signal clearer. My bet is: blind coding is the future of software engineering. Human cognition is the bottleneck.
Bing Xu tweet mediaBing Xu tweet media
English
47
152
1K
207.9K
Laura Wang me-retweet
PyTorch
PyTorch@PyTorch·
Building on the previous correctness-focused pipeline, KernelAgent can now integrate GPU hardware-performance signals into a closed-loop multi-agent workflow to guide the optimization for Triton Kernels. Learn more: hubs.la/Q045Wsqq0 @KaimingCheng @marksaroufim
PyTorch tweet media
English
3
20
89
23.4K
Laura Wang me-retweet
Elon Musk
Elon Musk@elonmusk·
@cb_doge I’m started off knowing nothing about rockets, satellites, cars, etc, but I learn fast
English
1.5K
1.2K
30.3K
462K
Laura Wang me-retweet
AK
AK@_akhaliq·
Nvidia presents VibeTensor System Software for Deep Learning, Fully Generated by AI Agents
AK tweet media
English
2
16
89
9.8K
Laura Wang me-retweet
Mark Saroufim
Mark Saroufim@marksaroufim·
Pretty interesting Kernel LLM result, the beginnings of a new fast CuteDSL kernel zoo that's Quack inspired called Oink! An AI generated fused RMS norm kernel was integrated into VLLM and is showing 40% speedups relative to the existing RMS norm kernel in VLLM and an e2e 1.6% over the entire system github.com/vllm-project/v… Looking at the kernel specifically and comparing it vs the Quack one is also interesting github.com/meta-pytorch/K… First off, the code is much longer than Quack's and that's because the AI tries to effectively write a mini heuristic autotuner that's splatted over the file - so for instance Deepseek has a hot shape of 7168, at bf16 and if we choose to copy 256 bit vectors we get 16 vector elements. 7168 / 16 = 448 vectors across the row so we can just choose 224 threads per row to get 448 / 224 = 2 vectors per thread. This would be quite tedious for humans to work out per shape especially without a long running autotuner. This doesn't always work perfectly and the AI admits that this can segfault if used in conjunction with cluster launches and direct GMEM loads which brings me to the second cool trick. The AI figured out it could do direct_gmem which skips smem for data staging but keeps it for reduction The kernel is still quite long at 3K LOC but I suspect this can be brought down significantly, for instance the AI built its own tensor marshalling abstraction when it could have just leveraged tvm-ffi but overall it seems good at taking existing code written by experts such as Quack and modifying it to make a bit faster by using more tricks Using VLLM as an eval suite is quite nice, I'm not sure on where we'll converge as a community with excessive fallbacks on hard to fully test kernels, I suspect those will make stability and/or determinism work much more challenging but between this work and the work the FlashInfer team is doing on using full systems as an eval suite, I'm more optimistic we'll up end up with SOTA AI kernels this year.
English
4
17
120
12.9K
Laura Wang me-retweet
PyTorch
PyTorch@PyTorch·
KernelFalcon achieves 100% correctness across all 250 KernelBench L1–L3 tasks through a deep agent architecture that structures the problem instead of prompting harder. The system combines hierarchical task decomposition, deterministic orchestration, grounded execution, and parallel verification to generate GPU kernels that compile to PTX, execute on real hardware, and preserve PyTorch semantics. 💡Read our latest blog from @LaurawlyLaura and collaborators at Team PyTorch: hubs.la/Q03RYkZq0 #PyTorch #KernelFalcon #AIInfrastructure #OpenSourceAI
PyTorch tweet media
English
8
19
138
19.2K
Laura Wang me-retweet
Future 42
Future 42@future42org·
As WA State Education Spending goes UP, Reading and Math scores go DOWN👎 Without question, the Return On Investment for education in Washington State is getting worse. Link edunomicslab.org/washington-roi… Via @EdunomicsLab
Future 42 tweet media
English
13
43
112
3.9K
Laura Wang me-retweet
Mark Harris
Mark Harris@harrism·
What a cool award! I've always been proud of this paper. @jowens #CUDA
Mark Harris tweet media
English
3
14
66
0
Laura Wang
Laura Wang@LaurawlyLaura·
VR village is pretty cool at #GTC16 ! Live Finding Nemo creation by a Pixar artist.
English
0
0
2
0