Laura Wang

17 posts

Laura Wang

@LaurawlyLaura

Bergabung Ekim 2012

97 Mengikuti109 Pengikut

Laura Wang me-retweet

Bing Xu@bingxu_·26 Mar

This may be one of the first real signs of superhuman intelligence in software. On some of the most optimized attention workloads, agents can now outperform almost all human GPU experts by searching continuously for 7 days with no human intervention inside the optimization loop. Terry and I started agentic coding efforts at NVIDIA 1.5 years ago. Neither of us knew GPU programming, so from day one we pushed toward fully automated, human-out-of-the-loop systems. We call it blind coding. Over those 1.5 years, the two of us generated 4 generations across 2 agent systems. Since the 2nd generation, the stacks have been self-evolving. Each agent is now around 100k non-empty LOC. When we released the blind-coding framework VibeTensor in January, the implication was easy to miss. AVO makes the signal clearer. My bet is: blind coding is the future of software engineering. Human cognition is the bottleneck.

English

152

207.9K

Laura Wang me-retweet

PyTorch@PyTorch·6 Mar

Building on the previous correctness-focused pipeline, KernelAgent can now integrate GPU hardware-performance signals into a closed-loop multi-agent workflow to guide the optimization for Triton Kernels. Learn more: hubs.la/Q045Wsqq0 @KaimingCheng @marksaroufim

English

23.4K

Laura Wang me-retweet

Elon Musk@elonmusk·23 Oca

@cb_doge I’m started off knowing nothing about rockets, satellites, cars, etc, but I learn fast

English

1.5K

1.2K

30.3K

462K

Laura Wang me-retweet

AK@_akhaliq·22 Oca

Nvidia presents VibeTensor System Software for Deep Learning, Fully Generated by AI Agents

English

9.8K

Laura Wang me-retweet

Mark Saroufim@marksaroufim·8 Oca

Pretty interesting Kernel LLM result, the beginnings of a new fast CuteDSL kernel zoo that's Quack inspired called Oink! An AI generated fused RMS norm kernel was integrated into VLLM and is showing 40% speedups relative to the existing RMS norm kernel in VLLM and an e2e 1.6% over the entire system github.com/vllm-project/v… Looking at the kernel specifically and comparing it vs the Quack one is also interesting github.com/meta-pytorch/K… First off, the code is much longer than Quack's and that's because the AI tries to effectively write a mini heuristic autotuner that's splatted over the file - so for instance Deepseek has a hot shape of 7168, at bf16 and if we choose to copy 256 bit vectors we get 16 vector elements. 7168 / 16 = 448 vectors across the row so we can just choose 224 threads per row to get 448 / 224 = 2 vectors per thread. This would be quite tedious for humans to work out per shape especially without a long running autotuner. This doesn't always work perfectly and the AI admits that this can segfault if used in conjunction with cluster launches and direct GMEM loads which brings me to the second cool trick. The AI figured out it could do direct_gmem which skips smem for data staging but keeps it for reduction The kernel is still quite long at 3K LOC but I suspect this can be brought down significantly, for instance the AI built its own tensor marshalling abstraction when it could have just leveraged tvm-ffi but overall it seems good at taking existing code written by experts such as Quack and modifying it to make a bit faster by using more tricks Using VLLM as an eval suite is quite nice, I'm not sure on where we'll converge as a community with excessive fallbacks on hard to fully test kernels, I suspect those will make stability and/or determinism work much more challenging but between this work and the work the FlashInfer team is doing on using full systems as an eval suite, I'm more optimistic we'll up end up with SOTA AI kernels this year.

English

120

12.9K

Laura Wang@LaurawlyLaura·9 Oca

KernelFalcon: Agentic kernel authoring is getting very real. vLLM: github.com/vllm-project/v… PyTorch: github.com/meta-pytorch/K…

Elliot Arledge@elliotarledge

This is my favorite clip of the new Elon pod. He opens up saying xAI struggles with memory usage/bandwidth and CUDA kernel optimization (matmul, attention, MoE, etc). If you are good kernel or performance engineering in general, you should apply. Steer the world in a better direction.

English

505

Laura Wang@LaurawlyLaura·6 Kas

@elliotarledge @PyTorch Perf is next—stay tuned~

English

Elliot Arledge@elliotarledge·6 Kas

@PyTorch What about speedups?

English

541

Laura Wang me-retweet

PyTorch@PyTorch·6 Kas

KernelFalcon achieves 100% correctness across all 250 KernelBench L1–L3 tasks through a deep agent architecture that structures the problem instead of prompting harder. The system combines hierarchical task decomposition, deterministic orchestration, grounded execution, and parallel verification to generate GPU kernels that compile to PTX, execute on real hardware, and preserve PyTorch semantics. 💡Read our latest blog from @LaurawlyLaura and collaborators at Team PyTorch: hubs.la/Q03RYkZq0 #PyTorch #KernelFalcon #AIInfrastructure #OpenSourceAI

English

138

19.2K

Laura Wang me-retweet

Bing Xu@bingxu_·13 Şub

NVIDIA AI Developer@NVIDIAAIDev

Our engineers developed an AI coding agent using DeepSeek-R1 and inference-time scaling to automatically generate GPU attention kernels. Read how ➡️ nvda.ws/42V0mze

ZXX

13.4K

Laura Wang me-retweet

Future 42@future42org·4 Şub

As WA State Education Spending goes UP, Reading and Math scores go DOWN👎 Without question, the Return On Investment for education in Washington State is getting worse. Link edunomicslab.org/washington-roi… Via @EdunomicsLab

English

112

3.9K

Laura Wang@LaurawlyLaura·7 Eki

Introducing NNVM Compiler: A New Open End-to-End Compiler for AI Frameworks aws.amazon.com/blogs/ai/intro… via @awscloud

English

Laura Wang me-retweet

Tianqi Chen@tqchenml·6 Eki

Introducing NNVM compiler tvmlang.org/2017/10/06/nnv… brings MXNet, PyTorch, Caffe2, CoreML to bare-metal hardware backends with #TVM stack twitter.com/uwcse/status/9…

Allen School@uwcse

#UWAllen researchers team up with @awscloud to release new NNVM compiler for deep learning frameworks: news.cs.washington.edu/2017/10/06/all…

English

134

259

Laura Wang me-retweet

Tianqi Chen@tqchenml·17 Ağu

I am proud to announce TVM, our new deep learning compiler toolchain news.cs.washington.edu/2017/08/17/all… tvmlang.org/2017/08/17/tvm… @guestrin @uwcse

English

163

Laura Wang me-retweet