Lorenzo Garcia

51 posts

Lorenzo Garcia

@_fmla_

Manchester, England Katılım Mart 2021

328 Takip Edilen3 Takipçiler

Lorenzo Garcia@_fmla_·6h

@Amarillo_Slim1 you need to get more educated on what the CPU narrative is about

English

282

Amarillo Slim@Amarillo_Slim1·9h

This absolutely nukes the bullshit CPU narrative. uccl-project.github.io/posts/mkernel/

English

19.8K

Lorenzo Garcia@_fmla_·18 May

@jiayiy how is this different from skip softmax from nvidia?

English

149

Jiayi Yuan@jiayiy·18 May

🚀 BLASST just won Best Paper at #MLSys26! In this paper, we introduce a simple, training-free dynamic sparse attention mechanism that uses a single scalar threshold on online softmax statistics to skip negligible attention blocks. Unfortunately I won’t be there in person, but please say hi to my awesome coauthors! 🙌 Paper: arxiv.org/abs/2512.12087

SemiAnalysis@SemiAnalysis_

Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sparse Attention, and recently @NousResearch 's Lighthouse Attention. BLASST by NVIDIA, from paper Dynamic Blocked Attention Sparsity via Softmax Thresholding, attempts to sparsify attention in a different way, leveraging a similar rescale factor threshold idea from Flash Attention 4. We expect to see more interesting sparse attention techniques in the future. arxiv.org/abs/2512.12087 (2/4)

English

359

40.3K

Lorenzo Garcia@_fmla_·2 May

@always_ff_rohan > FPGAs can hit 400 GFLOPs, basically replacing a 16-core server. 400 Gflops is nothing in the grand scheme of things. A single Graviton 4 core can do around 360 bf16 GFlops, so i don’t know which 16 cores are you planning to replace with this fpga

English

Rohan makes ASICs 🛠️@always_ff_rohan·25 May

Baidu and a bunch of other companies have already been using FPGAs for AI. Meanwhile, Indian AI/DL startups are just burning money at GPUs. For instance on matmul, FPGAs can hit 400 GFLOPs, basically replacing a 16-core server. For DL workloads with small batch sizes, FPGAs can actually beat both GPUs and CPUs in terms of throughput.

English

145

20.2K

Lorenzo Garcia@_fmla_·25 Nis

@ezyang linear attention ops

English

255

Edward Z. Yang@ezyang·25 Nis

Give me your O(num operators) PyTorch improvement ideas that you are interested in. Historical examples: making every kernel deterministic / support zero size. Not done: every kernel in your favorite DSL / batch invariant / masked / padded / device side size

English

11.3K

Lorenzo Garcia@_fmla_·29 Mar

@yacineMTB @difficultyang you’re delusional bro

English

kache@yacineMTB·29 Mar

@difficultyang i would simply just not use torch at all and write cuda directly

English

3.7K

difficultyang@difficultyang·29 Mar

Let's say we wanted to rewrite PyTorch from scratch, because such a thing is topic du jour in the age of LLMs. What would the goals of such a rewrite be? What problems could a rewrite solve that incremental evolution from where the code is today not? 🧵

English

620

72.2K

Lorenzo Garcia@_fmla_·7 Mar

@bubbleboi taylor series approximation of exp is not an innovation. people have been doing this for ages

English

bubble boi@bubbleboi·6 Mar

So Nvidia added the following innovations: -Taylor series like approximation of e^x -Horners method for soft max And we think GPUs aren’t just becoming LLM accelerators at this point?

Tri Dao@tri_dao

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.

English

536

52K

Lorenzo Garcia@_fmla_·25 Şub

@benjihyam aged like milk

English

Benji Hyam@benjihyam·12 Oca

ChatGPT is the next clubhouse. In 12 months, people will forget it existed.

English

1.3K

284

4.8K

1.5M

Lorenzo Garcia@_fmla_·23 Şub

@JennyTheDev @AnthropicAI you must be super dumb

English

162

Jenny@JennyTheDev·23 Şub

@AnthropicAI 24,000 fake accounts and 16 million exchanges just to avoid training their own models. The "build vs steal" decision tree is getting wild.

English

11.2K

Anthropic@AnthropicAI·23 Şub

We’ve identified industrial-scale distillation attacks on our models by DeepSeek, Moonshot AI, and MiniMax. These labs created over 24,000 fraudulent accounts and generated over 16 million exchanges with Claude, extracting its capabilities to train and improve their own models.

English

7.2K

6.2K

54.6K

33.8M

Lorenzo Garcia@_fmla_·19 Şub

@scaling01 @zephyr_z9 100x speedups are BS

English

Lisan al Gaib@scaling01·17 Şub

This is what I'm most bullish on. Kernel optimization is just going absolutely vertical.

Lisan al Gaib@scaling01

Sonnet 4.6 beating Opus 4.6 on AIRD Kernels Hard (kernel optimization)

English

141

17K

Lorenzo Garcia@_fmla_·31 Ara

@gopinath9629 @mkurman88 vLLM does support CPU, if it didn’t work for you please raise and issue and we’ll fix it

English

GopiNath@gopinath9629·30 Ara

@mkurman88 Llama.cpp doing well I guess but still its lack behind when its coming to multiple gpu and vllm don't support cpu i am currently expolaring exo and then distributed llama for my setup.

English

2.3K

Mariusz Kurman@mkurman88·30 Ara

Is vllm broken? How can two identical requests produce different outputs? Temperature 0.0, seed 42 - how the hell can you ensure determinism?

English

28.3K

Lorenzo Garcia@_fmla_·14 Ara

@ivanfioravanti @awnihannun Everybody does that btw

English

Ivan Fioravanti ᯅ@ivanfioravanti·13 Ara

Apple MLX team never stops. Here @awnihannun merging a PR on Saturday morning 💪🏻

English

124

6.8K

Lorenzo Garcia@_fmla_·22 Kas

@istoica05 17x speedup means that the baseline was either not written by experts or the op was not important enough to be optimised

English

Ion Stoica@istoica05·21 Kas

Optimizing kernels for emerging hardware architectures remains a critical bottleneck today. This new blog post describes Autocomp, a new ADRS (AI-Drivern Research for Systems) framework that leverages LLM-driven search to automate this process and achieves up to a 17x speedup over expert-written code on AWS Trainium.

AI-Driven Research for Systems@ai4research_ucb

🚀 AI optimizes tensor kernels to run 17x faster than human expert designs! [ADRS Blog] Programming hardware accelerators is notoriously hard. We describe Autocomp, the first LLM-driven optimizer for tensor accelerators, which outperforms hand-tuned expert kernels on AWS Trainium by up to 17x! ✍️ Read the blog: adrs-ucb.notion.site/autocomp 📖 ADRS Blog Series: ucbskyadrs.github.io 📃 Autocomp Paper: arxiv.org/pdf/2505.18574 👩‍💻 Code: github.com/ucb-bar/autoco…

English

261

29.9K

Lorenzo Garcia@_fmla_·26 Ağu

@suchenzang he doesn’t need that hairline yet he has it

English

700

Susan Zhang@suchenzang·26 Ağu

@_fmla_ ilya would never need linkedin recruiters

English

2.4K

Susan Zhang@suchenzang·26 Ağu

be bold take risks aim to build super-intelligence before even reaching baseline intelligence

English

256

43K

Lorenzo Garcia@_fmla_·17 Ağu

@noor_supernova7 .

QAM

SO?@noor_supernova7·16 Ağu

For those who do not know, the occupation of Gaza has begun. If you’re scrolling, PLEASE leave a dot . it's just a dot.

English

34.3K

34.5K

244K

5.7M

Lorenzo Garcia@_fmla_·28 Tem

@HasanEssam29636 .

QAM

Hasan alrabay@HasanEssam29636·26 Tem

I’m very hungry. My body is slowly falling apart from malnutrition, dizziness, and weight loss. If you’re scrolling, PLEASE leave a dot . it's just a dot

English

52K

40.8K

346.4K

8.3M

Lorenzo Garcia@_fmla_·22 Tem

@chrisgpt Nobody wants to be garbage collected

English

Chris@Chrisgpt·22 Tem

can’t sleep, who wants to be added to a gc?

English

6.9K

Lorenzo Garcia@_fmla_·17 Haz

@GavinWax I lost brain cells reading this

English

Gavin M. Wax@GavinWax·15 Haz

Silicon Valley is a strategic national asset whose jobs should be legally reserved for U.S. citizens. It’s a matter of national security. This shouldn’t be controversial.

English

852

464

3.4K

5.1M

Lorenzo Garcia@_fmla_·15 Haz

@SupaDupaChop That’s not what they’re saying in the video. They’re encouraging julani to strike tel aviv

English

102

Lorenzo Garcia@_fmla_·2 Haz

@_jasonwei The transformer was invented to machine translation MT btw

English

337

Jason Wei@_jasonwei·2 Haz

There are traditionally two types of research: problem-driven research and method-driven research. As we’ve seen with large language models and now AlphaEvolve, it should be very clear now that total method-driven research is a huge opportunity. Problem-driven research is nice because you have a consistent and specific goal. The goal is usually virtuous, so it feels good to have a mission and identity. However, it just doesn’t work due to The Bitter Lesson. Basically everything in classical NLP (machine translation, summarization, chatbots) lost to simple scaling. ChatGPT is a prime example—it used nothing from chatbot research and certainly wasn’t the intended end goal of OpenAI’s 2022 research program, but was a huge hit because someone (John Schulman et al) figured out the right way to package large language models as a product. Method-driven research feels less stable because you’re constantly searching for problems and you have to be opportunistic. But I believe AI will allow method-driven research to dominate progress in most fields of science, one-by-one. The latest method (or “hammer”), as we’ve seen in AlphaEvolve, is ruthless search and optimization against a reward function (whether this requires RL or not is a separate discussion). Things that problem-driven researchers have been trying to solve for a long time like the kissing number problem will become nails hit by the hammer. Eventually the hammer will become bigger, stronger, and more general and will hit more and more nails. So a very important meta-skill for the next decade will be knowing how to create the right environments to use The Hammer. Ironically, the problem-driven researchers, who by definition are experts in a specific problem, are well-positioned to create these environments. If, that is, they can put down their egos and pick up the hammer.

English

712

78.2K

Keşfet

@Amarillo_Slim1 @jiayiy @always_ff_rohan @ezyang @yacineMTB @difficultyang @bubbleboi @benjihyam