Vladimir Vlejd Macko

100 posts

Vladimir Vlejd Macko

@vlejd

I like taking things from 0 to 1. From nothing to something. Sprinkled with ML when necessary.

Beigetreten Mart 2011

106 Folgt67 Follower

Angehefteter Tweet

Vladimir Vlejd Macko@vlejd·24 Kas

Unstructured weight #sparsity made practical. 50% unstructured weight sparsity was considered too low for real GPU speed up without specific hardware support (like @cerebras). With @bozavlado we built MACKO-SpMV - a new matrix format + SpMV kernel to change that. 🧵

English

1.6K

Vladimir Vlejd Macko@vlejd·29 Ara

@mmaaz_98 Nice work! if you want to add support for low 20-90% sparsity, we have an implementation at github.com/vlejd/macko_sp…

English

Maaz@mmaaz_98·28 Ara

I built a GPU-accelerated linear programming solver in PyTorch that scales to 100k+ variables and constraints -- and is competitive with state-of-the-art solvers. The entire implementation is only ~350 lines (excl. docs / logging) and is meant to be as simple as possible.

English

901

62.6K

Vladimir Vlejd Macko@vlejd·4 Ara

@mariyaivasileva I worked at a company that sometimes had spare GPUs.

English

121

Mariya I. Vasileva@mariyaivasileva·1 Ara

“Tell me you’re an ML veteran without telling me you’re an ML veteran.” “My first paper was published at NIPS, not NeurIPS.”

English

156

81.4K

Vladimir Vlejd Macko retweetet

James Bradbury@jekbradbury·24 Kas

opus 4.5 is really good at GPU programming, but somehow it’s even better at GPU programming jokes (h/t @Si_Boehm)

English

539

83.2K

Vladimir Vlejd Macko@vlejd·24 Kas

🛠️ Next step: server GPUs. If you know how to implement a minimal CUDA matvec on H100 that hits ≥95% of cuBLAS 👉 My DMs are open.

English

Vladimir Vlejd Macko@vlejd·24 Kas

And yes, it translates to real LLM inference speed ups.

English

Vladimir Vlejd Macko@vlejd·24 Kas

English

1.6K

Vladimir Vlejd Macko@vlejd·22 Kas

It is funny how little correct information is there about how to properly benchmark a CUDA kernel. Most papers are wrong, eval libraries are hard to inspect and even this could have a problem because it may include the kernel launch depending on clear_cache implementation

tender@tenderizzation

btw, I think BackendBench just uses triton's do_bench function, which uses a very similar timing mechanism to the one exploited here and wouldn't be robust to the same side-stream shenanigans

English

Vladimir Vlejd Macko@vlejd·22 Kas

@miru_why @niklassheth @ronusedh @IntologyAI My second personal favorite is to not clean the cache between invocations, and testing only on matrices that fit the cache. You can get some truly unbelievable flops :D

English

2.1K

miru@miru_why·21 Kas

@niklassheth @ronusedh @IntologyAI their 'superhuman' ai cleverly assigned all the work to non-default streams, which means the correctness test (which waits on all streams) passes, while the profiling timer (which only waits on the default stream) is tricked into reporting a huge speedup

English

566

258K

Intology@IntologyAI·19 Kas

Introducing Locus: the first AI system to outperform human experts at AI R&D Locus conducts research autonomously over multiple days and achieves superhuman results on RE-Bench given the same resources as humans, as well as SOTA performance on GPU kernel & ML engineering tasks. RE-Bench is a collection of several frontier AI research tasks that typically take human experts (e.g., top ML PhDs and frontier lab researchers) several days. By scaling experimentation to far longer time horizons than previous systems, Locus represents a step change in AI scientist capabilities. 🧵

GIF

English

419

217K

Vladimir Vlejd Macko@vlejd·21 Kas

@miru_why @niklassheth @ronusedh @IntologyAI github.com/pytorch/pytorc…

QME

2.8K

Vladimir Vlejd Macko@vlejd·21 Kas

@miru_why @niklassheth @ronusedh @IntologyAI Hahaha. I spent months debugging this. Had to fix the official torch documentation that contained the same problem in it's examples. Unfortunately, pretty common pattern.

English

3.5K

Vladimir Vlejd Macko retweetet

Julian@julianboolean_·20 Ağu

holy shit they found a power series solution to ALL polynomial equations!! (bypassing Galois which says you can’t solve them in radicals)

English

1.1K

168K

Vladimir Vlejd Macko@vlejd·18 Tem

@ollama local app is coming up! Awesome. Local models are the future.

English

130

Vladimir Vlejd Macko@vlejd·18 Tem

Happy birthday @ollama !

English

251

Vladimir Vlejd Macko@vlejd·18 Tem

I do model compression and optimization. It is essential to have access to different GPUs and that would be impossible without @vast_ai . Happy to finally meet you guys at #ICML2025 . And thanks a lot for the Nintendo Switch!

English

Vladimir Vlejd Macko@vlejd·16 Tem

Just visited with Vast.ai at ICML 2025. Without them my research would be almost impossible. They make gpu kernel development much more accessible. @vast_ai

English

186

Vladimir Vlejd Macko@vlejd·2 Tem

I found "high-priority" "oncall" bug in torch. Achievement unlocked: github.com/pytorch/pytorc… Two weeks of insanity were worth it.

English

443

Entdecken

@mmaaz_98 @mariyaivasileva @Si_Boehm @cerebras @bozavlado @miru_why @niklassheth @ronusedh