Filip Morawiec

111 posts

Filip Morawiec

Filip Morawiec

@gournge

Katılım Şubat 2023
991 Takip Edilen18 Takipçiler
levi
levi@levidiamode·
Day 123/365 of GPU Programming Another day, another attempt at understanding SSMs from first principles. Gaining a real intuition for them (and their hardware implications) has been harder than expected, so I'm just taking a closer look at the foundational state space model papers (Hippo, H3, Mamba family, S4/5, et cetera) today to see if I can understand their genealogy and the rationale behind their evolutions better. If anyone has specific blog posts or code that helped them get a better sense of the problem space, I'd love to know!
levi tweet medialevi tweet medialevi tweet medialevi tweet media
levi@levidiamode

Day 122/365 of GPU Programming Continuing to learn about state space models (SSM), especially the mamba model family. I find them a bit more difficult to understand than Transformers, so trying to build up a clearer picture progressively from earlier related versions like S4 and understanding the motivations (in particular, hardware related) behind their existence. Also insane how one person (Tri Dao) can be behind so many of the interesting systems papers of the last few years. Really inspirational.

English
2
7
88
10.2K
Piotr Mazurek
Piotr Mazurek@tugot17·
relearning some probabilistic ml fundamentals over the weekend; very different from the standard LLM work, but quite important for robotics or world models
Piotr Mazurek tweet media
English
1
0
8
818
levi
levi@levidiamode·
Day 119/365 of GPU Programming Lecture 3 of Stanford's CS336 (Language Modeling from Scratch) is such a gem. Instead of yet another lecture on transformers, it provides a deep dive into what empirically has been used in the last two years in LLM land in terms of architecture and hyperparameter choices. Surprising how many of the original Transformer decisions are still valid today but also interesting to see the almost all modern language models use pre-norm and how universally useful throwing in layernorms seems to be. It goes also into the activation zoo and why gated activations are popular these days, where rotary position embeddings came from, why almost always d_ff = 4*d_model, how Head-dim*num-heads / model-dim ratio is often 1, what the most common vocabulary sizes are, why we weight decay LLMs and what stability tricks exist that make your loss curves less spiky. One of the most practical and up to date lectures on LLMs I've seen so far. Really really recommend to anyone who hasn't watched it yet.
levi tweet medialevi tweet medialevi tweet medialevi tweet media
levi@levidiamode

Day 118/365 of GPU Programming Spent some time looking more into GPU fabric and learned a little bit about how modern GPU clusters move data across NVLink, NVSwitch, PCIe, RDMA and similar interconnects. My current understanding is that once a model no longer fits on one GPU, performance might be less about whether a single GPU can do some computation and more about how much data to move between GPUs and how to avoid moving too much? In decode a new query can be really small (e.g. just one token or a small # of tokens) but the KV cache can be huge because it stores KVs for the entire previous context. And since online softmax can be computed in pieces where each shard keeps only small local summary stats like the local max score, local normalizer and locally weighted value sum, those summaries can be merged later into the same result as full attention. So seems like it’s possible to compute exact long context attention by leaving the KV cache distributed across the GPU fabric, sending small query blocks to the KV shards and merging only tiny softmax stats instead of moving raw KV back to one GPU? Not sure but maybe someone who's more knowledgeable can enlighten me on how this works exactly one day.

English
3
20
262
16.5K
Filip Morawiec
Filip Morawiec@gournge·
@Jianfei_AI How does it compare to some SOTA 1-step generation methods? The "Generative Modelling via Drifting" had strong performance using only one step. It also had some experiments in robotics domain
English
0
0
0
252
Jianfei Yang
Jianfei Yang@Jianfei_AI·
Excited to share a piece of work that I'm personally very proud of 👇 Our paper "Action-to-Action Flow Matching (A2A)" has been accepted to RSS-2026. What's the idea? Instead of generating robot actions from random noise (slow), we start from past actions and directly map to the next one via flow matching. Result: ⚡ single-step inference ⚡ great success rate ⚡ closer to real-world control speed From diffusion-style "slow thinking" → to instant action. Very excited about this step toward execution-speed embodied intelligence. 🔗 Project page: lorenzo-0-0.github.io/A2A_Flow_Match… 🔗 Paper link: arxiv.org/abs/2602.07322
Jianfei Yang tweet media
English
4
48
366
29.2K
Filip Morawiec retweetledi
Afshine Amidi
Afshine Amidi@afshinea·
Our new Stanford class "CME 296: Diffusion & Large Vision Models" is now available on YouTube!
Afshine Amidi tweet media
English
11
180
1.6K
59.7K
Filip Morawiec retweetledi
Vincent Abbott
Vincent Abbott@vtabbott_·
ICLR poster for FlashAttention on a Napkin with @GioeleZardini! See y'all in Brazil 🇧🇷🇧🇷🇧🇷
Vincent Abbott tweet media
English
2
4
34
2.4K
Filip Morawiec retweetledi
Steve Jurvetson
Steve Jurvetson@FutureJurvetson·
Subtext: how Zuck’s obsession with VR lost him AI leadership and “the greatest deal Google ever made.” “if Facebook didn’t buy DeepMind, they would end up in the arms of Google. Hassabis came out to the West Coast to have lunch with Larry Page, still the strongest suitor. Zuckerberg got wind of his visit and invited him to dinner. Arriving at Zuckerberg’s Palo Alto home, Hassabis administered a subtle test on him. The two men discussed the potential of AI, and Zuckerberg expressed appropriate excitement. But then, as the dinner continued, Hassabis brought up other hot technologies: virtual reality, augmented reality, 3-D printing. Zuckerberg sounded equally excited about all of them. ‘That told me what I needed to know,’ Hassabis said. ‘Facebook offered more money, but I wanted somebody who really understood why AI would be bigger than all these other things.’ After the dinner, Hassabis got back to Larry Page. ‘Let’s go further,’ he told him.” — book excerpt from today’s WSJ: wsj.com/tech/ai/deepmi… Zuck’s misplaced devotion to VR and the metaverse hurt the company much more than the $80 billion of wasted spend. It’s the reputational hit. @DemisHassabis divined it in his final test, and Zuck didn’t even know that he blew the opportunity. Eight years later, he renamed the company Meta, doubling down on what anyone with tech savvy knew was DOA. Then, in a 2025 attempt to play catchup, Zuck spent $14 billion on a data labelling company with a salesy leader and upended his AI team. Once again, anyone with tech savvy rolled their eyes on the acquisition and management changes, further evidence that the tech leadership at Meta was seriously lacking. TLDR; beware the metaverse. It is a dystopian vision at best, and luckily for humanity, headsets are still nowhere near readiness for mass adoption.
Steve Jurvetson tweet mediaSteve Jurvetson tweet media
English
86
171
2.1K
756K
Filip Morawiec retweetledi
Alex Clemmer 🔥🔥🔥😅🔥🔥🔥
The first time you hear about the JL lemma, it will seem too good to be true. And it is, kind of, I'll explain. The idea is: if you have points in large d-dimensional space, a RANDOM projection to much smaller k-dim subspace will be "nearly optimal" "in the general case." Or, more specifically: with high probability, the pairwise distances between points are preserved, given a couple other requirements around d and k. So why don't we just use random projections instead of carefully-constructed ones all the time? This is the most common misunderstanding of the JL lemma, and the one thing to really understand about it: in many (most?) datasets that are meaningful to humans, you actually CAN do better with something like maybe PCA. If your dataset is pathological, e.g., the points all lie on a plane even though it's technically in 3 dimensions, then clearly some planes you project onto will be better than others. The JL lemma does not apply to 2 and 3 dimensions, but you can imagine this would be true in large numbers of dimensions too. (See screenshot 1, i hope you like it because i made it myself lol.) If you know just those facts, you will be pretty well-prepared to answer most questions about its use. Most of the papers Delip mentions do presuppose that you know this. At least when I was a student, I found this to be non-obvious.
Alex Clemmer 🔥🔥🔥😅🔥🔥🔥 tweet media
Delip Rao e/σ@deliprao

The Google turboquant paper is making ML folks in this decade discover JL lemma and interact with math folks (which is cool). It appears more cool and mysterious if you do not read ML papers from the 90s and early 2000s :)

English
8
54
630
82.2K
Filip Morawiec retweetledi
Peter Holderrieth
Peter Holderrieth@peholderrieth·
🚀MIT Flow Matching and Diffusion Lecture 2026 Released (diffusion.csail.mit.edu)! We just released our new MIT 2026 course on flow matching and diffusion models! We teach the full stack of modern AI image, video, protein generators - theory and practice. We include: 📺 Videos: Step-by-step derivations. 📝 Notes: Mathematically self-contained lecture notes 💻 Coding: Hands-on exercises for every component We fully improved last years’ iteration and added new topics: latent spaces, diffusion transformers, building language models with discrete diffusion models. Everything is available here: diffusion.csail.mit.edu A huge thanks to Tommi Jaakkola for his support in making this class possible and Ashay Athalye (MIT SOUL) for the incredible production! Was fun to do this with @RShprints! #MachineLearning #GenerativeAI #MIT #DiffusionModels #AI
Peter Holderrieth tweet media
English
15
397
2.2K
527.5K
Filip Morawiec retweetledi
Arnas Uselis
Arnas Uselis@a_uselis·
How do embedding spaces of models that generalize from limited data look? We study what structure such models should exhibit. Turns out: linear and orthogonal. And modern embedding models like CLIP and SigLIP already show signs of it! 🧵 (1/n)
English
4
100
714
77.2K
Filip Morawiec retweetledi
Leonardo de Moura
Leonardo de Moura@Leonard41111588·
AI is writing a growing share of the world's software. No one is formally verifying any of it. New essay: "When AI Writes the World's Software, Who Verifies It?" leodemoura.github.io/blog/2026/02/2…
English
41
247
1.6K
422.5K
Filip Morawiec
Filip Morawiec@gournge·
@TimDarcet I recently experimented with changing the residual stream to be transformed with orthogonal matrices, but it didn't improve the performance I also reproved some theorem, but it didn't help Could you please take a quick look? I am still learning github.com/gournge/orthog…
English
0
0
0
6
Filip Morawiec
Filip Morawiec@gournge·
@vr4300 How does your work relate to the work of @vtabbott_? I'm not an expert, but I saw him e.g. automatically derive FlashAttn through an algebraic/category theory description
English
1
0
0
445
George Morgan
George Morgan@vr4300·
Applying category theory is the way we will push forward state of the art artificial intelligence research.
George Morgan tweet media
English
61
190
2.1K
126.7K
Filip Morawiec retweetledi
Alex Kontorovich
Alex Kontorovich@AlexKontorovich·
Real Analysis, The Game (v0.1) is DONE!! 44 Worlds 138 Levels All your old favorites like Bolzano-Weierstrass and Heine-Borel, Uniform Convergence and Riemann Sums, and the biggest Boss of all, the Intermediate Value Theorem! :) Play the game here: adam.math.hhu.de/#/g/AlexKontor… Follow along with all the lectures/videos/notes here: alexkontorovich.github.io/2025F311H/ Very relieved to have survived this semester, it was a tough one! Here's a sneak preview of where this project is headed next (math as a Scratch game... Let's get 12 year olds doing epsilon-delta proofs!)
Alex Kontorovich tweet mediaAlex Kontorovich tweet media
English
46
227
1.4K
93.9K
Filip Morawiec retweetledi
Arif Ahmad
Arif Ahmad@arif_ahmad_py·
We need more senior researchers camping out at their posters like this. Managed to catch 10 minutes of Alyosha turning @anand_bhattad’s poster into a pop-up mini lecture. Extra spark after he spotted @jathushan. Other folks in the audience: @HaoLi81 @konpatp @GurushaJuneja.
English
25
146
1.4K
202.3K
Filip Morawiec retweetledi
Stijn Spanhove
Stijn Spanhove@stspanho·
🏀 I built this PoC using a single 2D broadcast video: applied pitch detection, player positioning, and movement tracking to reconstruct the play in 3D and map it onto a real basketball court in XR.
English
26
97
731
36.8K
Filip Morawiec retweetledi
Brent 📍SF
Brent 📍SF@BingBongBrent·
Today I’m launching Swipe -- a new way to steer image models. The idea is simple – start with a prompt, swipe left / right to steer the model towards what you’re thinking of.
English
223
119
4K
288.1K
Filip Morawiec retweetledi
Seohong Park
Seohong Park@seohong_park·
Introducing *dual representations*! tl;dr: We represent a state by the "set of similarities" to all other states. This dual perspective has lots of nice properties and practical benefits in RL. Blog post: seohong.me/blog/dual-repr… Paper: arxiv.org/abs/2510.06714
Seohong Park tweet media
English
14
117
936
174.3K