Filip Morawiec

111 posts

Filip Morawiec

@gournge

Katılım Şubat 2023

991 Takip Edilen18 Takipçiler

@levidiamode I saw someone said Mamba clicked for them after reading that Jane Street blog on positional encodings: x.com/fleetingbytes/…

fleetingbytes@fleetingbytes

i never quite understood mamba, while reading about a jane street post on positional embedding, it suddenly made sense - and, i think it is part of a general trend that nothing is explained well until it must be explained to a mass audience (in this case myself)

English

141

levi@levidiamode·7 May

Day 123/365 of GPU Programming Another day, another attempt at understanding SSMs from first principles. Gaining a real intuition for them (and their hardware implications) has been harder than expected, so I'm just taking a closer look at the foundational state space model papers (Hippo, H3, Mamba family, S4/5, et cetera) today to see if I can understand their genealogy and the rationale behind their evolutions better. If anyone has specific blog posts or code that helped them get a better sense of the problem space, I'd love to know!

levi@levidiamode

Day 122/365 of GPU Programming Continuing to learn about state space models (SSM), especially the mamba model family. I find them a bit more difficult to understand than Transformers, so trying to build up a clearer picture progressively from earlier related versions like S4 and understanding the motivations (in particular, hardware related) behind their existence. Also insane how one person (Tri Dao) can be behind so many of the interesting systems papers of the last few years. Really inspirational.

English

10.2K

Filip Morawiec@gournge·3 May

@tugot17 I did the same recently, I used visionbook.mit.edu @phillip_isola comes up with nice toy models for studying stuff What did you use?

English

Piotr Mazurek@tugot17·3 May

relearning some probabilistic ml fundamentals over the weekend; very different from the standard LLM work, but quite important for robotics or world models

English

818

Filip Morawiec@gournge·3 May

@levidiamode @stanfordnlp In the context of Pre-Norm vs Post-Norm: there is this nice article studying a variation that some models have adopted. Maybe you have seen it: arxiv.org/abs/2502.02732

English

171

levi@levidiamode·3 May

Day 119/365 of GPU Programming Lecture 3 of Stanford's CS336 (Language Modeling from Scratch) is such a gem. Instead of yet another lecture on transformers, it provides a deep dive into what empirically has been used in the last two years in LLM land in terms of architecture and hyperparameter choices. Surprising how many of the original Transformer decisions are still valid today but also interesting to see the almost all modern language models use pre-norm and how universally useful throwing in layernorms seems to be. It goes also into the activation zoo and why gated activations are popular these days, where rotary position embeddings came from, why almost always d_ff = 4*d_model, how Head-dim*num-heads / model-dim ratio is often 1, what the most common vocabulary sizes are, why we weight decay LLMs and what stability tricks exist that make your loss curves less spiky. One of the most practical and up to date lectures on LLMs I've seen so far. Really really recommend to anyone who hasn't watched it yet.

levi@levidiamode

Day 118/365 of GPU Programming Spent some time looking more into GPU fabric and learned a little bit about how modern GPU clusters move data across NVLink, NVSwitch, PCIe, RDMA and similar interconnects. My current understanding is that once a model no longer fits on one GPU, performance might be less about whether a single GPU can do some computation and more about how much data to move between GPUs and how to avoid moving too much? In decode a new query can be really small (e.g. just one token or a small # of tokens) but the KV cache can be huge because it stores KVs for the entire previous context. And since online softmax can be computed in pieces where each shard keeps only small local summary stats like the local max score, local normalizer and locally weighted value sum, those summaries can be merged later into the same result as full attention. So seems like it’s possible to compute exact long context attention by leaving the KV cache distributed across the GPU fabric, sending small query blocks to the KV shards and merging only tiny softmax stats instead of moving raw KV back to one GPU? Not sure but maybe someone who's more knowledgeable can enlighten me on how this works exactly one day.

English

262

16.5K

Filip Morawiec@gournge·2 May

@Jianfei_AI How does it compare to some SOTA 1-step generation methods? The "Generative Modelling via Drifting" had strong performance using only one step. It also had some experiments in robotics domain

English

252

Jianfei Yang@Jianfei_AI·2 May

Excited to share a piece of work that I'm personally very proud of 👇 Our paper "Action-to-Action Flow Matching (A2A)" has been accepted to RSS-2026. What's the idea? Instead of generating robot actions from random noise (slow), we start from past actions and directly map to the next one via flow matching. Result: ⚡ single-step inference ⚡ great success rate ⚡ closer to real-world control speed From diffusion-style "slow thinking" → to instant action. Very excited about this step toward execution-speed embodied intelligence. 🔗 Project page: lorenzo-0-0.github.io/A2A_Flow_Match… 🔗 Paper link: arxiv.org/abs/2602.07322

English

366

29.2K

Filip Morawiec retweetledi

Afshine Amidi@afshinea·11 Nis

Our new Stanford class "CME 296: Diffusion & Large Vision Models" is now available on YouTube!

English

180

1.6K

59.7K

Filip Morawiec retweetledi

Vincent Abbott@vtabbott_·10 Nis

ICLR poster for FlashAttention on a Napkin with @GioeleZardini! See y'all in Brazil 🇧🇷🇧🇷🇧🇷

English

2.4K

Filip Morawiec retweetledi

Paata Ivanisvili@PI010101·1 Nis

Every subgaussian is a sum of Gaussians -- Antoine Song resolves Talagrand's conjectures arxiv.org/pdf/2602.22342

English

402

40.2K

Filip Morawiec retweetledi

Steve Jurvetson@FutureJurvetson·28 Mar

Subtext: how Zuck’s obsession with VR lost him AI leadership and “the greatest deal Google ever made.” “if Facebook didn’t buy DeepMind, they would end up in the arms of Google. Hassabis came out to the West Coast to have lunch with Larry Page, still the strongest suitor. Zuckerberg got wind of his visit and invited him to dinner. Arriving at Zuckerberg’s Palo Alto home, Hassabis administered a subtle test on him. The two men discussed the potential of AI, and Zuckerberg expressed appropriate excitement. But then, as the dinner continued, Hassabis brought up other hot technologies: virtual reality, augmented reality, 3-D printing. Zuckerberg sounded equally excited about all of them. ‘That told me what I needed to know,’ Hassabis said. ‘Facebook offered more money, but I wanted somebody who really understood why AI would be bigger than all these other things.’ After the dinner, Hassabis got back to Larry Page. ‘Let’s go further,’ he told him.” — book excerpt from today’s WSJ: wsj.com/tech/ai/deepmi… Zuck’s misplaced devotion to VR and the metaverse hurt the company much more than the $80 billion of wasted spend. It’s the reputational hit. @DemisHassabis divined it in his final test, and Zuck didn’t even know that he blew the opportunity. Eight years later, he renamed the company Meta, doubling down on what anyone with tech savvy knew was DOA. Then, in a 2025 attempt to play catchup, Zuck spent $14 billion on a data labelling company with a salesy leader and upended his AI team. Once again, anyone with tech savvy rolled their eyes on the acquisition and management changes, further evidence that the tech leadership at Meta was seriously lacking. TLDR; beware the metaverse. It is a dystopian vision at best, and luckily for humanity, headsets are still nowhere near readiness for mass adoption.

English

171

2.1K

756K

Filip Morawiec retweetledi

Alex Clemmer 🔥🔥🔥😅🔥🔥🔥@hausdorff_space·29 Mar

The first time you hear about the JL lemma, it will seem too good to be true. And it is, kind of, I'll explain. The idea is: if you have points in large d-dimensional space, a RANDOM projection to much smaller k-dim subspace will be "nearly optimal" "in the general case." Or, more specifically: with high probability, the pairwise distances between points are preserved, given a couple other requirements around d and k. So why don't we just use random projections instead of carefully-constructed ones all the time? This is the most common misunderstanding of the JL lemma, and the one thing to really understand about it: in many (most?) datasets that are meaningful to humans, you actually CAN do better with something like maybe PCA. If your dataset is pathological, e.g., the points all lie on a plane even though it's technically in 3 dimensions, then clearly some planes you project onto will be better than others. The JL lemma does not apply to 2 and 3 dimensions, but you can imagine this would be true in large numbers of dimensions too. (See screenshot 1, i hope you like it because i made it myself lol.) If you know just those facts, you will be pretty well-prepared to answer most questions about its use. Most of the papers Delip mentions do presuppose that you know this. At least when I was a student, I found this to be non-obvious.

Delip Rao e/σ@deliprao

The Google turboquant paper is making ML folks in this decade discover JL lemma and interact with math folks (which is cool). It appears more cool and mysterious if you do not read ML papers from the 90s and early 2000s :)

English

630

82.2K

Filip Morawiec retweetledi

Peter Holderrieth@peholderrieth·18 Mar

🚀MIT Flow Matching and Diffusion Lecture 2026 Released (diffusion.csail.mit.edu)! We just released our new MIT 2026 course on flow matching and diffusion models! We teach the full stack of modern AI image, video, protein generators - theory and practice. We include: 📺 Videos: Step-by-step derivations. 📝 Notes: Mathematically self-contained lecture notes 💻 Coding: Hands-on exercises for every component We fully improved last years’ iteration and added new topics: latent spaces, diffusion transformers, building language models with discrete diffusion models. Everything is available here: diffusion.csail.mit.edu A huge thanks to Tommi Jaakkola for his support in making this class possible and Ashay Athalye (MIT SOUL) for the incredible production! Was fun to do this with @RShprints! #MachineLearning #GenerativeAI #MIT #DiffusionModels #AI

English

397

2.2K

527.5K

Filip Morawiec retweetledi

Arnas Uselis@a_uselis·11 Mar

How do embedding spaces of models that generalize from limited data look? We study what structure such models should exhibit. Turns out: linear and orthogonal. And modern embedding models like CLIP and SigLIP already show signs of it! 🧵 (1/n)

English

100

714

77.2K

Filip Morawiec retweetledi

Leonardo de Moura@Leonard41111588·3 Mar

AI is writing a growing share of the world's software. No one is formally verifying any of it. New essay: "When AI Writes the World's Software, Who Verifies It?" leodemoura.github.io/blog/2026/02/2…

English

247

1.6K

422.5K

Filip Morawiec retweetledi

Donghoon Ahn @ ICLR@donghoon_ahn·16 Şub

Blog post: The flavor of the bitter lesson for computer vision vincentsitzmann.com/blog/bitter_le… Great insights from @vincesitzmann. This really makes me think about where computer vision research should go next.

English

318

18.7K

Filip Morawiec@gournge·16 Ara

@TimDarcet I recently experimented with changing the residual stream to be transformed with orthogonal matrices, but it didn't improve the performance I also reproved some theorem, but it didn't help Could you please take a quick look? I am still learning github.com/gournge/orthog…

English

TimDarcet@TimDarcet·16 Ara

Fun fact post-norm is pre-norm + normalizing the residual (apart from first and last layer) pre-norm: x = x + f_n(norm_n(x)) post-norm: x = norm_{n-1}(x) + f_n(norm_{n-1}(x)) My intuition is never touch the residual, so the gradients flow unharmed

vipli@viplismism

why does pre-norm work better than post-norm in transformers? i've been diving into transformer architecture (n'th time again) and noticed something interesting this time- almost all the implementations I have seen uses the "pre-norm" variant (normalizing before the sublayer then adding residual) consistently outperforms the original "post-norm" design (adding residual then normalizing). the code difference is simple: • post-norm: output = norm(x + sublayer(x)) • pre-norm: output = x + sublayer(norm(x)) but why does this seemingly small change allow for much deeper and more stable transformer training? i get that it helps gradient flow, but i'm curious about deeper intuitions, I mean what's the pin point maths behind this?!

English

212

20.3K

Filip Morawiec@gournge·12 Ara

@vr4300 How does your work relate to the work of @vtabbott_? I'm not an expert, but I saw him e.g. automatically derive FlashAttn through an algebraic/category theory description

English

445

George Morgan@vr4300·12 Ara

Applying category theory is the way we will push forward state of the art artificial intelligence research.

English

190

2.1K

126.7K

Filip Morawiec retweetledi

Alex Kontorovich@AlexKontorovich·10 Ara

Real Analysis, The Game (v0.1) is DONE!! 44 Worlds 138 Levels All your old favorites like Bolzano-Weierstrass and Heine-Borel, Uniform Convergence and Riemann Sums, and the biggest Boss of all, the Intermediate Value Theorem! :) Play the game here: adam.math.hhu.de/#/g/AlexKontor… Follow along with all the lectures/videos/notes here: alexkontorovich.github.io/2025F311H/ Very relieved to have survived this semester, it was a tough one! Here's a sneak preview of where this project is headed next (math as a Scratch game... Let's get 12 year olds doing epsilon-delta proofs!)

English

227

1.4K

93.9K

Filip Morawiec retweetledi

Arif Ahmad@arif_ahmad_py·5 Ara

We need more senior researchers camping out at their posters like this. Managed to catch 10 minutes of Alyosha turning @anand_bhattad’s poster into a pop-up mini lecture. Extra spark after he spotted @jathushan. Other folks in the audience: @HaoLi81 @konpatp @GurushaJuneja.

English

146

1.4K

202.3K

Filip Morawiec retweetledi

Stijn Spanhove@stspanho·24 Kas

🏀 I built this PoC using a single 2D broadcast video: applied pitch detection, player positioning, and movement tracking to reconstruct the play in 3D and map it onto a real basketball court in XR.

English

731

36.8K

Filip Morawiec retweetledi

Brent 📍SF@BingBongBrent·6 Kas

Today I’m launching Swipe -- a new way to steer image models. The idea is simple – start with a prompt, swipe left / right to steer the model towards what you’re thinking of.

English

223

119

288.1K

Filip Morawiec retweetledi

Seohong Park@seohong_park·9 Eki

Introducing *dual representations*! tl;dr: We represent a state by the "set of similarities" to all other states. This dual perspective has lots of nice properties and practical benefits in RL. Blog post: seohong.me/blog/dual-repr… Paper: arxiv.org/abs/2510.06714 ↓

English

117

936

174.3K

Keşfet

@levidiamode @tugot17 @phillip_isola @stanfordnlp @Jianfei_AI @GioeleZardini @DemisHassabis @RShprints