Oscar Davis
126 posts

Oscar Davis
@osclsd
Research Intern @Apple MLR; PhD ML @UniofOxford; generative modelling; previously at @MSFTResearch, @EPFL, @imperialcollege


1/ Non-autoregressive language models promised massive parallel speedups, but aggressive decoding always led to catastrophic quality collapse. Until now. By replacing rigid discrete token choices with soft continuous trajectories, we can now decode >5x faster. 🧵



🏎️Drift in the right direction🏎️ Introducing kernel-gradient drifting models: a reformulation of drifting models where the kernel itself defines the direction of motion through its gradient. 📜Paper: arxiv.org/pdf/2605.10727 💾Notebook: tinyurl.com/mv2jhuky

We were all wondering whether Categorical Flow Maps (CFMs) could scale... 🤔 I couldn't help trying it out... So we scaled CFMs to 1.7B parameters over 2.1T tokens 🚀🔥 Short summary 🧵⬇️

We were all wondering whether Categorical Flow Maps (CFMs) could scale... 🤔 I couldn't help trying it out... So we scaled CFMs to 1.7B parameters over 2.1T tokens 🚀🔥 Short summary 🧵⬇️

We were all wondering whether Categorical Flow Maps (CFMs) could scale... 🤔 I couldn't help trying it out... So we scaled CFMs to 1.7B parameters over 2.1T tokens 🚀🔥 Short summary 🧵⬇️

🚨 Before concluding: As noted by @Sam_Acqua and many others, we all ought to be very skeptical of Gen PPL as a metric, especially in isolation. ❌ It is actually a bit crazy that we have been using it for so long. Hence, the additional metrics, and the presence of several qualitative samples in the appendix. Please have a look yourselves to get a better understanding of the sample quality! 🔍 There's comparisons across SD/non-SD, number of NFEs, and others.




