Pritam

1.6K posts

Pritam

@Pritamstudyai

kernel Katılım Ekim 2024

1.2K Takip Edilen254 Takipçiler

Pritam@Pritamstudyai·1d

@eigenron Cool, what's your top 3 book recs of any genre?

English

eigenron@eigenron·1d

@Pritamstudyai all my books are still in ohio which i'm getting shipped soon, so i'm just trying to buy whatever i can in the meantime lol.

English

eigenron@eigenron·1d

good times lie ahead.

English

1.4K

Pritam@Pritamstudyai·1d

@Yuchenj_UW What do we do 😭?

English

Yuchen Jin@Yuchenj_UW·2d

The highest-paying jobs today may be first in line for AI. Take GPU kernel engineers: a golden ticket to OpenAI or Anthropic, million-dollar offers. Now people're rushing to learn it. It takes ~1 year to get decent. But will Claude Opus 4.9 or GPT-6 beat you, or even the best kernel writers in the world? It’s already happening.

English

983

107.3K

Pritam retweetledi

Dan Alistarh@DAlistarh·3d

Speedrunning GPT-2 is now routine thanks to @karpathy. But can we speedrun GPT3-175B? We attempted to match accuracy on a <$10K budget; while we didn't quite reach it, our first results show that quality data, engineering, and native FP4 can get close. Details in 🧵

English

167

11.9K

Pritam retweetledi

Edward Z. Yang@ezyang·2d

Cool pure Python implementation of CuTe layout algebra: github.com/facebookresear… -- with it, it only took a few minutes for Claude to make all of the CuTe paper arxiv.org/abs/2603.02298 have executable Python code with it too github.com/ezyang/cute-in…

English

363

33.6K

Pritam@Pritamstudyai·2d

@parmita yop

Parmita Mishra@parmita·2d

can u reply to this

English

148

190

13K

Pritam@Pritamstudyai·2d

@BoWang87 @Xaira_Thera How about undergrads?

English

233

Bo Wang@BoWang87·3d

We're launching an AI in Residence program at @Xaira_Thera ! 6-12 months. Own real projects. Work alongside the team building X-Cell — our virtual cell model trained on the largest, most context-diverse genome-wide perturbation dataset ever reported. If you're a recent MS or PhD grad who wants to work at the actual frontier of ML + drug discovery, this is it. Applications open now: job-boards.greenhouse.io/xairatherapeut…

English

206

16.6K

Pritam retweetledi

chuyi shang@chuyishang·4d

Wrote a deep dive on implementing a language model from scratch in JAX and scaling it with distributed training! If you’re coming from PyTorch and want to see how the same ideas look in JAX, or just want a hands-on intro to distributed training, check out this blog post: chuyishang.com/blog/2026/jax-… Comes with code + an assignment and test cases so you can follow along!

English

602

31K

Pritam@Pritamstudyai·4d

@kmeanskaran And how much they pay for it ? Small businesses?

English

284

Karan🧋@kmeanskaran·4d

@Pritamstudyai yes specially, using AI I am making webisites and SEO-stragegies for local indian buisnesses. justdial in india is not that relevant nowadays, everyone wants to get on gmaps where you requires your google business profile and website. check this: visionpolymers.com

English

10.5K

Karan🧋@kmeanskaran·4d

using claude you can become money printing machine. > get claude subscription > go on maps, linkedin, google business > search logistics, e-commerce, ed-tech companies > see what they are doing > tell them you can do better and faster > automate their minor tasks like filling spread sheets, optimising expenses, create scripts for courses > charge them much lower of their engineering cost > deliver projects in strict deadline > find 3-4 good paying recurring clients this is not too late, actually most of the local businesses don't have even idea of claude. use this early leverage.

Claude@claudeai

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.

English

160

2.5K

234.2K

Pritam retweetledi

Sayak Paul@RisingSayak·4d

Last year, I got to collaborate on a number of serious projects at the intersection of Diffusers x optimization ⚡️ First, NONE of them were bootstrapped with any AI agents but pure domain knowledge and expertise. So, besides just feeling good, it's also very reassuring to me to know how important those two traits are. Now, coming to the projects that I think are worth mentioning: * `flux-fast`: Showing a combination of `torch.compile` + unscaled FP8 FA3 + no CPU-GPU sync + dynamic FP8 is great for accelerating Flux.1-*. github.com/huggingface/fl… * `torch.compile` x Diffusers: What does it take to get the most out of `torch.compile` in Diffusers across different user workloads? pytorch.org/blog/torch-com… * `lora-fast`: How to hotswap LoRAs into compiled models without incurring (slow) recompilation issues? How to set it up for success? github.com/huggingface/lo… * `zerogpu-brrr`: How to optimize a ZeroGPU HF Space with AOT + FA3 and other goodies? This helps save 💰 and improve the user experience of your ZeroGPU applications. huggingface.co/blog/zerogpu-a… Hopefully, this will make you realize there's still a LOT that you can do (preferably pairing with AI) if you're curious and deeply invested in stuff you care about.

English

13.7K

Pritam retweetledi

Google Open Source@GoogleOSS·5d

Level up your TPU optimization game! We've added three new advanced capabilities to XProf: Continuous Profiling, Utilization Viewer, and LLO Bundle Visualization. What are you most excited to try? goo.gle/xprof-tpu-opti… #MachineLearning #OpenSource

English

3.2K

Pritam@Pritamstudyai·5d

link: transformerlensorg.github.io/TransformerLen…

English

Pritam@Pritamstudyai·5d

Transformerlens mech interp resources

Deutsch

Pritam@Pritamstudyai·5d

@0xlelouch_ Dude it's just copy pasta from ai ugh

English

331

Abhishek Singh@0xlelouch_·6d

Reviewed the resume of a 2023 CS graduate currently working as a Software Engineer II at a real estate marketplace, with experience across backend, infra, blockchain, AI agents, and even a short technical head stint. Here is my hiring manager + recruiter view of it: The good: Strong early-career trajectory. Going from intern to Software Engineer II this quickly gets attention. The resume shows real ownership, not just task execution. Good use of impact in some bullets like reducing API response time by almost 50%. Redis caching, RabbitMQ, AWS infra, aggregation pipelines, CDN, Nginx, workers, microservices, these are real engineering things, not tutorial fluff. Mentoring 3 engineers and 2 interns is a very strong signal for someone this early in career. The RealX experience feels valuable because it mixes product work, backend optimization, infra ownership, and business-facing systems. The Lapicart section also helps because it shows speed, messy execution environments, and business problem solving. Overall, this person looks employable and useful. Not just “knows MERN”. --- The bad: The skills section is too crowded. It tries to say everything and ends up saying very little. GenAI, AI Agents, LangGraph, blockchain, microservices, smart contracts, HLD, design patterns, CDN, Vim, load balancer all in one skills block feels overloaded. A recruiter will think: what is this person actually strongest at? Some bullets are too long and hard to scan. Resume bullets should land fast. A few here feel like explanation, not impact. A few claims need tighter wording. For example “pioneered blockchain adoption” sounds grand, but what was the actual engineering/business outcome? There is inconsistency in quality of bullets. Some are strong and measurable, others sound generic. Internship bullets are weak compared to the rest. They take space but do not add much proof. Awards and certificates section is not adding much weight at this stage. Work experience is much stronger than that section. --- The ugly: The resume has an identity problem. Is this person a backend engineer? A platform engineer? A full stack engineer? An AI engineer? A blockchain engineer? Right now it says all of them a little bit, and that reduces clarity. Good resumes do not just show range. They show positioning. There are too many buzzword-adjacent terms that may trigger skepticism in senior reviewers. If you say AI agent, RAG, memory, checkpointing, blockchain, tokenization, RabbitMQ, CDN, Nginx, AWS, microservices all together, people start looking for what is real depth and what is surface familiarity. The strongest parts of the resume are infra, backend, optimization, platform ownership, and execution under business constraints. That story should dominate harder. Right now the resume sometimes reads like “here is everything I have touched” instead of “here is why you should hire me for this role”. --- If I were hiring: I would shortlist this person for backend or full stack product engineering roles. Especially in startups or product teams where ownership matters. I would be interested because there is evidence of shipping, scaling, mentoring, and working close to business. But I would also test deeply in interviews because the resume covers many domains and I would want to know where the real depth is. --- If I were recruiting: I would advise this person to position the resume around one core story: backend/product engineer who improves systems, reduces cost, and ships business-critical features. That is much sharper than trying to look like 5 different engineers at once. Final verdict: This is a strong resume. Better than most 2-3 year experience resumes I usually see. But it can become much stronger with sharper positioning, tighter bullets, and less skill-section inflation. --- Big resume lesson for engineers: Broad exposure gets attention. Clear positioning gets interviews. Measurable impact gets offers.

Vishal Dorge@VishalDorge

@0xlelouch_ I am founding engineer at realx and thus work on their entire investment product from scratch. Not sure how to say that without looking braggy

English

182

31.7K

Pritam retweetledi

0xSero@0xSero·6d

In 72 hours I got over 100k of value 1. Lambda gave me 5000$ credits in compute 2. Nvidia offered me 8x H100s on the cloud (20$/h) idk for how long but assuming 2 weeks that'd be 5000$~ 3. TNG technology offered me 2 weeks of B200s which is something like 12000$ in compute 4. A kind person offered me 100k in GCP credits (enough to train a 27B if you do it right) 5. Framework offered to mail me a desktop computer 6. We got 14,000$ in donations which will go to buying 2x RTX Pro 6000s (bringing me up to 384GB VRAM) 7. I got over 6M impressions which based on my RPM would be 1500$ over my 500$~ usual per pay period 8. I have gained 17,000~ followers, over doubling my follower count 9. 17 subscribers on X + 700 on youtube. The total value of all this approaches at minimum 50,000$~ and closer to 150,000$ if I leverage it all. --------------------- What I'll be doing with all this: Eric is an incredibly driven researcher I have been bouncing ideas off of over the last month. Him and I have been tackling the idea of getting massive models to fit on relatively cheap memory. The idea is taking advantage of different forms of memory, in combination with expert saliency scoring, to offload specific expert groupings to different memory tiers. For the MoEs I've tested over my entire AI session history about 37.5% of the model is responsible for 95% of token routing. So we can offload 62.5% of an LLM onto SSD/NVMe/CPU/Cheap VRAM this should theoretically result in minimal latency added if we can select the right experts. We can combine this with paged swapping to further accelerate the prompt processing, if done right we are looking at very very decent performance for massive unquantisation & unpruned LLMs. You can get DeepSeek-v3.2-speciale at full intelligence with decent tokens/s as long as you have enough vram to host the core 20-40% of the model and enough ram or SSD to host the rest. Add quantisation to the mix and you can basically have decent speeds and intelligence with just 5-10% of the model's size in vram (+ you need some for context) The funds will be used to push this to it's limits. ----------------- There's also tons of research that you can quantise a model drastically, then distill from the original BF16 or make a LoRA to align it back to the original mostly. This will be added to the pipeline too. ------------------ All this will be built out here: github.com/0xSero/moe-com… you will be able to take any MoE and shove it in here, and with only 24GB and enough RAM/NVMe to compress it down. it'll be slow as hell but it will work with little tinkering. ------------------ Lastly I will be looking into either a full training run from scratch -> or just post-training on an open AMERICAN base model - a research model - an openclaw/nanoclaw/hermes model - a browser-use model To prove that this can be done. -------------------- I will be bad at all of it, and doubt I will get beyond the best small models from 6 months ago, but I want to prove it's no boogeyman impossible task to everyone who says otherwise. -------------------- By the end of the year: 1. I will have 1 model I trained in some capacity be on the top 5 at either pinchbench, browseruse, or research. 2. My github will have a master repo which combines all my work into reusable generalised scripts to help you do that same. 3. The largest public comparative dataset for all MoE quantisations, prunes, benchmarks, costs, hardware requirements. -------------------------- A lot of this will be lead by Eric, who I will tag in the next post. I want to say thank you to everyone who has supported me, I have gotten a lot of comments stating: 1. I'm crazy, stupid, or both 2. I'm wasting my time, no one cares about this 3. This is not a real issue I believe the amount of interest and support I've received says it all. donate.sybilsolutions.ai

English

223

273

4.1K

166.3K

Pritam retweetledi

Sebastian Raschka@rasbt·6d

A visual guide to modern LLM attention variants, all in one place: magazine.sebastianraschka.com/p/visual-atten…

English

364

1.9K

101.3K

Pritam retweetledi

Zhuokai Zhao@zhuokaiz·6d

I wish someone had told me this when I started digging into diffusion language models (dLLMs) from an LLM post-training background. I've spent the last few weeks reading across both the dLLM RL literature (d1, EGSPO, MDPO, LLaDA 1.5) and the older robotics literature on diffusion policies + RL (DPPO, Diffusion-QL, and follow-up work). What surprised me most wasn't the algorithms themselves — it was realizing that the robotics community had already worked through several of the same problems the dLLM community is hitting now. The robotics insight — structured exploration — doesn't transfer to discrete dLLMs as directly as I initially thought, but the broader lesson does. The multi-step denoising process isn't just an expensive way to generate tokens. It gives RL tools that autoregressive models don't have — intermediate evaluations, entropy signals, a natural coarse-to-fine hierarchy — and understanding how to use (and not break) these tools is probably one of the key challenges. This post is me organizing what I've learned — how RL post-training works (or doesn't) with diffusion language models, what carries over from the autoregressive world, what's genuinely new, and where I'm still confused. A Quick Intro to How dLLMs Generate Autoregressive LLMs generate left-to-right, one token at a time, and each token choice is irreversible during generation. The probability of a sequence factorizes as a product of conditional distributions: p(x₁)·p(x₂|x₁)·p(x₃|x₁,x₂)·… Diffusion language models generate through iterative denoising. The mainstream approach right now — masked diffusion (LLaDA, Dream, MDLM) — starts with the entire response masked, then over T denoising steps, progressively unmasks tokens. At each step, the model predicts all masked positions simultaneously using bidirectional attention, and selectively reveals the most confident predictions. The process repeats until all tokens are unmasked. Properties of this process matter a lot for RL (a) No fixed generation order. Tokens can be revealed in any order — high-confidence tokens first, uncertain ones later. This means the model can lay down the skeleton of a response early and refine details later. Think of it as coarse-to-fine generation rather than left-to-right. (b) Complete generations at every intermediate step. Unlike autoregressive models where you have a partial sequence mid-generation, a dLLM produces a full (noisy) output at every denoising step. This turns out to be very useful for RL — you can evaluate intermediate states cheaply. (c) No cheap exact autoregressive-style sequence log-probability. Autoregressive models give you log p(sequence) for free via the chain rule. dLLMs don't have an equally convenient sequence-level factorization for standard RL objectives, so exact likelihood-style updates become awkward and expensive. Practical methods usually rely on approximations, surrogates, or stepwise reformulations. This is one of the core obstacles for applying standard RL algorithms directly. The field has moved fast over the last year or so. Notable models include LLaDA 8B (trained from scratch, reported by its authors as competitive with LLaMA 3 8B), Dream 7B (adapted from Qwen2.5, notably strong on planning tasks), Mercury 2 (Inception, focused on inference speed), and LLaDA 2.0 (scaled to 100B). Where the Standard RL Pipeline Breaks The standard RL post-training pipeline for autoregressive models is straightforward. Sample a response, get a reward, compute log-probability of the response under the current policy, estimate advantage, update with policy gradient. The log-probability computation is trivial since you just sum per-token log-probs from the forward pass. With dLLMs, this pipeline breaks at step 3. You can sample responses and get rewards just fine. But you can't recover an exact autoregressive-style response log-probability with the same convenience, because there's no left-to-right chain-rule factorization. So RL methods that rely on likelihood ratios or preference-style likelihood comparisons (PPO, GRPO, DPO-style objectives) need some workaround. So far, a few approaches have emerged. (a) Mean-field approximation (d1 / diffu-GRPO). Since exact autoregressive-style sequence likelihood is unavailable in a convenient form, approximate it by treating token positions more independently and summing per-token terms — similar in spirit to autoregressive likelihood computation, but ignoring some within-step dependencies. This is cheap and works surprisingly well in practice, but it is still an approximation, especially in early denoising steps where token predictions can be strongly correlated. (b) ELBO-based estimates with variance reduction (LLaDA 1.5 / VRPO). Instead of computing the exact likelihood, these approaches use a tractable surrogate based on the ELBO, which is already central to diffusion-model training. The problem is that these estimates can be noisy — high variance makes preference-style updates unstable. LLaDA 1.5's key contribution is VRPO, which analyzes this variance explicitly and introduces variance-reduction techniques that make this route much more practical. (c) Treat denoising as an MDP (EGSPO, MDPO, DiFFPO). This is the approach most analogous to DPPO in robotics. Formulate the T-step denoising process as a finite-horizon MDP where state = the current partially denoised sequence, action = the denoising decision at that step, reward = often sparse at the end, though some methods also use intermediate rewards. Each denoising step has tractable local transition probabilities. Then apply policy gradient across the denoising chain. A Parallel Story from Robotics In robotics, from-scratch online RL for diffusion policies has proven challenging and often unstable or sample-inefficient enough to motivate alternatives and architectural workarounds. But in the fine-tuning regime — pretrain a diffusion policy from demonstrations, then improve with RL — the results are much better. DPPO reports strong gains over alternative fine-tuning baselines, including standard Gaussian PPO-style policies, especially in sim-to-real transfer. On the Furniture-Bench assembly task, DPPO achieves 80% real-robot success zero-shot from simulation, while a Gaussian PPO baseline achieves 88% in simulation and 0% on hardware. The explanation offered by this line of work is structured, on-manifold exploration. In continuous action spaces, a pretrained diffusion policy denoises noisy actions back toward the data manifold. Each denoising step adds stochasticity (exploration) while also restoring structure, so the exploration stays in the neighborhood of plausible behavior rather than scattering across the full action space. This is why RL fine-tuning works despite the long denoising horizon — most sampled trajectories are still "reasonable," so even coarse credit assignment can produce useful gradients. Now, this specific geometric mechanism doesn't transfer cleanly to dLLMs. In masked diffusion, the "actions" are discrete token predictions, not continuous vectors. There's no continuous score field pulling tokens back toward a manifold in the same way. But the broader principle does transfer — the denoising process is sequential structure that RL can exploit. What the Denoising Structure Gives dLLM RL The denoising chain gives dLLM RL methods specific tools that don't exist in the autoregressive setting. (a) Iterative self-correction. dLLMs can revise tokens across denoising steps. d1 observed "aha moments" — the model initially commits to a wrong reasoning path, then during later denoising steps, corrects itself. Autoregressive models can do chain-of-thought, but they can't go back and change earlier tokens. For RL, this means the policy has a built-in error-correction mechanism that RL doesn't need to learn from scratch. (b) Free intermediate evaluations. Because dLLMs produce complete outputs at every denoising step, you can evaluate quality at intermediate steps without extra rollouts. MDPO exploits this directly — it checks whether the answer is correct at each denoising step and uses these intermediate rewards for credit assignment. They also discovered something interesting — over-denoising, where models sometimes get the right answer at an intermediate step, then "refine" it into a wrong answer. This is probably the dLLM version of RL over-optimization destroying a good pretrained policy. (c) Entropy-guided compute allocation. EGSPO uses the model's entropy at each denoising step to decide where to spend training compute. High-entropy steps (where the model is most uncertain) get more gradient signal; low-entropy steps (where the model is confident) get less. The intuition is that you're directing optimization pressure where decisions are most consequential. My interpretation of this, in the structured-exploration framing, is that high entropy often marks denoising steps where the model has not yet committed to a stable solution, so optimization matters more there. Low entropy steps are more settled and may offer less room for improvement. (d) Denoising discount as an implicit regularizer. DPPO in robotics uses a denoising discount that downweights earlier (noisier) denoising steps in the policy gradient. My read is that this plays a role similar to regularization — it discourages RL from aggressively modifying the early, structure-establishing denoising steps, while allowing more freedom in later refinement steps. The same principle may apply to dLLMs — you want to preserve the coarse structure and optimize the fine-grained details more aggressively. The Failure Modes We're Seeing The robotics literature warns about specific failure modes, and we're already seeing some of the analogues in dLLMs. (a)Mode collapse. This is a recurring concern in RL fine-tuning of diffusion models more broadly, including image-generation work and policy fine-tuning. RL optimization can collapse multimodal distributions toward a smaller set of reward-favored modes. dLLMs' ability to represent multiple valid responses (different reasoning paths, different coding styles) is a key advantage — but RL will try to compress this diversity. The DPPO paper argues that its specific setup is relatively robust to catastrophic collapse, but the broader diffusion-RL literature suggests this risk is real. (b) Data/manifold bias. The pretrained distribution is bounded by pretraining + SFT data. If your SFT data only demonstrates one reasoning style, RL can optimize that style but can't easily discover fundamentally different approaches. The denoising process may make this harder to escape, since it actively pulls generations back toward the pretrained distribution. (c) Over-denoising / over-optimization. MDPO's finding that models get correct answers at intermediate steps and then "refine" them into wrong final answers is the dLLM-specific version of RLHF over-optimization. The iterative structure that provides self-correction can also provide self-destruction if RL pushes too hard. What this Suggests? If this framing is roughly right, then maybe we should: (a) Invest heavily in pretraining and SFT quality, not just fancier RL. My current read is that the quality of the pretrained dLLM and SFT data may matter more than the choice between diffu-GRPO, EGSPO, or MDPO. The pretrained distribution appears to be doing a lot of the heavy lifting. If your pretrained model doesn't cover the relevant solution space, no amount of RL sophistication will find what isn't there. (b) Exploit denoising structure for credit assignment. The intermediate evaluations that dLLMs offer for free might be under-appreciated. MDPO and EGSPO are pointing the way. Use entropy-guided step selection. Use intermediate rewards. The denoising chain gives you structure that autoregressive models don't have; so why not use it. (c) Be careful with early denoising steps. The early steps establish coarse structure — the overall shape of the response. Aggressively optimizing these risks destroying the pretrained distribution. Consider denoising discounting, or only fine-tuning later denoising steps, or using larger clipping ratios for early steps. DPPO in robotics found that fine-tuning only the last K' of K denoising steps can work well — the same principle likely applies. (d) Monitor for over-denoising. Track performance at intermediate denoising steps, not just the final output. If intermediate steps consistently outperform the final output after RL, you're over-optimizing. This is a dLLM-specific early warning system for reward hacking. (e) Take mode collapse seriously. If the task has multiple valid solution strategies, check that RL preserves them. Measure output diversity, not just reward. KL from the reference model is necessary but probably not sufficient. What I Still Don't Know 1. Does the denoising structure actually help RL quantitatively? The robotics evidence is strong — DPPO clearly outperforms Gaussian PPO in the fine-tuning regime. For dLLMs, the comparison would be whether diffu-GRPO on a dLLM produces more stable or efficient RL fine-tuning than standard GRPO on an equivalently pretrained autoregressive model. I haven't seen this head-to-head comparison done cleanly. d1 shows diffu-GRPO works, but doesn't compare against autoregressive GRPO with matched pretraining quality. 2. Is the planning advantage real? Dream 7B reports substantially stronger results than Qwen2.5 7B on several planning-style tasks (for example, Countdown 16.0 vs 6.2 and Sudoku 81.0 vs 21.0 in the paper's evaluation). Is this because the non-autoregressive generation structure is genuinely better for constraint satisfaction, or is it an artifact of evaluation methodology? If it's real, it suggests dLLMs + RL could be particularly powerful for agentic tasks that require planning. 3. How far does this scale? DPPO in robotics works for 7-DOF manipulation but hasn't been tested on truly high-dimensional action spaces. dLLMs operate in vocabulary-size action spaces (32K+). Do the denoising structure advantages hold at this scale? 4. Can you escape the pretrained distribution when you need to? The denoising process constrains RL to stay near the pretrained distribution, which helps stability but limits what RL can discover. For genuinely novel reasoning, not just refinement of existing patterns, you may need to break free. What's the dLLM equivalent of off-distribution exploration? What I keep coming back to is that when you move from autoregressive to diffusion generation, the denoising chain provides exploitable structure for RL, but it also constrains what RL can do. The methods that seem to work best are the ones that take both sides of this seriously — exploiting the structure where it helps, and being careful not to destroy it where it matters.

English

531

40.7K

Pritam retweetledi

Chao Ma@ickma2311·22 Mar

MIT 18.065 Lecture 27 made backprop feel much clearer. A neural network is a chain of functions, so backpropagation is just the chain rule applied efficiently; each parameter’s gradient is not isolated, it is shaped by every layer that comes after it. My note: ickma2311.github.io/Math/MIT18.065…

English

129

37.6K

Pritam retweetledi

Peter Holderrieth@peholderrieth·20 Mar

We are also releasing self-contained lecture notes that explain flow matching and diffusion models from scratch. This goes from "zero" to the state-of-the-art in modern Generative AI. 📖 Read the notes here: arxiv.org/abs/2506.02070 Joint work with @EErives40101.

Peter Holderrieth@peholderrieth

🚀MIT Flow Matching and Diffusion Lecture 2026 Released (diffusion.csail.mit.edu)! We just released our new MIT 2026 course on flow matching and diffusion models! We teach the full stack of modern AI image, video, protein generators - theory and practice. We include: 📺 Videos: Step-by-step derivations. 📝 Notes: Mathematically self-contained lecture notes 💻 Coding: Hands-on exercises for every component We fully improved last years’ iteration and added new topics: latent spaces, diffusion transformers, building language models with discrete diffusion models. Everything is available here: diffusion.csail.mit.edu A huge thanks to Tommi Jaakkola for his support in making this class possible and Ashay Athalye (MIT SOUL) for the incredible production! Was fun to do this with @RShprints! #MachineLearning #GenerativeAI #MIT #DiffusionModels #AI

English

650

5.6K

462.9K

Keşfet

@eigenron @Yuchenj_UW @karpathy @parmita @BoWang87 @Xaira_Thera @kmeanskaran @0xlelouch_