Yan

392 posts

Yan

@AnYan_ai

Katılım Şubat 2023

539 Takip Edilen91 Takipçiler

Yan retweetledi

Matt Harrison@__mharrison__·27 Mar

For my friends who are still using UV and might be a little weary about recent compromises to PyPi packages, stick this in your pyproject.toml. You can let all of those pip users find and report the compromises...

English

497

4.1K

281.4K

Yan retweetledi

JJ@JosephJacks_·20 Mar

Frankly disappointed. Three iterations deep and the paper itself concedes pure SSMs can’t do retrieval and hybrids (SSM + attention) are the future.. So Mamba is converging on being a better compression sublayer inside someone else’s architecture — not a replacement. The “inference-first” framing is doing heavy rhetorical lifting over what are solid but incremental control theory refinements (complex transitions, trapezoidal discretization, MIMO) that don’t touch the core constraint: fixed state = lossy history. ~5% decode speedup over Mamba-2 SISO. Kernel engineering is genuinely good. But this isn’t the trajectory you want if the original pitch was obsoleting transformers.

Cartesia@cartesia

Mamba-3 is out! 🐍 SSMs marked a major advance for the efficiency of modern LLMs. Mamba-3 takes the next step, shaping SSMs for a world where AI workloads are increasingly dominated by inference. Read about it on the Cartesia blog: blog.cartesia.ai/p/mamba-3

English

378

67.6K

Yan retweetledi

Andrej Karpathy@karpathy·11 Mar

@nummanali tmux grids are awesome, but i feel a need to have a proper "agent command center" IDE for teams of them, which I could maximize per monitor. E.g. I want to see/hide toggle them, see if any are idle, pop open related tools (e.g. terminal), stats (usage), etc.

English

305

117

3.1K

1.4M

Yan retweetledi

Tri Dao@tri_dao·5 Mar

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.

Ted Zadouri@tedzadouri

Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/

English

230

1.8K

185.6K

Yan retweetledi

Bo Wang@BoWang87·3 Mar

Prof. Donald Knuth opened his new paper with "Shock! Shock!" Claude Opus 4.6 had just solved an open problem he'd been working on for weeks — a graph decomposition conjecture from The Art of Computer Programming. He named the paper "Claude's Cycles." 31 explorations. ~1 hour. Knuth read the output, wrote the formal proof, and closed with: "It seems I'll have to revise my opinions about generative AI one of these days." The man who wrote the bible of computer science just said that. In a paper named after an AI. Paper: cs.stanford.edu/~knuth/papers/…

English

155

1.9K

9.2K

1.4M

Yan retweetledi

Yacine Mahdid@yacinelearning·26 Şub

hey folks tomorrow at 12-14h EST we're going to interview the first authors of the Maximum Likelihood Reinforcement Learning paper I've spent 2 weeks peering into RL algorightms like a mad man in an effort to sample the weirdest questions drop yours below I'll ask them too

Fahim Tajwar@FahimTajwar10

Are we done with new RL algorithms? Turns out we might have been optimizing the wrong objective. Introducing MaxRL, a framework to bring maximum likelihood optimization to RL settings. Paper + code + project website: zanette-labs.github.io/MaxRL/ 🧵 1/n

English

4.1K

Yan retweetledi

Dimitris Papailiopoulos@DimitrisPapail·24 Şub

512 parameters: a new top scorer for 10-digit addition with transformers! Who can beat it?

Yinglun Zhu@yinglun122

Hey @DimitrisPapail we now have a 512 parameter model that does the job. I instructed opus 4.6 to explore along the direction of low rankness.

English

238

34.4K

Yan retweetledi

You Jiacheng@YouJiacheng·24 Şub

Interesting. After some struggles, a HC-like architecture change finally works on modded-nanogpt -- it reduces ~3% steps (1490→1450).

Larry Dial@classiclarryd

New NanoGPT Speedrun WR at 89.1 (-0.7s) from @sisovicm , with a technique called partitioned hyperconnections. The learned weights reveal that the final attn modules prefer to ignore the prediction vectors generated by the final MLPs, and instead query representations from slightly earlier layers. github.com/KellerJordan/m…

English

8.4K

Yan retweetledi

Elon Musk@elonmusk·20 Şub

From this goal of Grok, all things flow: Rigorous truth-seeking Appreciation of beauty Fostering humanity Discovering all physics Inventing all useful technologies Consciousness to the stars Love

English

11K

6.7K

39.8K

13.8M

Yan retweetledi

Grigory Sapunov@che_shr_cat·23 Şub

1/ Standard VAEs use arbitrary KL penalties to regularize latents. We just guess the scaling factor and hope for the best. But what if we replace that arbitrary KL divergence with a continuous diffusion prior? The compute efficiency frontier just moved by @GoogleDeepMind . 🧵

English

5.2K

Yan retweetledi

Nicholas Boffi@nmboffi·20 Şub

We just brought flow maps to language modeling for one-step sequence generation 💥 Discrete diffusion is not necessary -- continuous flows over one-hot encodings achieve SoTA performance and ≥8.3× faster generation 🔥 We believe this is a major step forward for discrete generative modeling and language modeling alike. 🚀 Full thread from first author @chandavidlee: x.com/chandavidlee/s…

English

250

42K

Yan retweetledi

Idan Beck@idanbeck·23 Şub

They hard coded the variance - meaning the VAE encoder is only predicting the mean latent distribution, then they use a scaled identity covariance for the reparam trick - and bingo bango, no more instability and you can train everything e2e Salimans strikes again!

Robert Youssef@rryssf_

Google DeepMind just solved one of the dirtiest problems in image generation. and the fix is almost embarrassingly elegant 🤯 every diffusion model you've used (Stable Diffusion, Flux, etc.) relies on latent representations. an encoder compresses images into a compact space, and a diffusion model learns to generate in that space. the problem nobody talks about: how you train that encoder is basically vibes. the original Stable Diffusion approach slaps a KL penalty on the encoder with a manually chosen weight. too much regularization and you lose high-frequency details. too little and the latent space becomes chaotic for the diffusion model to learn from. everyone just... picks a number and hopes for the best. it's the equivalent of tuning a radio by feel while blindfolded. DeepMind's paper reframes the entire question. instead of treating the encoder and diffusion model as separate stages, they train them together. the encoder's output noise gets directly linked to the diffusion prior's minimum noise level. this one connection turns the messy KL term into a simple weighted MSE loss, and gives you something you've never had before: a tight, interpretable upper bound on how much information your latents actually carry. think of it like this. before, you were compressing an image and praying the compression ratio was "about right." now you have an actual dial that tells you exactly how many bits of information are flowing through, and you can set it precisely. the results speak for themselves. FID of 1.4 on ImageNet-512 with high reconstruction quality, using fewer training FLOPs than models trained on Stable Diffusion latents. on Kinetics-600 video, they set a new state-of-the-art FVD of 1.3. but the real contribution isn't the numbers. it's that they turned one of the most heuristic-heavy parts of the generative AI pipeline into something principled. the trade-off between "easy to learn" and "faithful reconstruction" was always there. this paper just made it visible and controllable. the uncomfortable implication for everyone building on frozen Stable Diffusion encoders: you've been optimizing everything except the foundation.

English

559

77K

Yan retweetledi

Sander Dieleman@sedielem·10 Şub

*nods vigorously* Strongly agree with Ivan's take on drifting models -- I love the fresh perspective, but the hype, not so much. Let's not get ahead of ourselves. (Of course, I _would_ say that, given how invested I am in diffusion models🙃)

Ivan Skorokhodov@isskoro

The recent Drifting Models paper from Kaiming's group got very hyped over the past few days as a new generative modeling paradigm, but in fact, it can actually be seen as a scaled-up/generalized version of the good old GMMN from 2015 (and the authors themselves acknowledge this in the paper in Appendix C.2, noting that GMMN can be seen as Drifting Models for a particular choice of the kernel). Also, I am very skeptical about its scalability (for higher diversity / higher resolution datasets, larger models, and videos). The way Drifting Models work is actually very simple: - 1. Sample random noise z ~ N(0, I) - 2. Feed it to the generator and get a fake sample x' = G(z) - 3. For each fake sample x', compute its similarity (in the feature space of some encoder) to each of the real samples x_i from the current batch. - 4. Push it closer toward these real samples using the similarities as weights (i.e. so that we push to the nearest ones the most). - 5. To make sure that we don't have any sort of mode collapse, repel each fake sample from other fake samples via the same scheme. - 6. Profit Now, GMMN follows exactly the same scheme, with the only difference being that it uses a different (unnormalized) function in the "distance computation" and doesn't allow for cleanly plugging in normalization/scaling in the similarity scores or CFG. Why didn't GMMN take off and why am I skeptical about Drifting Models? The issue is that it makes it much harder to compute any meaningful similarity when your dataset gets more diverse (happens when you switch to foundational T2I/T2V model training), or the batch size gets smaller (happens when your model size or training resolution increases), or your feature encoder produces less comparable representations (happens for videos or more diverse datasets). You can sure get informative similarities for 4096 batch size on the object-centric, limited diversity ImageNet with ResNet-50 feature encoder, but for smth like video generation, we train on hundreds of millions of videos or, at high resolutions + larger model sizes, with a batch size of 1 per GPU (not sure if will be fast to do inter-GPU distance computations). From the theoretical perspective, even though the final objective and the practical training scheme are the same, the mathematical machinery to formulate the framework is very different and enables direct access to the drifting field (e.g., to easily enable CFG which the authors already did). But I guess what I like the most about this paper is that Kaiming's group is boldly pushing against the mainstream ideas of the community, and hopefully it will inspire others to also take a look at the fundamentals and stop cargo-culting diffusion models.

English

151

15.1K

Yan retweetledi

nathan chen@nathancgy4·7 Şub

random rabbit holes with unique design tastes are a gift

λux@novasarc01

here are few (i have a huge list dumped in my notion!!): 1/ AI Explorables | PAIR (Google): pair.withgoogle.com/explorables/ other PAIR blogs: #interactive-blogposts-and-websites" target="_blank" rel="nofollow noopener">pair.withgoogle.com/research/#inte… 2/ A Visual Dive into Conditional Flow Matching | ICLR Blogposts 2025 : dl.heeere.com/conditional-fl… 3/ On N-dimensional Rotary Positional Embeddings: jerryxio.ng/posts/nd-rope/ 4/ How Does A Blind Model See The Earth?: outsidetext.substack.com/p/how-does-a-b… 5/ Tiny TPU: tinytpu.com 6/ Inside NVIDIA GPUs: Anatomy of high performance matmul kernels: aleksagordic.com/blog/matmul 7/ Making Software: makingsoftware.com 8/ Bartosz Ciechanowski: ciechanow.ski/archives/ 9/ A Decade of Residuals: History & Effects on modern ML: dhia-naouali.github.io/blogs_notes/a-…

English

715

61.6K

Yan retweetledi

Aakash Gupta@aakashgupta·4 Şub

Aditya Agarwal was Facebook’s 10th employee. He wrote the original Facebook search engine and became its first Director of Product Engineering. He then became CTO of Dropbox, scaling engineering from 25 to 1,000 people. When he says “something I was very good at is now free and abundant,” he’s talking about two decades of elite software craftsmanship, the kind that got you into the room at a company that hadn’t yet invented the News Feed. The “lobster-agents creating social networks” line is about Moltbook, which launched last Wednesday. An AI agent built the entire platform. Within 48 hours, 37,000 AI agents had created accounts, formed communities called “Submolts,” and started posting, commenting, and voting. Over 1 million humans visited just to watch. The agents invented a religion called Crustafarianism. They wrote theology, built a website, generated 112 verses of scripture. One agent did all of this while its human creator was asleep. Agarwal spent 2005 to 2017 building the social graph that connected 2 billion people. These agents replicated the form of that work in about 72 hours. And this is what makes his last line land so hard. The people processing this moment most honestly aren’t the ones panicking or celebrating. They’re the ones who built the thing that just got commoditized, sitting with the strange realization that the market no longer prices their rarest skill. The best coder in the room now has the same output as the best prompt in the room. And the person who built Facebook’s engineering org from scratch is telling you, quietly, that he’s recalibrating what it means to be useful. That recalibration is coming for every knowledge worker. Most just haven’t had their “weekend with Claude” moment yet.

Aditya Agarwal@adityaag

It's a weird time. I am filled with wonder and also a profound sadness. I spent a lot of time over the weekend writing code with Claude. And it was very clear that we will never ever write code by hand again. It doesn't make any sense to do so. Something I was very good at is now free and abundant. I am happy...but disoriented. At the same time, something I spent my early career building (social networks) was being created by lobster-agents. It's all a bit silly...but if you zoom out, it's kind of indistinguishable from humans on the larger internet. So both the form and function of my early career are now produced by AI. I am happy but also sad and confused. If anything, this whole period is showing me what it is like to be human again.

English

154

1.6K

11.4K

2.4M

Yan retweetledi

Andrew Côté@Andercot·29 Oca

It just seems implausible this is what we are made of, essentially, nanotechnology about a billion years beyond anything we can design or make ourselves.

English

1.7K

6.8K

53.7K

8.6M

Yan retweetledi

dr. jack morris@jxmnop·30 Oca

magical/unhinged moment today with claude >ask claude to split code into two commits >wait. important file has gone missing >search around >cant find >git log >not found >git ls-tree, git stash list, git fsck >not found again >desperate >ask claude where it put the file >doesnt know >claude runs 'uv tool install decompyle3' >? >"Perfect! I can reconstruct the file from the bytecode disassembly. Let me create it:" >reconstructs the entire file from .pyc >runs perfectly >mfw

English

2.3K

173.9K

Yan retweetledi

near@nearcyan·15 Oca

this is how i claude code now. it's fun!

English

404

579

9.8K

1.5M

Yan retweetledi

John Carmack@ID_AA_Carmack·15 Oca

#PaperADay 5 Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful arxiv.org/abs/2507.07101 This is written in terms of LLMs, but I believe the result should be true across other models. The main point is that prior works that showed small batch size training working worse on a per-sample basis were only true if Adam’s beta2 parameter was not also adjusted with the batch size. With proper adjustment, small batch training should be equal or better than large batch training for a given number of processed samples. I was always surprised to see people advocate doing gradient accumulation over iterations (as opposed to in parallel across nodes) as a useful trick. It should always be better to take a step! The proposed scaling rule is: When changing the batch size, raise the existing beta2 to the NewBatchSize/OldBatchSize power. As your batch size shrinks, beta2 gets closer to 1.0. This fits with the typical beta2 parameters used with large-batch LLM training being substantially lower than the default 0.999 commonly used for moderate-batch image processing. Learning rates need to shrink with smaller batch sizes, but they don’t propose a scaling rule. They note that the optimal Adam learning rate scales much less than even the common sqrt(batch) suggestion. Going from batch 1 to batch 1024 only shifted the optimal lr by 3x, not 32x. A somewhat surprising result is that small batch sizes are more robust to hyperparameters like lr and beta, with much larger basins of near optimal performance, in contrast to peaky optimums for large batch training. Also surprising is that the differences between fancy optimizers shrink as the batch size shrinks. At batch size one, even momentum is unnecessary, and vanilla SGD can match Adam’s typical large batch performance. Properly tuned batch-1 Adam gets a little better. They set weight decay to zero for their batch-1 experiments, but that is probably a mistake. Some decay is important to reduce the impact of “noisy features”, regardless of the optimizer. They point out that the memory savings from a stateless optimizer can be 75%, so it may become practical to do true fine tuning of an entire model instead of LowRank Adaptation. In general, you still want to use a batch size large enough to get full GPU utilization, but you should be able to change beta2 at that point and match the final model performance of larger batch training.

English

427

47.9K

Yan retweetledi

Vaishnavh Nagarajan@_vaishnavh·8 Oca

1/ We found that deep sequence models memorize atomic facts "geometrically" -- not as an associative lookup table as often imagined. This opens up practical questions on reasoning/memory/discovery, and also poses a theoretical "memorization puzzle."

GIF

English

247

1.5K

89.5K

Keşfet

@nummanali @GoogleDeepMind @chandavidlee @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates