Tijmen Blankevoort

508 posts

Tijmen Blankevoort

@TiRune

Amsterdam, The Netherlands Katılım Mayıs 2009

217 Takip Edilen675 Takipçiler

Tijmen Blankevoort@TiRune·12 May

@CoreAutoAI It turns your deep learning network into a boosting ensemble. Don’t think it’s just an optimizer question.

English

194

Core Automation@CoreAutoAI·10 May

Are residual connections a hack, or provably optimal way to shape your loss landscape?

English

42.7K

Tijmen Blankevoort@TiRune·16 Nis

@hayden_prairie I see we’re back to doing Neural ODEs again with a forward Euler rule.

English

292

Hayden Prairie@hayden_prairie·15 Nis

We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇

English

178

1.3K

292.7K

Tijmen Blankevoort@TiRune·14 Nis

@didier_lopes @Tim_Dettmers You’ll find that the ‘super weights’ are just very significant weights in the channels causing this large outlier behavior :)

English

Tijmen Blankevoort@TiRune·14 Nis

@didier_lopes @Tim_Dettmers Basically, just clipping the large activations is very harmful. If you remove some of the larger weights in the corresponding channels, you similarly reduce the activations breaking the model. This happens on any transformer with softmax attention - worse the longer you train

English

Didier Lopes@didier_lopes·12 Nis

Something cool about this is that I was reading this post #How_to_make_quantization_methods_more_precise" target="_blank" rel="nofollow noopener">timdettmers.com/2022/08/17/llm… (based on this paper proceedings.neurips.cc/paper_files/pa…) by @Tim_Dettmers from 2022, and this super weight papers basically confirmed Tim intuition and dug deeper. Dettmers (2022) -> Super Weight paper (2024) "they only occur in 6 feature dimensions" -> "we find relationships between two individual scalars - up to six weights and one activation" "they occur in all layers" -> "The super activation persists throughout the model at exactly the same magnitude and position regardless of the prompt" (super weight paper explains that this happens because of the skip connections) "active in up to 75% of all sequence dimensions" -> "super activations often appear after the super weight, persist throughout subsequent layers with constant magnitude and position, irrespective of the input prompt" "consumed in the attention function and the second feedforward network" -> "The super weight is consistently found in the down projection of the feed-forward network following the attention block, typically in an early layer" "Transformers become more stable" -> "When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods"

Didier Lopes@didier_lopes

This was a really good read. h/t @guohao_li

English

4.1K

Tijmen Blankevoort@TiRune·11 Nis

@ID_AA_Carmack Yup, it’s basically int8 with a lot of dynamic range. Fp16 will also look a lot better! That’s just fp32 without that much range.

English

242

John Carmack@ID_AA_Carmack·10 Nis

Making a scatter plot of 400_000 data points, some of the plots had odd gaps in coverage. It took me a little while to realize that it was only when the data was farther from the origin -- it was the raw bfloat16 precision. Everything looks great from -1 to 1, but as you go past 2 and 4, the coverage gaps get larger. My intuition didn't have it being quite so "discretely countable" at those modest numeric values. Float32 for comparison.

English

109

1.9K

586.9K

Tijmen Blankevoort@TiRune·3 Nis

I wonder if the negative sink weight rejection is more of an optimization issue. In our original paper describing both sinks and gated attention: arxiv.org/pdf/2306.12929, we also showed how clipping the softmax gets rid of sink behavior. It must be at least part of the explanation?

English

λux@novasarc01·2 Nis

cool experiment! the negative sink weights are the strongest signal here. the model is actively rejecting a synthetic sink which pretty cleanly rules out “softmax forces sink” in this setting.

Muyu He@HeMuyu0327

We are interested in whether Kimi's Attention Residual has the same "attention sink" problem of attention layers. To figure out the answer, we design two novel architectures on top of Attn Res on nanochat. - The problem: since attention is computed with softmax, attention scores always sum to 1. So the model is forced to assign on average non-trivial weight to each individual component. In attention, the model solves this by absorbing most attention into the key of the first token, ie. "attention sink". Since Attn Res also uses softmax, we want to know if it suffers from the same issue, and needs a "sink" to absorb extra attention. In Attn Res, the token embedding output also receives substantial attention from every subsequent layer. Is this behaving like a sink, or is it genuinely useful? - The architectural change: For the first model, we add a learnable scalar to each layer of the Attn Res model, following the learnable sink design of GPT-OSS. During attention residual computation, this sink scalar is concatenated with other logits before softmax, essentially absorbing some attention. For the second model, we add a gate at the output of the attention residual, which scales each dimension by (0, 1), following the gate design of Qwen's gated attention. This essentially undos any overmixing softmax attention might have for each hidden dim. - The effect: both models seem to show that there is no attention sink problem brought about by softmax attention for Attn Res. For the learnable sink model, we plot the attention of each layer to previous outputs, with first column being attention to the sink (p1). We find that most attention is still on the embedding output, even though there is a sink for extra attention. This shows that the model does focus on the embedding output for specific gains. Looking at the values of the learnable sink, which is zero-initialized, we find that most layers drive the value to negative, essentially reducing its effect on attention even more (p2). This is a clear signal that the model wants to allocate the existing attention budget as much as possible on real layer outputs. For the gated model, we notice that the model does learn to scale the outputs in a pretty specific way. As the gate matrices are random init near 0, the init gate value should be centered around 0.5, but we see that for each layer, the gate values are evenly divided between the two extremes near 0 and 1 (p3). This shows that the model is actively trying to scale each dimension. - The performance: interesting for gate model and expected for the sink model, the FLOP-controlled validation loss for both are almost identical to the Attn Res baseline. Although the gate model learns to scale the outputs, this scaling seems to create little impacts on the actual effectiveness of model computation (p4). Compared to the baseline which is a gpt-2 style 12-layer 124M model, all three Attn Res variants outperform the baseline with minimal parameter overheads (gate matrices are rank-4 up and down proj matrices, sink is just a bunch of scalars). They also outperform Andrej's own version of "attention residual", which is a weighted combination of the current residual stream and the embedding. - What's next: Attn Res is a very cool model, and we have found a bunch of interesting things about it lately. Will share more interp insights and arch variants in the coming days (eg. it seems to change the 'curse of depth' dynamic quite a bit which is interesting).

English

4.3K

Tijmen Blankevoort@TiRune·31 Mar

@MrCatid @mgostIH NVFP4 is scalar, not vector quant! :o

English

catid@MrCatid·31 Mar

@mgostIH That’s how NVFP4 works too btw and every other quant scheme afaik

English

mgostIH@mgostIH·30 Mar

Nobody will tell you, but there's a free lunch you can get in TurboQuant by vector quantization. If you quantize 8 dimensions at a time rather than a single scalar, you can get higher accuracy because you cover the joint distribution better, see QuIP#

English

786

Tijmen Blankevoort@TiRune·31 Mar

@tsengalb99 Also pretty impressive they miss citing spinquant and quarot that both apply rotations specifically for the KV-cache 😂

English

107

Albert Tseng@tsengalb99·30 Mar

I don’t understand what the buzz about random projections and TurboQuant is. The AI literature has known about random rotations for LLM quantization since at least 2023, when we introduced QuIP/#. I’m pretty sure people have been doing this for decades outside of LLMs too.

Paata Ivanisvili@PI010101

The Johnson--Lindenstrauss lemma says something quite remarkable: if you have an astronomical number N of vectors of large size (say, in a very high-dimensional Euclidean space), then you can linearly map them into a much lower-dimensional space, of dimension about log(N), in such a way that the distances between the vectors are almost preserved. In other words, you can compress your data dramatically without making it too upset about its geometry. A random matrix with i.i.d. standard Gaussian entries will most likely do the job.

English

26.8K

Tijmen Blankevoort retweetledi

Dawid Kopiczko@dawkopi·16 Şub

Why repetition works so well is still an open question. There's a lot to uncover about training dynamics of SFT, and we hope this is a useful data point. Joint work with co-authors @Sagar_Vaze @TiRune @y_m_asano Paper: arxiv.org/abs/2602.11149 Code: github.com/dkopi/data-rep…

English

1.3K

Tijmen Blankevoort retweetledi

Bryan Catanzaro@ctnzr·15 Ara

Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.

English

222

1.2K

505.1K

Tijmen Blankevoort@TiRune·2 Ara

@chrisbarber @rronak_ @QuantumArjun @MichaelElabd @jonsidd @schwarzjn_ I’m hiring at Nvidia for efficiency, quantization, sparsity and working on the Nemotron models broadly.

English

259

Chris Barber@chrisbarber·2 Ara

I made an unofficial NeurIPS 2025 hiring list: @rronak_, @QuantumArjun, @michaelelabd, stealth, I’m a small investor: RL post-training from live product usage. Research Engineers. @jonsidd, Turing: data for frontier models. Research Engineers, SWEs. @schwarzjn_, ICL & Thomson Reuters: LLMs for law. Research Engineers, SWEs, PhD students. @panda_liyin, AdaL: copilot for ML engineering. MLEs, SWEs. @sarwal_varuni, TriFetch: data and post-training for medical AI. @bidhan, Bagel Labs: decentralized training for diffusion models. MLEs, ML Scientists. @meggmcnulty, Cosmic Labs: AI-native OS for embedded engineering. MLEs, SWEs, systems engineers. @samuelekpe, GrupaAI: operating system for AI agents. SWEs. @jaradcannon, Humanoid: industrial humanoids. SWEs and applied researchers. @saurabh_here1, Cantina: AI native social media. Research interns for video gen. @RicardoMonti9, DatologyAI: frontier data curation (filtering, mixing, synthetic) for LLMs. Research Scientists, MLEs, SWEs. @NimaGard, Path Robotics: physical AI to automate manufacturing tasks (e.g. welding). MLEs for robot learning. @DrJimFan, Nvidia robotics team. Research Engineers, SWEs. @katherine1ee, OpenAI pretraining safety team. Research Engineers. @BorisMPower, OpenAI applied AI research team. Research Engineers. @j_asminewang, OpenAI alignment team. Research Engineers, Research Scientists. @zijianwang30, MSL data research team. Research Engineers, Research Scientists. @RuiqiGao, Google DeepMind video gen team. Research Engineers, Research Scientists. @joshim5, Chai Discovery: molecule prediction for drug discovery. Research Engineers, SWEs. @crisbodnar, Project Prometheus: AI for manufacturing and logistics. Research Engineers. @vdbergrianne, Microsoft Research Amsterdam materials science team. Research Engineers. @kamath_sutra, Smallest: AI for call centers. SWEs. @idavidrein, METR: frontier model evaluation. Research Engineer. @jimmysmith1919, Liquid AI: on-device models. MLEs, Research Engineers. @alxndrdavies, AI Security Institute: red-teaming. Research Scientists/Engineers. @stuhlmueller, Elicit: AI for scientific research and good reasoning. MLEs, SWEs. @gavincrooks, @FarisSbahi, Normal Computing: physics-based ASICs. Research Engineers, SWEs. @myra_deng, Goodfire AI: interpretability research. Research Engineers, Research Scientists, MLEs. @_lychrel, @SergeiIakhnin, @ja_kirkpatrick, @sbos, Isomorphic Labs: AI-first drug discovery. Research Engineers, Research Scientists, MLEs. @kdqg1, @bneyshabur, Anthropic AI Scientist team. Research Engineers with infra experience. @sarahookr, Adaption: continuous learning. Research Engineers. @francedot, Cua, I’m a small investor: infra for computer-use agents. SWEs, Research Engineers. @iScienceLuvr, Sophont: multimodal models for healthcare. Research Engineers/Research Scientists. @aditshah00, Until Labs: organ preservation. MLEs. @RitvikKapila & @gauri__gupta, NeoSigma: evals and post-training for real world agents. SWEs. @abeirami, stealth: reliability & statistical evaluation. Research Engineers & SWEs. @adityachinchure, Ideogram: image generation. Research Engineers. @AndrewLBeam, @kenneth0stanley, Lila Sciences: autonomous labs, verifiability for science. Research Engineers, MLEs. @brianwilt, Waymo: ML infra for motion planning team. Senior SWEs. @thisismadani, Profluent Bio: protein generation for drug development. MLEs.

English

428

62.6K

Tijmen Blankevoort@TiRune·1 Ara

@RomiLifshitz Thanks! Fixed!

English

Romi Lifshitz@RomiLifshitz·1 Ara

@TiRune Would love to chat! (but your DMs are closed!)

English

Tijmen Blankevoort@TiRune·30 Kas

Looking for cracked full-time Deep Learning researchers on Efficiency, Quantization and Sparsity. Join our world-class applied deep learning research team at Nvidia. Team creates the Nemotron models, we influence the hardware with our research. Shoot me a message! Am at Neurips!

English

1.7K

Tijmen Blankevoort@TiRune·28 Kas

@MinChonChiSF @gu_xiangming @Alibaba_Qwen Known since 2023 btw - arxiv.org/pdf/2306.12929 <- our outlier paper already used gated attention to get rid of attention-sink behavior.

English

Min Chon Chi@MinChonChiSF·28 Kas

@gu_xiangming @Alibaba_Qwen Interesting that a no-op gate helps eliminate the attention sink.

English

978

Xiangming Gu@gu_xiangming·27 Kas

Congratulations to @Alibaba_Qwen for winning the NeurIPS 2025 Best Paper Award. Great to hear that attention sink attracts a lot of attention. I think why gated attention eliminates attention sink: the gate mechanism implements "no-op" (do not update token representations), exempting the necessity to develop attention sink to achieve. Please also check our two papers about when attention sink emerges in LLMs(openreview.net/forum?id=78Nn4…) and why LLMs need attention sink(arxiv.org/abs/2504.02732). In my first paper, I showed some attention variants that are attention-sink-free, like sigmoid attention and some linear attention.

Qwen@Alibaba_Qwen

🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award! A huge congratulations to our dedicated research team for pushing the boundaries of AI. Read more: blog.neurips.cc/2025/11/26/ann…

English

517

147.5K

Tijmen Blankevoort@TiRune·27 Kas

@Alibaba_Qwen Congrats!

English

274

Qwen@Alibaba_Qwen·27 Kas

English

385

2.9K

483.8K

Tijmen Blankevoort@TiRune·19 Kas

@scaling01 @david_sepulvado Kimi-K2 thinking was released in INT4, not FP4?

English

Lisan al Gaib@scaling01·19 Kas

@david_sepulvado very high OpenAI released their open source models 3 months ago in FP4 and other open source models come natively with QAT in FP4, like Kimi-K2 Thinking Google pioneered a lot of these techniques.

English

4.2K

Lisan al Gaib@scaling01·19 Kas

Gemini 3 Pro has around ~7.5T params (vibe-mathing with explanation) > the naive fit with with an R^2 of 0.8816 yields a mean estimation of 2.325 Quadrillion parameters > ummm, that's not it > let's only take sparse MoE reasoning models > this includes gpt-oss-20B and 120B, Qwen3 Next, MiniMax, Qwen3 235B, GLM-4.6, DeepSeek-V3.1 Terminus, DeepSeek V3.2, DeepSeek R1 0528 and Kimi K2 Thinking > R^2 of 0.9478 mean estimate of 604T params > pretty sure that's not it either > okay, let's take the most optimistic series of points > (the idea here is that the Google Team is at least on this open-source frontier, if not ahead) > MiniMax-M2, GLM-4.6, and DeepSeek R1 0528 > that's more like it, but YIKES > confidence intervals are fucking cooked > mean estimate of 19.6T with the lower 95% bound at 1.7T > I will take 1.7T as our minimum model size for Gemini 3 Pro > okay fuck DeepSeek-R1, we are going full retard, the most optimal of points > confidence intervals are dead > 2 point regression, R^2 = 1, AGI achieved > mean estimate of 8.2T params > TPUv7 rack has 64 TPUs @ 192GB/TPU = 12288 > I assume they wouldn't want multi-rack inference because of latency, complexity or whatever > they are likely serving in FP4 which limits the maximum model to 24.576T params > inference max shows that a GB200 NVL72 which is very similar to TPUv7 rack setup can serve 512 or even 1024 users at above 50 tokens/s > KV size only scales with layers and latent dim and data format, for DeepSeek V3 with MLA this would be 4.48TB for 256 concurrent users at 1 million context and FP4 (they probably have something better than this. since I overestimate memory usage I go with the lower batch size of 256 instead of 512) > so 4.48TB for context and 1TB of overhead > ~5.5TB of our precious memory gone > ~6.788TB memory left > max model size at FP4 -> ~12.576T params My prior vibe-estimate before doing all of this: 5-10T Mean estimate based on open-source MoE reasoning models: 8.2T Lower Bound: 1.7T Upper Bound: 12.576T Midpoint between upper and lower bound: 7.138T New estimate: Gemini 3 Pro has around ~7.5T params (big uncertainty here due to data format, batch-size and memory requirements)

English

652

201K

Tijmen Blankevoort@TiRune·23 Eki

@cuijiaxun Lmk if you’re interested in Nvidia!

English

587

Jiaxun Cui 🐿️@cuijiaxun·23 Eki

Meta has gone crazy on the squid game! Many new PhD NGs are deactivated today (I am also impacted🥲 happy to chat)

Yuandong Tian@tydsh

Several of my team members + myself are impacted by this layoff today. Welcome to connect :)

English

110

1.4M

Tijmen Blankevoort@TiRune·5 Eki

@karpathy @eigenrobot Make sure to do it in hardcore

English

107

Andrej Karpathy@karpathy·5 Eki

@eigenrobot World of Warcraft Classic grinding mobs, simple questing is mine. Repetitive skill rotation with just enough variety to keep fun/engaging but easy. A lot of *wrong* answers in the replies here, games that nowhere near mindless enough eg Factorio.

English

587

67.1K

eigenrobot@eigenrobot·4 Eki

any good video games for zoning out and listening to podcasts

English

1.2K

2.4K

445.4K

Tijmen Blankevoort@TiRune·18 Eyl

@thepushkarp arxiv.org/abs/2306.12929 Our work from 3 years ago showing why attention sinks exist. I believe we were the first ones? 😅

English

pushkar /ˈpʊʃkər/@thepushkarp·17 Eyl

this was a good read, esp the comparison of attention to graphs i didn’t understand all of it though. looking for more reads around attention sinks. what should i be looking at?

English

777

44.2K

Keşfet

@CoreAutoAI @hayden_prairie @didier_lopes @Tim_Dettmers @ID_AA_Carmack @MrCatid @mgostIH @tsengalb99