Tijmen Blankevoort

508 posts

Tijmen Blankevoort banner
Tijmen Blankevoort

Tijmen Blankevoort

@TiRune

Amsterdam, The Netherlands Katılım Mayıs 2009
217 Takip Edilen675 Takipçiler
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@CoreAutoAI It turns your deep learning network into a boosting ensemble. Don’t think it’s just an optimizer question.
English
0
0
0
194
Core Automation
Core Automation@CoreAutoAI·
Are residual connections a hack, or provably optimal way to shape your loss landscape?
English
14
0
76
42.7K
Hayden Prairie
Hayden Prairie@hayden_prairie·
We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇
Hayden Prairie tweet media
English
41
178
1.3K
292.7K
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@didier_lopes @Tim_Dettmers Basically, just clipping the large activations is very harmful. If you remove some of the larger weights in the corresponding channels, you similarly reduce the activations breaking the model. This happens on any transformer with softmax attention - worse the longer you train
English
2
0
0
35
Didier Lopes
Didier Lopes@didier_lopes·
Something cool about this is that I was reading this post #How_to_make_quantization_methods_more_precise" target="_blank" rel="nofollow noopener">timdettmers.com/2022/08/17/llm… (based on this paper proceedings.neurips.cc/paper_files/pa…) by @Tim_Dettmers from 2022, and this super weight papers basically confirmed Tim intuition and dug deeper. Dettmers (2022) -> Super Weight paper (2024) "they only occur in 6 feature dimensions" -> "we find relationships between two individual scalars - up to six weights and one activation" "they occur in all layers" -> "The super activation persists throughout the model at exactly the same magnitude and position regardless of the prompt" (super weight paper explains that this happens because of the skip connections) "active in up to 75% of all sequence dimensions" -> "super activations often appear after the super weight, persist throughout subsequent layers with constant magnitude and position, irrespective of the input prompt" "consumed in the attention function and the second feedforward network" -> "The super weight is consistently found in the down projection of the feed-forward network following the attention block, typically in an early layer" "Transformers become more stable" -> "When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods"
Didier Lopes tweet mediaDidier Lopes tweet mediaDidier Lopes tweet mediaDidier Lopes tweet media
Didier Lopes@didier_lopes

This was a really good read. h/t @guohao_li

English
2
2
22
4.1K
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@ID_AA_Carmack Yup, it’s basically int8 with a lot of dynamic range. Fp16 will also look a lot better! That’s just fp32 without that much range.
English
0
0
0
242
John Carmack
John Carmack@ID_AA_Carmack·
Making a scatter plot of 400_000 data points, some of the plots had odd gaps in coverage. It took me a little while to realize that it was only when the data was farther from the origin -- it was the raw bfloat16 precision. Everything looks great from -1 to 1, but as you go past 2 and 4, the coverage gaps get larger. My intuition didn't have it being quite so "discretely countable" at those modest numeric values. Float32 for comparison.
John Carmack tweet mediaJohn Carmack tweet media
English
69
109
1.9K
586.9K
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
I wonder if the negative sink weight rejection is more of an optimization issue. In our original paper describing both sinks and gated attention: arxiv.org/pdf/2306.12929, we also showed how clipping the softmax gets rid of sink behavior. It must be at least part of the explanation?
English
0
0
0
36
λux
λux@novasarc01·
cool experiment! the negative sink weights are the strongest signal here. the model is actively rejecting a synthetic sink which pretty cleanly rules out “softmax forces sink” in this setting.
Muyu He@HeMuyu0327

We are interested in whether Kimi's Attention Residual has the same "attention sink" problem of attention layers. To figure out the answer, we design two novel architectures on top of Attn Res on nanochat. - The problem: since attention is computed with softmax, attention scores always sum to 1. So the model is forced to assign on average non-trivial weight to each individual component. In attention, the model solves this by absorbing most attention into the key of the first token, ie. "attention sink". Since Attn Res also uses softmax, we want to know if it suffers from the same issue, and needs a "sink" to absorb extra attention. In Attn Res, the token embedding output also receives substantial attention from every subsequent layer. Is this behaving like a sink, or is it genuinely useful? - The architectural change: For the first model, we add a learnable scalar to each layer of the Attn Res model, following the learnable sink design of GPT-OSS. During attention residual computation, this sink scalar is concatenated with other logits before softmax, essentially absorbing some attention. For the second model, we add a gate at the output of the attention residual, which scales each dimension by (0, 1), following the gate design of Qwen's gated attention. This essentially undos any overmixing softmax attention might have for each hidden dim. - The effect: both models seem to show that there is no attention sink problem brought about by softmax attention for Attn Res. For the learnable sink model, we plot the attention of each layer to previous outputs, with first column being attention to the sink (p1). We find that most attention is still on the embedding output, even though there is a sink for extra attention. This shows that the model does focus on the embedding output for specific gains. Looking at the values of the learnable sink, which is zero-initialized, we find that most layers drive the value to negative, essentially reducing its effect on attention even more (p2). This is a clear signal that the model wants to allocate the existing attention budget as much as possible on real layer outputs. For the gated model, we notice that the model does learn to scale the outputs in a pretty specific way. As the gate matrices are random init near 0, the init gate value should be centered around 0.5, but we see that for each layer, the gate values are evenly divided between the two extremes near 0 and 1 (p3). This shows that the model is actively trying to scale each dimension. - The performance: interesting for gate model and expected for the sink model, the FLOP-controlled validation loss for both are almost identical to the Attn Res baseline. Although the gate model learns to scale the outputs, this scaling seems to create little impacts on the actual effectiveness of model computation (p4). Compared to the baseline which is a gpt-2 style 12-layer 124M model, all three Attn Res variants outperform the baseline with minimal parameter overheads (gate matrices are rank-4 up and down proj matrices, sink is just a bunch of scalars). They also outperform Andrej's own version of "attention residual", which is a weighted combination of the current residual stream and the embedding. - What's next: Attn Res is a very cool model, and we have found a bunch of interesting things about it lately. Will share more interp insights and arch variants in the coming days (eg. it seems to change the 'curse of depth' dynamic quite a bit which is interesting).

English
2
1
30
4.3K
catid
catid@MrCatid·
@mgostIH That’s how NVFP4 works too btw and every other quant scheme afaik
English
1
0
0
55
mgostIH
mgostIH@mgostIH·
Nobody will tell you, but there's a free lunch you can get in TurboQuant by vector quantization. If you quantize 8 dimensions at a time rather than a single scalar, you can get higher accuracy because you cover the joint distribution better, see QuIP#
mgostIH tweet media
English
1
0
8
786
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
@tsengalb99 Also pretty impressive they miss citing spinquant and quarot that both apply rotations specifically for the KV-cache 😂
English
0
0
1
107
Tijmen Blankevoort retweetledi
Bryan Catanzaro
Bryan Catanzaro@ctnzr·
Today, @NVIDIA is launching the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture. Super and Ultra are coming in the next few months.
Bryan Catanzaro tweet media
English
41
222
1.2K
505.1K
Chris Barber
Chris Barber@chrisbarber·
I made an unofficial NeurIPS 2025 hiring list: @rronak_, @QuantumArjun, @michaelelabd, stealth, I’m a small investor: RL post-training from live product usage. Research Engineers. @jonsidd, Turing: data for frontier models. Research Engineers, SWEs. @schwarzjn_, ICL & Thomson Reuters: LLMs for law. Research Engineers, SWEs, PhD students. @panda_liyin, AdaL: copilot for ML engineering. MLEs, SWEs. @sarwal_varuni, TriFetch: data and post-training for medical AI. @bidhan, Bagel Labs: decentralized training for diffusion models. MLEs, ML Scientists. @meggmcnulty, Cosmic Labs: AI-native OS for embedded engineering. MLEs, SWEs, systems engineers. @samuelekpe, GrupaAI: operating system for AI agents. SWEs. @jaradcannon, Humanoid: industrial humanoids. SWEs and applied researchers. @saurabh_here1, Cantina: AI native social media. Research interns for video gen. @RicardoMonti9, DatologyAI: frontier data curation (filtering, mixing, synthetic) for LLMs. Research Scientists, MLEs, SWEs. @NimaGard, Path Robotics: physical AI to automate manufacturing tasks (e.g. welding). MLEs for robot learning. @DrJimFan, Nvidia robotics team. Research Engineers, SWEs. @katherine1ee, OpenAI pretraining safety team. Research Engineers. @BorisMPower, OpenAI applied AI research team. Research Engineers. @j_asminewang, OpenAI alignment team. Research Engineers, Research Scientists. @zijianwang30, MSL data research team. Research Engineers, Research Scientists. @RuiqiGao, Google DeepMind video gen team. Research Engineers, Research Scientists. @joshim5, Chai Discovery: molecule prediction for drug discovery. Research Engineers, SWEs. @crisbodnar, Project Prometheus: AI for manufacturing and logistics. Research Engineers. @vdbergrianne, Microsoft Research Amsterdam materials science team. Research Engineers. @kamath_sutra, Smallest: AI for call centers. SWEs. @idavidrein, METR: frontier model evaluation. Research Engineer. @jimmysmith1919, Liquid AI: on-device models. MLEs, Research Engineers. @alxndrdavies, AI Security Institute: red-teaming. Research Scientists/Engineers. @stuhlmueller, Elicit: AI for scientific research and good reasoning. MLEs, SWEs. @gavincrooks, @FarisSbahi, Normal Computing: physics-based ASICs. Research Engineers, SWEs. @myra_deng, Goodfire AI: interpretability research. Research Engineers, Research Scientists, MLEs. @_lychrel, @SergeiIakhnin, @ja_kirkpatrick, @sbos, Isomorphic Labs: AI-first drug discovery. Research Engineers, Research Scientists, MLEs. @kdqg1, @bneyshabur, Anthropic AI Scientist team. Research Engineers with infra experience. @sarahookr, Adaption: continuous learning. Research Engineers. @francedot, Cua, I’m a small investor: infra for computer-use agents. SWEs, Research Engineers. @iScienceLuvr, Sophont: multimodal models for healthcare. Research Engineers/Research Scientists. @aditshah00, Until Labs: organ preservation. MLEs. @RitvikKapila & @gauri__gupta, NeoSigma: evals and post-training for real world agents. SWEs. @abeirami, stealth: reliability & statistical evaluation. Research Engineers & SWEs. @adityachinchure, Ideogram: image generation. Research Engineers. @AndrewLBeam, @kenneth0stanley, Lila Sciences: autonomous labs, verifiability for science. Research Engineers, MLEs. @brianwilt, Waymo: ML infra for motion planning team. Senior SWEs. @thisismadani, Profluent Bio: protein generation for drug development. MLEs.
English
19
41
428
62.6K
Romi Lifshitz
Romi Lifshitz@RomiLifshitz·
@TiRune Would love to chat! (but your DMs are closed!)
English
1
0
0
98
Tijmen Blankevoort
Tijmen Blankevoort@TiRune·
Looking for cracked full-time Deep Learning researchers on Efficiency, Quantization and Sparsity. Join our world-class applied deep learning research team at Nvidia. Team creates the Nemotron models, we influence the hardware with our research. Shoot me a message! Am at Neurips!
English
2
1
10
1.7K
Xiangming Gu
Xiangming Gu@gu_xiangming·
Congratulations to @Alibaba_Qwen for winning the NeurIPS 2025 Best Paper Award. Great to hear that attention sink attracts a lot of attention. I think why gated attention eliminates attention sink: the gate mechanism implements "no-op" (do not update token representations), exempting the necessity to develop attention sink to achieve. Please also check our two papers about when attention sink emerges in LLMs(openreview.net/forum?id=78Nn4…) and why LLMs need attention sink(arxiv.org/abs/2504.02732). In my first paper, I showed some attention variants that are attention-sink-free, like sigmoid attention and some linear attention.
Xiangming Gu tweet media
Qwen@Alibaba_Qwen

🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award! A huge congratulations to our dedicated research team for pushing the boundaries of AI. Read more: blog.neurips.cc/2025/11/26/ann…

English
4
63
517
147.5K
Qwen
Qwen@Alibaba_Qwen·
🏆 We are incredibly honored to announce that our paper, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free" has received the NeurIPS 2025 Best Paper Award! A huge congratulations to our dedicated research team for pushing the boundaries of AI. Read more: blog.neurips.cc/2025/11/26/ann…
Qwen tweet media
English
91
385
2.9K
483.8K
Lisan al Gaib
Lisan al Gaib@scaling01·
@david_sepulvado very high OpenAI released their open source models 3 months ago in FP4 and other open source models come natively with QAT in FP4, like Kimi-K2 Thinking Google pioneered a lot of these techniques.
English
1
0
13
4.2K
Lisan al Gaib
Lisan al Gaib@scaling01·
Gemini 3 Pro has around ~7.5T params (vibe-mathing with explanation) > the naive fit with with an R^2 of 0.8816 yields a mean estimation of 2.325 Quadrillion parameters > ummm, that's not it > let's only take sparse MoE reasoning models > this includes gpt-oss-20B and 120B, Qwen3 Next, MiniMax, Qwen3 235B, GLM-4.6, DeepSeek-V3.1 Terminus, DeepSeek V3.2, DeepSeek R1 0528 and Kimi K2 Thinking > R^2 of 0.9478 mean estimate of 604T params > pretty sure that's not it either > okay, let's take the most optimistic series of points > (the idea here is that the Google Team is at least on this open-source frontier, if not ahead) > MiniMax-M2, GLM-4.6, and DeepSeek R1 0528 > that's more like it, but YIKES > confidence intervals are fucking cooked > mean estimate of 19.6T with the lower 95% bound at 1.7T > I will take 1.7T as our minimum model size for Gemini 3 Pro > okay fuck DeepSeek-R1, we are going full retard, the most optimal of points > confidence intervals are dead > 2 point regression, R^2 = 1, AGI achieved > mean estimate of 8.2T params > TPUv7 rack has 64 TPUs @ 192GB/TPU = 12288 > I assume they wouldn't want multi-rack inference because of latency, complexity or whatever > they are likely serving in FP4 which limits the maximum model to 24.576T params > inference max shows that a GB200 NVL72 which is very similar to TPUv7 rack setup can serve 512 or even 1024 users at above 50 tokens/s > KV size only scales with layers and latent dim and data format, for DeepSeek V3 with MLA this would be 4.48TB for 256 concurrent users at 1 million context and FP4 (they probably have something better than this. since I overestimate memory usage I go with the lower batch size of 256 instead of 512) > so 4.48TB for context and 1TB of overhead > ~5.5TB of our precious memory gone > ~6.788TB memory left > max model size at FP4 -> ~12.576T params My prior vibe-estimate before doing all of this: 5-10T Mean estimate based on open-source MoE reasoning models: 8.2T Lower Bound: 1.7T Upper Bound: 12.576T Midpoint between upper and lower bound: 7.138T New estimate: Gemini 3 Pro has around ~7.5T params (big uncertainty here due to data format, batch-size and memory requirements)
Lisan al Gaib tweet mediaLisan al Gaib tweet mediaLisan al Gaib tweet mediaLisan al Gaib tweet media
English
55
49
652
201K
Andrej Karpathy
Andrej Karpathy@karpathy·
@eigenrobot World of Warcraft Classic grinding mobs, simple questing is mine. Repetitive skill rotation with just enough variety to keep fun/engaging but easy. A lot of *wrong* answers in the replies here, games that nowhere near mindless enough eg Factorio.
English
27
7
587
67.1K
eigenrobot
eigenrobot@eigenrobot·
any good video games for zoning out and listening to podcasts
English
1.2K
26
2.4K
445.4K
pushkar /ˈpʊʃkər/
pushkar /ˈpʊʃkər/@thepushkarp·
this was a good read, esp the comparison of attention to graphs i didn’t understand all of it though. looking for more reads around attention sinks. what should i be looking at?
pushkar /ˈpʊʃkər/ tweet media
English
9
75
777
44.2K