Neil Tenenholtz (@ntenenz) - Twitter Profili | Zamantika Mersobahis Locabet

@m_sirovatka Yes, but you don't have to purge in-flight generations. Provided you have the bandwidth to push the weight updates, pushing allows you to maximize your "on-policyness" w/o introducing undue delay and/or load.

English

0

24

Matej Sirovatka@m_sirovatka·22h

@ntenenz i mean you still have to "pause" the engine, but yeah decided for this (for a different reason though)

English

1

0

2

86

Matej Sirovatka@m_sirovatka·1d

trainer push or inference pull for weight transfer w rdma and why

English

6

0

11

2.5K

Neil Tenenholtz@ntenenz·1d

👀

Ava Amini@avapamini

The Hallmarks of Cancer brought clarity through abstraction - but cancer is more complex than we can reduce. honored to build on their legacy in our @CellCellPress perspective on generative models as unified, multimodal systems for cancer discovery 📄 msft.it/6013QhkUS !

ART

0

2

234

Neil Tenenholtz retweetledi

Ava Amini@avapamini·1d

The Hallmarks of Cancer brought clarity through abstraction - but cancer is more complex than we can reduce. honored to build on their legacy in our @CellCellPress perspective on generative models as unified, multimodal systems for cancer discovery 📄 msft.it/6013QhkUS !

English

1

8

30

1.9K

Neil Tenenholtz retweetledi

Ava Amini@avapamini·31 Mar

protein language models capture rich structural signals, but where that knowledge lives in the network is still unclear we show that small subnetworks inside PLMs encode structural concepts, from residues to folds journals.plos.org/ploscompbiol/a… @PLOSCompBiol work led by @riavinod_!

English

0

28

171

14.4K

Neil Tenenholtz@ntenenz·26 Mar

@CUDAHandbook One interesting question is how do we ship libraries as the kernels count grows? If I can hyperoptimize in a narrow setting, the count could quickly blow up -- should I ship all of them? If so, the question becomes how do we do so in a safe, scalable fashion.

English

1

0

26

Nicholas Wilt@CUDAHandbook·26 Mar

Did DeepBlue or AlphaGo have “superhuman intelligence intelligence”? No. That said, tools such as this kernel optimizer are extremely valuable, especially if they can produce simpler, human-readable (or even human-modifiable) code.

English

1

4

391

Nicholas Wilt@CUDAHandbook·26 Mar

This breakthrough is important and relevant; but I really don’t think we should characterize it as “superhuman intelligence.” Isn’t that like saying a 1940s-era computer has “superhuman intelligence” because it can do arithmetic better than any human? Many similar examples 1/2

Bing Xu@bingxu_

This may be one of the first real signs of superhuman intelligence in software. On some of the most optimized attention workloads, agents can now outperform almost all human GPU experts by searching continuously for 7 days with no human intervention inside the optimization loop. Terry and I started agentic coding efforts at NVIDIA 1.5 years ago. Neither of us knew GPU programming, so from day one we pushed toward fully automated, human-out-of-the-loop systems. We call it blind coding. Over those 1.5 years, the two of us generated 4 generations across 2 agent systems. Since the 2nd generation, the stacks have been self-evolving. Each agent is now around 100k non-empty LOC. When we released the blind-coding framework VibeTensor in January, the implication was easy to miss. AVO makes the signal clearer. My bet is: blind coding is the future of software engineering. Human cognition is the bottleneck.

English

3

0

8

2.1K

Neil Tenenholtz retweetledi

Carles Domingo-Enrich@cdomingoenrich·16 Mar

(1/9) Most LM fine-tuning optimizes next-token loss or scalar rewards. What if we fine-tune language models so that feature statistics of partial rollouts match those of ground-truth completions? That leads to Energy-Based Fine-Tuning (EBFT). arXiv: arxiv.org/abs/2603.12248

English

5

36

228

43K

Neil Tenenholtz@ntenenz·15 Mar

I wonder if it's correlation in the evals (causal) or instead everyone jointly prioritizing the same evals (correlation). If the same evals are prioritized, one could easily imagine marginal network capacity being devoted to those that underperform, inducing correlation. Thoughts @DimitrisPapail?

English

0

2

165

Seungju Han@SeungjuHan3·15 Mar

really cool and i like the message: "if your eval correlates with existing ones, it's not adding much"

Dimitris Papailiopoulos@DimitrisPapail

x.com/i/article/2026…

English

2

5

99

17.3K

Neil Tenenholtz@ntenenz·6 Şub

@mttrdmnd @ezyang From the opus 4.6 announcement... anthropic.com/engineering/bu…

English

0

1

62

Matt Redmond@mttrdmnd·6 Şub

@ezyang Which work is this?

English

1

0

311

Edward Z. Yang@ezyang·6 Şub

One of the things I find exciting about the Carlini work is we can come up with new high level ideas for how compilers should be put together and then spend $20k to find out if they actually make developing a compiler faster

English

5

3

76

7.6K

Neil Tenenholtz@ntenenz·1 Şub

@giffmana @maxjaderberg It's the difference between maximizing publication count vs maximizing impact. Beating a baseline gets you a paper. Beating a well-tuned baseline gets you (a better chance at) lasting impact.

English

1

0

1

76

Lucas Beyer (bl16)@giffmana·31 Oca

@maxjaderberg Yep! I have so many cool ideas, but I'm still busy squeezing even more juice of very simple things to date.

English

1

0

30

2.4K

Max Jaderberg@maxjaderberg·31 Oca

If you’re not pushing your baselines (existing methods) as hard as possible before developing something new, there’s no way to trust any positive result. That holds just as much for your own results, your own group’s, as for an external paper

Lucas Beyer (bl16)@giffmana

PSA: never, ever write "we use the same learning rate across all methods for fair comparison" I read this as "do not trust any of our conclusions" and then i move on. If learning rate tuning is not mentioned, it takes me a little more time to notice that, but i also move on.

English

3

6

117

14.6K

Neil Tenenholtz@ntenenz·31 Oca

Successful research eventually becomes "boring" textbook material. Unsuccessful research fades into oblivion. Either way, your work becomes "uninteresting." Such is the nature of research, and why the journey (not the outcome) is the reward.

Rohan Pandey@khoomeik

either your research dies or it lives long enough to become infra

English

0

1

164

Neil Tenenholtz@ntenenz·27 Oca

@StasBekman That's the right intuition. By quadratic function, I meant to include the lower-order terms as well. The linear term is largely the FFNs and QKVO projections. The quadratic term is the attn.

English

0

1

47

Stas Bekman@StasBekman·27 Oca

Thank you for more details, Neil. That's very much along the lines I have been thinking. Good call on figuring out the term. I was thinking of doing some weighted average of linear and quadratic token lengths, where the gamma can be a function of seqlen, since the longer the seqlen of a single sequence the more quadratic would have an impact. Oh, and the paper looks excellent as well. Much appreciating your input, Neil

English

1

0

1

59

Stas Bekman@StasBekman·27 Oca

Has anyone solved the load balancing issue in DataLoader across multiple-ranks to ensure each batch has a similar cost function and avoid outliers which would slow all ranks down. e.g. with SFT many instruction samples are packed into larger sequence, but due to quadratic attention a packed 10x 100 token sample will finish much faster than 1x 1000 token sample. So the cost function can't be a number of tokens, but some sort of sum of quadratic token len since attention for long sequence would dominate all linear ops. I was thinking to simply sort samples by cost function after they are packed, and not randomize DL, which should already do a pretty good load balancing in practice - the problem is that it could impact the learning where it learns long sequences first and then smaller, or vice versa. The other idea I have is to do a DL wrapper which rebalances batches across ranks on the fly while keeping some buffers of discarded items and re-use them later. Asking if perhaps someone has already experimented with some ideas and found an optimal one. Thank you!

English

13

2

45

5.5K

Neil Tenenholtz@ntenenz·27 Oca

@StasBekman elevating for clarity in case others come searching... x.com/ntenenz/status…

Neil Tenenholtz@ntenenz

Ok, now that i have a bit more time to respond... the "straightforward" way is to greedily pack within buckets. the size of the buckets will have a nontrivial impact on the uniformity of the compute cost. you'll have to make some decisions around bucket size, resorting batches, etc. this will be a balancing act between maintaining your desired curriculum and maximizing utilization. you'll likely want to estimate your cost fn empirically as it's a function of your parallelism strategy, kernels, etc. you can start with a pretty simple quadratic model. finally, depending on sequence length and model size, the relative importance of this can vary. perform the basic flop math (including the cross-token attn term) and get a sense at which sequence length the term becomes negligible -- the 6N frequently cited drops it. there is some public work that goes beyond this, for example arxiv.org/pdf/2509.21841…, if you're especially interested.

English

0

46

Neil Tenenholtz@ntenenz·27 Oca

@StasBekman persist tokenized batches not samples. turning it into an offline, or pseudo-offline, problem gives you much more flexibility.

English

2

0

299

Neil Tenenholtz@ntenenz·27 Oca

Ok, now that i have a bit more time to respond... the "straightforward" way is to greedily pack within buckets. the size of the buckets will have a nontrivial impact on the uniformity of the compute cost. you'll have to make some decisions around bucket size, resorting batches, etc. this will be a balancing act between maintaining your desired curriculum and maximizing utilization. you'll likely want to estimate your cost fn empirically as it's a function of your parallelism strategy, kernels, etc. you can start with a pretty simple quadratic model. finally, depending on sequence length and model size, the relative importance of this can vary. perform the basic flop math (including the cross-token attn term) and get a sense at which sequence length the term becomes negligible -- the 6N frequently cited drops it. there is some public work that goes beyond this, for example arxiv.org/pdf/2509.21841…, if you're especially interested.

English

1

0

1

112

Stas Bekman@StasBekman·27 Oca

@ntenenz No problem at all, Neil. Either works if the online one is efficient enough ;)

English

1

0

77

Neil Tenenholtz@ntenenz·27 Oca

@StasBekman Sorry! I thought you were describing an effort to perform packing/batching in a streaming fashion online rather than offline.

English

1

0

1

66

Stas Bekman@StasBekman·27 Oca

@ntenenz Thank you for the follow up, Neil We have a total flexibility of pre-processing ahead of time, so offline is indeed ideal. but I'm not sure what you mean by "persist tokenized batches not samples." - we are working with tokenized batches already.

English

1

0

264

Neil Tenenholtz@ntenenz·12 Oca

@frontier_foid @YouJiacheng Leveraging custom triton kernels is permitted, but max-autotune is disallowed? Or am i misunderstanding?

English

0

17

You Jiacheng@YouJiacheng·12 Oca

oh I forgot that max-autotune is disabled on the small track, haha.

You Jiacheng@YouJiacheng

I also observed that torch.compile won't fuse activation into previous matmul's epilogue. But I assumed the reason is that the inefficiency of triton GEMM offsets the gain of epilogue fusion (so torch.compile choose not to fuse). It turns out to be false...

English

1

0

8

1.7K

Neil Tenenholtz@ntenenz·12 Oca

@_arohan_ Claude is certainly impressive, but this isn't really showcasing it IMO. x.com/ntenenz/status…

Neil Tenenholtz@ntenenz

Take a peek at OHIF / Cornerstone, and it'll become clear why this wasn't too challenging for Claude. There's lots of open-source medical software. It's FDA clearance that limits its adoption. ohif.org

English

0

1

377

rohan anil@_arohan_·12 Oca

Incredible. Start of Software 4.0

tobi lutke@tobi

My annual MRI scan gives me a USB stick with the data, but you need this commercial windows software to open it. Ran Claude on the stick and asked it to make me a html based viewer tool. This looks... way better.

English

10

323

68.5K

Neil Tenenholtz@ntenenz·12 Oca

@cn8011 In the 1st image, the annotations are largely DICOM metadata. Easily parseable via a variety of OSS libraries. The image annotation in the 2nd image contains 2 vertebrae, L3-L5 is 3. 🙃

English

0

2

72

cn80@cn8011·12 Oca

@ntenenz The annotations aren't part of cornerstone right? there's also plenty of proprietary integrations for storing & serving the images, should be far more open source & AI to reduce dev efforts

English

1

0

79

Neil Tenenholtz@ntenenz·12 Oca

Take a peek at OHIF / Cornerstone, and it'll become clear why this wasn't too challenging for Claude. There's lots of open-source medical software. It's FDA clearance that limits its adoption. ohif.org

tobi lutke@tobi

My annual MRI scan gives me a USB stick with the data, but you need this commercial windows software to open it. Ran Claude on the stick and asked it to make me a html based viewer tool. This looks... way better.

English

1

0

8

993

Neil Tenenholtz

Keşfet