Neil Tenenholtz

1.5K posts

Neil Tenenholtz

Neil Tenenholtz

@ntenenz

Multimodal model training for biology / healthcare at MSR

Boston, MA Katılım Şubat 2016
1.1K Takip Edilen958 Takipçiler
Neil Tenenholtz
Neil Tenenholtz@ntenenz·
@m_sirovatka Yes, but you don't have to purge in-flight generations. Provided you have the bandwidth to push the weight updates, pushing allows you to maximize your "on-policyness" w/o introducing undue delay and/or load.
English
0
0
0
24
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
@ntenenz i mean you still have to "pause" the engine, but yeah decided for this (for a different reason though)
English
1
0
2
86
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
trainer push or inference pull for weight transfer w rdma and why
English
6
0
11
2.5K
Neil Tenenholtz retweetledi
Ava Amini
Ava Amini@avapamini·
The Hallmarks of Cancer brought clarity through abstraction - but cancer is more complex than we can reduce. honored to build on their legacy in our @CellCellPress perspective on generative models as unified, multimodal systems for cancer discovery 📄 msft.it/6013QhkUS !
Ava Amini tweet media
English
1
8
30
1.9K
Neil Tenenholtz retweetledi
Ava Amini
Ava Amini@avapamini·
protein language models capture rich structural signals, but where that knowledge lives in the network is still unclear we show that small subnetworks inside PLMs encode structural concepts, from residues to folds journals.plos.org/ploscompbiol/a… @PLOSCompBiol work led by @riavinod_!
Ava Amini tweet media
English
0
28
171
14.4K
Neil Tenenholtz
Neil Tenenholtz@ntenenz·
@CUDAHandbook One interesting question is how do we ship libraries as the kernels count grows? If I can hyperoptimize in a narrow setting, the count could quickly blow up -- should I ship all of them? If so, the question becomes how do we do so in a safe, scalable fashion.
English
1
0
0
26
Nicholas Wilt
Nicholas Wilt@CUDAHandbook·
Did DeepBlue or AlphaGo have “superhuman intelligence intelligence”? No. That said, tools such as this kernel optimizer are extremely valuable, especially if they can produce simpler, human-readable (or even human-modifiable) code.
English
1
1
4
391
Neil Tenenholtz retweetledi
Carles Domingo-Enrich
Carles Domingo-Enrich@cdomingoenrich·
(1/9) Most LM fine-tuning optimizes next-token loss or scalar rewards. What if we fine-tune language models so that feature statistics of partial rollouts match those of ground-truth completions? That leads to Energy-Based Fine-Tuning (EBFT). arXiv: arxiv.org/abs/2603.12248
English
5
36
228
43K
Neil Tenenholtz
Neil Tenenholtz@ntenenz·
I wonder if it's correlation in the evals (causal) or instead everyone jointly prioritizing the same evals (correlation). If the same evals are prioritized, one could easily imagine marginal network capacity being devoted to those that underperform, inducing correlation. Thoughts @DimitrisPapail?
English
0
0
2
165
Edward Z. Yang
Edward Z. Yang@ezyang·
One of the things I find exciting about the Carlini work is we can come up with new high level ideas for how compilers should be put together and then spend $20k to find out if they actually make developing a compiler faster
English
5
3
76
7.6K
Neil Tenenholtz
Neil Tenenholtz@ntenenz·
@giffmana @maxjaderberg It's the difference between maximizing publication count vs maximizing impact. Beating a baseline gets you a paper. Beating a well-tuned baseline gets you (a better chance at) lasting impact.
English
1
0
1
76
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
@maxjaderberg Yep! I have so many cool ideas, but I'm still busy squeezing even more juice of very simple things to date.
English
1
0
30
2.4K
Max Jaderberg
Max Jaderberg@maxjaderberg·
If you’re not pushing your baselines (existing methods) as hard as possible before developing something new, there’s no way to trust any positive result. That holds just as much for your own results, your own group’s, as for an external paper
Lucas Beyer (bl16)@giffmana

PSA: never, ever write "we use the same learning rate across all methods for fair comparison" I read this as "do not trust any of our conclusions" and then i move on. If learning rate tuning is not mentioned, it takes me a little more time to notice that, but i also move on.

English
3
6
117
14.6K
Neil Tenenholtz
Neil Tenenholtz@ntenenz·
@StasBekman That's the right intuition. By quadratic function, I meant to include the lower-order terms as well. The linear term is largely the FFNs and QKVO projections. The quadratic term is the attn.
English
0
0
1
47
Stas Bekman
Stas Bekman@StasBekman·
Thank you for more details, Neil. That's very much along the lines I have been thinking. Good call on figuring out the term. I was thinking of doing some weighted average of linear and quadratic token lengths, where the gamma can be a function of seqlen, since the longer the seqlen of a single sequence the more quadratic would have an impact. Oh, and the paper looks excellent as well. Much appreciating your input, Neil
English
1
0
1
59
Stas Bekman
Stas Bekman@StasBekman·
Has anyone solved the load balancing issue in DataLoader across multiple-ranks to ensure each batch has a similar cost function and avoid outliers which would slow all ranks down. e.g. with SFT many instruction samples are packed into larger sequence, but due to quadratic attention a packed 10x 100 token sample will finish much faster than 1x 1000 token sample. So the cost function can't be a number of tokens, but some sort of sum of quadratic token len since attention for long sequence would dominate all linear ops. I was thinking to simply sort samples by cost function after they are packed, and not randomize DL, which should already do a pretty good load balancing in practice - the problem is that it could impact the learning where it learns long sequences first and then smaller, or vice versa. The other idea I have is to do a DL wrapper which rebalances batches across ranks on the fly while keeping some buffers of discarded items and re-use them later. Asking if perhaps someone has already experimented with some ideas and found an optimal one. Thank you!
English
13
2
45
5.5K
Neil Tenenholtz
Neil Tenenholtz@ntenenz·
@StasBekman persist tokenized batches not samples. turning it into an offline, or pseudo-offline, problem gives you much more flexibility.
English
2
0
0
299
Neil Tenenholtz
Neil Tenenholtz@ntenenz·
Ok, now that i have a bit more time to respond... the "straightforward" way is to greedily pack within buckets. the size of the buckets will have a nontrivial impact on the uniformity of the compute cost. you'll have to make some decisions around bucket size, resorting batches, etc. this will be a balancing act between maintaining your desired curriculum and maximizing utilization. you'll likely want to estimate your cost fn empirically as it's a function of your parallelism strategy, kernels, etc. you can start with a pretty simple quadratic model. finally, depending on sequence length and model size, the relative importance of this can vary. perform the basic flop math (including the cross-token attn term) and get a sense at which sequence length the term becomes negligible -- the 6N frequently cited drops it. there is some public work that goes beyond this, for example arxiv.org/pdf/2509.21841…, if you're especially interested.
English
1
0
1
112
Stas Bekman
Stas Bekman@StasBekman·
@ntenenz No problem at all, Neil. Either works if the online one is efficient enough ;)
English
1
0
0
77
Neil Tenenholtz
Neil Tenenholtz@ntenenz·
@StasBekman Sorry! I thought you were describing an effort to perform packing/batching in a streaming fashion online rather than offline.
English
1
0
1
66
Stas Bekman
Stas Bekman@StasBekman·
@ntenenz Thank you for the follow up, Neil We have a total flexibility of pre-processing ahead of time, so offline is indeed ideal. but I'm not sure what you mean by "persist tokenized batches not samples." - we are working with tokenized batches already.
English
1
0
0
264
Neil Tenenholtz
Neil Tenenholtz@ntenenz·
@cn8011 In the 1st image, the annotations are largely DICOM metadata. Easily parseable via a variety of OSS libraries. The image annotation in the 2nd image contains 2 vertebrae, L3-L5 is 3. 🙃
English
0
0
2
72
cn80
cn80@cn8011·
@ntenenz The annotations aren't part of cornerstone right? there's also plenty of proprietary integrations for storing & serving the images, should be far more open source & AI to reduce dev efforts
English
1
0
0
79