Eltayeb Ahmed

676 posts

Eltayeb Ahmed

Eltayeb Ahmed

@clockwk7

PhD student at Oxford Uni. Ex: Google Research, FAIR. University of Khartoum and AIMS Rwanda Alumni.

London, England Katılım Eylül 2018
754 Takip Edilen324 Takipçiler
Sabitlenmiş Tweet
Eltayeb Ahmed
Eltayeb Ahmed@clockwk7·
Unlock the Hidden Diversity in Your Language Model. In our new paper, Intent Factored Generation (IFG), we propose an inference time method to increase the diversity of generations from LLMs. IFG leads to improvements in searching for solutions to maths and code problems. (1/6)
Uljad@uljadb99

Unlock real diversity in your LLM! 🚀 LLM outputs can be boring and repetitive. Today, we release Intent Factored Generation (IFG) to: - Sample conceptually diverse outputs💡 - Improve performance on math and code reasoning tasks🤔 - Get more engaging conversational agents 🤖

English
1
11
43
8.4K
Jeffrey Emanuel
Jeffrey Emanuel@doodlestein·
Just read through the new LeJEPA paper by Yann LeCun and Randall Balestriero. I’ve been curious to know what Yann’s been working on lately, especially considering all his criticisms of LLMs (which I disagree with, as I think LLMs will keep improving and will take us to ASI fairly soon). Anyway, there are several threads already on X about the paper and what it introduces. The short version is that it’s a principled, theoretically justified, and parsimonious approach to self-supervised learning that replaces a complex hodgepodge of ad-hoc, hacky heuristics for preventing mode collapse, which is the bane of self-supervised learning. That’s where the model screws up and starts mapping all inputs to nearly identical embeddings or to a narrow subspace of embeddings, collapsing down all the richness of the problem into a pathologically simple and wrong correspondence. The first pillar of the new approach is their proof that isotropic Gaussian distributions uniquely minimize worst-case downstream prediction risk. As soon as I read that, I immediately thought of CMA-ES, the best available black-box optimization algorithm for when you don’t have access to the gradient of the function you’re trying to minimize, but can only do (expensive/slow) function evaluations. Nikolaus Hansen has been working on CMA-ES since he introduced it way back in 1996. I’ve always been fascinated by this approach and used it with a lot of success to efficiently explore hyper-parameters of deep neural nets back in 2011 instead of doing inefficient grid searches. Anyway, the reason why I bring it up is because there’s a striking parallel and deep connection between that approach and the core of LeJEPA. CMA-ES says: Start with an isotropic Gaussian because it's the maximum entropy (least biased) distribution given only variance constraints. Then adapt the covariance to learn the problem's geometry. LeJEPA says: Maintain an isotropic Gaussian because it's the maximum entropy (least biased) distribution for unknown future tasks. Both recognize that isotropy is optimal under uncertainty for three reasons: The maximum entropy principle; Among all distributions with fixed variance, the isotropic Gaussian has maximum entropy; I.e., it makes the fewest assumptions. There’s no directional bias; Equal variance in all directions means you're not pre-committing to any particular problem structure. You get worst-case optimality; Minimize maximum regret across all possible problem geometries. So then what’s the difference? It comes down to adaptation timing. CMA-ES can adapt during optimization; it starts isotropic but then becomes anisotropic as it learns the specific optimization landscape. In contrast, LeJEPA has to stay isotropic because it's preparing for unknown downstream tasks that haven't been seen yet. This parallel suggests LeJEPA is applying a fundamental principle from optimization theory to representation learning. It's essentially saying: “The optimal search distribution for black-box optimization is also the optimal embedding distribution for transfer learning.” This makes sense because both problems involve navigating unknown landscapes; for CMA-ES, this is the unknown optimization landscape; for LeJEPA, this is the unknown space of downstream tasks. This difference then makes me wonder: could we have "adaptive LeJEPA" that starts isotropic but adapts its embedding distribution once we know the downstream task, similar to how CMA-ES adapts during optimization? That would be like meta-learning the right anisotropy for specific task families. Anyway, I thought I’d share my thoughts on this. It’s fascinating to see the connections between these different areas. The black-box optimization community has always been pretty separate and distinct from the deep learning community, and there’s not much cross-pollination there. This makes sense, because if you have a gradient, you’d be crazy not to use it. But there are strong connections.
Jeffrey Emanuel tweet media
English
40
94
924
89.1K
Eltayeb Ahmed
Eltayeb Ahmed@clockwk7·
Ray? Hadoop? The infamous weBsCaLe MongoDB? Sometimes all you need is the good old command line tool `xargs` with parallelism `-P > 1` to make your map-reduce go brrr.
Eltayeb Ahmed tweet media
English
0
0
0
151
Eltayeb Ahmed retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
I don't know what labs are doing to these poor LLMs during RL but they are mortally terrified of exceptions, in any infinitesimally likely case. Exceptions are a normal part of life and healthy dev process. Sign my LLM welfare petition for improved rewards in cases of exceptions.
English
292
342
7.1K
714.1K
Eltayeb Ahmed retweetledi
finbarr
finbarr@finbarrtimbers·
my goal in life is to join Anthropic, delete all try/except clauses from Claude’s training data, and then quit.
English
55
47
1.8K
157.9K
Eltayeb Ahmed
Eltayeb Ahmed@clockwk7·
@nrehiew_ Reject the arXiv, put your technical report pdf on github.
English
0
0
0
124
wh
wh@nrehiew_·
New open-weights Chinese model with a really detailed tech report just dropped. It has tons of details on architecture and infra. Here are some of my notes and the parts I found interesting :)
wh tweet media
English
14
91
977
142.5K
Eltayeb Ahmed
Eltayeb Ahmed@clockwk7·
@owainkenway This tweet is basically the only result for googling the 'ptxas' error I am getting 18 months later on a Grace Hopper Machine.
English
1
0
2
20
Dr Owain Kenway
Dr Owain Kenway@owainkenway·
I think I may be about done with Grace Hopper as a platform.
English
4
0
0
514
Dr Owain Kenway
Dr Owain Kenway@owainkenway·
nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference) FFFFFFFFFFFF
English
1
0
2
322
Jack Morris
Jack Morris@jxmnop·
OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only... or is it? turns out that underneath the surface, there is still a strong base model. so we extracted it. introducing gpt-oss-20b-base 🧵
Jack Morris tweet mediaJack Morris tweet media
English
161
438
6.1K
929.3K
Eltayeb Ahmed retweetledi
Alex Goldie
Alex Goldie@AlexDGoldie·
🥳 It’s an honour to have been awarded the Outstanding Paper for Scientific Understanding in RL at RLC for our work, ‘How Should We Meta-Learn RL Algorithms?’ Thank you to the organisers @RL_Conference for putting on a great conference, and congratulations to the other winners!
Alex Goldie tweet mediaAlex Goldie tweet media
English
3
24
225
22.3K
Eltayeb Ahmed retweetledi
Vishishta
Vishishta@vishishtagoyal·
My LinkedIn account was permanently restricted without warning or explanation. 10+ years of professional connections, gone. As an Oxford MBA student actively job hunting, this is not just inconvenient- it’s crippling. @LinkedInHelp -I’m begging for support. Case: #250726-017259
Vishishta tweet mediaVishishta tweet media
English
2
11
4
407
Eltayeb Ahmed
Eltayeb Ahmed@clockwk7·
Unlock the Hidden Diversity in Your Language Model. In our new paper, Intent Factored Generation (IFG), we propose an inference time method to increase the diversity of generations from LLMs. IFG leads to improvements in searching for solutions to maths and code problems. (1/6)
Uljad@uljadb99

Unlock real diversity in your LLM! 🚀 LLM outputs can be boring and repetitive. Today, we release Intent Factored Generation (IFG) to: - Sample conceptually diverse outputs💡 - Improve performance on math and code reasoning tasks🤔 - Get more engaging conversational agents 🤖

English
1
11
43
8.4K
Eltayeb Ahmed
Eltayeb Ahmed@clockwk7·
@nikhilchandak29 The decrease is within the bounds of the error bars (95% CI) and hence it is not significant. Each point in computed from an independent seed so some variance is to be expected.
English
0
0
1
31
Nikhil Chandak
Nikhil Chandak@nikhilchandak29·
@clockwk7 If pass@k is whether any of k solutions is correct, can you explain how can it decrease in the left plot? (from k=16 to 64 on MATH)
English
1
0
0
28
Eltayeb Ahmed
Eltayeb Ahmed@clockwk7·
We also look at IFG on exploration on the MATH dataset, using a similar methodology and we find that IFG leads to an increase in pass@k. We then combine IFG with RLVF and we see that the improved diversity leads to better exploration and better final performance. (6/6)
Eltayeb Ahmed tweet media
English
0
0
1
94
Eltayeb Ahmed retweetledi
Jonny Cook
Jonny Cook@JonnyCoook·
Can an LLM be programmed? In our new preprint, we show that LLMs can learn to evaluate programs for a range of inputs by being trained on the program source code alone – a phenomenon we call Programming by Backprop (PBB). 🧵⬇️
Jonny Cook tweet media
English
6
32
130
21.5K