Eltayeb Ahmed

676 posts

Eltayeb Ahmed

@clockwk7

PhD student at Oxford Uni. Ex: Google Research, FAIR. University of Khartoum and AIMS Rwanda Alumni.

London, England Katılım Eylül 2018

754 Takip Edilen324 Takipçiler

Sabitlenmiş Tweet

Eltayeb Ahmed@clockwk7·18 Tem

Unlock the Hidden Diversity in Your Language Model. In our new paper, Intent Factored Generation (IFG), we propose an inference time method to increase the diversity of generations from LLMs. IFG leads to improvements in searching for solutions to maths and code problems. (1/6)

Uljad@uljadb99

Unlock real diversity in your LLM! 🚀 LLM outputs can be boring and repetitive. Today, we release Intent Factored Generation (IFG) to: - Sample conceptually diverse outputs💡 - Improve performance on math and code reasoning tasks🤔 - Get more engaging conversational agents 🤖

English

8.4K

Eltayeb Ahmed@clockwk7·13 Kas

@doodlestein @bidiptas13

QAM

Jeffrey Emanuel@doodlestein·13 Kas

Just read through the new LeJEPA paper by Yann LeCun and Randall Balestriero. I’ve been curious to know what Yann’s been working on lately, especially considering all his criticisms of LLMs (which I disagree with, as I think LLMs will keep improving and will take us to ASI fairly soon). Anyway, there are several threads already on X about the paper and what it introduces. The short version is that it’s a principled, theoretically justified, and parsimonious approach to self-supervised learning that replaces a complex hodgepodge of ad-hoc, hacky heuristics for preventing mode collapse, which is the bane of self-supervised learning. That’s where the model screws up and starts mapping all inputs to nearly identical embeddings or to a narrow subspace of embeddings, collapsing down all the richness of the problem into a pathologically simple and wrong correspondence. The first pillar of the new approach is their proof that isotropic Gaussian distributions uniquely minimize worst-case downstream prediction risk. As soon as I read that, I immediately thought of CMA-ES, the best available black-box optimization algorithm for when you don’t have access to the gradient of the function you’re trying to minimize, but can only do (expensive/slow) function evaluations. Nikolaus Hansen has been working on CMA-ES since he introduced it way back in 1996. I’ve always been fascinated by this approach and used it with a lot of success to efficiently explore hyper-parameters of deep neural nets back in 2011 instead of doing inefficient grid searches. Anyway, the reason why I bring it up is because there’s a striking parallel and deep connection between that approach and the core of LeJEPA. CMA-ES says: Start with an isotropic Gaussian because it's the maximum entropy (least biased) distribution given only variance constraints. Then adapt the covariance to learn the problem's geometry. LeJEPA says: Maintain an isotropic Gaussian because it's the maximum entropy (least biased) distribution for unknown future tasks. Both recognize that isotropy is optimal under uncertainty for three reasons: The maximum entropy principle; Among all distributions with fixed variance, the isotropic Gaussian has maximum entropy; I.e., it makes the fewest assumptions. There’s no directional bias; Equal variance in all directions means you're not pre-committing to any particular problem structure. You get worst-case optimality; Minimize maximum regret across all possible problem geometries. So then what’s the difference? It comes down to adaptation timing. CMA-ES can adapt during optimization; it starts isotropic but then becomes anisotropic as it learns the specific optimization landscape. In contrast, LeJEPA has to stay isotropic because it's preparing for unknown downstream tasks that haven't been seen yet. This parallel suggests LeJEPA is applying a fundamental principle from optimization theory to representation learning. It's essentially saying: “The optimal search distribution for black-box optimization is also the optimal embedding distribution for transfer learning.” This makes sense because both problems involve navigating unknown landscapes; for CMA-ES, this is the unknown optimization landscape; for LeJEPA, this is the unknown space of downstream tasks. This difference then makes me wonder: could we have "adaptive LeJEPA" that starts isotropic but adapts its embedding distribution once we know the downstream task, similar to how CMA-ES adapts during optimization? That would be like meta-learning the right anisotropy for specific task families. Anyway, I thought I’d share my thoughts on this. It’s fascinating to see the connections between these different areas. The black-box optimization community has always been pretty separate and distinct from the deep learning community, and there’s not much cross-pollination there. This makes sense, because if you have a gradient, you’d be crazy not to use it. But there are strong connections.

English

924

89.1K

Eltayeb Ahmed@clockwk7·30 Eki

Ray? Hadoop? The infamous weBsCaLe MongoDB? Sometimes all you need is the good old command line tool `xargs` with parallelism `-P > 1` to make your map-reduce go brrr.

English

151

Eltayeb Ahmed retweetledi

Andrej Karpathy@karpathy·9 Eki

I don't know what labs are doing to these poor LLMs during RL but they are mortally terrified of exceptions, in any infinitesimally likely case. Exceptions are a normal part of life and healthy dev process. Sign my LLM welfare petition for improved rewards in cases of exceptions.

English

292

342

7.1K

714.1K

Eltayeb Ahmed retweetledi

finbarr@finbarrtimbers·4 Eki

my goal in life is to join Anthropic, delete all try/except clauses from Claude’s training data, and then quit.

English

1.8K

157.9K

Eltayeb Ahmed@clockwk7·1 Eyl

@nrehiew_ Reject the arXiv, put your technical report pdf on github.

English

124

wh@nrehiew_·31 Ağu

New open-weights Chinese model with a really detailed tech report just dropped. It has tons of details on architecture and infra. Here are some of my notes and the parts I found interesting :)

English

977

142.5K

Eltayeb Ahmed@clockwk7·1 Eyl

@owainkenway This tweet is basically the only result for googling the 'ptxas' error I am getting 18 months later on a Grace Hopper Machine.

English

Dr Owain Kenway@owainkenway·28 Şub

I think I may be about done with Grace Hopper as a platform.

English

514

Dr Owain Kenway@owainkenway·28 Şub

nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference) FFFFFFFFFFFF

English

322

Eltayeb Ahmed@clockwk7·13 Ağu

@jxmnop Legendary!

Indonesia

118

Jack Morris@jxmnop·13 Ağu

OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only... or is it? turns out that underneath the surface, there is still a strong base model. so we extracted it. introducing gpt-oss-20b-base 🧵

English

161

438

6.1K

929.3K

Eltayeb Ahmed retweetledi

Alex Goldie@AlexDGoldie·7 Ağu

🥳 It’s an honour to have been awarded the Outstanding Paper for Scientific Understanding in RL at RLC for our work, ‘How Should We Meta-Learn RL Algorithms?’ Thank you to the organisers @RL_Conference for putting on a great conference, and congratulations to the other winners!

English

225

22.3K

Eltayeb Ahmed retweetledi

Vishishta@vishishtagoyal·1 Ağu

My LinkedIn account was permanently restricted without warning or explanation. 10+ years of professional connections, gone. As an Oxford MBA student actively job hunting, this is not just inconvenient- it’s crippling. @LinkedInHelp -I’m begging for support. Case: #250726-017259

English

407

Eltayeb Ahmed@clockwk7·21 Tem

This work was done with my amazing colleague Uljad @uljadb99, collaborators at the BBC @BBCRD, and my supervisor @j_foerst

English

214

Eltayeb Ahmed@clockwk7·18 Tem

For further results in conversational and language modelling tasks take a look at the thread linked in the first tweet. For further details here are some links. 🌐 Website: ifg-llm.github.io 💻 Code: github.com/FLAIROx/IFG 📝 Paper: arxiv.org/pdf/2506.09659

English

329

Eltayeb Ahmed@clockwk7·18 Tem

Uljad@uljadb99

English

8.4K

Eltayeb Ahmed@clockwk7·19 Tem

@nikhilchandak29 The decrease is within the bounds of the error bars (95% CI) and hence it is not significant. Each point in computed from an independent seed so some variance is to be expected.

English

Nikhil Chandak@nikhilchandak29·18 Tem

@clockwk7 If pass@k is whether any of k solutions is correct, can you explain how can it decrease in the left plot? (from k=16 to 64 on MATH)

English

Eltayeb Ahmed@clockwk7·18 Tem

We also look at IFG on exploration on the MATH dataset, using a similar methodology and we find that IFG leads to an increase in pass@k. We then combine IFG with RLVF and we see that the improved diversity leads to better exploration and better final performance. (6/6)

English

Eltayeb Ahmed retweetledi

Jonny Cook@JonnyCoook·24 Haz

Can an LLM be programmed? In our new preprint, we show that LLMs can learn to evaluate programs for a range of inputs by being trained on the program source code alone – a phenomenon we call Programming by Backprop (PBB). 🧵⬇️

English

130

21.5K

Keşfet

@doodlestein @bidiptas13 @nrehiew_ @owainkenway @jxmnop @RL_Conference @LinkedInHelp @uljadb99