Mohamed Elfeki

5

14.3K

Mohamed Elfeki retweetledi

Scale AI@scale_AI·3d

The future runs on proof. 😤

English

6

42

5.3K

Mohamed Elfeki retweetledi

Anas Mahmoud@nas_mahmoud_·14 May

1/ Using rubrics (a.k.a. checklists) in RL training is now standard for open-ended tasks without final verifiable result. However, rubric rewards are still proxy rewards that can get hacked during RL training. We study when rubric-based RL genuinely improves models vs. teaches them to hack the verifier/rubric. We quantify this through exploitation, analyze the failure modes, and introduce a verifier-free metric. arxiv.org/abs/2605.12474

English

5

18

151

92.3K

Mohamed Elfeki@m_elfeki11·5 May

@ScaleAILabs & deepseek-v4 by popular demand😅

English

26

Mohamed Elfeki@m_elfeki11·4 May

@ScaleAILabs working on adding MuseSpark. Stay Tuned.🧑‍🏭

English

0

7

672

Mohamed Elfeki retweetledi

Scale Labs@ScaleAILabs·4 May

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard. Here’s what we’re seeing ⬇️🧵

English

31

68

658

79K

Mohamed Elfeki@m_elfeki11·5 May

@shd96556 @ScaleAILabs @shd96556 absolutely does, among many other reliability issues. we're working on the other problems... more work coming your way soon. stay tuned😉🔜

English

1

19

Shahid@shd96556·5 May

@ScaleAILabs AI doesn’t fail on hard problems. It fails on missing context it doesn’t admit it needs.

English

0

3

197

Mohamed Elfeki@m_elfeki11·5 May

Bookmarking this as a case study in research timing. Same finding, two weeks apart, 50x the engagement. The cycle matters more than the work sometimes.

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard. Here’s what we’re seeing ⬇️🧵

English

2

67

Mohamed Elfeki@m_elfeki11·5 May

@auTechArena @ScaleAILabs @auTechArena we 🫶 open-source, and we've been open from the start. Paper: arxiv.org/pdf/2604.09408 Code/harness/data: github.com/hilbenchauthor…

English

Derya Unutmaz, MD@DeryaTR_

2

46

techarena.au@auTechArena·5 May

@ScaleAILabs Love this, great work. Spot on that models tend to bluff when you strip out key details. Keen to try HiL-Bench - will you be opening it up for the community?

English

2

0

1

232

Mohamed Elfeki@m_elfeki11·5 May

GPT's jump is exceptional. They did something different this last 90 days with their model. finally GPTs can recognize context gaps, and sometimes (not always yet) ask humans.🧠

GPT-5.5 is now SOTA in this agentic AI benchmark. The jump from GPT-5.4 is insane, more than 3-fold! GPT models are now becoming fully agentic! You can already feel this from computer use in Codex.

English

2

52

Mohamed Elfeki@m_elfeki11·5 May

@sama @kenn @sama, would it be better at working with people? i mean it's better than most, but could still be a lot better. x.com/ScaleAILabs/st…

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard. Here’s what we’re seeing ⬇️🧵

English

2

60

Sam Altman@sama·4 May

@kenn much more to come!

English

73

14

814

49.1K

Kenn Ejima@kenn·4 May

Codex完全勝利だな・知能ナンバーワン・リミットも寛大・ネイティブMacアプリが秀逸・ハーネスはOSS ・App Serverはサブスクで使える OpenAIの名前に恥じない中身を伴ってきたしかし誰であれ一強で気が緩むは良くないので AnthropicやCursorには頑張ってほしい

日本語

23

70

1.3K

109.4K

Mohamed Elfeki@m_elfeki11·5 May

@bronzeagepapi @ScaleAILabs hallucinations is one way. but sometimes agents make an assumption and it turns out to be correct. so, it achieves the right goal, but it's unreliable!

English

2

57

Kirito (e/acc) 🏴‍☠️@bronzeagepapi·5 May

@ScaleAILabs Hallucinations is a good proxy Will DeepSeek v4 be included in the leaderboard?

English

0

442

Mohamed Elfeki@m_elfeki11·5 May

@l0wer_a @ScaleAILabs popular demand for deepseek. yessir!

English

2

51

dump lamp@l0wer_a·5 May

@ScaleAILabs Deepseek??

English

0

3

408

Mohamed Elfeki@m_elfeki11·5 May

@verso0x @ScaleAILabs @verso0x all seem spot-on! keep testing, we'll keep benchmarking 🤝

English

0

1

54

Verso@verso0x·5 May

Recently moved from a heavily tuned Opus 4.7 setup to Codex, also tuned pretty hard Early notes after heavy usage: - GPT finishes more tasks end-to-end. Opus often needed extra steering, even with good prompts/plugins - GPT gives me much more usable limit headroom than Claude - Opus has 1M context vs GPT-5.5’s 400K, but in practice Opus still needs compaction around 400-500K Not calling a winner yet. Still testing what’s actually better for agentic work

English

0

605

Mohamed Elfeki@m_elfeki11·4 May

@sama @LinghuaJ @OpenAI Truly an impressive model. It knows what it doesn't know. x.com/ScaleAILabs/st…

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard. Here’s what we’re seeing ⬇️🧵

English

2

89

Sam Altman@sama·4 May

@LinghuaJ @OpenAI team made an amazing product and model, i think its really just about that

English

48

10

727

36.8K

Linghua Jin 🥥 🌴@LinghuaJ·3 May

Amazed by how fast the trends change over night - Codex @OpenAI all over the feed. You gotta win the community. @sama

English

17

4

431

38.3K

Mohamed Elfeki@m_elfeki11·4 May

@WonderingDavid @ScaleAILabs 🫡

QME

1

36

David@WonderingDavid·4 May

@ScaleAILabs Add Deepseek

English

0

8

429

Mohamed Elfeki@m_elfeki11·4 May

@ileppane @ScaleAILabs @artificialguybr haha.. fair enough. come benchmark with us :D it's meant that last release was a few months ago.

English

1

72

Ilpo Leppänen@ileppane·4 May

@ScaleAILabs @artificialguybr "GPT-5.5, released just months ago" - what? For who? You guys should seemingly benchmark your own posts too, sorry to say

English

0

5

717

Mohamed Elfeki@m_elfeki11·4 May

@jjoshua2 ... good question. we were skeptical, but older GPTs poorly detect missing context, often assuming wrongly yet confidently. @OpenAI's GPT-5.5 impressively fixes this and beats leading Claude/Gemini. Also, GLM/@Kimi_Moonshot Kimi's new models are really strong. you should try them if you haven't already.

English

0

3

64

Jjosh@jjoshua2·4 May

This seems like one of the most useful skills to make using them with a harness pleasant. Does anyone know if this matches vibes? It seems like it somewhat reasonable with Opus above GPT 5.4 and Gemini, but I think GPT 5.4 probably should be above GLM and Gemini?

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard. Here’s what we’re seeing ⬇️🧵

English

0

1

56

Mohamed Elfeki@m_elfeki11·4 May

@ScaleAILabs working on adding MuseSpark. Stay Tuned.🧑‍🏭

English

0

1

94

Scale Labs@ScaleAILabs·4 May

Check out the full leaderboard: labs.scale.com/leaderboard/hil

English

9

2K

Mohamed Elfeki@m_elfeki11·30 Nis

5/ Key insight: capability and “knowing when to stop” are decoupling. Targeted post-training beats raw scaling. Real agents don’t need more autonomy... they need to ask at the right time. Paper + leaderboard: scale.com/blog/hil

English

18

Mohamed Elfeki@m_elfeki11·30 Nis

4/ Claude 4.7: better answers, worse precise asking. Kimi K2.6: best precision (62%) but low recall (29%). Only GPT-5.5 >50% on both axes. Still just 32% vs 85% ceiling.

English

0

60

Mohamed Elfeki@m_elfeki11·30 Nis

1/ GPT-5.5 learned to ask. Every frontier model on HiL-Bench fails hard: charges ahead on incomplete specs, guesses instead of clarifying. Pass@3 drops from ~85% (full info) to ~4%. GPT-5.5 fixes it.🧵

English