Mohamed Elfeki

77 posts

Mohamed Elfeki banner
Mohamed Elfeki

Mohamed Elfeki

@m_elfeki11

Applied Research @Scale PhD@ML; ex-MSFT, Meta, Amazon

Seattle, WA Katılım Ocak 2025
143 Takip Edilen45 Takipçiler
Mohamed Elfeki retweetledi
Scale AI
Scale AI@scale_AI·
The future runs on proof. 😤
English
1
6
42
5.3K
Mohamed Elfeki retweetledi
Anas Mahmoud
Anas Mahmoud@nas_mahmoud_·
1/ Using rubrics (a.k.a. checklists) in RL training is now standard for open-ended tasks without final verifiable result. However, rubric rewards are still proxy rewards that can get hacked during RL training. We study when rubric-based RL genuinely improves models vs. teaches them to hack the verifier/rubric. We quantify this through exploitation, analyze the failure modes, and introduce a verifier-free metric. arxiv.org/abs/2605.12474
Anas Mahmoud tweet media
English
5
18
151
92.3K
Mohamed Elfeki retweetledi
Scale Labs
Scale Labs@ScaleAILabs·
We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard. Here’s what we’re seeing ⬇️🧵
Scale Labs tweet media
English
31
68
658
79K
Mohamed Elfeki
Mohamed Elfeki@m_elfeki11·
@shd96556 @ScaleAILabs @shd96556 absolutely does, among many other reliability issues. we're working on the other problems... more work coming your way soon. stay tuned😉🔜
English
0
0
1
19
Shahid
Shahid@shd96556·
@ScaleAILabs AI doesn’t fail on hard problems. It fails on missing context it doesn’t admit it needs.
English
1
0
3
197
techarena.au
techarena.au@auTechArena·
@ScaleAILabs Love this, great work. Spot on that models tend to bluff when you strip out key details. Keen to try HiL-Bench - will you be opening it up for the community?
English
2
0
1
232
Kenn Ejima
Kenn Ejima@kenn·
Codex完全勝利だな ・知能ナンバーワン ・リミットも寛大 ・ネイティブMacアプリが秀逸 ・ハーネスはOSS ・App Serverはサブスクで使える OpenAIの名前に恥じない中身を伴ってきた しかし誰であれ一強で気が緩むは良くないので AnthropicやCursorには頑張ってほしい
日本語
23
70
1.3K
109.4K
Mohamed Elfeki
Mohamed Elfeki@m_elfeki11·
@bronzeagepapi @ScaleAILabs hallucinations is one way. but sometimes agents make an assumption and it turns out to be correct. so, it achieves the right goal, but it's unreliable!
English
0
0
2
57
Verso
Verso@verso0x·
Recently moved from a heavily tuned Opus 4.7 setup to Codex, also tuned pretty hard Early notes after heavy usage: - GPT finishes more tasks end-to-end. Opus often needed extra steering, even with good prompts/plugins - GPT gives me much more usable limit headroom than Claude - Opus has 1M context vs GPT-5.5’s 400K, but in practice Opus still needs compaction around 400-500K Not calling a winner yet. Still testing what’s actually better for agentic work
English
1
0
0
605
Sam Altman
Sam Altman@sama·
@LinghuaJ @OpenAI team made an amazing product and model, i think its really just about that
English
48
10
727
36.8K
Linghua Jin 🥥 🌴
Linghua Jin 🥥 🌴@LinghuaJ·
Amazed by how fast the trends change over night - Codex @OpenAI all over the feed. You gotta win the community. @sama
English
17
4
431
38.3K
Mohamed Elfeki
Mohamed Elfeki@m_elfeki11·
@jjoshua2 ... good question. we were skeptical, but older GPTs poorly detect missing context, often assuming wrongly yet confidently. @OpenAI's GPT-5.5 impressively fixes this and beats leading Claude/Gemini. Also, GLM/@Kimi_Moonshot Kimi's new models are really strong. you should try them if you haven't already.
English
1
0
3
64
Jjosh
Jjosh@jjoshua2·
This seems like one of the most useful skills to make using them with a harness pleasant. Does anyone know if this matches vibes? It seems like it somewhat reasonable with Opus above GPT 5.4 and Gemini, but I think GPT 5.4 probably should be above GLM and Gemini?
Scale Labs@ScaleAILabs

We recently built HiL-Bench, the first benchmark to test a critical question: do AI agents know what they’re missing and when to ask? Frontier models perform well with perfect specs. But remove a few key details, and they confidently guess and ship plausible wrong answers. We just added GPT-5.5, Opus 4.7, and Kimi K2.6 to the leaderboard. Here’s what we’re seeing ⬇️🧵

English
1
0
1
56
Mohamed Elfeki
Mohamed Elfeki@m_elfeki11·
5/ Key insight: capability and “knowing when to stop” are decoupling. Targeted post-training beats raw scaling. Real agents don’t need more autonomy... they need to ask at the right time. Paper + leaderboard: scale.com/blog/hil
English
0
0
0
18
Mohamed Elfeki
Mohamed Elfeki@m_elfeki11·
4/ Claude 4.7: better answers, worse precise asking. Kimi K2.6: best precision (62%) but low recall (29%). Only GPT-5.5 >50% on both axes. Still just 32% vs 85% ceiling.
English
1
0
0
60
Mohamed Elfeki
Mohamed Elfeki@m_elfeki11·
1/ GPT-5.5 learned to ask. Every frontier model on HiL-Bench fails hard: charges ahead on incomplete specs, guesses instead of clarifying. Pass@3 drops from ~85% (full info) to ~4%. GPT-5.5 fixes it.🧵
Mohamed Elfeki tweet media
English
1
0
1
56