Mark Ibrahim

82 posts

Mark Ibrahim

@marksibrahim

Researching the dark arts of deep learning at Meta's FAIR (Fundamental AI Research) Lab

everywhere Joined Aralık 2012

1.6K Following510 Followers

Mark Ibrahim retweeted

Sharut Gupta@sharut_gupta·3 Şub

1/n Can LLMs learn to reason on hard benchmarks like AIME and GPQA purely through context, without SFT, RL, or any weight updates? Turns out… Yes! And it can have strong performance while being highly efficient Paper: arxiv.org/pdf/2602.02366 Blog: reasoncache.github.io

English

207

17.2K

Mark Ibrahim retweeted

dr. jack morris@jxmnop·5 Şub

at long last, the final paper of my phd 🧮 Learning to Reason in 13 Parameters 🧮 we develop TinyLoRA, a new ft method. with TinyLoRA + RL, models learn well with dozens or hundreds of params example: we use only 13 parameters to train 7B Qwen model from 76 to 91% on GSM8K 🤯

English

230

2.1K

181.3K

Mark Ibrahim@marksibrahim·17 Oca

@fujikanaeda Related finding showing a single character can break LLM evals: x.com/marksibrahim/s…

Mark Ibrahim@marksibrahim

One can manipulate LLM rankings to put any model in the lead—only by modifying the single character separating demonstration examples. Learn more in our new paper arxiv.org/abs/2510.05152 w/ Jingtong Su, Jianyu Zhang, @karen_ullrich , and Léon Bottou. 1/3 🧵

English

270

Eric W. Tramel@fujikanaeda·15 Oca

The presence of a leading whitespace leaks the correct choice selection in the MMLU-Pro benchmark. Am I missing something? Seems to impact Chemistry, Physics, and Math. HF Issue in reply.

English

387

94.5K

Mark Ibrahim retweeted

Basile Terver@BasileTerv987·12 Oca

My first PhD paper is out! 🎓 "What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?" tl:dr: JEPA-WMs for robotics: learn dynamics on top of visual encoders, optimize actions towards goal 👇 w/ @JimmyTYYang1, Jean Ponce, @AdrienBardes, @ylecun

English

110

916

79.7K

Mark Ibrahim retweeted

Dr. Karen Ullrich@karen_ullrich·10 Ara

Release Day 🎉 Meet OpenApps — a pure-Python, open-source ecosystem for stress-testing UI agents at scale. Runs on a single CPU. Generates thousands of unique UI variations. And it reveals just how fragile today’s SOTA agents are. (Yes, even GPT-4 and Claude struggle.)

English

9.7K

Mark Ibrahim@marksibrahim·10 Ara

@browsercompany @fasthtml @openstreetmap in collaboration with the excellent research team at FAIR: @karen_ullrich Jingtong Su @randall_balestr @_amirbar Claudia Shi, Arjun Subramonian, Nikolaos Tsilivis, Ivan Evtimov, adn @KempeLab

English

Mark Ibrahim@marksibrahim·10 Ara

built on top of excellent framework thanks to @browsercompany @fasthtml @openstreetmap

English

276

Mark Ibrahim@marksibrahim·10 Ara

Want to teach AI agents to use apps like humans? Get started with digital agents research using OpenApps, our new Python-based environment.

English

9.6K

Mark Ibrahim retweeted

Dr. Karen Ullrich@karen_ullrich·3 Ara

Stop by the Meta booth tomorrow, Wednesday Dec 3rd at #NeurIPS in San Diego! 🤖📱 We demo our new research environment, OpenApps, for digital agents. Generate thousands of app versions to train and evaluate multimodal agents to use apps like humans do. Not attending? Stay tuned

English

908

Mark Ibrahim retweeted

Randall Balestriero@randall_balestr·21 Kas

With LeJEPA (arxiv.org/abs/2511.08544) it has never been easier to train JEPAs! And this matters A LOT because JEPAs have numerous provable benefits over the good-old reconstruction based methods (arxiv.org/abs/2505.12477). NeurIPS spotlight: Wed, 11 a.m. PST, Hall C,D,E #2613

Hugues Van Assel@hugues_va

Lots of discussion around JEPA and why latent space prediction works better than input space (e.g., LLMs) for certain modalities. But no one has formalized WHY. The answer lies in whether statistically dominant features are semantically meaningful. @NeurIPSConf spotlight 🧵👇

English

449

86K

Mark Ibrahim@marksibrahim·7 Kas

✅ 22k multi-scene questions ✅ New scenes not in existing web data ✅ Runs in ~15 min on one GPU Work led by Candace Ross in collaboration with Florian Bordes, @adinamwilliams, and @polkirichenko . Check it out on HuggingFace & ArXiv: huggingface.co/datasets/faceb…

English

125

Mark Ibrahim@marksibrahim·7 Kas

Despite saturating single image perception, Common-O establishes a new challenging multimodal benchmark. The best performing model only achieves 35% on Common-O and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%. 🧵2/3

English

150

Mark Ibrahim@marksibrahim·7 Kas

We introduce, Common-O, a new multimodal benchmark for hallucination when reasoning across scenes. We find leading multimodal LLMs can reliably identify objects, yet hallucinate when reasoning across scenes. 🧵1/3

English

3.6K

Mark Ibrahim retweeted

Sarthak Mittal@sarthmit·18 Eki

Meta on meta: thrilled to share our work on Meta-learning… at Meta! 🔥🧠 We make two major contributions: 1️⃣ Unified framework revealing insights into various amortizations 🧠 2️⃣ Greedy belief-state updates to handle long context-lengths 🚀

English

225

45.8K

Mark Ibrahim@marksibrahim·16 Eki

If you’re an NYU student, come learn about this wonderful opportunity to collaborate with us at FAIR events.atmeta.com/metanyuaimento… Panel is tomorrow 10am at NYU Center for Data Science.

English

4.3K

Mark Ibrahim retweeted

Nikos Tsilivis@nikostsilivis·15 Eki

RL has led to amazing advances in reasoning domains with LLMs. But why has it been so successful, and why does the length of the response increases during RL? In new work, we introduce a framework to provide conceptual and theoretical answers to these questions.

English

Mark Ibrahim retweeted

fly51fly@fly51fly·9 Eki

[CL] A Single Character can Make or Break Your LLM Evals J Su, J Zhang, K Ullrich, L Bottou... [FAIR at Meta] (2025) arxiv.org/abs/2510.05152

English

754

Mark Ibrahim@marksibrahim·9 Eki

We explain how good delimiters steer attention heads to key input tokens and offer practical recommendations for prompts and delimiter choices to get the best performance from your LLM—tldr; use “!” or “\n”.

English

122

Mark Ibrahim@marksibrahim·9 Eki

- MMLU performance varies by +/- 23% depending on the choice of delimiter across leading open model families (Llama, Qwen, and Gemma). - Closed models, GPT-4o, are also brittle to the choice of delimiter. 2/3 🧵

English

180

Mark Ibrahim@marksibrahim·9 Eki

English

1.8K

Discover

@fujikanaeda @JimmyTYYang1 @AdrienBardes @ylecun @browsercompany @fasthtml @openstreetmap @karen_ullrich