Michael Elabd

108 posts

Michael Elabd

Michael Elabd

@MichaelElabd

Foundational Research @ DeepMind

San Francisco, CA เข้าร่วม Temmuz 2020
310 กำลังติดตาม1.2K ผู้ติดตาม
Michael Elabd
Michael Elabd@MichaelElabd·
@HyperAI_News Definitely, I would still like to see it extend beyond a 24 hour challenge to see how an agent can accumulate knowledge across sessions, users and even organizations!
English
0
0
1
255
Hyper.AI
Hyper.AI@HyperAI_News·
@MichaelElabd The concept of Hierarchical Cognitive Caching is brilliant. Moving from just 'running experiments' to actually 'accumulating knowledge' is the missing piece for truly autonomous agents.
English
1
0
1
296
Michael Elabd
Michael Elabd@MichaelElabd·
Continual learning for MLE just hit a new milestone 🥁 ML-Master 2.0 reached SOTA on OpenAI’s MLE-Bench and its not a new model or some RL technique; it’s better memory. The team introduced Hierarchical Cognitive Caching, a module that splits context into "experience" (short-term), "knowledge" (mid-term), and "wisdom" (long-term) memory layers. This lets the agent maintain coherence over 24-hour experimental cycles. It introduces a nice balance between exploration and persistence. This is a glimpse of AI systems that don’t just run experiments… they *accumulate knowledge*. Its really interesting to think what products will look like once agents can truly accumulate that knowledge.
Michael Elabd tweet media
English
7
25
174
15.3K
Michael Elabd
Michael Elabd@MichaelElabd·
So how does Hierarchical Cognitive Caching work? Its like how CPUs manage memory: L1 cache is tiny but super fast for what you need immediately, L2 holds more at slightly slower access speed, and L3/RAM stores everything you care about for the task. Each tier trades capacity for access speed, and data gets promoted (or evicted) between them as needed. This decouples execution from strategy. L1 handles the fast loop (debug, config changes, etc,). L2+L3 handle the slow loop (reflection, exploration, etc.). From the paper's ablations, we can confirm every layer matters (though working memory is the most critical, with performance collapsing from ~70% to ~20%!). Would be interested in ablations with 5 or even a 100 memory layers that an agent can control. I guess the question is do agents also need SSD/HD for storing and retrieving more information.
Michael Elabd tweet media
English
1
0
12
723
Scale Labs
Scale Labs@ScaleAILabs·
Big research update: we have 6 papers accepted at ICLR 2026! 🎉 We’re pushing the frontier of eval-driven RL, rubric-based rewards, and agentic capabilities. Over the next week, we’ll be sharing insights from our accepted papers. Here’s a preview into the work we'll be presenting in Brazil 🧵
English
2
6
90
15K
Michael Elabd
Michael Elabd@MichaelElabd·
@cwolferesearch Would really like to see a logprob ablation to see the impact on the quality of the judging
English
2
0
3
1.3K
Cameron R. Wolfe, Ph.D.
Cameron R. Wolfe, Ph.D.@cwolferesearch·
Strongly recommend the LLM-as-a-Verifier writeup. Biggest takeaway for me is that increasing scoring granularity makes the verifier more effective. This indicates that LLM judges / verifiers are developing new (and better) capabilities. This did not work well 1-2 years ago. In fact, LLM-as-a-Judge best practice was that lower scoring granularity (e.g., binary, ternary, or 1-5 Likert score) worked way better than granular scores (e.g., 1-100 scale). This was a constant recommendation I gave for setting up LLM judges properly. It seems like recent frontier LLMs now are better at scoring at finer granularities, making this best practice (potentially) obsolete. One caveat to this finding is that the scoring setup used in this writeup is a specific setup based upon logprobs. Instead of just using the score token outputted by the LLM as the result, they compute the logprob of each possible score token and take a weighted average of scores (with weights given by probabilities). Then, they go further by expanding this weighted average across repeated verifications and multiple criterion: Reward = (1 / CK) * ∑_{c=1}^{C} ∑_{k=1}^{K} ∑_{g=1}^{G} score_logprob * score_value where C is the total number of evaluation criterion, K is the number of repeated verifications, and G is the scoring granularity (i.e., number of unique scoring output options). The reward determines if a particular output passes verification across criteria. When using this logprob setup, we see consistent gains in verifier accuracy by: - Increasing scoring granularity G. - Increasing repeated verifications K. - Increasing the number of evaluation criterion C. The last two findings are in line with prior work, but the fact that higher scoring granularity is helpful is interesting! In the LLM-as-a-Verifier paper, this system is used at inference time in a pairwise fashion as described below. "To pick the best trajectory among N candidates for a given task, a round-robin tournament is conducted. For every pair (i, j) the verifier produces Reward(i) and Reward(j) using the formula above. The trajectory with the higher reward receives a win, and the trajectory with the most wins across all \binom{N}{2} pairs is selected."
Cameron R. Wolfe, Ph.D. tweet media
English
14
103
970
187.3K
Michael Elabd
Michael Elabd@MichaelElabd·
@Azaliamirh Very interesting! Wondering how big the gain is from running it as best-of-1 vs best-of-10
English
0
0
1
219
Azalia Mirhoseini
Azalia Mirhoseini@Azaliamirh·
Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05
Azalia Mirhoseini tweet media
English
33
114
980
113K
Carolina Parada
Carolina Parada@parada_car88104·
Excited to announce #GeminiRobotics ER 1.6 our upgraded embodied reasoning model with: ⏱️Advanced instrument reading 👉improved spatial pointing ✅enhanced success detection (including multiview) 🛑more robust safety features Read more: deepmind.google/blog/gemini-ro…
Google DeepMind@GoogleDeepMind

We’re rolling out an upgrade designed to help robots reason about the physical world. 🤖 Gemini Robotics-ER 1.6 has significantly better visual and spatial understanding in order to plan and complete more useful tasks. Here’s why this is important 🧵

English
1
4
33
5.1K
Google DeepMind
Google DeepMind@GoogleDeepMind·
Gemini Robotics-ER 1.6 enables robots to better pinpoint objects in an image. Ask it to find tools in a cluttered workshop, and it can accurately identify and count the right items while ignoring things that aren't present.
English
5
19
230
30.5K
Google DeepMind
Google DeepMind@GoogleDeepMind·
We’re rolling out an upgrade designed to help robots reason about the physical world. 🤖 Gemini Robotics-ER 1.6 has significantly better visual and spatial understanding in order to plan and complete more useful tasks. Here’s why this is important 🧵
English
139
426
2.6K
532.7K
Michael Elabd
Michael Elabd@MichaelElabd·
@askalphaxiv Such a clean idea tbh, tool definitions always felt too restrictive to explaining the capabilities and the constraints of the environment
English
0
0
0
224
alphaXiv
alphaXiv@askalphaxiv·
“Natural-Language Agent Harnesses” This paper argues that agent performance increasingly depends on the harness around the model, but that harness logic is usually buried in controller code and runtime-specific conventions. So they propose Natural-Language Agent Harnesses (NLAHs), which express harness behavior as editable natural language with explicit contracts, roles, stages, state semantics, and failure modes. It also introduces Intelligent Harness Runtime (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters, making agent scaffolds more comparable and scientifically analyzable.
alphaXiv tweet media
English
13
64
356
16.3K
Harry Partridge
Harry Partridge@part_harry_·
@MichaelElabd I am writing a post explaining my thoughts on this right now! tldr: SDPO has its own kind of reward hacking that likely makes it very difficult to get right at scale.
English
1
0
1
343
Harry Partridge
Harry Partridge@part_harry_·
Human-in-the-loop RL is necessarily done at group size 1; you cannot do a group of rollouts with only one human. i.e. there is no baseline for you to subtract for each input prompt. This is by far the most interesting and under-discussed part of this announcement. The same was true for their tab-completions model. From the wording in their posts, it sounds like they are using plain REINFORCE (no mention of value functions) with a large batch size + re-evaluating each checkpoint to guard against high variance. Cursor is implicitly revealing an important empirical result: with a large enough batch size, simple REINFORCE just works, no baseline needed. In other words, large scale continual learning is solved.
Cursor@cursor_ai

Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.

English
12
23
254
40.8K
Michael Elabd
Michael Elabd@MichaelElabd·
@dair_ai Highly recommend the the Next Intelligence Explosion paper!
English
0
0
0
193
Michael Elabd
Michael Elabd@MichaelElabd·
Love when papers introduce general frameworks for training-time continual learning! Continual learning only works if experience actually accumulates (learn new skills from attempting new tasks). Memory is a great solution for that experience accumalation, however, most memory-based bethods are inefficient (saving full trajectories is usually noisy and redundant). SkillRL solves this problem by distilling trajectories into reusable skills (very Voyager-esque approach with their skill library). Another cool insight is that they distinguish between "general skills" and "task-specific" ones to allow the model to attend to the right skill set. The outcome is ~10× memory reduction and SOTA performance, even beating much larger closed-source models. The interesting question to ask here is does structured experience / continual learning beat pre-training scale? or will this be another bitter lesson for the AI community?
Michael Elabd tweet media
English
2
10
96
4.7K