Michael Elabd

108 posts

Michael Elabd

@MichaelElabd

Foundational Research @ DeepMind

San Francisco, CA เข้าร่วม Temmuz 2020

310 กำลังติดตาม1.2K ผู้ติดตาม

Michael Elabd@MichaelElabd·4d

@HyperAI_News Definitely, I would still like to see it extend beyond a 24 hour challenge to see how an agent can accumulate knowledge across sessions, users and even organizations!

English

255

Hyper.AI@HyperAI_News·4d

@MichaelElabd The concept of Hierarchical Cognitive Caching is brilliant. Moving from just 'running experiments' to actually 'accumulating knowledge' is the missing piece for truly autonomous agents.

English

296

Michael Elabd@MichaelElabd·4d

Continual learning for MLE just hit a new milestone 🥁 ML-Master 2.0 reached SOTA on OpenAI’s MLE-Bench and its not a new model or some RL technique; it’s better memory. The team introduced Hierarchical Cognitive Caching, a module that splits context into "experience" (short-term), "knowledge" (mid-term), and "wisdom" (long-term) memory layers. This lets the agent maintain coherence over 24-hour experimental cycles. It introduces a nice balance between exploration and persistence. This is a glimpse of AI systems that don’t just run experiments… they *accumulate knowledge*. Its really interesting to think what products will look like once agents can truly accumulate that knowledge.

English

174

15.3K

Michael Elabd@MichaelElabd·4d

@askalphaxiv link: alphaxiv.org/abs/2601.10402

English

503

Michael Elabd@MichaelElabd·4d

So how does Hierarchical Cognitive Caching work? Its like how CPUs manage memory: L1 cache is tiny but super fast for what you need immediately, L2 holds more at slightly slower access speed, and L3/RAM stores everything you care about for the task. Each tier trades capacity for access speed, and data gets promoted (or evicted) between them as needed. This decouples execution from strategy. L1 handles the fast loop (debug, config changes, etc,). L2+L3 handle the slow loop (reflection, exploration, etc.). From the paper's ablations, we can confirm every layer matters (though working memory is the most critical, with performance collapsing from ~70% to ~20%!). Would be interested in ablations with 5 or even a 100 memory layers that an agent can control. I guess the question is do agents also need SSD/HD for storing and retrieving more information.

English

723

Michael Elabd@MichaelElabd·4d

@ScaleAILabs Amazing paper by the @neilkale !!

English

Scale Labs@ScaleAILabs·4d

Reliable Weak-to-Strong Monitoring of LLM Agents: openreview.net/pdf?id=WV7xIbo…

English

1.2K

Scale Labs@ScaleAILabs·4d

Big research update: we have 6 papers accepted at ICLR 2026! 🎉 We’re pushing the frontier of eval-driven RL, rubric-based rewards, and agentic capabilities. Over the next week, we’ll be sharing insights from our accepted papers. Here’s a preview into the work we'll be presenting in Brazil 🧵

English

15K

Michael Elabd@MichaelElabd·4d

@cwolferesearch Would really like to see a logprob ablation to see the impact on the quality of the judging

English

1.3K

Cameron R. Wolfe, Ph.D.@cwolferesearch·4d

Strongly recommend the LLM-as-a-Verifier writeup. Biggest takeaway for me is that increasing scoring granularity makes the verifier more effective. This indicates that LLM judges / verifiers are developing new (and better) capabilities. This did not work well 1-2 years ago. In fact, LLM-as-a-Judge best practice was that lower scoring granularity (e.g., binary, ternary, or 1-5 Likert score) worked way better than granular scores (e.g., 1-100 scale). This was a constant recommendation I gave for setting up LLM judges properly. It seems like recent frontier LLMs now are better at scoring at finer granularities, making this best practice (potentially) obsolete. One caveat to this finding is that the scoring setup used in this writeup is a specific setup based upon logprobs. Instead of just using the score token outputted by the LLM as the result, they compute the logprob of each possible score token and take a weighted average of scores (with weights given by probabilities). Then, they go further by expanding this weighted average across repeated verifications and multiple criterion: Reward = (1 / CK) * ∑_{c=1}^{C} ∑_{k=1}^{K} ∑_{g=1}^{G} score_logprob * score_value where C is the total number of evaluation criterion, K is the number of repeated verifications, and G is the scoring granularity (i.e., number of unique scoring output options). The reward determines if a particular output passes verification across criteria. When using this logprob setup, we see consistent gains in verifier accuracy by: - Increasing scoring granularity G. - Increasing repeated verifications K. - Increasing the number of evaluation criterion C. The last two findings are in line with prior work, but the fact that higher scoring granularity is helpful is interesting! In the LLM-as-a-Verifier paper, this system is used at inference time in a pairwise fashion as described below. "To pick the best trajectory among N candidates for a given task, a round-robin tournament is conducted. For every pair (i, j) the verifier produces Reward(i) and Reward(j) using the formula above. The trajectory with the higher reward receives a win, and the trajectory with the most wins across all \binom{N}{2} pairs is selected."

English

103

970

187.3K

Michael Elabd@MichaelElabd·5d

Compiled some of my thoughts after talking to 200 AI-native companies on building defensible businesses in the age of claude code! Excited to hear people's thoughts and feedback

Michael Elabd@MichaelElabd

x.com/i/article/2044…

English

1.8K

Michael Elabd@MichaelElabd·5d

x.com/i/article/2044…

ZXX

436

12.2K

Michael Elabd@MichaelElabd·5d

@Azaliamirh Very interesting! Wondering how big the gain is from running it as best-of-1 vs best-of-10

English

219

Azalia Mirhoseini@Azaliamirh·6d

Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05

English

114

980

113K

Michael Elabd@MichaelElabd·6d

@parada_car88104 LETS GOOO

English

Carolina Parada@parada_car88104·6d

Excited to announce #GeminiRobotics ER 1.6 our upgraded embodied reasoning model with: ⏱️Advanced instrument reading 👉improved spatial pointing ✅enhanced success detection (including multiview) 🛑more robust safety features Read more: deepmind.google/blog/gemini-ro…

Google DeepMind@GoogleDeepMind

We’re rolling out an upgrade designed to help robots reason about the physical world. 🤖 Gemini Robotics-ER 1.6 has significantly better visual and spatial understanding in order to plan and complete more useful tasks. Here’s why this is important 🧵

English

5.1K

Michael Elabd@MichaelElabd·6d

@GoogleDeepMind Beautiful! Congrats to the GR team!

English

Google DeepMind@GoogleDeepMind·6d

Gemini Robotics-ER 1.6 enables robots to better pinpoint objects in an image. Ask it to find tools in a cluttered workshop, and it can accurately identify and count the right items while ignoring things that aren't present.

English

230

30.5K

Google DeepMind@GoogleDeepMind·6d

English

139

426

2.6K

532.7K

Michael Elabd@MichaelElabd·6d

@ivanfioravanti lets goooooo

English

Ivan Fioravanti ᯅ@ivanfioravanti·6d

Kimi K2.6 code preview is out! 🤩

English

58.8K

Michael Elabd@MichaelElabd·30 Mar

@YinjieW2024 👀

QME

Yinjie Wang@YinjieW2024·28 Mar

Agentic RL in real world setting will be the main theme of 2026.

Philipp Schmid@_philschmid

x.com/i/article/2037…

English

20.6K

Michael Elabd@MichaelElabd·30 Mar

@askalphaxiv Such a clean idea tbh, tool definitions always felt too restrictive to explaining the capabilities and the constraints of the environment

English

224

alphaXiv@askalphaxiv·30 Mar

“Natural-Language Agent Harnesses” This paper argues that agent performance increasingly depends on the harness around the model, but that harness logic is usually buried in controller code and runtime-specific conventions. So they propose Natural-Language Agent Harnesses (NLAHs), which express harness behavior as editable natural language with explicit contracts, roles, stages, state semantics, and failure modes. It also introduces Intelligent Harness Runtime (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters, making agent scaffolds more comparable and scientifically analyzable.

English

356

16.3K

Michael Elabd@MichaelElabd·30 Mar

@part_harry_ Oh very interesting! Excited to read it

English

Harry Partridge@part_harry_·30 Mar

@MichaelElabd I am writing a post explaining my thoughts on this right now! tldr: SDPO has its own kind of reward hacking that likely makes it very difficult to get right at scale.

English

343

Harry Partridge@part_harry_·30 Mar

Human-in-the-loop RL is necessarily done at group size 1; you cannot do a group of rollouts with only one human. i.e. there is no baseline for you to subtract for each input prompt. This is by far the most interesting and under-discussed part of this announcement. The same was true for their tab-completions model. From the wording in their posts, it sounds like they are using plain REINFORCE (no mention of value functions) with a large batch size + re-evaluating each checkpoint to guard against high variance. Cursor is implicitly revealing an important empirical result: with a large enough batch size, simple REINFORCE just works, no baseline needed. In other words, large scale continual learning is solved.

Cursor@cursor_ai

Earlier this week, we published our technical report on Composer 2. We're sharing additional research on how we train new checkpoints. With real-time RL, we can ship improved versions of the model every five hours.

English

254

40.8K

Michael Elabd@MichaelElabd·30 Mar

@dair_ai Highly recommend the the Next Intelligence Explosion paper!

English

193

DAIR.AI@dair_ai·29 Mar

x.com/i/article/2038…

ZXX

124

727

118.1K

Michael Elabd@MichaelElabd·25 Şub

@CrusoeAI @nvidia @imbue_ai @tzafon_company @flappyairplanes Thank you so much for having us! Really enjoyed the talks and the discussion

English

Crusoe@CrusoeAI·24 Şub

Huge thanks to @NVIDIA, @imbue_ai, @tzafon_company, @flappyairplanes, @MichaelElabd, and everyone who joined us in San Francisco for Thursday's tech talk on next-gen AI. Fantastic discussions on AI breakthroughs and the challenges facing builders today. Until next time, SF! 🌉 #AI #ML #TechTalk

English

889

Michael Elabd@MichaelElabd·21 Şub

@GoStanford Lets gooo!!!

English

502

Stanford Cardinal 🌲🤓@GoStanford·20 Şub

Introducing a new way to get around The Farm. 🌲🚙 Waymo has arrived at Stanford: gostanford.com/waymo

English

589

76.8K

Michael Elabd@MichaelElabd·11 Şub

Authors: @richardxp888 , Jianwen Chen, Hanyang Wang, @JiaqiLiu835914, Kaide Zeng, @__YuWang__, @lillianwei423, @AiYiyangZ, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, @cihangxie, @HuaxiuYaoML, @unccs, @UChicagoCS, @ucsd_cse, @NECLabsAmerica, @Berkeley_EECS, @BaskinEng @askalphaxiv link: SkillRL: alphaxiv.org/abs/2602.08234 Voyager: alphaxiv.org/abs/2305.16291

Filipino

522

Michael Elabd@MichaelElabd·11 Şub

Love when papers introduce general frameworks for training-time continual learning! Continual learning only works if experience actually accumulates (learn new skills from attempting new tasks). Memory is a great solution for that experience accumalation, however, most memory-based bethods are inefficient (saving full trajectories is usually noisy and redundant). SkillRL solves this problem by distilling trajectories into reusable skills (very Voyager-esque approach with their skill library). Another cool insight is that they distinguish between "general skills" and "task-specific" ones to allow the model to attend to the right skill set. The outcome is ~10× memory reduction and SOTA performance, even beating much larger closed-source models. The interesting question to ask here is does structured experience / continual learning beat pre-training scale? or will this be another bitter lesson for the AI community?

English

4.7K

ค้นพบ

@HyperAI_News @askalphaxiv @ScaleAILabs @neilkale @cwolferesearch @Azaliamirh @jackyk02 @shululi256