Alex Dimakis

4.5K posts

Alex Dimakis

@AlexGDimakis

Professor, UC berkeley | Founder @bespokelabsai |

Berkeley, CA เข้าร่วม Nisan 2009

2.5K กำลังติดตาม22.8K ผู้ติดตาม

Alex Dimakis@AlexGDimakis·14h

@pgasawa @Thom_Wolf yup: training an advisor model to personalize GPT also makes it a good personalization advisor to Claude. The second aspect is robustness: Training an advisor model to improve performance on some specialized task does not reduce performance on unrelated tasks.

English

199

Parth Asawa@pgasawa·16h

One way to not tie adapters to the base model is by making the communicating mechanism b/w adapter and base model natural language (i.e. model agnostic and transferable) That was one of the big points behind the Advisor Models work (arxiv.org/pdf/2510.02453). You can train on one model and transfer your advisor to any other model with minimal degradation because learning to communicate in natural language makes what it learns largely transferable.

English

1.1K

Thomas Wolf@Thom_Wolf·17h

This is really cool. It got me thinking more deeply about personalized RL: what’s the real point of personalizing a model in a world where base models can become obsolete so quickly? The reality in AI is that new models ship every few weeks, each better than the last. And the pace is only accelerating, as we see on the Hugging Face Hub. We are not far away from better base models dropping daily. There’s a research gap in RL here that almost no one is working on. Most LLM personalization research assumes a fixed base model, but very few ask what happens to that personalization when you swap the base model. Think about going from Llama 3 to Llama 4. All the tuned preferences, reward signals, and LoRAs are suddenly tied to yesterday’s model. As a user or a team, you don’t want to reteach every new model your preferences. But you also don’t want to be stuck on an older one just because it knows you. We could call this "RL model transferability": how can an RL trace, a reward signal, or a preference representation trained on model N be distilled, stored, and automatically reapplied to model N+1 without too much user involvement? We solved that in SFT where a training dataset can be stored and reused to train a future model. We also tackled a version of that in RLHF phases somehow but it remain unclear more generally when using RL deployed in the real world. There are some related threads (RLTR for transferable reasoning traces, P-RLHF and PREMIUM for model-agnostic user representations, HCP for portable preference protocols) but the full loop seems under-studied to me. Some of these questions are about off-policy but other are about capabilities versus personalization: which of the old customizations/fixes does the new model already handle out of the box, and which ones are actually user/team-specific to ever be solved by default? That you would store in a skill for now but that RL allow to extend beyond the written guidance level. I have surely missed some work so please post any good work you’ve seen on this topic in the comments.

Ronak Malde@rronak_

This paper is almost too good that I didn't want to share it Ignore the OpenClaw clickbait, OPD + RL on real agentic tasks with significant results is very exciting, and moves us away from needing verifiable rewards Authors: @YinjieW2024 Xuyang Chen, Xialong Jin, @MengdiWang10 @LingYang_PU

English

543

84.9K

Alex Dimakis@AlexGDimakis·2d

Let an AI agent design its own evolution process.

Shu Lynn Liu@shulynnliu

Researchers spend hours and hours hand-crafting the strategies behind LLM-driven optimization systems like AlphaEvolve: deciding which ideas to reuse, when to explore vs exploit, and what mutations to try. 🤖But what if AI could evolve its own evolution process? We introduce EvoX, a meta-evolution pipeline that lets AI evolve the strategy guiding the optimization. It achieves high-quality solutions for <$5, while existing open systems and even Claude Code often cost 3-5× more on some tasks. Across ~200 optimization problems, EvoX delivers the strongest overall results: often outperforming AlphaEvolve, OpenEvolve, GEPA, and ShinkaEvolve on math and systems tasks, exceeding human SOTA, and improving median performance by up to 61% on 172 competitive programming problems. 👇

English

9.6K

Alex Dimakis@AlexGDimakis·4d

@guohao_li Yup that’s what we are doing in Bespoke Labs

English

467

Guohao Li 🐫@guohao_li·4d

Enjoyed reading this article. It resonates with me a lot. Almost a year since we started the Scaling Environments for Agents initiative: camel-ai.org/blogs/scaling-… Building scalable pipelines for agent environments has been a lot of fun. But one realization keeps coming back: high quality difficult RL environments don’t emerge from surface-level understanding of a domain. There’s still a huge amount of research to be done to figure out how to scale them systematically. Maybe it’s time to start a dedicated research lab focused on scaling RL environments.

Elliot Arledge@elliotarledge

x.com/i/article/2032…

English

129

16.8K

Alex Dimakis รีทวีตแล้ว

Shangyin Tan@ShangyinT·20 Şub

GEPA for skills is here! Introducing gskill, an automated pipeline to learn agent skills with @gepa_ai. With learned skills, we boost Claude Code’s repository task resolution rate to near-perfect levels, while making it 47% faster. Here's how we did it:

English

382

78.6K

Alex Dimakis@AlexGDimakis·5d

@VladBarash Gepa gskill was made for that. But you need tasks that develop the skills, eg with swe smith.

English

547

Vlad Barash@VladBarash·5d

GEPA or Auto Research for optimizing a skill.md?

English

1.5K

Alex Dimakis@AlexGDimakis·13 Mar

Evolutionary algorithms that optimize prompts and programs with LLMs and reflection are becoming excellent at optimizing anything that can be measured. We introduce AdaEvolve, an adaptive evolutionary algorithm that dynamically adjusts its own search strategy based on observed progress.

Mert Cemri@mertcemri

AlphaEvolve proved LLMs can discover novel algorithms, but it remains closed-source, and open-source alternatives (OpenEvolve, GEPA) rely on rigid, static search policies. Introducing AdaEvolve: a fully adaptive evolutionary algorithm that dynamically adjusts its own search strategy based on observed progress. It matches or beats AlphaEvolve and best known Human SOTA on math and systems benchmarks, and boosts Frontier-CS median scores by 33% over the best open-source baseline across 185 tasks. 🧵👇 (1/n)

English

117

22.9K

Alex Dimakis@AlexGDimakis·10 Mar

@ylecun Congratulations Yann.

English

413

Yann LeCun@ylecun·10 Mar

Unveiling our new startup Advanced Machine Intelligence (AMI Labs). We just completed our seed round: $1.03B / 890M€, one the largest seeds ever, probably the largest for a European company. We're hiring! [the background image is the Veil Nebula - a picture I took from my backyard, most appropriate for an unveiling] More details here: techcrunch.com/2026/03/09/yan…

AMI Labs@amilabs

Advanced Machine Intelligence (AMI) is building a new breed of AI systems that understand the world, have persistent memory, can reason and plan, and are controllable and safe. We’ve raised a $1.03B (~€890M) round from global investors who believe in our vision of universally intelligent systems centered on world models. This round is co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions, along with other investors and angels across the world. We are a growing team of researchers and builders, operating in Paris, New York, Montreal and Singapore from day one. Read more: amilabs.xyz AMI - Real world. Real intelligence.

English

865

1.9K

19.3K

2.6M

Alex Dimakis@AlexGDimakis·8 Mar

@marwaabdulhai @rm_rf_ryan @timalthoff @svlevine Very interesting work!

English

Marwa Abdulhai@marwaabdulhai·3 Ara

I am at #NeurIPS2025 in San Diego! I will be presenting a poster for my paper: Consistently Simulating Human Personas with Multi-turn RL on Friday Exhibit Hall C,D,E #1805 at 4:30 pm: arxiv.org/abs/2511.00222 Co-authors: @rm_rf_ryan donovanclay @timalthoff @svlevine @natashajaques

English

6.2K

Alex Dimakis รีทวีตแล้ว

Addy Osmani@addyosmani·5 Mar

Introducing the Google Workspace CLI: github.com/googleworkspac… - built for humans and agents. Google Drive, Gmail, Calendar, and every Workspace API. 40+ agent skills included.

English

654

1.6K

15K

5.4M

Alex Dimakis@AlexGDimakis·4 Mar

This is a wild fun post. The error correction coding and distribution shaping got the information theorist in me excited.

Davis Blalock@davisblalock

🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 arxiv.org/abs/2602.23349 A bunch of cool ideas make this possible: [1/n]

English

8.3K

Alex Dimakis@AlexGDimakis·3 Mar

I am still waiting for that day when I will start Discord and it will not need 8 updates.

English

1.2K

Alex Dimakis@AlexGDimakis·3 Mar

Excited to announce SkyDiscover: Think of it as Open-source AlphaEvolve with modular components and the option to make it human-in-the-loop. Also with leading performance on evolution and optimization benchmarks.

Shu Lynn Liu@shulynnliu

AlphaEvolve is closed-source. We release 🌟SkyDiscover🌟, a flexible, modular open-source framework with two new adaptive algorithms that match or exceed AlphaEvolve on many benchmarks and outperform OpenEvolve, GEPA, and ShinkaEvolve across 200+ optimization tasks. Our new algorithms dynamically adapt their search strategy, and can even let the AI optimize its own optimization process on the fly! Results: 📊 +34% median score improvement on 172 Frontier-CS problems. 🧮 Matches/exceeds AlphaEvolve on many math benchmarks ⚙️ Discovers system optimizations beyond human-designed SOTA 🧵👇

English

11.2K

Alex Dimakis@AlexGDimakis·3 Mar

Gradient descent can train a computer 😎

Dimitris Papailiopoulos@DimitrisPapail

x.com/i/article/2028…

English

17.7K

Alex Dimakis@AlexGDimakis·28 Şub

@alexatallah Tb2. If you want a faster and easier eval, try openthoughts-tblite. You can run of these with harbor. If you want to eval a harness, fix the llm and try different harnesses.

English

607

Alex Atallah@alexatallah·27 Şub

What is the best benchmark for agent harnesses?

English

15.6K

Alex Dimakis@AlexGDimakis·27 Şub

It’s fine to update 400 gradient steps old models. Great work !

Kianté Brantley@xkianteb

Does LLM RL post-training need to be on-policy?

English

7.7K

Alex Dimakis@AlexGDimakis·27 Şub

(2/2) Thank you @LaudeInstitute and the team: @etash_guha @ryanmart3n @FeuerBenjamin @NeginRaoof_ @tyler_griggs_ @alexgshaw . @Mike_A_Merril and of course the Datacomp OG: @lschmidt3 Our project is here: openthoughts.ai/blog/agent

English

715

Alex Dimakis@AlexGDimakis·27 Şub

I would like to thank Laude Institute for supporting OpenThoughts-Agent as one of the Slingshots projects. We have been focused on data curation research with the Datacomp (2023), DCLM (2024) and OpenThoughts (2025) projects. This year, OpenThoughts-Agent focuses on data curation for Teminal-Bench agents. We release end-to-end data, environments and RL loops so that researchers can build compare and improve agents.

English

1.8K

Alex Dimakis@AlexGDimakis·26 Şub

66% of my AI Twitter news today is Dimitris vibe coding his ideas.

English

4.1K

Alex Dimakis@AlexGDimakis·25 Şub

The BenchPress idea is delightfully simple application of compressed sensing on AI evals: Instead of running all the benchmarks, run a few (ideally the cheaper ones) and use these numbers as features, given to a model to predict the other benchmark numbers from these observations. Turns out the matrix of benchmarks is very low-rank and the matrix completion model works very well. My thought is that at the end of the day you still need to run all the benchmarks, but while iterating, this is a valuable trick to get more signal of what works and in which direction. You can also look at the low-rank directions you discover and understand how your model performs in these data-driven performance directions. They may be easy to name: ('Persistence, 'Coding comfort','Terminal use ability?' etc?)

Dimitris Papailiopoulos@DimitrisPapail

Running Terminal-Bench 2.0 on expensive frontier models costs $1K–$50K or more. BenchPress Predicts Gemini 3.1 Pro and Claude Opus 4.6's scores within ±2 points after 15 randomly selected benchmarks. .... using zero agentic benchmark data!! Cost: $0.

English

6.7K

Alex Dimakis@AlexGDimakis·25 Şub

The Mac Mini has found its purpose.

Andrej Karpathy@karpathy

Bought a new Mac mini to properly tinker with claws over the weekend. The apple store person told me they are selling like hotcakes and everyone is confused :) I'm definitely a bit sus'd to run OpenClaw specifically - giving my private data/keys to 400K lines of vibe coded monster that is being actively attacked at scale is not very appealing at all. Already seeing reports of exposed instances, RCE vulnerabilities, supply chain poisoning, malicious or compromised skills in the registry, it feels like a complete wild west and a security nightmare. But I do love the concept and I think that just like LLM agents were a new layer on top of LLMs, Claws are now a new layer on top of LLM agents, taking the orchestration, scheduling, context, tool calls and a kind of persistence to a next level. Looking around, and given that the high level idea is clear, there are a lot of smaller Claws starting to pop out. For example, on a quick skim NanoClaw looks really interesting in that the core engine is ~4000 lines of code (fits into both my head and that of AI agents, so it feels manageable, auditable, flexible, etc.) and runs everything in containers by default. I also love their approach to configurability - it's not done via config files it's done via skills! For example, /add-telegram instructs your AI agent how to modify the actual code to integrate Telegram. I haven't come across this yet and it slightly blew my mind earlier today as a new, AI-enabled approach to preventing config mess and if-then-else monsters. Basically - the implied new meta is to write the most maximally forkable repo and then have skills that fork it into any desired more exotic configuration. Very cool. Anyway there are many others - e.g. nanobot, zeroclaw, ironclaw, picoclaw (lol @ prefixes). There are also cloud-hosted alternatives but tbh I don't love these because it feels much harder to tinker with. In particular, local setup allows easy connection to home automation gadgets on the local network. And I don't know, there is something aesthetically pleasing about there being a physical device 'possessed' by a little ghost of a personal digital house elf. Not 100% sure what my setup ends up looking like just yet but Claws are an awesome, exciting new layer of the AI stack.

English

2.9K

ค้นพบ

@pgasawa @Thom_Wolf @guohao_li @gepa_ai @VladBarash @ylecun @marwaabdulhai @rm_rf_ryan