Alex Dimakis

4.5K posts

Alex Dimakis

Alex Dimakis

@AlexGDimakis

Professor, UC berkeley | Founder @bespokelabsai |

Berkeley, CA เข้าร่วม Nisan 2009
2.5K กำลังติดตาม22.8K ผู้ติดตาม
Alex Dimakis
Alex Dimakis@AlexGDimakis·
@pgasawa @Thom_Wolf yup: training an advisor model to personalize GPT also makes it a good personalization advisor to Claude. The second aspect is robustness: Training an advisor model to improve performance on some specialized task does not reduce performance on unrelated tasks.
English
0
1
4
199
Parth Asawa
Parth Asawa@pgasawa·
One way to not tie adapters to the base model is by making the communicating mechanism b/w adapter and base model natural language (i.e. model agnostic and transferable) That was one of the big points behind the Advisor Models work (arxiv.org/pdf/2510.02453). You can train on one model and transfer your advisor to any other model with minimal degradation because learning to communicate in natural language makes what it learns largely transferable.
English
1
2
13
1.1K
Thomas Wolf
Thomas Wolf@Thom_Wolf·
This is really cool. It got me thinking more deeply about personalized RL: what’s the real point of personalizing a model in a world where base models can become obsolete so quickly? The reality in AI is that new models ship every few weeks, each better than the last. And the pace is only accelerating, as we see on the Hugging Face Hub. We are not far away from better base models dropping daily. There’s a research gap in RL here that almost no one is working on. Most LLM personalization research assumes a fixed base model, but very few ask what happens to that personalization when you swap the base model. Think about going from Llama 3 to Llama 4. All the tuned preferences, reward signals, and LoRAs are suddenly tied to yesterday’s model. As a user or a team, you don’t want to reteach every new model your preferences. But you also don’t want to be stuck on an older one just because it knows you. We could call this "RL model transferability": how can an RL trace, a reward signal, or a preference representation trained on model N be distilled, stored, and automatically reapplied to model N+1 without too much user involvement? We solved that in SFT where a training dataset can be stored and reused to train a future model. We also tackled a version of that in RLHF phases somehow but it remain unclear more generally when using RL deployed in the real world. There are some related threads (RLTR for transferable reasoning traces, P-RLHF and PREMIUM for model-agnostic user representations, HCP for portable preference protocols) but the full loop seems under-studied to me. Some of these questions are about off-policy but other are about capabilities versus personalization: which of the old customizations/fixes does the new model already handle out of the box, and which ones are actually user/team-specific to ever be solved by default? That you would store in a skill for now but that RL allow to extend beyond the written guidance level. I have surely missed some work so please post any good work you’ve seen on this topic in the comments.
Ronak Malde@rronak_

This paper is almost too good that I didn't want to share it Ignore the OpenClaw clickbait, OPD + RL on real agentic tasks with significant results is very exciting, and moves us away from needing verifiable rewards Authors: @YinjieW2024 Xuyang Chen, Xialong Jin, @MengdiWang10 @LingYang_PU

English
28
42
543
84.9K
Guohao Li 🐫
Guohao Li 🐫@guohao_li·
Enjoyed reading this article. It resonates with me a lot. Almost a year since we started the Scaling Environments for Agents initiative: camel-ai.org/blogs/scaling-… Building scalable pipelines for agent environments has been a lot of fun. But one realization keeps coming back: high quality difficult RL environments don’t emerge from surface-level understanding of a domain. There’s still a huge amount of research to be done to figure out how to scale them systematically. Maybe it’s time to start a dedicated research lab focused on scaling RL environments.
Elliot Arledge@elliotarledge

x.com/i/article/2032…

English
10
15
129
16.8K
Alex Dimakis รีทวีตแล้ว
Shangyin Tan
Shangyin Tan@ShangyinT·
GEPA for skills is here! Introducing gskill, an automated pipeline to learn agent skills with @gepa_ai. With learned skills, we boost Claude Code’s repository task resolution rate to near-perfect levels, while making it 47% faster. Here's how we did it:
Shangyin Tan tweet media
English
17
49
382
78.6K
Alex Dimakis
Alex Dimakis@AlexGDimakis·
@VladBarash Gepa gskill was made for that. But you need tasks that develop the skills, eg with swe smith.
English
0
1
1
547
Vlad Barash
Vlad Barash@VladBarash·
GEPA or Auto Research for optimizing a skill.md?
English
6
0
8
1.5K
Alex Dimakis รีทวีตแล้ว
Addy Osmani
Addy Osmani@addyosmani·
Introducing the Google Workspace CLI: github.com/googleworkspac… - built for humans and agents. Google Drive, Gmail, Calendar, and every Workspace API. 40+ agent skills included.
English
654
1.6K
15K
5.4M
Alex Dimakis
Alex Dimakis@AlexGDimakis·
I am still waiting for that day when I will start Discord and it will not need 8 updates.
English
0
0
8
1.2K
Alex Dimakis
Alex Dimakis@AlexGDimakis·
@alexatallah Tb2. If you want a faster and easier eval, try openthoughts-tblite. You can run of these with harbor. If you want to eval a harness, fix the llm and try different harnesses.
English
0
0
2
607
Alex Atallah
Alex Atallah@alexatallah·
What is the best benchmark for agent harnesses?
English
46
0
95
15.6K
Alex Dimakis
Alex Dimakis@AlexGDimakis·
I would like to thank Laude Institute for supporting OpenThoughts-Agent as one of the Slingshots projects. We have been focused on data curation research with the Datacomp (2023), DCLM (2024) and OpenThoughts (2025) projects. This year, OpenThoughts-Agent focuses on data curation for Teminal-Bench agents. We release end-to-end data, environments and RL loops so that researchers can build compare and improve agents.
Alex Dimakis tweet media
English
2
1
25
1.8K
Alex Dimakis
Alex Dimakis@AlexGDimakis·
66% of my AI Twitter news today is Dimitris vibe coding his ideas.
Alex Dimakis tweet media
English
3
2
50
4.1K
Alex Dimakis
Alex Dimakis@AlexGDimakis·
The BenchPress idea is delightfully simple application of compressed sensing on AI evals: Instead of running all the benchmarks, run a few (ideally the cheaper ones) and use these numbers as features, given to a model to predict the other benchmark numbers from these observations. Turns out the matrix of benchmarks is very low-rank and the matrix completion model works very well. My thought is that at the end of the day you still need to run all the benchmarks, but while iterating, this is a valuable trick to get more signal of what works and in which direction. You can also look at the low-rank directions you discover and understand how your model performs in these data-driven performance directions. They may be easy to name: ('Persistence, 'Coding comfort','Terminal use ability?' etc?)
Dimitris Papailiopoulos@DimitrisPapail

Running Terminal-Bench 2.0 on expensive frontier models costs $1K–$50K or more. BenchPress Predicts Gemini 3.1 Pro and Claude Opus 4.6's scores within ±2 points after 15 randomly selected benchmarks. .... using zero agentic benchmark data!! Cost: $0.

English
1
4
38
6.7K
Alex Dimakis
Alex Dimakis@AlexGDimakis·
The Mac Mini has found its purpose.
Andrej Karpathy@karpathy

Bought a new Mac mini to properly tinker with claws over the weekend. The apple store person told me they are selling like hotcakes and everyone is confused :) I'm definitely a bit sus'd to run OpenClaw specifically - giving my private data/keys to 400K lines of vibe coded monster that is being actively attacked at scale is not very appealing at all. Already seeing reports of exposed instances, RCE vulnerabilities, supply chain poisoning, malicious or compromised skills in the registry, it feels like a complete wild west and a security nightmare. But I do love the concept and I think that just like LLM agents were a new layer on top of LLMs, Claws are now a new layer on top of LLM agents, taking the orchestration, scheduling, context, tool calls and a kind of persistence to a next level. Looking around, and given that the high level idea is clear, there are a lot of smaller Claws starting to pop out. For example, on a quick skim NanoClaw looks really interesting in that the core engine is ~4000 lines of code (fits into both my head and that of AI agents, so it feels manageable, auditable, flexible, etc.) and runs everything in containers by default. I also love their approach to configurability - it's not done via config files it's done via skills! For example, /add-telegram instructs your AI agent how to modify the actual code to integrate Telegram. I haven't come across this yet and it slightly blew my mind earlier today as a new, AI-enabled approach to preventing config mess and if-then-else monsters. Basically - the implied new meta is to write the most maximally forkable repo and then have skills that fork it into any desired more exotic configuration. Very cool. Anyway there are many others - e.g. nanobot, zeroclaw, ironclaw, picoclaw (lol @ prefixes). There are also cloud-hosted alternatives but tbh I don't love these because it feels much harder to tinker with. In particular, local setup allows easy connection to home automation gadgets on the local network. And I don't know, there is something aesthetically pleasing about there being a physical device 'possessed' by a little ghost of a personal digital house elf. Not 100% sure what my setup ends up looking like just yet but Claws are an awesome, exciting new layer of the AI stack.

English
0
1
3
2.9K