Romain Beaumont

1.1K posts

Romain Beaumont banner
Romain Beaumont

Romain Beaumont

@rom1504

Gemini Video at Google DeepMind. Better dataset, better infra, and automate it all

Palo Alto, CA Katılım Temmuz 2008
1K Takip Edilen2.1K Takipçiler
Romain Beaumont
Romain Beaumont@rom1504·
@VictorTaelin If you find some of these meta "how to approach the problem" solutions recurring, it might be worth encoding them as skills so opus will more easily reach for these kind of ideas in the future
English
0
0
8
274
Taelin
Taelin@VictorTaelin·
highlights: opus identified the O(n^2) issue on its own after I asked it to break down the load time of each file per phase. so, why couldn't it have done it on its own? it is such an obvious thing to do! also a bit proud with myself for the extra quote-bind that resulted in the checker itself becoming 10x faster. as soon as I noticed the match desugarer was being done post bind, i suspected it might have caused a lot of wasted work on the evaluator, and that was indeed the case. quoting and rebinding materializes these computations and hands a clean term to the checker / interpreter
English
1
2
48
3.7K
Taelin
Taelin@VictorTaelin·
more of the same - after begging opus in every way possible to optimize the bend's checker ("make it fast, fix quadratic blowups, think hard pls"), there was zero improvement so I decided to babysit it. i was giving the instructions, it was doing the boring work. i asked it to measure stuff, break timings down, dissect the code exactly how i would 2 hours later: the checker is now ~10x faster so, as of march 2026, and I don't like that, automated research with AI *still* sucks, but a human domain expert using it to empower himself can achieve great things below is the summary of this chat! good night
Taelin tweet media
English
13
14
345
29.9K
Romain Beaumont
Romain Beaumont@rom1504·
Yes totally agree, I've been reverse engineering many things to integrate into CLIs and skills. Also been doing that for a lot longer for PrismarineJS, a js network lib for Minecraft. Most of it not relying on any UI. However... As an human I would often gain information on how the system work via the UI and then use that information to build the non-UI integration. I think that's what the UI value should bring. Have the agent look at things and do things a few times in an non efficient way and then optimize by calling the network or integrating in some other low level way. Maybe more generally I think it makes sense for the agent to try and follow the path most humans interact the system with, learn from that and the optimize down to lower level of abstractions. The human level often had people spending a lot of time on UX and likely contain good nuggets of information for the agents to learn how to design good interfaces, for example what to expose in CLIs, in network interfaces, what is ok to be higher friction vs not etc. Same thought for humanoid robot vs specialized Would be much more efficient to have specialized robots for every little things, but it would be much faster to have slow and clunky but versatile humanoid robots learn to do all the things and then optimize only the hot paths with specialized robots. Said another way, bootstrap skills and integration starting from humans and then go towards low level / physical rules.
English
0
0
1
29
Jeffrey Emanuel
Jeffrey Emanuel@doodlestein·
@rom1504 Usually there’s a way to completely bypass the UI. It might require some reverse engineering though! Which is why it’s better for this to be an open-source, community driven initiative.
English
1
0
0
123
Jeffrey Emanuel
Jeffrey Emanuel@doodlestein·
The new Anthropic computer use system is cool and looks useful, but I think it’s ultimately the wrong approach. It’s one thing to use humanoid robots because they can seamlessly slot into a human-centric physical world that’s difficult and expensive to reconfigure, and use the same buildings, stairs, tools, appliances, etc. That makes sense to me (plus seeing weird spider-bots and crab-bots would scare children) for a lot of reasons, both practically, economically, and technically. But your computer isn’t like that. It’s all just software. Your clicks and keyboard shortcuts just get turned into instructions. The cost to reconfigure most software, if you do it in clever ways, is pretty low. And then you can make it super efficient and intuitive for the agents without wasting so much of their cognitive energy trying to pretend to be a human user and jumping through hoops. More importantly, it also lets you have much more control from a security and audit and telemetry standpoint, with fine-grained permissions and execution controls. This is the approach I’ve been following in my Flywheel Connectors project, which is almost ready. I’ve been quietly working on this basically every day for 2 months and 2k commits. It’s already ~1.5M lines of Rust and offers many dozens of connectors of all kinds, with all the controls and security features you’d ever want. And adding new connectors is easy and fast. Plus, adding and using a new connector doesn’t give it carte blanche over your machine: you have very tight controls over everything, and your agents can configure and control it all via the fcp cli tool. It’s all designed to be agent-first in every way. For agents, by agents, taken to the extreme. Anyway, I’m excited to finish it soon and start integrating it into my various projects and systems. If you want to take a sneak peak, you can see it here: github.com/Dicklesworthst…
Felix Rieseberg@felixrieseberg

Today, we’re releasing a feature that allows Claude to control your computer: Mouse, keyboard, and screen, giving it the ability to use any app. I believe this is especially useful if used with Dispatch, which allows you to remotely control Claude on your computer while you’re away.

English
26
4
121
21.3K
Jeffrey Emanuel
Jeffrey Emanuel@doodlestein·
OK this might be getting excessive, even for me (from my FrankenEngine project): ● Session complete. 37,542 unit tests + 83,971 integration tests = 121,513 total tests across the codebase, all passing (37,947 lib tests verified at 0 failures).
English
9
0
67
7.4K
Joseph Garvin
Joseph Garvin@joseph_h_garvin·
Claude code rarely runs for longer than 15m without stopping and asking for input from me. How do all these stories of people letting agents run overnight work? Custom harnesses? Yelling at Claude in all caps to keep going no matter what?
English
404
65
5.8K
1.3M
Jeffrey Emanuel
Jeffrey Emanuel@doodlestein·
I’m surprised that you don’t see OpenAI announcing major partnerships with FANUC and KUKA. These companies make extremely accurate and powerful robot arms that can do small-scale work all the way up to lifting and spinning entire car frames. The problem with them has always been that the programming is extremely tedious, finicky, and time consuming. So the cost of the equipment is just the beginning since you need integrator consultants to program them for you. And then they’re not flexible. Even after you finally dial everything in, you’re now stuck with that configuration. Good luck trying to optimize the overall layout of your plant or to change other work streams without needing to reprogram everything. But this is precisely where advanced AI, vision, RL, world models, etc. are so valuable. But it’s going to be hard for the big labs to match 30+ years of mechanical engineering excellence and learnings like the robot arm companies have. Not to mention 30+ year supplier relationships with every large Western auto OEM. And the arm companies certainly aren’t going to be training any frontier transformers soon! Anyway, that’s what I would be advising them to do now.
Jeffrey Emanuel tweet mediaJeffrey Emanuel tweet media
English
9
2
43
3.7K
Romain Beaumont
Romain Beaumont@rom1504·
@VictorTaelin Libs are about social collaboration. If you don't split the code with stable interfaces you can't scale the number of people and agents working on it
English
0
0
0
267
Taelin
Taelin@VictorTaelin·
Bend2's standard library is now racing from having ~10 functions to being larger than Haskell's Prelude, except better organized and formally verified. All these functions are trivial to modern AI's. But then, this raises the question: why do we even need libs anymore?
English
20
4
196
17.4K
Romain Beaumont
Romain Beaumont@rom1504·
Playing with codex today and it decided on its own to call that "repo memory" which I think makes a lot of sense.
English
0
0
0
74
Romain Beaumont
Romain Beaumont@rom1504·
I find it works really well to have the coding agent keep track of what we're working on in a docs/ as md files. Eg functionalities we implemented / want to implement. Reliability of each feature. short/medium/long term plan, ... All in different md files. It really helps it to keep on track even with context rot and it also helps for human visibility on what is going on.
English
3
0
2
255
Romain Beaumont
Romain Beaumont@rom1504·
@levie Yes agent native systems are a thing. Lot to figure out on this. That's true for human processes but also products and infra.
English
0
0
0
92
Aaron Levie
Aaron Levie@levie·
There’s a fundamental difference between taking an existing process and applying AI agents to it vs. taking a process from scratch and designing it from the ground up for AI agents. The gap we’re going to see will widen between the teams and companies that are able to do the latter instead of just the former. In theory it would have been ideal for all the gains of AI to have come “for free”, but there are both clear constraints of AI (like getting the context right) and clear upsides (like being able to execute code and run in parallel) that the workflows themselves must be redesigned to take full advantage of this technology. One of the biggest implications that will come into focus is that agents that can write and run code, and interact with any API, will lead to agents effectively being expert engineers applied to your business process. So to some extent one of the biggest ways of reengineering a workflow is to ask yourself: what would you do if you had an infinite number of capable engineers write software for this process. What if those engineers wrote code to connect your disparate data sources, comb thorough any amount of unstructured data, automate your repeated tasks, connect your various systems together specific to your process, and so on. Not every process has that upside, but there tons of tasks that we do every day across marketing, finance, operations, and even sales, where a programmer with infinite code writing and API access would be able to make something go far faster or produce way more output. The teams that start to think this way will start to operate entirely differently.
English
128
63
522
106.8K
Romain Beaumont
Romain Beaumont@rom1504·
@snowmaker Still happens. You're not trying projects that are difficult enough.
English
1
0
9
377
Jared Friedman
Jared Friedman@snowmaker·
I realized something else AI has changed about coding: you don't get stuck anymore. Programming used to be punctuated by episodes of extreme frustration, when a tricky bug ground things to a halt. That doesn't happen anymore.
English
593
444
7.4K
915.5K
Romain Beaumont
Romain Beaumont@rom1504·
@jeremyphoward It is for sure weird that non treatment that is proven to be not safe and not effective is an option (see #Health_problems" target="_blank" rel="nofollow noopener">en.wikipedia.org/wiki/Steve_Job… ) while some cure with some ongoing evidence are not an option.
English
0
0
2
546
Jeremy Howard
Jeremy Howard@jeremyphoward·
This is a really interesting thread. If we literally already have a cure for (some kinds of) cancer, but can't *prove* it's "safe and effective", should terminally ill patients have an option to use it anyway?
Patrick Heizer@PatrickHeizer

I literally have an ongoing cancer experiment where 100% of the untreated and control animals have had to be euthanized while 100% of the treatment animals are seemingly unaffected. But we're still extremely far away from "proving that it works." Science is hard.

English
104
49
1.2K
110K
Romain Beaumont
Romain Beaumont@rom1504·
@adam_rosler In some cases directly using lower level abstraction such as network calls is better than higher level abstractions such as CLIs. But that's not what I meant. I mean can we port over the concept of an API to English?
English
1
0
1
23
Romain Beaumont
Romain Beaumont@rom1504·
Skills are reusable functions. Program.md in the style of autoresearcher is a loop. What other computing concept gets ported to LLM executable English next ? Types ? Frameworks ? Services ? Network RPCs ?
English
1
0
4
306
Romain Beaumont
Romain Beaumont@rom1504·
@fchollet How do you reconcile that idea with the fact humans need lot of prompts to do things and they need harnesses in the shape of books, social network, companies,... ? Do you prefer to build an asocial AI ?
English
0
0
0
100
François Chollet
François Chollet@fchollet·
The persisting importance of prompt engineering -- and now harness engineering -- is one of the best indicators of how far we are from AGI. A general system doesn't need a task-specific harness. And when provided with instructions, it is robust to phrasing variations.
English
124
84
1K
67.8K
Romain Beaumont
Romain Beaumont@rom1504·
@jianxliao Yes exactly! CLI+skill is friendly to agents So then the next question is, what other infra is agent friendly or can be made agent friendly?
English
0
0
0
67
jian
jian@jianxliao·
the reason cli >>>> mcp is that agents have the full environment to install, debug, fix, and use skills/clis autonomously without human intervention. all of this is done via bash, not through some mcp server stdio transport. looking back now, mcp feels like it was vibe-coded by claude-sonnet-3.7
jian@jianxliao

I found something better than MCP It’s called CLI

English
37
5
130
22.4K
Kingsley Uyi Idehen
Kingsley Uyi Idehen@kidehen·
Exemplifying why @AnthropicAI is doing so well on the back of Skills! Skills are the new unit of economic value delivered as software. They build on the filesystem’s well-established universal interface, with Markdown as the common content type. You’ve seen this movie before—you might just not immediately recognize it in the form of HTML and the Web explosion. Skills all the way down…
Garry Tan@garrytan

I've been having such an amazing time with Claude Code I wanted you to be able to have my *exact* skill setup: Introducing gstack, which you can install just by pasting a short piece of text into your Claude code

English
1
0
1
482
Romain Beaumont
Romain Beaumont@rom1504·
@karpathy This is pretty amazing. Wonder how much that says about the kind of RL that Anthropic does on the model to make it able to do this kind of stuff ;)
English
0
0
2
401
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
969
2.1K
19.4K
3.5M