diego

216 posts

diego

@diblacksmith

software → AI engineer @ amazon

Natal - Brazil Katılım Nisan 2016

1.9K Takip Edilen85 Takipçiler

diego retweetledi

Goodfire@GoodfireAI·7 May

A simple example: days of the week, which lie on a circular path in models’ activations. Steering linearly from Monday to Friday gets you incoherent outputs in between. Steering along the circular manifold means you cleanly shift from Mon → Tues → Wed → Thurs → Fri. (5/8)

English

845

161.1K

diego@diblacksmith·5 May

@lateinteraction @CAISconf see you!!

English

Omar Khattab@lateinteraction·5 May

Wow, it's already May 5th. Don't miss the early-bird registration TODAY for the first ACM conference on AI systems @CAISconf. CAIS will have a packed program of really exciting keynotes, paper presentations, workshops, and demos. See you in San Jose in late May!

English

140

diego@diblacksmith·5 May

@VictorTaelin I know I sound jerky. Sorry for that. I truly root for you. I hope I am wrong and you are doing it optimally.

English

diego@diblacksmith·5 May

@VictorTaelin As far as the eye can see you are most definitely overdoing it, and holding it in until it is perfect. Which will never happen. (if all goes well, I hope you keep improving Bend for years on end... right?)

English

Taelin@VictorTaelin·3 May

exactly the point of Bend2! if you 1. specify your program with strong types (aka theorems). 2. let the AI prove it, 3. let the compiler check it, then you *literally* don't need to see it. there is no reason to. it IS correct! btw, Bend now has nearly 350k lines of code, 264k of which are in the stdlib. agents are autonomously porting entire libraries from Rust, Haskell, etc. to Bend, and everything is going smoothly I barely check it since I'm working in other things just codex /goal for now, every day it grows larger soon, every algorithm ever conceived by humans will be in Bend

solst/ICE of Astarte@IceSolst

Interesting article on treating agent output like compiler output (and why) skiplabs.io/blog/codegen_a…

English

415

31.7K

diego@diblacksmith·27 Nis

@nearcyan Hi near, hope everything's alright! Would you be open to chat over DM on feedbacks for Seren?

English

near@nearcyan·21 Oca

2026 twitter: "i am leaving this place, as it has become quite sickly. follow me for daily updates as my departure progresses"

English

754

63.1K

diego@diblacksmith·18 Nis

@IterIntellectus Vittorio, how can someone research like you did? Anything you can share? I knew some stuff already and I am forever grateful for everything else that this article taught.

English

vittorio@IterIntellectus·17 Nis

x.com/i/article/2042…

ZXX

223

296

3.9K

1.8M

diego retweetledi

Sumeet Motwani@sumeetrm·16 Nis

We also evaluate Recursive Language Models with GPT 5.2 on LongCoT. Without code execution, it doesn't beat the base model. Code helps on implicit domains where structure can be externalized (Logic: 68%, Chess: 31%), but explicit compositional domains stay near zero. We’re excited to see more progress along this direction! We believe that RLMs and subagents more generally are likely to be very beneficial for long-horizon reasoning, and that there are many exciting open problems in training models to decompose problems better.

English

1.6K

diego retweetledi

Michael@michael_chomsky·12 Nis

Garry is kinda correct here, but is oversimplifying memory. Harrison (the author of the original article) makes a very good point but also makes memory sound easier than it is. (before reading this article, note that I wrote down my thoughts and then passed it through Claude Code. I read every word. read it like like a coworker's Claude Code output) Let me start with where Garry is right, because he IS right about something important. Git-backed markdown is a memory format that is simultaneously human-readable, version-controlled, diffable, and greppable. No database gives you all four at once by default. If your agent's memory is an opaque blob in someone else's database, you have no idea what it "knows" about you. You can't correct it, can't diff it, can't even look at it. That matters. A lot. I agree with this completely, and it's the right starting posture. But it's a storage format. It's not a memory system. And the difference matters more than most people in this debate seem to realize. Harrison's argument is different. He says memory is tied to the harness, the harness must be open, therefore you should use their open harness. The first two points are correct, the third is iffy because it assumes you want to be responsible for memory, which is hard (probably right but not a trivial decision). But that core insight--that the harness and memory can't be separated--is real and more important than people give it credit for. Let me explain why, and then why everyone in this debate is still underselling the difficulty. The harness owns the critical moment The most important time for memory to be created or updated is during compaction. Compaction is when the context window fills up and the agent compresses everything into a summary. Information that doesn't survive the summary is gone--not archived, gone. This is memory triage, and the harness controls it. Always. OpenCode, OpenClaw, and Hermes all handle this. OpenClaw does it by default. OpenCode's SDK exposes compaction hooks--you can listen for session.compacting events and handle memory yourself. This is a great place for memory logic to live. Now look at what Codex does: it produces an opaque, encrypted compaction summary that isn't usable outside the OpenAI ecosystem. Harrison himself flagged this in his article. This isn't just vendor lock-in, it's architectural lock-in by design. Harrison is right to be alarmed by this. Garry is right that being "above the API line" matters. But neither one grapples with what actually makes memory hard once you've decided to own it. Where files break down: forgetting Garry's model inherits all the strengths of git: version history, diffs, blame, rollback. But git's greatest strength is the core problem: nothing is ever truly forgotten. When do you choose to forget a memory? How do you know it's outdated? You changed jobs six months ago--is the memory about your old team's coding standards still valid? Your codebase migrated from REST to GraphQL--are the API pattern memories stale, or still useful for legacy endpoints that still exist? With files, you can delete them. But you need to know they exist AND that they're stale. And you need to check this proactively, because nobody is going to tell you. This is actually a structured problem with real solutions starting to emerge. Zep's Graphiti engine uses what they call bi-temporal knowledge graphs--every fact gets timestamps for when the system recorded it AND when it was true in the real world. Facts are invalidated, not deleted. You can query "what did I know about X on March 15th" separately from "what is currently true about X." Most memory providers are converging on some version of this. Supermemory has a graph-based system. Hydra is moving toward mixed graph/vector approaches. Mem0 added graph memory. This convergence is telling--it means the industry is collectively figuring out that flat files and pure vector search aren't enough for temporal reasoning. Files don't have temporal validity windows. Git has history, but history and validity are different things. Knowing a file changed on March 15th doesn't tell you whether its contents are still true today. Then there's the injection problem. OpenClaw's memory.md is a trivial file with memories, injected into context every time, updated at compaction. It's also fully observable because it's just.. a file. This was a genuine innovation and a really good idea. But my OpenClaw installation clients keep running into the same wall: not all memory needs to be in context every time, and there's a ceiling on how much fits. Claude Code caps MEMORY.md at something like 200 lines. After that, the content just doesn't get loaded at session start. You lose it. Most memory systems solve this with a reactive search_memories tool. The agent needs something, searches for it, finds it. Fine. But what happens when the agent doesn't know it should be searching? A coding agent drifts off-track and violates a pattern your team agreed on three months ago. The memory exists. The agent didn't search for it because it didn't know it was relevant. There was no trigger. It just.. didn't know what it didn't know. This is the proactive injection problem, and it's the hardest open question in memory right now. There IS real research on this. MemGuide ranks candidate memories by something they call "marginal slot-completion gain"--basically asking "would injecting this memory fill a gap the agent actually needs right now?" PRIME takes a different angle, building proactive reasoning through iterative memory evolution. These are promising but none of them are production-ready for synchronous agents where you can't afford an extra inference round on every turn. Mesa's Saguaro is interesting here. After every agent turn, it spawns a separate LLM that reviews what the agent just did against the full codebase. If the agent is drifting, it corrects course. They kinda built a memory system without calling it one. It's just really slow because you're doing LLM inference after every single turn. Supermemory proved the logical extreme of this in their April Fools experiment: throw enough inference at the problem (eight parallel prompt variants, a dozen model calls per query) and you beat basically every memory benchmark. 98.6% accuracy. But the per-query cost is absurd. Their actual production system--the graph-based one--scores lower on benchmarks but is, you know, usable. For async agents where latency doesn't matter, brute force actually make sense. Not for Jarvis, my OpenClaw agent. Where files break down: relationships and search If you store everything as files, there's no way to search "all people I know" or "bugs I often make in this codebase" unless the agent happens to organize memories that way. And it won't, because agents are inconsistent organizers. Concepts and relationships aren't flat. They're graphs. A person connects to a company, a project, a set of conversations. A coding pattern connects to a language, a framework, a set of past mistakes. Files can represent individual nodes but they can't represent the edges without becoming something else entirely. So you solve it by adding structured search over your markdown files. Oops, you've built a database! dx.tips/oops-database @swyx wrote about this years ago: developers who avoid using a real database inevitably build one, badly, through incremental decisions. You start with files, add search, add indexing, add schemas, add conflict resolution, and suddenly you have Postgres except worse. This is actually what happened with GBrain--Garry's own implementation of "memory is markdown, brain is a git repo." The files go in as markdown. But underneath? Postgres and pgvector for hybrid search. The markdown is the interface, the database is the engine. Even the strongest advocate for file-based memory needed a database to make it actually work. The Composio model Here's something I think is underexplored: portable memory across agents. The same way Composio lets you move integrations across agents, some memory providers are moving toward letting you own and share memories across Claude, ChatGPT, OpenClaw, whatever. Your memories live in a vault you control, and each agent reads from and writes to it. I'd call this the Composio model of memory. It's a good idea and more providers should pursue it. But then you're potentially running two memory systems--one inside the harness (memory.md, CLAUDE.md, whatever the harness does at compaction) and one external. What a mess. Hermes and OpenClaw both let the user choose their memory backend. Flexibility sounds great until you realize it means the system has to handle the possibility that memory is in two places at once, managed by two different things, with two different update cadences. I still think giving users this choice is the right call. But it is genuinely complicated. The cost that nobody talks about Every sophisticated memory system costs inference tokens. Letta's self-editing model--where the agent actively decides what to remember during reasoning via tool calls--is the most architecturally interesting approach I've seen. The agent curates its own memory as a first-class part of thinking. But every core_memory_replace call is tokens. Mesa's per-turn review is a whole extra LLM call. Supermemory's brute force approach is a dozen. File-based memory is effectively free. Read a file, inject it, done. The bar for beating memory.md isn't just "is it smarter?" It's "is it enough smarter to justify the cost?" And for most use cases today, the honest answer is no. But here's something that should make people pay attention: recent benchmarks on agentic memory (AMA-Bench among others) are finding that the design of your memory system matters way more than which model you're running. We're talking maybe an order of magnitude more variance from architecture choices than from model scaling. The architecture matters enormously. It just also costs real money, and that tension is why most production systems still use the simple thing. The unsolved problems Recent research has started to identify what a memory system actually needs to do well: Accurate retrieval--find the right memory when asked. Learning in real time--update what you know from new information as it comes in. Long-range understanding--connect things across sessions that happened weeks apart. Selective forgetting--know when a memory is stale and stop using it. No current system is good at all four. Graph-based systems handle forgetting and long-range connections better than anything else, which is probably why everyone is converging on them. Letta does well on retrieval and real-time learning. File-based systems do ok on retrieval and struggle with the rest. Now add multi-agent coordination. Multiple agents on the same filesystem. Multiple people cooperating with agents on different projects. Who organizes the memory? Who resolves conflicts? Do we deploy an async agent to consolidate memories at compaction time? At session end? On a cron job overnight? How do we prioritize recent memories over old ones? How much control should the agent have over its own memory? How do we handle that some people want aggressive memory and some want minimal? And they might want to export it and bring it to another agent! These aren't rhetorical questions I'm asking to sound smart. I deal with these every week deploying agents for clients. Nobody has good answers. Benchmarks exist now. They're just not reliable. A year ago there were no memory benchmarks worth talking about. That's changed. LOCOMO, LongMemEval, AMA-Bench, MemoryAgentBench all exist. There's even an ICLR workshop this year dedicated to agent memory. But here's the problem: evaluation choices that look like implementation details--the prompt you use for the judge model, the scoring methodology, the answer generation setup--can swing accuracy by double digits. Supermemory showed this directly when they demonstrated you could score 98.6% by letting any of eight prompt variants count as correct. That's not a benchmark result. That's a configuration choice dressed up as one. So we have benchmarks. They're just not trustworthy enough to settle any debates. If you overcomplicate your memory system, you still can't be sure it's actually outperforming a memory.md other than vibes. Just vibes with numbers attached. Nobody has memory right Not Garry, not Harrison, not OpenClaw, not Letta, not Zep, not Supermemory, not Mem0. Nobody. Garry's instinct--keep it simple, keep it readable, keep it yours--is the right starting posture. Harrison's instinct--the harness and memory are inseparable, own both of them--is architecturally correct. Sarah Wooders' framing--memory is context management, not a retrieval problem--is the most precise explanation of why this is so hard. But memory.md is not the end state. It's the beginning. It's the simplest thing that works, and for most use cases today it's the right choice. Not because it's good. Because everything else is either too expensive, too complex, too slow, or too unproven to justify the leap. The gap will close. The research is real, the providers are converging on graphs, and the benchmarks are slowly forming. But if anyone tells you they've solved memory, they haven't. They've solved one of the four problems and they're hoping you don't ask about the other three.

Garry Tan@garrytan

If your memory dies when your harness dies, you built the harness too thick. Memory is markdown. Skills are markdown. Brain is a git repo. The harness is a thin conductor — it reads the files, it doesn't own them.

English

394

92K

diego retweetledi

Chrys Bader@chrysb·11 Nis

x.com/i/article/2043…

ZXX

835

275.6K

diego@diblacksmith·12 Nis

@bensig Im super curious! but still waiting for someone to rerun benchmarks and publish the *actual* results

English

498

Ben Sigman@bensig·12 Nis

MemPalace just crossed 42K stars and 5.4K forks. v3.1.0 already shipped. Milla and I have barely slept this week. The response has been overwhelming in the best way. We’re running on parallel tracks right now - fixing bugs and reviewing PRs from the community on one side, building the next generation of storage and retrieval on the other. Both are getting better fast. To everyone who has starred, forked, opened issues, submitted PRs, or just sent kind words - thank you. This thing belongs to all of us now. More soon. ✨ github.org/mempalace/memp…

English

383

23.7K

diego retweetledi

alex zhang@a1zhang·10 Nis

x.com/i/article/2041…

ZXX

139

1.1K

303.2K

diego@diblacksmith·31 Mar

@hopeandlonging @himanshustwts @lateinteraction so you're basically asking how much premium OpenAI is adding on top? that I wouldn't know. We can only guess how much it costs to run OpenAI's models, and even the model size itself is not disclosed.

English

hopeandlonging@hopeandlonging·31 Mar

OK, let me phrase differently, imagine that we had an open source locally hostable model with equivalent capabilities as the closed API model they were using. So we replaced API consumption with local inference. How much of the price reduction would’ve been achieved there? I understand that DSPY allows a less capable model to achieve something far greater which is very impressive. I’m not trying to take anything away there. Just honestly curious how much of the original price was provider inflation

English

115

himanshu@himanshustwts·30 Mar

dude what that's like ~99% reduction, dspy is on generational run.

Drew Breunig@dbreunig

At our last DSPy meetup, @kshetrajna shared this amazing case study about how he's using DSPy at @Shopify scale. I think this was my favorite slide.

English

559

62.5K

diego@diblacksmith·31 Mar

@hopeandlonging @himanshustwts @lateinteraction ...all of it? DSPy itself isn't making the LLM cheaper. It's making a cheaper LLM solve the same problem just as well as an expensive one without sacrificing performance (in this case with better performance even)

English

127

hopeandlonging@hopeandlonging·31 Mar

@himanshustwts @lateinteraction This is obviously very impressive engineering overall, but I’m genuinely curious how much of the price reduction was just moving from third-party model via API to local hosted open source model

English

796

diego retweetledi

Mario Zechner@badlogicgames·28 Mar

we as software engineers are becoming beholden to a handful of well funded corportations. while they are our "friends" now, that may change due to incentives. i'm very uncomfortable with that. i believe we need to band together as a community and create a public, free to use repository of real-world (coding) agent sessions/traces. I want small labs, startups, and tinkerers to have access to the same data the big folks currently gobble up from all of us. So we, as a community, can do what e.g. Cursor does below, and take back a little bit of control again. Who's with me? cursor.com/blog/real-time…

English

182

344

2.8K

277.3K

diego@diblacksmith·28 Mar

@fcoury representa!

Español

Felipe Coury 🦀@fcoury·26 Mar

Proud to announce I'll be the first Brazilian on the OpenAI Codex team!

English

172

1.9K

48.4K

diego@diblacksmith·14 Mar

@gkpacker Gabriel, responde DM sobre um pequeno vazamento de dados acidental na plataforma de feedback?

Português

100

Gabriel Packer → 👁️ visorfinance.app@gkpacker·13 Mar

adicionei atalhos pros meses pra facilitar na aba de transações tinha deixado pra depois, daí uma galera pediu no board e era low effort medium impact por mais que tenham outras prioridades, é interessante entregar pequenas features que impactem positivamente na usabilidade enquanto trabalha em features maiores

Gabriel Packer → 👁️ visorfinance.app tweet media

Português

diego@diblacksmith·6 Mar

@VictorTaelin are these things you've been putting off, or were they always part of the plan? If the latter, then that sounds like about +2mo of work. It's worth it at this point. But if you spend another 2mo and it's still not perfect, just launch it man!!

English

550

Taelin@VictorTaelin·5 Mar

just thinking out loud about all the hard things I still need to do before launching Bend2, and consequences if we launch without each 1. HVM4's AOT compiler that means compiling the HVM4 efficient C. that gives a 10x-100x speedup in practice, so, it is essential for the HVM4 back-end to be viable. without it, you'd be using 10's of threads to run an interpreter. that was the main criticism of Bend1. given that parallelism was a key feature of v1, I think we *need* a really good version of it, even though it isn't the central point anymore. we have an initial compiler, but would probably take 5-10 days of focused work to make it good 2. HVM4's GPU runtime it still doesn't have one, at all. doing so is way more complex than HVM2, because of lazy (rather than super strict) evaluation. launching without a GPU runtime would feel like a regression and this is basically locking me. probably 15 days of work 3. Bend2-SupGen integration this is really hard to do, due to Bend's dependent types. currently, SupGen is still reliable for simple types. without it, the time/cost it takes for the AI to fill all sub-proofs is still too high for it to be a viable product. we're talking about whole nights to make a small refactor. probably 5-10 days of focused work ... and that's basically it actually. if these things were ready, all the rest is trivial, and we could launch it the next week... I wonder if I should just launch Bend2 without the HVM4 back-end. keep in mind that Bend2 compiles to very fast JS, so we don't *need* HVM4 or parallelism in any way. but it would be extremely hard to communicate that, given the initial appeal of the language was parallelism and interaction nets. we have small windows where people take the time to try our stuff, so I think if it is not great, polished and covers all corners on launch, it will just be dismissed as a "not ready yet" thing for another year not sure the "move fast and break things" advice applies to when you're building a programming language, Bend1 did launch prematurely and I shouldn't let that happen again. perhaps "a delayed game is eventually good, but a rushed game is forever bad" is the advice I should be following here...

English

111

10.3K

diego@diblacksmith·22 Şub

@VictorTaelin @gooby_esq The “‘magic” is that Claude doesn’t have to verbalize every single call. Claude also doesn’t have to process 1000 subLLM results in its context, because it’s stored in a Python variable in memory. Sub-agents can do that too, but RLMs are biased towards that almost naturally.

English

diego@diblacksmith·22 Şub

@VictorTaelin @gooby_esq DSPy is just a harness (at least for our purposes here), like, say, langchain. I think a better comparison is to “code mode”. You’ll see the difference in an example: say I need to count happy sentences in 1k files. Claude can’t call 1000 subagents reliably, but code can.

English

Taelin@VictorTaelin·22 Şub

ok so RLM is the next big thing (even though it is just a fancy name for agents that call sub-agents, which codex and claude already do???) but ok, I buy it. so... what do I use? how I'm supposed to "try" it?

English

333

36.2K

Keşfet

@lateinteraction @CAISconf @VictorTaelin @nearcyan @IterIntellectus @swyx @bensig @hopeandlonging