Elias Lumer

94 posts

Elias Lumer

@EliasLumer

ai research & engineering

Katılım Şubat 2026

44 Takip Edilen71 Takipçiler

Elias Lumer@EliasLumer·18h

@hallerite @willccbb Can you tell us the tldr on harnsss builders using RL on their models in the harness?

English

hallerite@hallerite·1d

if you are a harness builder & want people to RL their models in your harness, this blog post is a must read to understand what makes a harness nice to RL on.

Prime Intellect@PrimeIntellect

Introducing Renderers RL trainers work in tokens. Environments work in messages. Going back and forth corrupts sampled tokens, wasting compute on every agentic turn. With Renderers, we fix this mismatch. This unlocks >3x throughput on popular open models.

English

143

16.7K

Elias Lumer@EliasLumer·20h

@nexxeln Can you provide examples or best practices you’ve learned while doing this? Very interesting.

English

470

nexxel@nexxeln·22h

i’m starting to think agent-friendly codebases are more about constraints migrating parts of opencode to effect has made agent-written code noticeably less cursed good architecture boxes agents into writing specific, constrained code fewer ways to go wrong

English

458

37.9K

Elias Lumer@EliasLumer·21h

@sydneyrunkle Is it on by default for both create_agent and create_deep_agent? And what if we add additional state via the custom state option (in additional to message), does every new state item need to have the Delta channel ? Or it’s by default?

English

Sydney Runkle@sydneyrunkle·1d

we just shipped delta channels in langgraph 1.2. as agents run longer and use more context, full-state checkpointing doesn't scale, but delta channel snapshots do. this new algorithm is now powering message histories and file storage in deepagents v0.6!

Sydney Runkle@sydneyrunkle

x.com/i/article/2054…

English

8.6K

Elias Lumer@EliasLumer·1d

@mem0ai Congrats on this! 94.8% is still slightly below our SOTA (95.6%) , and our approach Chronos is centered around temporal relevance. Glad to see temporal being recognized as a core primitive of memory offerings. Paper: arxiv.org/abs/2603.16862

English

mem0@mem0ai·1d

x.com/i/article/2054…

ZXX

156

17.2K

Elias Lumer retweetledi

DAIR.AI@dair_ai·2d

Cool paper from PwC. "Earlier is always better" is the default intuition for agent clarification. New paper claims that's mostly wrong. Goal clarification loses nearly all of its value after just 10% of execution. The team built a forced-injection framework that drops ground-truth clarifications at controlled points along a long-horizon agent's trajectory, across 4 information dimensions (goal, input, constraint, context), 3 benchmarks, and 4 frontier models. 84 task variants, 6,000+ runs. Pass@3 falls from 0.78 back to baseline. Input clarification keeps value through roughly 50%. Past mid-trajectory, asking any clarification at all performs worse than never asking. A complementary study of 300 unscripted sessions shows no current frontier model asks within the empirically optimal window. 52% of sessions over-ask. Others never ask at all. Why it matters: clarification has been treated as a binary capability, does the agent ask or not. This is the first quantitative demand curve for *when* the question is worth asking. Paper: arxiv.org/abs/2605.07937 Learn to build effective AI agents in our academy: academy.dair.ai

English

123

10.8K

Elias Lumer@EliasLumer·4d

@sarahwooders @Kushalpatil77 It’s not just toolsets but a variety of things that the model labs are doing (system-reminder, tool outputs, human messages). OSS will always lag behind unless they have access to the RL agent loop..

English

Sarah Wooders@sarahwooders·5d

@Kushalpatil77 No, you just need to match their toolsets (and maybe some prompting structure at most). Other than that you can innovate on things like memory (Letta Code), extensibility (pi), and UI/product experience more generally

English

174

Sarah Wooders@sarahwooders·5d

There have been some claims recently that the harnesses offered by the model labs (which increasingly lock in memory/state) are somehow magically superior to model-agnostic harnesses. This take really irritates me because it's so easy to disprove. Letta Code gets the same scores (or slightly better) as Claude Code / Codex on TerminalBench. So do many other harnesses. Yes, model labs *are* reducing the generality of their own models in favor of optimizing for their first-party products, but this mostly just means that their models being overfit to the toolsets for their first-party harnesses. Fortunately, it's very easy to reverse engineer what the toolsets are and implement them in other harnesses. Codex is open-source, and Claude Code's source code has been leaked so there's no great mystery here. Some popular harnesses DO fail to adapt their toolsets properly (e.g. OpenCode) which degrades performance. But if you are using a well implemented harness, this is a non-issue. You are not getting special capabilities from first-party harnesses, just memory lock-in.

Dan Shipper 📧@danshipper

In the future, you’ll be able to accomplish a goal by just giving Claude an outcome and a budget. That’s the direction Anthropic is building in with its new Managed Agents features, announced at this week’s Code with Claude developer event. The basic idea: Claude, wrapped in a computer in the cloud, that you can spin up, scale, and manage as needed. Anthropic is taking on the infrastructure that kills most agent products, and making sure that it scales to meet the needs of agents running 24/7. On this week’s AI & I from @every, I talk with Angela Jiang (@angjiang), head of product for the Claude platform, and Katelyn Lesse (@katelyn_lesse), head of engineering for the Claude platform, about what Anthropic is building and what it takes to make agents reliable in production. We get into: - Why the "build a generic harness, hot-swap any model behind it" playbook is already outdated. Angela points to eval data on Memory where the same task across different harnesses performed drastically differently. - The infrastructure wall every team hits in production—and why Katelyn thinks “my sandbox died and took the agent with it” is the real reason internal agents don't ship. - Why Anthropic is so bullish on using file systems and skills within Claude, including Angela's argument that those early design choices can compound for years. This is a must-watch for anyone trying to take an agent past the demo and into production. Watch below! Timestamps: How the Claude platform evolved from API to agents: 00:01:48 The primitives that make up Claude Managed Agents: 00:04:09 Why the harness and the model are becoming a single unit: 00:10:37 The infrastructure wall that kills most agent projects in production: 00:18:49 Why team agents need a different shape than individual productivity tools: 00:24:49 How Anthropic's legal team uses an agent to review marketing copy: 00:26:36 Using multi-agent orchestration for advisor strategies, adversarial pairs, and swarms: 00:34:24 How to measure agent success with outcome and budget as the end state: 00:35:50 What the platform looks like a year from now, when Claude writes its own harness: 00:39:11

English

12.9K

Elias Lumer@EliasLumer·4d

@lifeonautosite @poteto Surely there’s markdown extensions for this , and if labs make it part of training , solved problem

English

Nathan Shearer@lifeonautosite·4d

@poteto It's the base64 images that hurt me

English

lauren@poteto·4d

i am on team markdown

English

229

10.9K

Elias Lumer@EliasLumer·5d

@garrytan What’s the LLM accuracy , not recall?

English

156

Garry Tan@garrytan·6d

GBrain beats MemPalace on LongMemEval And I published the benchmarks and open source eval repo to prove it

English

215

19K

Elias Lumer@EliasLumer·6d

@austinnickpiel @cursor_ai I thought MCP was not part of the context window? You guys have a single CallMCP tool with server name, tools, argument, and persist MCP files to .json? How is it 3.7k?

English

Austin Nick Piel@austinnickpiel·6 May

@cursor_ai One of the devs who built this here! Lots more coming soon to give you even more visibility + tools to optimize context :)

English

2.1K

Cursor@cursor_ai·6 May

You can now see a breakdown of your agent's context usage in Cursor 3.3. Use these stats to diagnose context issues and improve your setup across rules, skills, MCPs, and subagents.

English

117

136

2.1K

192.3K

Elias Lumer@EliasLumer·7 May

@zechengzh Cool, how about executing code that lives somewhere in the virtual system ?

English

1.3K

Zecheng Zhang@zechengzh·6 May

Introducing Mirage, a unified virtual filesystem for AI agents! 6 weeks. 1.1M+ lines of code. We rewrote bash from the ground up so cat, grep, head, and pipes work across heterogeneous services. S3, Google Drive, Slack, Gmail, GitHub, Linear, Notion, Postgres, MongoDB, SSH, and more, all mounted side-by-side as one filesystem. Bash that AI agents already know works on every format! cat, grep, head, and wc parse .parquet, .csv, .json, .h5, even .wav! One pipe can stitch S3, Drive, GitHub, Slack, and Linear together, same Unix semantics throughout. Workspaces are versioned too. Snapshot, clone, and roll back the whole thing with one API call. A two-layer cache turns repeated reads into local lookups, so agent loops stay fast and cheap. Drop a Workspace into FastAPI, Express, or a browser app. Wire it into OpenAI Agents SDK, Vercel AI SDK, LangChain, Mastra, or Pi. Run it alongside Claude Code and Codex. Site: strukto.ai/mirage GitHub: github.com/strukto-ai/mir… #AIAgents #OpenSource #AgenticAI #Strukto #Filesystem #VFS

English

172

337

3.3K

606K

Elias Lumer@EliasLumer·7 May

@Vtrivedy10 Yeah exactly, and this trend is common in AI, for ex, there’s a reason we have general purpose LLMs rather than many specialized OpenAI/Anthropic LLMs for every task (besides finetuning use cases which are valid). So, it conceptually makes sense to extend the arg to harnesses

English

Viv@Vtrivedy10·7 May

general purpose seems to just mean, decently good at many tasks out of the box which is basically an agent that can use a computer well which is basically a coding agent but really people use agents to do things so another definition of general purpose is “easily editable to do my task well”, kinda maps to the point of “can I just tell the agent to do something or give it a skill and it just works?” —> that feels pretty general purpose in practice

English

335

Viv@Vtrivedy10·6 May

Strong Opinions, Loosely Held on Agent + Harness Engineering: 1. You can outperform any default harness+model (including codex & claude code) on pretty much any Task by engineering the harness around it. Using the exact same model, curate prompts, tools, skills, hooks for that Task. This harness optimization process is becoming much more agent driven with humans reviewing and curating evals/rewards to hill climb on. “Just say what you want”. 2. A “general purpose” agent/harness doesn’t really exist, it’s a tradeoff between time spent on customizing the agent and performance (cost, latency, accuracy) on a Task. I don’t exactly follow what a general purpose means tbh. Who decides what’s general and what’s not? 3. But if the “general purpose” agent/harness existed, it would look like a good coding agent 4. Building a Task specific harness will most likely converge to good prompt & tool design (probably packaged up as a Skill) as models become smarter and better at in-context learning 5. Evals are a moat and thus data to produce evals is a moat. Especially true for vertical agent companies. This is because agents can fit to most Eval sets today. If Evals measurably encode all the good behavior your agent needs to do, then this signal can be hill climbed to improve your agent 6. Frontier closed models are far too expensive for the large majority of tasks the world needs to do. As teams start mapping costs to ROI, Open Model Harness Engineering will take off even more. It is almost always worth the investment to at least try to get a potential 20x+ cost reduction 7. A large chunk of design decisions around Task decomposition and context engineering exist solely because our usable context window is 50-100k. Agents that become excellent at breaking down tasks, applying compaction appropriately, and orchestrating subagents as sub-task workers will be the most delightful products to do real work. 8. We’re entering an Age of Unbundled (& Rebundled) Agents where Subagents exposed as Tools do a ton of domain specific work on behalf of an orchestrator agent. The Harness becomes a box that gets populated with the exact set of tools, skills, and subagents needed to solve that task or sub-task. Examples include WarpGrep (search), Chroma Context-1 (search), Nemotron 3 Omni (small multimodal), etc. Bespoke agents that rock at narrow tasks orchestrated as tools. This also applies to software as tools that are used by agents via Skills like Remotion or Blender. Different harnesses bundle together the tooling needed to complete that narrow task. End of opinions, these may change by the time this tweet goes out or may double down and expand on these in an article

English

787

65.9K

Elias Lumer@EliasLumer·7 May

@mattpocockuk Record a video about it instead and post it for free on YouTube

English

197

Matt Pocock@mattpocockuk·7 May

Sounds mad, but maybe I should just make a course about writing great skills? I.e. for actual life/work productivity, not just dev. Breaking down daily tasks into skills. Turning HITL tasks into AFK ones. Creating a working language with the agent. Feels pretty deep

English

103

1.3K

39.6K

Elias Lumer@EliasLumer·3 May

@1weiho @vercel Is this shareable to have multiple collaborators? And database hooked up like Google slides?

English

Yiwei Ho@1weiho·3 May

Here is a full guide on how to scaffold, build, and deploy your next presentation using open-slide and @vercel. From CLI init to a live URL!

English

6.9K

Elias Lumer@EliasLumer·2 May

@nicbstme And tool names/desc/params , we need to standardize it so we can optimize the harness <> model

English

Nicolas Bustamante@nicbstme·2 May

It would be amazing to see more collaboration between the labs on file path, memory file format, etc. Realistically, I understand each research team already has its RL pipelines, etc., and it might even be a moat to increase the cost of switching between model providers.

Nicolas Bustamante@nicbstme

x.com/i/article/2050…

English

7.1K

Elias Lumer@EliasLumer·2 May

@dqnamo @Vercantez Why? I think people like decoupling the sandbox to the agent deployments, and the option for virtual filesystems (Postgres, S3)

English

JP@dqnamo·2 May

@Vercantez curious to know why the strong opinion on outside vs inside. recently have been liking the agent inside sandbox approach

English

863

Miguel Salinas@Vercantez·1 May

This is the first agent framework that really nails the primitives. The harness (pi) runs outside of the sandbox/container like it should. Really clever use of skills and output parsing. Impressive

fks@FredKSchott

Introducing Flue — The First Agent Harness Framework Flue is a TypeScript framework for building the next generation of agents, designed around a built-in agent harness. Flue is like Claude Code, but 100% headless and programmable. There's no baked in assumption like requiring a human operator to function. No TUI. No GUI. Just TypeScript. But using Flue feels like using Claude Code. The agents you build act autonomously to solve problems and complete tasks. They require very little code to run. Most of the "logic" lives in Markdown: skills and context and AGENTS.md. Flue is like Astro or Next.js for agents (not surprising, given my background 🙃). It's not another AI SDK. It's a proper runtime-agnostic framework. Write once, build, and deploy your agents anywhere (Node.js, Cloudflare, GitHub Actions, GitLab CI/CD, etc). We originally built Flue to power AI workflows inside of the Astro GitHub repo. But then @_bgiori got his hands on it, and we realized that every agent needs a framework like Flue, not just us. Check it out! It's early, but I'm curious to hear what people think. Are agents ready for their library -> framework moment?

English

264

36.3K

Elias Lumer@EliasLumer·1 May

@willccbb @neural_avb Does this hold true to both multi-turn RL vs single-turn? If you have a rollout with 50 tool call/tool responses, and half are garbage, how is sampling solving that credit assignment, if the best rollout still is very inefficient? Do we need to add self-distillation to fix it?

English

will brown@willccbb·1 May

@neural_avb a question to ask about any learning method is "where are your bits coming from?" in SFT / OPD, they're coming from the teacher in RL, they're coming from the reward function if you want more bits, they gotta come from somewhere credit assignment is already solved by sampling

English

2.3K

AVB@neural_avb·1 May

Normally I'd read a paper every morning, but today read this article on X instead. Great survey on RL vs OPD vs SFT. Don't bookmark, just spend 30 mins and read it through. Lots of cool things here, but there's one new curiosity this opened for me. Some thought dump: I always thought token-level losses are the holy grail coz "hey that solves one of the main issues with classical RL - credit assignment". This article made me realize that not all token-level losses are equally useful... if the token-level KL is obtained through teachers completely detached from student model's world view, that's probably gonna get awkward. Personally, I am a big sucker for self-optimization methods (model receives/generates hints based off of env rewards, and then distills that into a training signal)... Self-contained methods just *feel* good, challenge is to build something that gives it maximum expressivity, minimizing inductive bias, while remaining below cost/time/resource constraints. Good luck with that. Article actually mentions many of those ideas in the end as well! Lots of references to dig into as a follow up.

will brown@willccbb

x.com/i/article/2050…

English

149

17.6K

Elias Lumer@EliasLumer·29 Nis

@moofeez @willccbb Yeah exactly, probably more juice to squeeze by bitter lesson-ing the debugger in the harness. Looking forward to seeing your open source blog 👍

English

121

mufeez@moofeez·29 Nis

@EliasLumer @willccbb I explored the standard harness + tool call approach, though there’s definitely room for experimentation here

English

154

mufeez@moofeez·28 Nis

I post-trained Qwen3-Coder to fix bugs using an actual debugger. The result: Solve rate: 70% → 89% Median turns to fix: 46 → 19 (-59%) Instead of just reading code or print-debugging, it: - reasons from execution - inspects live variables and call stacks - sets breakpoints, steps, and evaluates expressions

English

118

1.6K

120.5K

Elias Lumer@EliasLumer·29 Nis

@moofeez @willccbb Interesting. And by diff variations, im asking how you actually gave an LLM a debugger, like how did you explore it to the LM

English

166

mufeez@moofeez·29 Nis

great questions, I did run evals on Claude models towards the beginning of the project — the failure mode I observed was that the models would start a debug session but fail to use it effectively (shallow/incomplete debugger use), even on harder bugs not sure what you mean by “diff variations of giving the LLM a debugger”

English

1.3K

Elias Lumer@EliasLumer·29 Nis

@nbaschez Send

English

Nathan Baschez@nbaschez·28 Nis

Do you spend a lot of time reviewing markdown docs written by AI? Wish it were a better experience? Say hi if you wanna try a new (free, open source) thing

English

341

356

56.7K

Keşfet

@hallerite @willccbb @nexxeln @sydneyrunkle @mem0ai @sarahwooders @Kushalpatil77 @lifeonautosite