Elias Lumer

94 posts

Elias Lumer

Elias Lumer

@EliasLumer

ai research & engineering

Katılım Şubat 2026
44 Takip Edilen71 Takipçiler
Elias Lumer
Elias Lumer@EliasLumer·
@nexxeln Can you provide examples or best practices you’ve learned while doing this? Very interesting.
English
0
0
1
470
nexxel
nexxel@nexxeln·
i’m starting to think agent-friendly codebases are more about constraints migrating parts of opencode to effect has made agent-written code noticeably less cursed good architecture boxes agents into writing specific, constrained code fewer ways to go wrong
English
34
15
458
37.9K
Elias Lumer
Elias Lumer@EliasLumer·
@sydneyrunkle Is it on by default for both create_agent and create_deep_agent? And what if we add additional state via the custom state option (in additional to message), does every new state item need to have the Delta channel ? Or it’s by default?
English
0
0
0
13
Sydney Runkle
Sydney Runkle@sydneyrunkle·
we just shipped delta channels in langgraph 1.2. as agents run longer and use more context, full-state checkpointing doesn't scale, but delta channel snapshots do. this new algorithm is now powering message histories and file storage in deepagents v0.6!
Sydney Runkle@sydneyrunkle

x.com/i/article/2054…

English
9
12
55
8.6K
Elias Lumer
Elias Lumer@EliasLumer·
@mem0ai Congrats on this! 94.8% is still slightly below our SOTA (95.6%) , and our approach Chronos is centered around temporal relevance. Glad to see temporal being recognized as a core primitive of memory offerings. Paper: arxiv.org/abs/2603.16862
English
0
0
1
78
Elias Lumer retweetledi
DAIR.AI
DAIR.AI@dair_ai·
Cool paper from PwC. "Earlier is always better" is the default intuition for agent clarification. New paper claims that's mostly wrong. Goal clarification loses nearly all of its value after just 10% of execution. The team built a forced-injection framework that drops ground-truth clarifications at controlled points along a long-horizon agent's trajectory, across 4 information dimensions (goal, input, constraint, context), 3 benchmarks, and 4 frontier models. 84 task variants, 6,000+ runs. Pass@3 falls from 0.78 back to baseline. Input clarification keeps value through roughly 50%. Past mid-trajectory, asking any clarification at all performs worse than never asking. A complementary study of 300 unscripted sessions shows no current frontier model asks within the empirically optimal window. 52% of sessions over-ask. Others never ask at all. Why it matters: clarification has been treated as a binary capability, does the agent ask or not. This is the first quantitative demand curve for *when* the question is worth asking. Paper: arxiv.org/abs/2605.07937 Learn to build effective AI agents in our academy: academy.dair.ai
DAIR.AI tweet media
English
8
23
123
10.8K
Elias Lumer
Elias Lumer@EliasLumer·
@sarahwooders @Kushalpatil77 It’s not just toolsets but a variety of things that the model labs are doing (system-reminder, tool outputs, human messages). OSS will always lag behind unless they have access to the RL agent loop..
English
1
0
1
49
Sarah Wooders
Sarah Wooders@sarahwooders·
@Kushalpatil77 No, you just need to match their toolsets (and maybe some prompting structure at most). Other than that you can innovate on things like memory (Letta Code), extensibility (pi), and UI/product experience more generally
English
1
0
6
174
Sarah Wooders
Sarah Wooders@sarahwooders·
There have been some claims recently that the harnesses offered by the model labs (which increasingly lock in memory/state) are somehow magically superior to model-agnostic harnesses. This take really irritates me because it's so easy to disprove. Letta Code gets the same scores (or slightly better) as Claude Code / Codex on TerminalBench. So do many other harnesses. Yes, model labs *are* reducing the generality of their own models in favor of optimizing for their first-party products, but this mostly just means that their models being overfit to the toolsets for their first-party harnesses. Fortunately, it's very easy to reverse engineer what the toolsets are and implement them in other harnesses. Codex is open-source, and Claude Code's source code has been leaked so there's no great mystery here. Some popular harnesses DO fail to adapt their toolsets properly (e.g. OpenCode) which degrades performance. But if you are using a well implemented harness, this is a non-issue. You are not getting special capabilities from first-party harnesses, just memory lock-in.
Dan Shipper 📧@danshipper

In the future, you’ll be able to accomplish a goal by just giving Claude an outcome and a budget. That’s the direction Anthropic is building in with its new Managed Agents features, announced at this week’s Code with Claude developer event. The basic idea: Claude, wrapped in a computer in the cloud, that you can spin up, scale, and manage as needed. Anthropic is taking on the infrastructure that kills most agent products, and making sure that it scales to meet the needs of agents running 24/7. On this week’s AI & I from @every, I talk with Angela Jiang (@angjiang), head of product for the Claude platform, and Katelyn Lesse (@katelyn_lesse), head of engineering for the Claude platform, about what Anthropic is building and what it takes to make agents reliable in production. We get into: - Why the "build a generic harness, hot-swap any model behind it" playbook is already outdated. Angela points to eval data on Memory where the same task across different harnesses performed drastically differently. - The infrastructure wall every team hits in production—and why Katelyn thinks “my sandbox died and took the agent with it” is the real reason internal agents don't ship. - Why Anthropic is so bullish on using file systems and skills within Claude, including Angela's argument that those early design choices can compound for years. This is a must-watch for anyone trying to take an agent past the demo and into production. Watch below! Timestamps: How the Claude platform evolved from API to agents: 00:01:48 The primitives that make up Claude Managed Agents: 00:04:09 Why the harness and the model are becoming a single unit: 00:10:37 The infrastructure wall that kills most agent projects in production: 00:18:49 Why team agents need a different shape than individual productivity tools: 00:24:49 How Anthropic's legal team uses an agent to review marketing copy: 00:26:36 Using multi-agent orchestration for advisor strategies, adversarial pairs, and swarms: 00:34:24 How to measure agent success with outcome and budget as the end state: 00:35:50 What the platform looks like a year from now, when Claude writes its own harness: 00:39:11

English
9
8
74
12.9K
lauren
lauren@poteto·
i am on team markdown
English
45
5
229
10.9K
Garry Tan
Garry Tan@garrytan·
GBrain beats MemPalace on LongMemEval And I published the benchmarks and open source eval repo to prove it
Garry Tan tweet media
English
16
14
215
19K
Elias Lumer
Elias Lumer@EliasLumer·
@austinnickpiel @cursor_ai I thought MCP was not part of the context window? You guys have a single CallMCP tool with server name, tools, argument, and persist MCP files to .json? How is it 3.7k?
English
0
0
0
25
Austin Nick Piel
Austin Nick Piel@austinnickpiel·
@cursor_ai One of the devs who built this here! Lots more coming soon to give you even more visibility + tools to optimize context :)
English
16
1
79
2.1K
Cursor
Cursor@cursor_ai·
You can now see a breakdown of your agent's context usage in Cursor 3.3. Use these stats to diagnose context issues and improve your setup across rules, skills, MCPs, and subagents.
English
117
136
2.1K
192.3K
Elias Lumer
Elias Lumer@EliasLumer·
@zechengzh Cool, how about executing code that lives somewhere in the virtual system ?
English
1
0
1
1.3K
Zecheng Zhang
Zecheng Zhang@zechengzh·
Introducing Mirage, a unified virtual filesystem for AI agents! 6 weeks. 1.1M+ lines of code. We rewrote bash from the ground up so cat, grep, head, and pipes work across heterogeneous services. S3, Google Drive, Slack, Gmail, GitHub, Linear, Notion, Postgres, MongoDB, SSH, and more, all mounted side-by-side as one filesystem. Bash that AI agents already know works on every format! cat, grep, head, and wc parse .parquet, .csv, .json, .h5, even .wav! One pipe can stitch S3, Drive, GitHub, Slack, and Linear together, same Unix semantics throughout. Workspaces are versioned too. Snapshot, clone, and roll back the whole thing with one API call. A two-layer cache turns repeated reads into local lookups, so agent loops stay fast and cheap. Drop a Workspace into FastAPI, Express, or a browser app. Wire it into OpenAI Agents SDK, Vercel AI SDK, LangChain, Mastra, or Pi. Run it alongside Claude Code and Codex. Site: strukto.ai/mirage GitHub: github.com/strukto-ai/mir… #AIAgents #OpenSource #AgenticAI #Strukto #Filesystem #VFS
Zecheng Zhang tweet mediaZecheng Zhang tweet media
English
172
337
3.3K
606K
Elias Lumer
Elias Lumer@EliasLumer·
@Vtrivedy10 Yeah exactly, and this trend is common in AI, for ex, there’s a reason we have general purpose LLMs rather than many specialized OpenAI/Anthropic LLMs for every task (besides finetuning use cases which are valid). So, it conceptually makes sense to extend the arg to harnesses
English
0
0
1
52
Viv
Viv@Vtrivedy10·
general purpose seems to just mean, decently good at many tasks out of the box which is basically an agent that can use a computer well which is basically a coding agent but really people use agents to do things so another definition of general purpose is “easily editable to do my task well”, kinda maps to the point of “can I just tell the agent to do something or give it a skill and it just works?” —> that feels pretty general purpose in practice
English
1
0
1
335
Viv
Viv@Vtrivedy10·
Strong Opinions, Loosely Held on Agent + Harness Engineering: 1. You can outperform any default harness+model (including codex & claude code) on pretty much any Task by engineering the harness around it. Using the exact same model, curate prompts, tools, skills, hooks for that Task. This harness optimization process is becoming much more agent driven with humans reviewing and curating evals/rewards to hill climb on. “Just say what you want”. 2. A “general purpose” agent/harness doesn’t really exist, it’s a tradeoff between time spent on customizing the agent and performance (cost, latency, accuracy) on a Task. I don’t exactly follow what a general purpose means tbh. Who decides what’s general and what’s not? 3. But if the “general purpose” agent/harness existed, it would look like a good coding agent 4. Building a Task specific harness will most likely converge to good prompt & tool design (probably packaged up as a Skill) as models become smarter and better at in-context learning 5. Evals are a moat and thus data to produce evals is a moat. Especially true for vertical agent companies. This is because agents can fit to most Eval sets today. If Evals measurably encode all the good behavior your agent needs to do, then this signal can be hill climbed to improve your agent 6. Frontier closed models are far too expensive for the large majority of tasks the world needs to do. As teams start mapping costs to ROI, Open Model Harness Engineering will take off even more. It is almost always worth the investment to at least try to get a potential 20x+ cost reduction 7. A large chunk of design decisions around Task decomposition and context engineering exist solely because our usable context window is 50-100k. Agents that become excellent at breaking down tasks, applying compaction appropriately, and orchestrating subagents as sub-task workers will be the most delightful products to do real work. 8. We’re entering an Age of Unbundled (& Rebundled) Agents where Subagents exposed as Tools do a ton of domain specific work on behalf of an orchestrator agent. The Harness becomes a box that gets populated with the exact set of tools, skills, and subagents needed to solve that task or sub-task. Examples include WarpGrep (search), Chroma Context-1 (search), Nemotron 3 Omni (small multimodal), etc. Bespoke agents that rock at narrow tasks orchestrated as tools. This also applies to software as tools that are used by agents via Skills like Remotion or Blender. Different harnesses bundle together the tooling needed to complete that narrow task. End of opinions, these may change by the time this tweet goes out or may double down and expand on these in an article
English
50
69
787
65.9K
Elias Lumer
Elias Lumer@EliasLumer·
@mattpocockuk Record a video about it instead and post it for free on YouTube
English
0
0
1
197
Matt Pocock
Matt Pocock@mattpocockuk·
Sounds mad, but maybe I should just make a course about writing great skills? I.e. for actual life/work productivity, not just dev. Breaking down daily tasks into skills. Turning HITL tasks into AFK ones. Creating a working language with the agent. Feels pretty deep
English
103
23
1.3K
39.6K
Elias Lumer
Elias Lumer@EliasLumer·
@1weiho @vercel Is this shareable to have multiple collaborators? And database hooked up like Google slides?
English
1
0
1
41
Yiwei Ho
Yiwei Ho@1weiho·
Here is a full guide on how to scaffold, build, and deploy your next presentation using open-slide and @vercel. From CLI init to a live URL!
English
5
3
59
6.9K
Elias Lumer
Elias Lumer@EliasLumer·
@nicbstme And tool names/desc/params , we need to standardize it so we can optimize the harness <> model
English
1
0
0
26
Nicolas Bustamante
Nicolas Bustamante@nicbstme·
It would be amazing to see more collaboration between the labs on file path, memory file format, etc. Realistically, I understand each research team already has its RL pipelines, etc., and it might even be a moat to increase the cost of switching between model providers.
Nicolas Bustamante@nicbstme

x.com/i/article/2050…

English
5
3
36
7.1K
Elias Lumer
Elias Lumer@EliasLumer·
@dqnamo @Vercantez Why? I think people like decoupling the sandbox to the agent deployments, and the option for virtual filesystems (Postgres, S3)
English
0
0
0
60
JP
JP@dqnamo·
@Vercantez curious to know why the strong opinion on outside vs inside. recently have been liking the agent inside sandbox approach
English
4
0
5
863
Elias Lumer
Elias Lumer@EliasLumer·
@willccbb @neural_avb Does this hold true to both multi-turn RL vs single-turn? If you have a rollout with 50 tool call/tool responses, and half are garbage, how is sampling solving that credit assignment, if the best rollout still is very inefficient? Do we need to add self-distillation to fix it?
English
0
0
0
27
will brown
will brown@willccbb·
@neural_avb a question to ask about any learning method is "where are your bits coming from?" in SFT / OPD, they're coming from the teacher in RL, they're coming from the reward function if you want more bits, they gotta come from somewhere credit assignment is already solved by sampling
English
8
1
38
2.3K
AVB
AVB@neural_avb·
Normally I'd read a paper every morning, but today read this article on X instead. Great survey on RL vs OPD vs SFT. Don't bookmark, just spend 30 mins and read it through. Lots of cool things here, but there's one new curiosity this opened for me. Some thought dump: I always thought token-level losses are the holy grail coz "hey that solves one of the main issues with classical RL - credit assignment". This article made me realize that not all token-level losses are equally useful... if the token-level KL is obtained through teachers completely detached from student model's world view, that's probably gonna get awkward. Personally, I am a big sucker for self-optimization methods (model receives/generates hints based off of env rewards, and then distills that into a training signal)... Self-contained methods just *feel* good, challenge is to build something that gives it maximum expressivity, minimizing inductive bias, while remaining below cost/time/resource constraints. Good luck with that. Article actually mentions many of those ideas in the end as well! Lots of references to dig into as a follow up.
will brown@willccbb

x.com/i/article/2050…

English
1
10
149
17.6K
Elias Lumer
Elias Lumer@EliasLumer·
@moofeez @willccbb Yeah exactly, probably more juice to squeeze by bitter lesson-ing the debugger in the harness. Looking forward to seeing your open source blog 👍
English
0
0
0
121
mufeez
mufeez@moofeez·
@EliasLumer @willccbb I explored the standard harness + tool call approach, though there’s definitely room for experimentation here
English
1
0
1
154
mufeez
mufeez@moofeez·
I post-trained Qwen3-Coder to fix bugs using an actual debugger. The result: Solve rate: 70% → 89% Median turns to fix: 46 → 19 (-59%) Instead of just reading code or print-debugging, it: - reasons from execution - inspects live variables and call stacks - sets breakpoints, steps, and evaluates expressions
English
92
118
1.6K
120.5K
Elias Lumer
Elias Lumer@EliasLumer·
@moofeez @willccbb Interesting. And by diff variations, im asking how you actually gave an LLM a debugger, like how did you explore it to the LM
English
1
0
1
166
mufeez
mufeez@moofeez·
great questions, I did run evals on Claude models towards the beginning of the project — the failure mode I observed was that the models would start a debug session but fail to use it effectively (shallow/incomplete debugger use), even on harder bugs not sure what you mean by “diff variations of giving the LLM a debugger”
English
1
0
1
1.3K
Nathan Baschez
Nathan Baschez@nbaschez·
Do you spend a lot of time reviewing markdown docs written by AI? Wish it were a better experience? Say hi if you wanna try a new (free, open source) thing
English
341
1
356
56.7K