Dave Lee

50 posts

Dave Lee banner
Dave Lee

Dave Lee

@mega__d

agent systems, optimized

Katılım Kasım 2023
68 Takip Edilen5 Takipçiler
Sabitlenmiş Tweet
Dave Lee
Dave Lee@mega__d·
AI coding tools are getting packed with skills. But the real problem isn’t adding more skills. It’s making sure the agent doesn’t load everything when it only needs one or two. That’s why we built MEGA Tron. It routes the right skills into Codex, Claude Code, and Gemini CLI based on the prompt. Less wasted context. More useful skills. GitHub: github.com/mega-edo/mega-…
English
0
0
0
104
Dave Lee
Dave Lee@mega__d·
@steipete Exactly. Shorter skill descriptions help. But isn’t loading the right skills just as important?
English
2
0
15
8.7K
Peter Steinberger 🦞
Folks: when you write skills, ask your agent to be token efficient, relax grammer. I see too many skills that write books in the skill description, and all that crap is loaded into every context. I wrote a skill that finds the worst offenders. github.com/steipete/agent…
English
136
266
3.6K
194.6K
Dave Lee
Dave Lee@mega__d·
Compared with MEGA Tron, this is far more tokens for lower coverage. Hard to call that token-efficient.
Dave Lee tweet media
English
1
0
0
10
Dave Lee
Dave Lee@mega__d·
Is this acceptable for agent skills? Maybe.
Dave Lee tweet media
English
1
0
0
17
Dave Lee
Dave Lee@mega__d·
@mattpocockuk All of this is important. But the right skill still needs to show up at the right time. Otherwise, even a well-designed skill doesn’t get to create value.
English
0
0
0
35
Matt Pocock
Matt Pocock@mattpocockuk·
Skills should be: - Concise - Responsible for one thing, not multi-step - Composable - Progressively disclosed - Harness-agnostic What else? Or - what did I get wrong?
English
186
82
1.7K
120.4K
Dave Lee
Dave Lee@mega__d·
@NousResearch @browserbase Once agents have access to hundreds of skills, routing becomes increasingly important. The right skill needs to show up at the right moment.
English
1
0
0
14
Nous Research
Nous Research@NousResearch·
Hermes Agent now has access to hundreds of browser skills through @browserbase’s new Browse.sh hub, so agents can more reliably perform any task on the internet. You can try a skill from their catalog or contribute your own.
English
107
193
2.4K
532.8K
Dave Lee
Dave Lee@mega__d·
@exploraX_ 70k+ skills is a lot. At that point, the real problem becomes routing. Which skill should the agent load, and when?
English
1
0
0
4
m0h
m0h@exploraX_·
someone created a open-source marketplace for agentic skills. it’s completely free, more than 70k+ skills. categories of skills includes: • developing • business • sale and marketing • content creation etc. it’s the site where you’d find any agent skills you need.
m0h@exploraX_

x.com/i/article/2039…

English
17
18
155
21.2K
Dave Lee
Dave Lee@mega__d·
@milesdeutscher More skills ≠ better agents. The real challenge is knowing what to load, when. Otherwise, a marketplace just becomes more noise.
English
1
0
0
9
Miles Deutscher
Miles Deutscher@milesdeutscher·
This is a complete game-changer. This agent skills marketplace has over a MILLION ready-to-use agent skills and plug-ins. Just search the skill type you want to deploy, and watch hundreds of skill files appear. If you use AI regularly, this is a must. skillsmp.com
English
34
58
397
38K
Dave Lee
Dave Lee@mega__d·
@pmitu Agent status: very busy very autonomous not very done
English
0
0
0
8
Paul Mit
Paul Mit@pmitu·
Everyone's building AI agents. Nobody's building AI agents that actually work.
English
346
66
1K
36.7K
Dave Lee
Dave Lee@mega__d·
@fchollet A lot of current agent systems still treat history as “more context to stuff into the window,” but long-horizon tasks seem to require a much more structured way to represent trajectory, decisions, intent, and operational context over time.
English
0
0
0
15
François Chollet
François Chollet@fchollet·
Most human tasks are not Markovian, the optimal next action cannot be determined solely by looking at the current state. It depends heavily on the past trajectory, the original intent, and context constraints. An agent that cannot compress and track its past trajectory with absolute fidelity is maybe 20% as useful as one that can.
English
90
70
963
54.4K
Dave Lee
Dave Lee@mega__d·
Strongly agree with this framing. A lot of people still think the main problem is prompting, but once agents run for long periods in production, the harder part becomes the environment around them. Constraints, feedback loops, evals, routing, memory, tool behavior. All of these start to matter much more.
François Chollet@fchollet

A mental model for working with coding agents is that they're blind squirrels running into a maze and bumping into walls. You must place the walls (verifiable constraints) strategically so that they end up in the general region you want them in.

English
0
0
0
27
Dave Lee
Dave Lee@mega__d·
@garrytan The memory layer part feels important. A lot of leverage probably comes from systems carrying forward what they learn through repeated execution.
English
0
0
1
502
Garry Tan
Garry Tan@garrytan·
The reason why I release my X articles about AI agents (fat skill fat code thin harness) and GStack and GBrain is that we, yes you and I, can have *PROCESS POWER*, which is the one super powerful specific moat that anyone can create for themselves. The agent helps you do it.
Garry Tan tweet media
Taylor Pearson@TaylorPearsonMe

I spent some time going through Garry Tan's GBrain. I want to pull out what I see as the general form factors and what's interesting there as someone who is non-technical and doesn't work in VC. I think a lot of people are converging on the same set of 5 core form factors and they represent something of the natural next progression of how to use agentic AI tools like Codex/Claude Code/Hermes/OpenClaw/etc. x.com/garrytan/statu… 1. Skills. This is the most natural starting point for pretty much everyone. People build these without being told to because they're a familiar shape. I thought of them like an SOP, a documented procedure for doing something. The user supplies what, the skill supplies the how. Tan's framing is that a skill works like a method call. In programming, a method call is the syntax for invoking a procedure with arguments. The same code runs every time. The arguments are what vary: what data, what question, what target. The same process_invoice function handles every invoice in the system, not just the one it was first written for. A skill is the same shape. The seven steps of a skill called "/investigate" don't change. The parameters do: a TARGET (who or what to investigate), a QUESTION (what you're trying to figure out), a DATASET (where to look). Point it at a medical whistleblower case and you get a research analyst. Point it at SEC filings and you get a forensic investigator. Same file, same seven steps, the world supplies the difference. This is a different form factor from a traditional SOP. Most SOPs are written for a specific job: "Process Accounts Payable." One procedure per use case. A skill is written abstractly enough that the same procedure handles a family of cases. One well-built skill can do the work of dozens of SOPs because the case-specific detail moves out of the document and into the parameters. Depending on how you are using them, some skills are closer to SOPs, others to method calls. 2. Thin harness. The model (Opus, GPT-5.5, etc.) is the raw intelligence. The harness (Claude Code, Codex CLI, Hermes, OpenClaw) is what gives the model hands. They loop, read and write files, manage context, enforce safety. About 200 lines of code at the core. Garry notes the mistake most people make (he and I included) is to keep loading more stuff into the harness itself. I ended up with 100 tool definitions and a bunch of MCP servers. The result is that context window fills up with descriptions of tools the model doesn't need for the current task. The model gets confused about which to use. Latency goes up, accuracy goes down. Context rot. 3. Resolvers. The solution to context rot is a routing table. A resolver maps "task type X just came in" to "fire skill Y." When you have five skills, you don't need one. When you have a hundred, the descriptions blur together and the model fails to invoke the skill at the right time. The resolver replaces ambient pattern-matching with explicit rules. Tan also runs something like a resolver for files: a separate routing table that decides where the output of a skill should land in the filesystem. Same audit-and-route shape applied to a different problem. The output ends up in the right folder reliably rather than wherever the model guesses. Skillify is his companion idea: a quality loop that turns one-off skills into permanent infrastructure. The 10-step version Tan describes includes a contract, deterministic code where code can do the job, unit tests, integration tests, LLM-as-judge evals, resolver entry, an audit script that flags skills with no path to invocation, and an end-to-end smoke test. The test is simple. If you have to ask the model the same thing twice, you failed. 4. Latent vs. deterministic. Be thoughtful about which work lives where. The LLM is excellent at judgment, synthesis, pattern recognition, reading between the lines. It is bad at arithmetic, combinatorial optimization, anything that needs the same answer every time. LLMs are fundamentally probabilistic and shouldn't be used when a deterministic solution will do. Most non-technical people under-use the deterministic side. The default instinct is to throw everything at the model. If you can do something deterministically, you almost certainly should. And you don't need to be a programmer to do it. The model can write the code for you. The discipline is to ask, every time, whether code could handle this reliably for free, and to actually have the model write that code when the answer is yes. 5. Memory. The system needs some form of memory to be useful. I'm not sure what the right form is, and a lot of people are building it different ways: vector embeddings with semantic similarity, knowledge graphs, hybrid stores. Tan's approach is the same as mine: just a folder of markdown files. He has one page per person, one page per company, one page per concept. Each page has compiled truth on top (the current best understanding, rewritten as new evidence arrives) and an append-only timeline below. A few things follow from the markdown choice. The file is the system of record, not an export. You can open it in VS Code, edit it by hand, and the agent picks up the changes. Typed relationships (works_at, invested_in, founded, attended, advises) get extracted via regex on every write, so the knowledge graph wires itself without spending tokens. This particular schema makes sense for his job, but should probably be customized depending on what you do. A signal detector runs in the background. Mention someone once and they get a stub page; three mentions across sources and web enrichment fires; after a meeting, the full pipeline runs. An overnight dream cycle scans conversations, enriches stale entities, and fixes broken citations. The base is text. Everything on top is cheap and composable. There is more under the hood, but I think those are the broad strokes which I feel are more or less universally useful approaches. I had maybe half of this architecture already. I hadn't hit the scale where a real resolver was necessary, but I'm there now and just did a little refactor to make my setup model agnostic and with a built-in resolver. The signal detector and overnight dream cycle running automatic enrichment in the background is the main piece I haven't built yet and want to try and add. I suspect that the convergence across people building these is a signal that the form is generally (though probably not universally) useful. Even though implementation details vary in ways that matter, the general form seems to be coming up for many people. The question I have been asking is: how do you use AI to build sustainable competitive advantage? Everyone is excited about vibe-coded apps and one-shot prompts (which is 100% super cool). This is how I started playing with things and it got me hooked, but the equilibrium price of anything you can build with a one-shot prompt is the token cost to build it (which is a few cents). Like the person who copied My Fitness Pal and made a million dollars selling it for half the cost is awesome. But, someone else is just going to copy that and sell it for half again and the cycle keeps going until there's no margin there. What's actually durable is some form of process power implicit in the architecture above in Hamilton Helmer's 7 Powers sense. 7 Powers names the seven structural conditions that let a business sustain above-market margins over time. Anything not rooted in one of those powers gets competed away. Five of Helmer's seven powers are essentially closed doors for SMBs and early-stage companies. Scale economies require scale. Network economies and Switching costs can be developed but require building a big base. Cornered resources usually mean patents or similar that are not typical to companies. Branding usually takes a decade and you can't shortcut it. The two remaining ones are counter-positioning and process power. Counter-positioning (a model an incumbent can't mimic without cannibalizing their existing business) is sometimes available but not always. That leaves process power. And a well-built AI system is exactly the kind of artifact that generates it. It's the same kind of work as building really good SOPs or proprietary software. The procedures are codified, the cases are parameterized, the deterministic layer underneath is fast and reliable, and the memory layer carries forward what you've learned. It enables something like productized services on steroids: You can perform a service or supply a product at lower cost or higher quality because the work is structured. Imagine an accountant who builds this out. Memory layer: one folder with markdown files per client with compiled truth (entity structure, year-over-year tax positions, ongoing audits) and a timeline (meetings, decisions, what changed). There are some skills like /year-end-review, /quarterly-estimate, /audit-prep, same procedure parameterized for each client. There is a deterministic layer: tax tables, depreciation schedules, IRS publications, client tax return histories, etc. Then some form of diarization or dream cycle. E.g. overnight, the system flags a partner whose K-1 distribution dropped 40% without a strategy change, or notices that one client's home-office deduction structure is portable to another client (the structure travels, identities stay where they belong). She charges a small premium, handles more clients per year, and her competitors can't replicate it because the structure didn't exist when she started building it. The artifact itself is a folder of markdown files, but the lines in each file are downstream of lots of thoughtful testing and building to make process power.

English
37
55
644
104.8K
Dave Lee
Dave Lee@mega__d·
AI made it much easier to ship things quickly. What still feels rare is teams that can take everything happening after launch. Usage patterns, failures, weird edge cases, repeated behaviors. Then turn that back into a better system. A lot of companies are shipping faster now. Not many seem to be evolving the same way yet.
English
0
0
0
41
Dave Lee
Dave Lee@mega__d·
Really resonate with this. Optimization during development is important, but what feels increasingly critical is building systems that continuously learn from real-world production data, trajectories, and behavior patterns over time. The sim2real gap becomes much more visible once agents operate in open-ended production environments.
English
0
0
0
33
Viv
Viv@Vtrivedy10·
Environment Engineering, the sim2real gap, & Continual Learning from Production Traces As we’re building more complicated agents that work across our products, we’re becoming more aware of how our Eval Environment reflects the real world Evals are measurements of how our agents should behave in Production But at their core, evals are fundamentally a simulation of what we expect to happen The gap in what we set up in Evals vs the distribution of what actually happens is the sim2real gap This is often brought up in robotics where the simulation training data doesn’t exactly map to the real world so robots struggle with moving and taking real world options The goal of a good environment is to, as accurately as possible, model things that the agent will encounter. These are things like installed tools, the distribution of inputs that users will provide, the results that come from executing tools The more we reduce this gap in testing, the more we can trust that our evals will reflect user/agent experience in prod Apriori this is an incredibly hard problem to solve - we don’t know how users will use our product until they use it This is why there’s such large value in using Production Traces to generate high fidelity evals and environments to accurately model how real users interact with your agents This data both helps you: 1. close the sim2real gaps 2. finds agent error modes that new evals can help close as you hill-climb them Part of the Continual Learning loop is research to efficiently use data that’s generated by agents to “update” them over time Part of the update step is adjusting the agent harness, model weights, external memory stores, creating evals/environments It’s a hard problem but one of the most existing research questions of our time
Viv tweet media
English
5
5
64
3.7K