mary

github.com/howdymary/code…

24

2.2K

mary@howdymary·1d

ZXX

7

248

mary@howdymary·1d

two of the claude skills i use the most are /prp-prd and /devfleet, so i built a version for codex codex has codex exec for headless agents and is building native fanout (enable_fanout listed as under development in codex features) what's missing right now is a workflow layer, eg what to dispatch, how to split ownership, how agents allocate work to each other $ prp-prd helps you write hypothesis driven product specs $ devfleet splits work into packets with disjoint file ownership, dispatches parallel codex exec agents in worktrees, with a reviewer and tester verify after unlike claude /devfleet, this codex skill doesn't require an mcp server (it uses the native codex exec as the dispatch primitive) enjoy!

English

0

18

1.2K

mary@howdymary·2d

@daniel_dhawan @Superwall send an invite! I just moved to sf and am not in any group chats ha

English

41

Daniel Dhawan@daniel_dhawan·2d

@howdymary At @Superwall dinners :)

English

Madeline Griswold@MadelineGris

0

107

mary@howdymary·3d

where are the consumer app founders in sf? i lowkey have three iOS apps live in the app store right now but haven't built up the gumption to start marketing them yet looking for buddies tbh

Where are the consumer app founders in SF?

English

19

0

88

14.9K

mary@howdymary·2d

summoning consumer founders in sf come hang out at mission control to celebrate tax day next wednesday partiful.com/e/3qhXnNInMC4P…

mary@howdymary

where are the consumer app founders in sf? i lowkey have three iOS apps live in the app store right now but haven't built up the gumption to start marketing them yet looking for buddies tbh

English

4

1

39

4.2K

mary retweetledi

Peter Gao@PlanetaryGao·3d

People are reacting emotionally to Artemis II for the same reason people react similarly to Mr. Rogers: it is a moment of pure good and hope and inspiration that is just missing from most of our lives these days

English

14

560

5.3K

44.4K

mary retweetledi

chl$@chelsssseeeea·3d

Quadruple NASAs budget immediately!

English

297

3.5K

59.4K

2.4M

mary retweetledi

Teknium (e/λ)@Teknium·3d

Very interesting!

mary@howdymary

TLDR on Meta Harnesses and a practical implementation that I built for Hermes Hermes is an agent runtime (operating system) around a model (the brain) Meta harnesses are a way to improve the operating system, not the brain itself Rather than retraining the model, the meta harness continuously learns better ways to run the model by searching over runtime policy (prompt additions, tool ordering, stop heuristics, bootstrap steps, context management etc) to discover what makes the agent perform better on verifiable tasks A lot of coding agent failure centers around the agent runtime wasting time and tokens discovering basics, using the wrong tools / wrong context At the moment, Hermes does not have a research loop that treats the benchmark harness itself as something to optimize, which is the gap that this implementation addresses This setup uses the meta harness as a research layer around benchmark harnesses, not the full product runtime It splits Hermes into two layers: hermes-agent owns the inner runtime (candidate protocol, benchmark integration, loop hooks, and archive writing) hermes-agent-metaharness owns the outer loop (candidate evaluation, archive analysis, baseline reuse, frontier tracking, and search) This searches over code and policies that impact agent performance, such as - what bootstrap context to gather - which tools to expose and what order - how many turns to allow - which baseline to compare against - how to rank candidate harnesses Side note: You may have seen @Teknium previously release self-evolution; the distinction here is that self-evolution is intended to write better instructions for the agent and metaharness is intended to run the agent more efficiently on benchmarks Please try it out!

English

6

8

155

15.8K

mary retweetledi

🤠@heavensbvnny·3d

I need more men to understand that two men crying and hugging in space after one of them announced they were naming a moon crater after the other one’s late wife is actually what peak masculinity looks like

English

122

6K

74.9K

468.1K

mary@howdymary·3d

@kennytjay mine all have just 1 user (me) because i haven't marketed yet haha, soon!

English

0

1

232

Kenny Tjay@kennytjay·3d

@howdymerry also got three on market! 2 has about 100+ users just shipped third one last week. wbu?

English

0

2

237

mary@howdymary·3d

@MadelineGris gm

Madeline Griswold@MadelineGris

159

Madeline Griswold@MadelineGris·3d

planning a consumer app founder dinner monday only for people with a live app on the app store dm me for the invite

Where are the consumer app founders in SF?

English

11

1

45

6.9K

mary@howdymary·3d

@vikhyatk @MattVMacfarlane so true, just subscribed

English

1

47

vik@vikhyatk·3d

@MattVMacfarlane @howdymerry isn't this just prompt optimization?

English

0

57

mary@howdymary·3d

TLDR on Meta Harnesses and a practical implementation that I built for Hermes Hermes is an agent runtime (operating system) around a model (the brain) Meta harnesses are a way to improve the operating system, not the brain itself Rather than retraining the model, the meta harness continuously learns better ways to run the model by searching over runtime policy (prompt additions, tool ordering, stop heuristics, bootstrap steps, context management etc) to discover what makes the agent perform better on verifiable tasks A lot of coding agent failure centers around the agent runtime wasting time and tokens discovering basics, using the wrong tools / wrong context At the moment, Hermes does not have a research loop that treats the benchmark harness itself as something to optimize, which is the gap that this implementation addresses This setup uses the meta harness as a research layer around benchmark harnesses, not the full product runtime It splits Hermes into two layers: hermes-agent owns the inner runtime (candidate protocol, benchmark integration, loop hooks, and archive writing) hermes-agent-metaharness owns the outer loop (candidate evaluation, archive analysis, baseline reuse, frontier tracking, and search) This searches over code and policies that impact agent performance, such as - what bootstrap context to gather - which tools to expose and what order - how many turns to allow - which baseline to compare against - how to rank candidate harnesses Side note: You may have seen @Teknium previously release self-evolution; the distinction here is that self-evolution is intended to write better instructions for the agent and metaharness is intended to run the agent more efficiently on benchmarks Please try it out!

Deedy@deedydas

Meta Harnesses is Autoresearch on steroids. Something I've been exploring recently is to get long running agents to hill climb on a verifiable task to continuously improve without my intervention. Karpathy's Autoresearch did this pretty well on specific tasks, but this weekend I tried Meta Harnesses which moves one level of abstraction up. What does Meta Harness do? Autoresearch can be used in harness like Claude Code / Codex to generate experiments to try, evaluate results, and continue looping. Meta Harness generates a harness itself that optimizes on a task or a set of task. Here, we define a harness as "a single-file Python program that modifies task-specific prompting, retrieval, memory, and orchestration logic". The idea is that LLMs are very powerful today, but to harness [pun intended] their power, you need to give it the right prompts and context. Meta Harnesses automates coming up with the right prompts and the right way to retrieve context to solve a problem. Where did this idea come from? This is from a paper from Stanford and the author of DSPy written last week. The paper shows fantastic performance on 3 tasks: text classification, math reasoning (IMO level problems) and coding (Terminal Bench 2.0), far outperforming traditional harnesses. The discovered harnesses are interesting: math for example, splits up the logic into different categories (Combinatorics, Geometry, Number Theory, Algebra) and prompts and looks at the context differently. The coding harness, amongst other things, pre-processes the tools available in the environment to save exploratory turns. When should you use and not use it? Meta Harnesses seem pretty useful for tackling a specific but wide set of problems where the result is verifiable. In contrast, when I tried it on a specific task like Chess, it arbitrarily divides the problem into separate tasks - opening, mid game, end game, and creates different approaches for each. This "works" but isn't really clean because we believe there should be one approach that does all three. It does far better on things like examinations (JEE, Gaokao) where it splits problems into categories and tackles each category with different strategies. This paper covers a pretty light version of what a harness means. In the future, we can split up tasks into harnesses that have access to specific kinds of data, specific toolchains and various models to get even better results. Overall, pretty cool applied AI approach to hillclimb a verifiable task in a specific domain with variety within the problem space.

English

16

27

408

45.8K

mary@howdymary·3d

@MattVMacfarlane matt is perpetually five steps ahead of everyone 🙈

English

0

605

matt@MattVMacfarlane·3d

@howdymerry Really cool work. Learning a policy to create harnesses takes context learning (x.com/MattVMacfarlan…) to the next level. Time to put our efforts into creating / learning the best meta-harness possible.

matt@MattVMacfarlane

💡 What if the most important learning space in LLMs isn't in the weights but in the context itself? Most people hear this and say: "oh, so just prompt optimisation." That framing misses what's actually going on. The context window is not just input. It is a parameter space where small token changes can completely shift the model's behaviour. Unlike weights, context can be built up incrementally, reversed trivially, inspected and interpreted, and small token changes can dramatically shift the output distribution allowing rapid adaptation. ❌ This isn't context engineering (humans designing prompts, instructions, data or tool access). ❌ And it's not in-context learning, where the model learns from examples in a single forward pass. ✳️ This is autonomous learning that happens directly in the context itself. The model can rewrite it, improve it over time, and optimise it using feedback, error and search. 🔄

English

github.com/howdymary/herm…

0

12

1.8K

mary@howdymary·3d

ZXX