mary
14.3K posts

mary
@howdymary
data @marketmotionxyz | prev galaxy, brookings, schwarzman, columbia | sidequests: @jobanxiety









Where are the consumer app founders in SF?

where are the consumer app founders in sf? i lowkey have three iOS apps live in the app store right now but haven't built up the gumption to start marketing them yet looking for buddies tbh

TLDR on Meta Harnesses and a practical implementation that I built for Hermes Hermes is an agent runtime (operating system) around a model (the brain) Meta harnesses are a way to improve the operating system, not the brain itself Rather than retraining the model, the meta harness continuously learns better ways to run the model by searching over runtime policy (prompt additions, tool ordering, stop heuristics, bootstrap steps, context management etc) to discover what makes the agent perform better on verifiable tasks A lot of coding agent failure centers around the agent runtime wasting time and tokens discovering basics, using the wrong tools / wrong context At the moment, Hermes does not have a research loop that treats the benchmark harness itself as something to optimize, which is the gap that this implementation addresses This setup uses the meta harness as a research layer around benchmark harnesses, not the full product runtime It splits Hermes into two layers: hermes-agent owns the inner runtime (candidate protocol, benchmark integration, loop hooks, and archive writing) hermes-agent-metaharness owns the outer loop (candidate evaluation, archive analysis, baseline reuse, frontier tracking, and search) This searches over code and policies that impact agent performance, such as - what bootstrap context to gather - which tools to expose and what order - how many turns to allow - which baseline to compare against - how to rank candidate harnesses Side note: You may have seen @Teknium previously release self-evolution; the distinction here is that self-evolution is intended to write better instructions for the agent and metaharness is intended to run the agent more efficiently on benchmarks Please try it out!



Where are the consumer app founders in SF?


Meta Harnesses is Autoresearch on steroids. Something I've been exploring recently is to get long running agents to hill climb on a verifiable task to continuously improve without my intervention. Karpathy's Autoresearch did this pretty well on specific tasks, but this weekend I tried Meta Harnesses which moves one level of abstraction up. What does Meta Harness do? Autoresearch can be used in harness like Claude Code / Codex to generate experiments to try, evaluate results, and continue looping. Meta Harness generates a harness itself that optimizes on a task or a set of task. Here, we define a harness as "a single-file Python program that modifies task-specific prompting, retrieval, memory, and orchestration logic". The idea is that LLMs are very powerful today, but to harness [pun intended] their power, you need to give it the right prompts and context. Meta Harnesses automates coming up with the right prompts and the right way to retrieve context to solve a problem. Where did this idea come from? This is from a paper from Stanford and the author of DSPy written last week. The paper shows fantastic performance on 3 tasks: text classification, math reasoning (IMO level problems) and coding (Terminal Bench 2.0), far outperforming traditional harnesses. The discovered harnesses are interesting: math for example, splits up the logic into different categories (Combinatorics, Geometry, Number Theory, Algebra) and prompts and looks at the context differently. The coding harness, amongst other things, pre-processes the tools available in the environment to save exploratory turns. When should you use and not use it? Meta Harnesses seem pretty useful for tackling a specific but wide set of problems where the result is verifiable. In contrast, when I tried it on a specific task like Chess, it arbitrarily divides the problem into separate tasks - opening, mid game, end game, and creates different approaches for each. This "works" but isn't really clean because we believe there should be one approach that does all three. It does far better on things like examinations (JEE, Gaokao) where it splits problems into categories and tackles each category with different strategies. This paper covers a pretty light version of what a harness means. In the future, we can split up tasks into harnesses that have access to specific kinds of data, specific toolchains and various models to get even better results. Overall, pretty cool applied AI approach to hillclimb a verifiable task in a specific domain with variety within the problem space.

💡 What if the most important learning space in LLMs isn't in the weights but in the context itself? Most people hear this and say: "oh, so just prompt optimisation." That framing misses what's actually going on. The context window is not just input. It is a parameter space where small token changes can completely shift the model's behaviour. Unlike weights, context can be built up incrementally, reversed trivially, inspected and interpreted, and small token changes can dramatically shift the output distribution allowing rapid adaptation. ❌ This isn't context engineering (humans designing prompts, instructions, data or tool access). ❌ And it's not in-context learning, where the model learns from examples in a single forward pass. ✳️ This is autonomous learning that happens directly in the context itself. The model can rewrite it, improve it over time, and optimise it using feedback, error and search. 🔄










