Hrishi

4.1K posts

Hrishi banner
Hrishi

Hrishi

@hrishioa

Trying to build systems of lasting value at https://t.co/JoR2nVEIRH. Previously CTO, Greywing (YC W21). Chop wood carry water.

Long form thoughts 🫱 Katılım Haziran 2013
2.7K Takip Edilen11.5K Takipçiler
Sabitlenmiş Tweet
Hrishi
Hrishi@hrishioa·
I wasn't sure if we were going to share this, because knowing what doesn't work is often more valuable than seeing what worked. That - and being nervous about sharing your failures. Here's a technical retrospective on our 2025: southbridge.ai/blog/25-ways-n…
English
0
2
30
3.1K
Hrishi
Hrishi@hrishioa·
Skills are definitely a move in the right direction, don't get me wrong - but it's when we try to distribute them that we run into the problem. A custom skill you made for yourself for a custom use-case likely has all the things that make for good tooling - context dependent, very little bloat, and likely field tested and improved. A general-purpose skill you get from a registry is often worse than no skill at all - since there's no way (same as MCPs) to express what specifically you need, or which parts of the skill are relevant to you. Better registries and search I think would be a good start, but more opinionated structures would be better. The less that is possible, the easier both sides (builder and user) can reason about what is in a skill..
English
2
0
1
39
Tim Johnson
Tim Johnson@timmyj1023·
Hmm isn't that what skills are though? Custom instructions and scripts. Without the bloat associated w/ MCP Is the challenge against skill packages that are being put out by vendors / SAAS platforms for public consumption as a complement to their API? Who owns the 'skill' in that instance? Seems fairly clear to me it's the vendor. But what if their skill becomes out of date compared to their API? Well then ... just point your agent at their documentation any time you need to code using their API. If their documentation is out of date then ... *shrug* reverse engineer it from their app using an agent in your browser client? I'm not sure what the next primitive would even be after skills. What's the alternative?
English
1
0
3
71
Hrishi
Hrishi@hrishioa·
Skills will likely run into the same critical problem that MCPs did. Why did MCPs fail? They were a wonderful idea, but the protocol was too open. Too many ways to do things means no one's in charge. Who's responsible for an MCP? Is it the service? the author? You? Who's running the MCP? What this meant is that we quickly got to a state where you're more likely to hit a bloated (possibly dangerous) MCP that hasn't had commits in months than a well managed one. The well-managed MCPs - as it turns out - belong to companies that have well managed API surfaces anyway, replete with nice llms.txts. MCPs eventually came around to 'just write custom scripts'. Because there was no way for an MCP client/searcher/connoisseur to say 'this is who I am, this is what I want' an MCP creator has to service everybody for everything. Skills have the same problem - the surface is too big. This makes them easy to vibe (which means slop that does nothing x.com/hrishioa/statu…) and extremely easy to make malicious. There are also no ways to unload skills and clean context - almost every failed execution I've seen on a coding agent these days is because a skill or memory activated at the wrong time. This means that we are now quickly - once again - at a point where you're more likely to hit a bad skill than a useful one. Unless we fix this, skill will come around to 'just write custom instructions and scripts'.
English
2
1
11
714
Hrishi
Hrishi@hrishioa·
@tryingET Most are interleaved calls - and this was just a choice made in the braintrust uploader to reduce data load - if I think I know which ones you mean :)
English
0
0
0
20
tryingEveryThing
tryingEveryThing@tryingET·
@hrishioa looks good. One question I have: why are the LLM inputs only shown for some instances?
English
1
0
1
30
Hrishi
Hrishi@hrishioa·
New release! Harness engineering made easy: You can now switch harnesses in hankweave with a few characters. "sonnet" will run your prompts and code inside the Agents SDK. "gpt-5.3-codex" will run it in Codex. "pi/google/gemini-3.1-flash" or "opencode/cerebras/glm4.7" - exactly what you'd expect. Unified input interface for defining prompts, behavior, etc. Unified logs, tracing and control interface. The underlying logic that led us here has become inescapable: ♟️Harness engineering is hilariously non-trivial. Getting shell calls, sandboxing, even file editing working over hundreds of loops isn't easy. ↠ Don't build your own, use the sdks. ♟️Models function best in different harnesses. Claude is best in the Agents SDK. Codex is best in Codex. Gemini is best in @opencode. @badlogicgames' pi is the best lightweight embeddable harness for cloud work ↠ You need to support more than one. ♟️Harnesses have no unified input/outut. ACP - while an awesome protocol - is rarely fully supported. ↠ You need a translation layer. Declarative inputs (hanks) -> Harnesses -> NDJSON log-based output. For fun, here's a braintrust run (limited time, before it gets deleted) of a hank that uses all four harnesses - even loops them for fun, and adds budgets: braintrust.dev/app/sb/p/hw-tr…
Hrishi tweet media
English
6
4
33
3.5K
Hrishi
Hrishi@hrishioa·
Fun little case in point about real-time connectors: @Calclavia tells me about Cursor Agent over lunch, go home, run Clausetta and point it to cursor agent, and now we can use it in hankweave! Here's the whole run (limited time hosted on Braintrust) for the terminally curious: braintrust.dev/app/sb/p/hw-tr…
Hrishi tweet media
Hrishi@hrishioa

Clausetta (Clawd + Rosetta) is still one of the coolest things I've seen hanks do: Just-in-time connectors between complex code. It's also open-source. It's a very simple hank and it makes **any** agent harness accessible and compliant with the same interface spec - our interface spec (which is actually the original Claude interface spec that we never upgraded from). So it's an AI program that automatically shims any agent to speak the same language as Claude Code. Try it: Clone github.com/SouthBridgeAI/… or substitute it into the command below: bunx hankweave -i "" It'll try and build a connector from that agent to hankweave. If it completes, you can use the resulting shim to write your own codons to pi, opencode, whatever you'd like. Crazy times we live in. How it works is simple. There's a codon that sets everything up and a codon that writes documentation at the end, but the main hank is a simple loop between two codons - build and verify. We load in a test suite, and a markdown spec as rigs. Within a few loops (even with haiku) - you can build some pretty reliable connectors. Our existing shims for gemini cli and codex are generated this way - shims we use almost every day. Sometimes using and making these shims feels like the future of glue code. Coding agents are some of the hardest things to connect to because of the amount of parallel work, state and side effects involved. How do they deal with sessions? Parallel tool calls? Subagents? The process we've had with the Clausetta hank has made it actually possible to deploy shims in production - and to regenerate them when an underlying harness updates. It goes like this: We see problem / harness X gets a new update -> We update the hank and rerun it -> We deploy the shim. Truly spec driven development - where the hank is the spec. If you look at Clausetta, you'll notice it's not specific to *any* one agent.

English
1
0
10
1.8K
Hrishi
Hrishi@hrishioa·
@tryingET You're better off importing only what you need, as rigs or prompts into specific codons :) Actually we have something that can help port things, let me pass it to you later
English
1
0
0
11
tryingEveryThing
tryingEveryThing@tryingET·
@hrishioa The ones I build myself ;) there are multiple extremely useful things. some swear about codemap, others rely on repoprompt. Me specifically: I have a prompt-vault where the ai picks and chooses the right prompts for the jobs. This is the one I would like the most probably.
English
1
0
1
13
John Peng
John Peng@theRealJohnPeng·
@hrishioa like an *ideal* thing for my usecase would be a hankweave SDK to program against lol
English
1
0
1
11
Hrishi
Hrishi@hrishioa·
This is THE question we struggled with for a while In the end we made an opinionated call on a few tradeoffs: The biggest one is reusability. A JSON DSL is significantly more restrictive, but to us it preserved the code/data boundary well enough to have codons (units of agentic work) be reusable across people, companies or tasks. It also means that we can (as we do now) have LLMs edit these DSLs without crossing that boundary. The second is surface area - integrating into typescript or python creates the exact same problem of diversity that we were fighting at the time. Because you can do almost anything, almost everything will be done - which means that - deterministically reasoning about executions and rollbacks before starting (like hankweave's preflight does), - keeping up test surface area across models, backbones and known behavior, and - the debugging path (which was the most important thing for us with hankweave) becomes needlessly complex. In hankweave today, if something breaks - there is a known, well-trodden way to fix and test the fix. Third one - that I think is a little less important now, but we were super concerned about it in June - is auto-recovery. Hankweave is not turing complete, which makes unrolling executions and automatic budgeting a lot easier. Remains to be seen if we're wrong - as we accumulate more usage it'll become apparent. The prime philosophy in hankweave has been 'don't build anything that you don't NEED' and so far we haven't needed to move past JSON. Longer-term I think we might build typescript integrations that compile down to the DSL so we get the fun of working with the linter and typechecker back - we've had a few experiments in progress!
John Lam@john_lam

i love the continuous vs. step function analogy. i wonder why you chose to use a JSON DSL in hankweave vs. just having the agent implement it in typescript or python? in the past we as an industry made this mistake with build systems (see ant and msbuild for good examples of DSLs implemented in XML). jim weirich got this right with rake where the DSL was embedded in ruby when he created rake. eventually you need all the affordances of a real programming language - flow control, exception handling, expressions etc. the anthropic article was interesting but i also wonder why they chickened out and didn't ship the code on top of their agent sdk. in an agentic world, i wonder whether DSLs are still valuable or whether just having them gen the harness dynamically for the specific task using a well defined api surface is the right answer here?

English
1
1
9
1.3K
Hrishi
Hrishi@hrishioa·
@theRealJohnPeng No worries! I think that's very much a philosophy hankeave was trying to guard against (global context I mean) The one path to reliable, repeatable runs we found was to reduce blast radius into smaller boxes with repeatedly flushed, local contexts
English
0
0
0
12
John Peng
John Peng@theRealJohnPeng·
Actually I think my main gripe with hankweave is, Im treating it as a library, whereas you've built it as standalone agent runtime. Like, the: hank A -> hank B -> hank C model doesn't feel as nice when I want to run some code in a *global context* rather than a more local context that is supported by rigs (srry dont mean to complain abt this on a weekend)
English
2
0
1
18
Hrishi
Hrishi@hrishioa·
@tryingET Not at the moment - what extensions would you want to use if it did?
English
1
0
0
32
Hrishi
Hrishi@hrishioa·
Try running it yourself at github.com/SouthBridgeAI/… and it'll make you this Kept it simple (and a little generic) as a demo but the possibilities are endless now
Hrishi tweet media
English
0
0
3
356
Hrishi
Hrishi@hrishioa·
Validation will tell you exactly what's running where. @mitchellh thank you for the silent update notifs haha they're super helpful
Hrishi tweet media
English
1
0
4
447
John Peng
John Peng@theRealJohnPeng·
@hrishioa something like this, added this feature on my local hankweave
John Peng tweet media
English
1
0
1
18
Hrishi
Hrishi@hrishioa·
@theRealJohnPeng You could import codons, prompts or rigs :) Hank.json represents a full combo of what to run, where to run it and how to join it - importing a Hank might be easier to do by just running the Hank inside another Hank :) But need more exploration here definitely
English
1
0
0
37
John Peng
John Peng@theRealJohnPeng·
@hrishioa One other annoying thing is I cant import hank.jsons and compose them
English
1
0
1
32
Hrishi
Hrishi@hrishioa·
@john_lam Thank you! Genuinely appreciated this question Started responding, ended up getting too long (might actually write about this in detail later) :) x.com/hrishioa/statu…
Hrishi@hrishioa

This is THE question we struggled with for a while In the end we made an opinionated call on a few tradeoffs: The biggest one is reusability. A JSON DSL is significantly more restrictive, but to us it preserved the code/data boundary well enough to have codons (units of agentic work) be reusable across people, companies or tasks. It also means that we can (as we do now) have LLMs edit these DSLs without crossing that boundary. The second is surface area - integrating into typescript or python creates the exact same problem of diversity that we were fighting at the time. Because you can do almost anything, almost everything will be done - which means that - deterministically reasoning about executions and rollbacks before starting (like hankweave's preflight does), - keeping up test surface area across models, backbones and known behavior, and - the debugging path (which was the most important thing for us with hankweave) becomes needlessly complex. In hankweave today, if something breaks - there is a known, well-trodden way to fix and test the fix. Third one - that I think is a little less important now, but we were super concerned about it in June - is auto-recovery. Hankweave is not turing complete, which makes unrolling executions and automatic budgeting a lot easier. Remains to be seen if we're wrong - as we accumulate more usage it'll become apparent. The prime philosophy in hankweave has been 'don't build anything that you don't NEED' and so far we haven't needed to move past JSON. Longer-term I think we might build typescript integrations that compile down to the DSL so we get the fun of working with the linter and typechecker back - we've had a few experiments in progress!

English
0
0
0
71
John Lam
John Lam@john_lam·
i love the continuous vs. step function analogy. i wonder why you chose to use a JSON DSL in hankweave vs. just having the agent implement it in typescript or python? in the past we as an industry made this mistake with build systems (see ant and msbuild for good examples of DSLs implemented in XML). jim weirich got this right with rake where the DSL was embedded in ruby when he created rake. eventually you need all the affordances of a real programming language - flow control, exception handling, expressions etc. the anthropic article was interesting but i also wonder why they chickened out and didn't ship the code on top of their agent sdk. in an agentic world, i wonder whether DSLs are still valuable or whether just having them gen the harness dynamically for the specific task using a well defined api surface is the right answer here?
English
1
0
2
1.3K
Hrishi retweetledi
Rhys
Rhys@RhysSullivan·
i bet it has to feel good asf to be a service and get restarted after a memory leak
Rhys tweet media
English
87
189
10.2K
261.5K
Hrishi
Hrishi@hrishioa·
@rez0__ Most APIs now I presume will transcode in flight anw so might not matter as much as long as its image
English
0
0
1
35
Hrishi
Hrishi@hrishioa·
@rez0__ Haha gem3flash best of 10 might beat almost everything But harness matters quite a bit. Claude code (and Agents SDK) compress the pngs without telling you so what the model sees is *tiny*. Try switching harnesses around to see what happens to performance!
English
1
0
1
64
Joseph Thacker
Joseph Thacker@rez0__·
@hrishioa I used pngs iirc. Gemini dominated. I’m using gem3 flash lite. I do like best of 10 haha. It’s so cheap you can run it a bajillin times.
English
1
0
1
47
Hrishi
Hrishi@hrishioa·
GPT-5.4 is a very capable visual model. It's now part of a TINY club (with opus 4.6) as the only two models I've tested that can actually see. It's also way cheaper than opus! Until opus 4.6, we didn't have models that could really see the problems with a design - comics are a perfect test. The comic hank tests character generation, story writing, and adversarial improvement - all in a visual context. gpt-5.4 and opus are the only models that can see when a thread is wrong, when a character is pointed the wrong way, or dialogue is misattributed. This is really good news - visual understanding is key to computer use (beyond pointing and clicking) and it's already in use at southbridge through @minu_who's design hanks that make interfaces for us. Case in point, one of these comics is Opus and the other guy is GPT - if you can tell me which is you win an imaginary muffin ( @gabrielchua tests with mini incoming - I have high hopes haha)
Hrishi tweet mediaHrishi tweet media
English
2
1
12
1.9K