Ericlamideas
9.8K posts

Ericlamideas
@ericlamideas
Building things on the computer
Katılım Şubat 2021
821 Takip Edilen6.2K Takipçiler
Ericlamideas retweetledi

@archiexzzz @karpathy This is perturbation basically? Why not use DSPy?
English

Introducing AutoVoiceEvals
I've applied the @karpathy autoresearch loop to voice AI agents. It's open source.
Your voice agent has a system prompt. That prompt determines how it handles every call - bookings, complaints, edge cases, background noises, long pauses, people trying to trick it. Most teams write it once, test manually, and hope for the best.
autovoiceevals makes it a loop. One artifact (system prompt), one metric (adversarial eval score), keep what improves it, revert what doesn't. Run it overnight. Wake up to a better agent.
> How it works:
You describe your agent in a config file - what it does, its services, policies, and what it should never do. You don't write test cases. You don't define attack vectors.
provider: vapi / smallest ai
assistant:
id: "your-agent-id"
description: |
Voice receptionist for a hair salon.
Maria does coloring only. Jessica does cuts only.
$25 cancellation fee under 24 hours notice.
Cannot advise on skin conditions. Closed Sundays.
From that description alone, Claude generates adversarial caller personas - each with an attack strategy, a voice profile (accents, background noise, mumblers, interrupters), a multi-turn caller script, and pass/fail evaluation criteria. The eval suite is generated once and held fixed for the entire run, like a validation set.
> The loop:
1. Read the agent's current prompt from the platform
2. Generate adversarial eval suite from your description
3. Run baseline
4. Claude proposes ONE surgical change to the prompt
5. Push the modified prompt to the agent via API
6. Run all scenarios against the updated agent
7. Score improved? Keep. Same score but shorter prompt? Keep. Otherwise revert.
8. Go to 4. Run until Ctrl+C.
The system sees its own experiment history. When a change fails, the next proposal knows what was tried and why it didn't work.
We ran 20 experiments on a live Vapi dental scheduling agent. 0 human intervention.
> Score: 0.728 → 0.969 (+33%)
> CSAT: 45 → 84
> Pass rate: 25% → 100%
> 9 kept, 10 discarded
> Prompt: 1191 → 1139 chars (better AND shorter)
You describe your agent. It figures out how to break it.

English

@sukh_saroy @grok how would this compare to tobi lutke’s qmd for which is hypothetically more performant?
English

🚨Breaking: Someone just open sourced a knowledge graph engine for your codebase and it's terrifying how good it is.
It's called GitNexus. And it's not a documentation tool.
It's a full code intelligence layer that maps every dependency, call chain, and execution flow in your repo -- then plugs directly into Claude Code, Cursor, and Windsurf via MCP.
Here's what this thing does autonomously:
→ Indexes your entire codebase into a graph with Tree-sitter AST parsing
→ Maps every function call, import, class inheritance, and interface
→ Groups related code into functional clusters with cohesion scores
→ Traces execution flows from entry points through full call chains
→ Runs blast radius analysis before you change a single line
→ Detects which processes break when you touch a specific function
→ Renames symbols across 5+ files in one coordinated operation
→ Generates a full codebase wiki from the knowledge graph automatically
Here's the wildest part:
Your AI agent edits UserService.validate().
It doesn't know 47 functions depend on its return type.
Breaking changes ship.
GitNexus pre-computes the entire dependency structure at index time -- so when Claude Code asks "what depends on this?", it gets a complete answer in 1 query instead of 10.
Smaller models get full architectural clarity. Even GPT-4o-mini stops breaking call chains.
One command to set it up:
`npx gitnexus analyze`
That's it. MCP registers automatically. Claude Code hooks install themselves.
Your AI agent has been coding blind. This fixes that.
9.4K GitHub stars. 1.2K forks. Already trending.
100% Open Source.
(Link in the comments)

English
Ericlamideas retweetledi

Ericlamideas retweetledi

the singularity has begun. so many signs.
Andrej Karpathy@karpathy
@tobi Who knew early singularity could be this fun? :) I just confirmed that the improvements autoresearch found over the last 2 days of (~650) experiments on depth 12 model transfer well to depth 24 so nanochat is about to get a new leaderboard entry for “time to GPT-2” too. Works 🤷♂️
English

The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them.
Current code synchronously grows a single thread of commits in a particular research direction. But the original repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. Git(Hub) is *almost* but not really suited for this. It has a softly built in assumption of one "master" branch, which temporarily forks off into PRs just to merge back a bit later.
I tried to prototype something super lightweight that could have a flavor of this, e.g. just a Discussion, written by my agent as a summary of its overnight run:
github.com/karpathy/autor…
Alternatively, a PR has the benefit of exact commits:
github.com/karpathy/autor…
but you'd never want to actually merge it... You'd just want to "adopt" and accumulate branches of commits. But even in this lightweight way, you could ask your agent to first read the Discussions/PRs using GitHub CLI for inspiration, and after its research is done, contribute a little "paper" of findings back.
I'm not actually exactly sure what this should look like, but it's a big idea that is more general than just the autoresearch repo specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.
English
Ericlamideas retweetledi

People get high on abstraction too early. They want the system before they’ve earned the insight.
But the good abstractions are never designed. They’re discovered. You do the stupid manual thing enough times and the real bottleneck just emerges. Your initial agency might be driven by a hunch you had in the shower, but that moment won’t get you all the way to making something people want. The right way to make anything is forced on you by reality: what are the real jobs to be done? And what sequence?
This is why “do things that don’t scale” still hits, especially now when AI makes it trivially easy to scale things that probably shouldn’t be scaled yet. PG’s point was never about suffering. It was about contact. When you’re the one manually doing the loop, you see the edge cases. The weird user behavior. The failure modes nobody designed for. The hidden dependencies that only show up at 2am when some flow or intermediate step breaks in a way you didn’t anticipate. If you automate before you have that contact, you just scale your misunderstanding faster.
When the machines can help you vibe code perfection it gives you a false sense of power. I love that feeling as much as you do. But fuck perfection. Do it live. Be the loop.
Feel every friction point. Notice what’s actually true every single time versus what just looked true because you hadn’t seen enough cases yet. Formalize that. Build the recursive version. Then keep checking that your abstraction is still attached to real humans and their needs. Because reality drifts. Your users drift. The ground truth changes under you. You may think you understand but no plan survives contact with the real users and what they want. You find those body blows in analytics and user feedback and we call them the roadmap.
Humans left with not enough data hallucinate too. But just like the LLMs with enough data you unlock real transcendence. Real utility. Prosperity for humans in real life.
The abstraction is a tool, not a destination. The moment you forget that, you’re cooked.
English

Ericlamideas retweetledi
Ericlamideas retweetledi

@trq212 harness theory really feels like it's becoming about how to create the optimal environment in which the model can succeed. optimizing for "nice boss" that guides and makes the desired outcome obvious vs. the "mean boss" that punishes and fights the models desired behaviors.
English
Ericlamideas retweetledi

building agentic harnesses is turning out to be the opposite strategy of traditional software development. Instead of trying to constrain the system and enforce the outcomes you want - the optimal path is to redesign the harness to be a positive environment in which the model logically comes to the conclusions that align with your outcomes.
harness theory is very similar to what makes a good organizational manager. the optimal path is to set the model up to flourish. never tell it "what not to do". that's restrictive thinking, instead - reshape the environment for the model to logically come to your conclusion. If it doesn't logically come to your conclusion then it doesn't have the optimal environment needed to flourish.
English

@irabukht With all those relationships it’s easier to upsell ur not starting from 0
English

@irabukht So you used your tech as a wedge into orgs and now looking to build something differentiated… doesn’t sound like failure
English

@_colemurray it kept going out of scope. I tried editing my harness to adapt it to opus 4.6 but still sonnet 4.5 performed best. im going to test sonnet 4.6 next week. definitely need more tasks - this is saturated.
English

@ericlamideas need more tasks!
for the one that opus 4.6 failed on, why did it fail?
English









