NeoSigma

34 posts

NeoSigma banner

NeoSigma

NeoSigma

@NeoSigmaAI

Closing the feedback loop in production.

San Francisco, CA Inscrit le Ekim 2025

2 Abonnements571 Abonnés

Tweet épinglé

NeoSigma

NeoSigma@NeoSigmaAI·2d

The next era of AI engineering is self-improving agentic systems! Really excited to share what we are building at NeoSigma! Self-maintaining agent systems represent a shift in how we build and operate software. We, at NeoSigma are building the infrastructure to support this feedback loop in real-world systems, helping teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior.

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

0

7

1.5K

NeoSigma retweeté

Srijan Bansal

Srijan Bansal@SrijanBansal1·15h

Self improving agents are coming sooner than expected. Interesting read

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

1

3

161

NeoSigma retweeté

Shagunn

Shagunn@shagunuppls·2d

Your agent's failures are not bugs to squash. They're training data for your next improvement cycle. The teams that internalize this will compound faster than everyone else.

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

5

17

1.6K

NeoSigma retweeté

Ritvik Kapila

Ritvik Kapila@RitvikKapila·1d

Watch our system optimize the Tau3 bench agent harness in real time. Every diff is a targeted improvement mined from production failures. Full breakdown in the blog!

Gauri Gupta@gauri__gupta

3/ We start with a baseline agent on Tau3 bench and run our system directly on top of it where it: - observes and mines failures from production traces - automatically clusters them into underlying failure modes - converts failure clusters into reusable living eval cases - proposes and experiments multiple harness changes and validates them - accepts only changes that both improve performance and don’t regress on previously fixed failures.

English

3

10

950

NeoSigma retweeté

Shreya Sharma

Shreya Sharma@shreya__verse·2d

An uncomfortable truth: the teams that win won’t be the ones with the smartest agents to begin with, they’ll be the ones whose agents fail better, learn faster, and never regress. Kudos to @NeoSigmaAI team for building toward these self-improving agents!! 🚀🚀 @gauri__gupta @RitvikKapila 🎉

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

2

6

566

NeoSigma retweeté

Dvij Kalaria

Dvij Kalaria@DvijKalaria·2d

Amazing! How does it compare against auto-prompt optimization techniques like GEPA @LakshyAAAgrawal

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

3

12

1.8K

NeoSigma retweeté

Mehul Agarwal

Mehul Agarwal@meh_agarwal·2d

Self improving AI systems are the natural evolution of codegen. Soon people will stop deploying code that doesn’t fix itself. So proud of my sister @gauri__gupta for moving the needle with @NeoSigmaAI

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

2

5

475

NeoSigma retweeté

Varsha Rao

Varsha Rao@varsharao·2d

The most expensive thing in agent systems isn't compute - it's engineers manually debugging failures that keep coming back. Automating that loop is a massive unlock. Great work @NeoSigmaAI team!

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

2

6

932

NeoSigma retweeté

Ritvik Kapila

Ritvik Kapila@RitvikKapila·1d

@Vtrivedy10 Agent harness engineering is the future. The result: a harness that evolves reliably and faster than any manual loop, leveraging far more context, running more experiments, and validating them all in parallel. Check out our work on harness optimization: x.com/gauri__gupta/s…

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

2

4

399

NeoSigma retweeté

Tanmay Agarwal

Tanmay Agarwal@tanmayagarwal98·2d

when production failures become evals, evals become experiments, and experiments become reliably better agent behavior, you get a system that compounds. the biggest ai companies of next decade will be the ones with the best self improvement loops. excited for what @gauri__gupta @RitvikKapila are building at @NeoSigmaAI

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

2

4

354

NeoSigma retweeté

Sriraam

Sriraam@27upon2·1d

Continual learning and continuously improving systems are already here

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

2

3

604

NeoSigma retweeté

Anish Madan

Anish Madan@anishmadan23·2d

Writing code is only a starting point, but how do you maintain code in production level systems and keep improving continually? Check out @NeoSigmaAI tackling this important gap with self improving AI systems. Big congrats on the launch @gauri__gupta @RitvikKapila 🚀🚀

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

2

2

322

NeoSigma retweeté

Somanshu Singla

Somanshu Singla@ssingla17·2d

Amazing work by @gauri__gupta @RitvikKapila ! Self-improving AI systems are the future

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

2

2

532

NeoSigma retweeté

Ritvik Kapila

Ritvik Kapila@RitvikKapila·2d

@yoonholeee @roshen_nair @qizhengz_alex @Kangwook_Lee @lateinteraction @chelseabfinn Love this! We've been working on automating harness engineering at @NeoSigmaAI . We see a 40% jump on Tau3 bench with GPT5.4. What do you think about generalizability across models? We see most gains carry over, but upgrades still surface new failures. x.com/gauri__gupta/s…

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

2

1

483

NeoSigma retweeté

Maz

Maz@0xmaz_·1d

legit think the next trillion dollar opportunity will come from self improving ais @gauri__gupta and @RitvikKapila are super cracked devs this is a new project ill be keeping an eye on

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

4

11

1.1K

NeoSigma retweeté

renjie pi

renjie pi@RenjiePi·1d

Very interesting work! Researchers keep finding new ways to have AI replace ourselves😆

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

3

3

526

NeoSigma retweeté

Santanu Bhattacharya

Santanu Bhattacharya@SantanuB01·1d

Watching former students build what the rest of us were only theorizing about is a researcher's best reward. @gauri__gupta and @RitvikKapila of @NeoSigmaAI - you made my day! We spent years getting AI to write code. The harder, more valuable problem is getting AI to close the loop between customers, products, and systems — in real time. That's the next frontier, and they're building it. Congratulations and Godspeed! 🚀

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

3

7

829

NeoSigma retweeté

Gauri Gupta

Gauri Gupta@gauri__gupta·1d

1. Right now the system works best when there is a way to verify outcomes, whether that's automated evals, user feedback signals, or success/failure signals from production traces. Though, our system can also generate its own evals from production failures. In this experiment also, we start with an empty regression set that grows as failures are fixed. 2. We guard against overfitting through the regression gate. Every proposed harness change has to pass both the new evals and not regress on previously fixed failures. We also continuously refresh the eval set from live production data and retire stale evals, so the harness is always optimizing against a moving, representative target rather than a fixed set.

English

1

3

369

NeoSigma retweeté

Andrew Sohrabi

Andrew Sohrabi@AndrewSohrabi·1d

I have had to architect a similar approach to my own agent development by running evals + regression gates and keeping only helpful changes to the harness over time. cool to see a product focused on this. nice work @gauri__gupta!

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

3

5

704

NeoSigma retweeté

simon

simon@disiok·2d

recursive improvement is here

Gauri Gupta@gauri__gupta

We @neosigmaai @RitvikKapila are building the future of self-improving AI systems! By closing the feedback loop between production data and system improvements, we help teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior. We show how our system works on Tau3 bench across retail, telecom, and airline domains. Agent performance on the validation set (with a fixed underlying model, GPT5.4) improves from 0.56 → 0.78 (~40% jump in accuracy).

English

3

8

3.8K

NeoSigma retweeté

shyamal

shyamal@shyamalanadkat·1d

clearly the future of self-improving agents and AI FDEs that live within your stack

Gauri Gupta@gauri__gupta

3/ We start with a baseline agent on Tau3 bench and run our system directly on top of it where it: - observes and mines failures from production traces - automatically clusters them into underlying failure modes - converts failure clusters into reusable living eval cases - proposes and experiments multiple harness changes and validates them - accepts only changes that both improve performance and don’t regress on previously fixed failures.

English

7

33

4.5K

Découvrir

@gauri__gupta @RitvikKapila @LakshyAAAgrawal @Vtrivedy10 @yoonholeee @roshen_nair @qizhengz_alex @Kangwook_Lee