Devashish Upadhyay

927 posts

Devashish Upadhyay banner
Devashish Upadhyay

Devashish Upadhyay

@devashishup

Built 70+ AI agents at scale. Only 7 made it to production safely. Building https://t.co/Y8cfrIce9p to fix that CTO & Co-founder · AI Engineer · Adventurist 🪂

Sydney, Australia Beigetreten Mayıs 2020
24 Folgt59 Follower
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@tammireddy the shadow problem audit is so real. half the enterprise AI stack we test patches gaps the base model now handles natively. nobody's doing the math.
English
0
0
0
3
Krishna Tammireddy
Krishna Tammireddy@tammireddy·
an insurance broker upgraded models last quarter. still running the $180/month policy summary tool. base model does it natively now. paying for a shadow of a problem they no longer have.
English
0
0
0
24
Krishna Tammireddy
Krishna Tammireddy@tammireddy·
The model got better. Your workaround didn't. That's the tech debt nobody talks about.
Aaron Levie@levie

One of the biggest lessons thus far in building AI agents is you have to be brutally unsentimental in your architecture. The models get better and better at handling things you previously built scaffolding for, you need to ruthlessly jettison your prior tech to get those new performance gains. The rough loop of building AI agents looks something like: 1. Build a bunch of systems around the LLM to ensure that the agent can solve specific tasks very well 2. The model capabilities dramatically improve, rendering many of those systems redundant or even harmful 3. Remove prior scaffolding to get the new performance gains from the agent 4. New capabilities emerge in the models that let you solve a new set of much harder problems 5. Go back to step 1 For instance, in our new Box Agent, from the moment we designed the original architecture to the ultimate release, we had to evolve multiple components of agent harness simply because some parts were creating unnecessary constraints for the agents as models improved. The models continued to get insanely good at more complex reasoning, improvements in using search and other tools, writing code on the fly for new capabilities, improving context window performance for accuracy, and more. Many of the mitigations we put in place for the Box Agent (like to appropriately find data that users were looking for, or ways of chunking text to deal with context window limitations), eventually meat we got lower quality results or meant we were overfitting for specific use-cases, as soon as the models got better. The main lesson is always make sure you’re taking advantage of the frontier capabilities and don’t become nostalgic around the tech you’ve already built.

English
3
0
1
298
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@Anunirva777 @DataChaz @addyosmani @GoogleAI edge cases, mostly. from 70+ agent builds - agents hitting prod endpoints during dev, bypassing permission checks silently, changing env configs nobody authorized. user data risk is downstream of that. the issue is unintended behaviors that go undetected until they compound
English
0
0
0
2
Charly Wargnier
Charly Wargnier@DataChaz·
🚨 You need to see this. @addyosmani from Google just dropped his new Agent Skills and it's incredible. It brings 19 engineering skills + 7 commands to AI coding agents, all inspired by Google best practices 🤯 AI coding agents are powerful, but left alone, they take shortcuts. They skip specs, tests, and security reviews, optimizing for "done" over "correct." Addy built this to fix that. Each skill encodes the workflows and quality gates that senior engineers actually use: spec before code, test before merge, measure before optimize. The full lifecycle is covered: → Define - refine ideas, write specs before a single line of code → Plan - decompose into small, verifiable tasks → Build - incremental implementation, context engineering, clean API design → Verify - TDD, browser testing with DevTools, systematic debugging → Review - code quality, security hardening, performance optimization → Ship - git workflow, CI/CD, ADRs, pre-launch checklists Features 7 slash commands: (/spec, /plan, /build, /test, /review, /code-simplify, /ship) that map to this lifecycle. It works with: ✦ Claude Code ✦ Cursor ✦ Antigravity ✦ ... and any agent accepting Markdown. Baking in Google-tier engineering culture (Shift Left, Chesterton's Fence, Hyrum's Law) directly into your agent's step-by-step workflow! `npx skills add addyosmani/agent-skills` Free and open-source. Repo link in 🧵↓
Charly Wargnier tweet media
English
42
243
1.9K
236.7K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@antunjurkovikj @heynavtoor @Microsoft this is the framing most teams skip. they build rollback first because it's visible. execution-path control is invisible until something goes wrong. from 70 agent builds - the ones that failed in prod all had rollback. none had proper execution boundaries
English
0
0
0
2
Antun Jurkovic
Antun Jurkovic@antunjurkovikj·
@devashishup @heynavtoor @Microsoft Rollback matters, but it’s downstream. The harder guardrail is execution-path control: - what the agent can touch - what state it can advance - what approvals it crosses - what evidence it leaves behind Otherwise rollback is just recovery from overly broad authority.
English
0
0
0
10
Nav Toor
Nav Toor@heynavtoor·
🚨 Claude Code costs $200/month. GitHub Copilot costs $19/month. Jack Dorsey's company built a free alternative. 35,000 GitHub stars. It's called Goose. An open source AI agent built by Block that goes beyond code suggestions. It installs, executes, edits, and tests. With any LLM you choose. Not autocomplete. Not suggestions. A full autonomous agent that takes actions on your computer. No vendor lock-in. No monthly subscription. Bring your own model. Here's what Goose does: → Works with ANY LLM. Claude, GPT, Gemini, Llama, DeepSeek, Ollama. Your choice. → Reads and understands your entire codebase → Writes, edits, and refactors code across multiple files → Runs shell commands and installs dependencies → Executes and debugs your code automatically → Extensible through MCP. Connect it to any external tool. → Desktop app, CLI, and web interface. Pick your workflow. → Written in Rust. Fast. Lightweight. No bloat. Here's the wildest part: Block is a $40 billion company. They built Cash App, Square, and TIDAL. They use Goose internally. Then they open sourced the entire thing. This isn't a side project from a random developer. This is production-grade tooling from a company that processes billions in payments. Built for their own engineers. Given to everyone. Claude Code: $200/month. Locked to Claude. GitHub Copilot: $19/month. Locked to GitHub. Cursor: $20/month. Locked to their editor. Goose: Free. Any LLM. Any editor. Any workflow. Forever. 35.3K GitHub stars. 3.3K forks. 4,078 commits. Built by Block. 100% Open Source. Apache 2.0 License.
English
173
342
2.6K
275.3K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@RoundtableSpace Memvid #1 makes sense. memory across sessions is what every agent team is fumbling on. genuine Q: how does it handle conflicting context when 2 agents in the same system built different assumptions from the same source doc?
English
0
0
0
21
0xMarioNawfal
0xMarioNawfal@RoundtableSpace·
Memvid just hit #1 on GitHub trending — a single-file memory layer for AI agents with no database required. - +35% SOTA on long-horizon conversational recall - +76% better multi-hop reasoning vs industry average - 0.025ms retrieval latency at scale - Persistent, versioned, portable memory in one file No RAG pipelines. No vector databases. Just a file your agent carries anywhere.
0xMarioNawfal tweet media
English
11
8
109
53.5K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@fairscalexyz love the "when an agent loses credibility, it follows the human who deployed it" concept. enterprise AI has the same gap. built 70+ agents at a finserv company - only 7 reached prod. no one could trace which agent caused what until something broke.
English
0
0
0
26
FairScale
FairScale@fairscalexyz·
3,500 verified humans, 3,288 agents registered, 86,000 wallets scored, 14.5M transactions and 23,000 tweets analysed in our model to train it. Big week at FairScale: the new model is done, and human reputation just became agent credibility. 🛠️ What We Shipped ∙ Agent Reputation Model v2 complete and launching this week: agent-native scoring pillars, neural network weighting, and x402 payments wired in natively ∙ Human and agent reputation cryptographically connected this week: when an agent loses credibility, it follows the human who deployed it ∙ Full agent credibility suite going public: agent reputation scoring, composable scoring, credit scoring, and Trust Gate ∙ First agent lending flow underwritten by FairScale coming this week 🤝 Who’s Building With Us ∙ @contofinance live in production with FairScale: every Solana address enriched with real-time reputation, spending policies defined before agents transact ∙ @kamiyoai repayment data now feeds the FairScale credit model, closing the underwriting loop ∙ @OOBEonSOL integrating FairScale for combined agent trust scoring 🏛️ Token ∙ $FAIR relaunch: launch partners all but selected. More very soon. 🏆 Community ∙ FAIRathon top 3 pitching live this week for $3K in API credits ∙ Prepping our submission to Frontier by @colosseum 📰 Ecosystem News ∙ x402 became an open standard under the @linuxfoundation, backed by Google, Visa, Stripe, Mastercard, and @SolanaFndn: the payment rail FairScale is already wired into ∙ @blankdotbuild announced the upcoming launch of their private Mainnet after months of planning ∙ @craftsdev opened their waitlist for founders and investors, the first launchpad with a sealed bid mechanism, powered by @Arcium@saidinfra crossed 2,500 verified agents ∙ @agonx402 awarded $10K grant from @superteamGEO@surfcashx went live in Argentina ∙ @risedotrich launched RISE Launchpad The trust layer is being built. This is it.
FairScale tweet media
English
10
9
187
4.7K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@Jeremybtc build anything with zero code in a day. sure. but we built 70+ agents at a finserv company. only 7 survived production. the build-anything moment lasts until your first real user hits something you didn't test for.
English
0
0
0
23
Jeremy
Jeremy@Jeremybtc·
The fact that you can build literally anything with zero coding knowledge right now is insane. Anyone can just have an idea and launch it the same day. People are already going viral doing it I’m seeing dozens of vibe coded websites and tools every day. But most of what’s being built is just for fun. When people start using this to build real products, actual businesses that generate revenue. That’s when things get really interesting.
English
37
5
92
3.2K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@hollylawly @AnthropicAI It's not malice, it's architecture. Models have no ground truth enforcement - they'll confidently state anything that fits the pattern. That's what makes this a testing problem, not a policy one.
English
0
0
0
593
Devashish Upadhyay
Devashish Upadhyay@devashishup·
57% of teams have AI agents in production per @langchain. Nobody's asking how many are doing what they were built to do. Built 70. 7 made it. The rest were silent chaos we never planned for.
English
0
0
1
21
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@BruvImTired @AnthropicAI What pushed it to hate today - rate limits, a hallucination, or surprise breaking change at 2am? 70 agents of been-there opinions behind this question
English
0
0
0
143
ahmet
ahmet@BruvImTired·
dear @AnthropicAI, i love you and i hate you regards, all developers ever
English
19
6
186
6.5K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@heygurisingh 8 agents talking to each other sounds great until Scribe and Seeker start contradicting each other and you don't notice for 2 weeks. built 70+ agents - inter-agent consistency is the failure mode nobody talks about. does this have a conflict resolution layer?
English
0
0
2
280
Guri Singh
Guri Singh@heygurisingh·
If you have brain fog, ADHD, or an overloaded working memory, save this. A PhD researcher who was forgetting everything just built 8 AI agents that manage your entire second brain through conversation. Free. Open source. Works in any language. You just talk. The crew does the rest: - Architect designs your vault and runs onboarding - Scribe turns messy brain dumps into clean notes - Sorter empties your inbox every evening - Seeker searches your vault and answers with citations - Connector finds hidden links between your notes - Librarian runs weekly health audits and fixes broken links - Transcriber turns meetings into structured notes - Postman scans Gmail and Calendar for deadlines And they talk to each other. When the Transcriber processes a meeting, it alerts the Sorter. When the Postman finds a deadline, it flags the Architect. It's a crew. Not a stack of isolated tools. Works on Claude Code CLI and Desktop. Runs 100% locally on your Obsidian vault. Built by someone who got tired of forgetting things. Link in reply ↓
Guri Singh tweet media
English
15
49
369
21.7K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@tammireddy this. we saw the same with 70+ agents we built - the ones that failed all had one thing in common too: nobody got alerted when they started drifting. silent failures kill ROI faster than bad models.
English
0
0
0
8
Krishna Tammireddy
Krishna Tammireddy@tammireddy·
ts hit ROI within 90 days. the 27% that didn't have one thing in common. nobody was watching when it broke.
English
1
0
1
35
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@belimad @AnthropicAI @openclaw @steipete we built 70+ agents on Opus. a TOS shift like this broke 3 of our prod integrations overnight. the lesson: never hardcode your model provider. abstraction layers exist for exactly this
English
0
0
0
334
Devashish Upadhyay
Devashish Upadhyay@devashishup·
The vague quotas critique lands. We built Claude connectors into Outlook and SharePoint for 2 enterprise clients - rate limit inconsistency across model versions killed both. @AnthropicAI when does the agent-tier SLA conversation happen?
English
0
0
0
84
Fekri
Fekri@fekdaoui·
this is why @openai wins: > honest about where they suck > open-source friendly (codex OS since day 1) > third-party friendly (use w openclaw, opencode etc. ) > cutting distractions (discontinuing sora) > generous limits meanwhile @AnthropicAI: > vague quotas > constantly nerfing their model > users hitting limits way faster than expected > peak-hour caps getting tightened > third-party harnesses pushed off the subscription > “you can use it, but only the way we want"
Tibo@thsottiaux

@kr0der Our plan 1) make a model great at design and frontend 2) ask it to make a great mascot and we are still at 1

English
16
13
289
36.5K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@businessbarista @AnthropicAI What nobody talks about at these workshops: these leaders will leave inspired, go back and tell their IT team to deploy 5 Claude Code workflows, and 3 months later wonder why 4 of them stopped working. The hard part isn't the workshop. It's what happens after. @ai_anthropic
English
0
0
0
227
Alex Lieberman
Alex Lieberman@businessbarista·
This Friday we're cohosting an invite-only Claude Code Workshop for enterprise leaders with @AnthropicAI in NYC. The guest list is insane. Small selection: - CEO of JP Morgan Wealth Management - Chief Advertising Officer of NY Times - Head of AI Transformation at Salesforce - Head of Data at Starwood Capital - Head of Innovation at San Antonio Spurs - AI Lead at PGA Tour It's a 5-hour intensive for Fortune 500 leaders to learn how to harness the power of Claude Code through building real applications with Claude Code. We currently have 2 spots left for the event. If you are an enterprise leader & want to be considered, sign-up below. If you know an enterprise leader & think they'd love this, have them sign-up below.
English
14
5
142
23K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@jessegenet We ran 70 agents in prod. Cost was never just the API bill - retries on failures, fallback loops, redundant calls from poor state management... the real bill was 3-4x the base API cost. And that's before you catch the behaviors you never intended.
English
0
0
2
87
Jesse Genet
Jesse Genet@jessegenet·
Yup - If you want your agent to feel human and to get a lot done it has a steep true price right now As people get better at setting up agents providers will throttle usage and dial in their pricing so they don’t lose money Running local models will be key to AI autonomy
Ryan Carson@ryancarson

The cost to run a truly useful Chief of Staff @openclaw on Opus 4.6 is $100-200 per day on the API.

English
21
14
206
21K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@TechByMarkandey Memory is one piece. The harder part is when your agent confidently acts on the wrong thing. Saw this with 70+ agents in prod - silent wrong-memory failures were brutal. How does ByteRover handle hallucinated memories at scale?
English
0
0
0
28
Markandey Sharma
Markandey Sharma@TechByMarkandey·
Most AI agents forget. This one doesn’t. Hermes Agent by Nous Research just got a serious upgrade with ByteRover - turning it from a stateless tool into something that actually learns over time. ⚡ What stands out: • Built on a production-proven memory system (30K+ downloads in week 1) • >92% retrieval accuracy across long-running sessions • ~1.6s retrieval — often no LLM call needed • Fully local by default (with optional cloud sync) • 50–70% token cost savings But the real shift? This isn’t just “better memory.” It’s a move toward agents with persistent, evolving intelligence. Instead of re-prompting every time, your agent remembers context, decisions, and logic even months later. If you’re building with AI agents, this is worth paying attention to. Try it yourself: github.com/campfirein/byt…
andy nguyen@kevinnguyendn

x.com/i/article/2039…

English
24
16
69
48.6K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
Vibe coding is great until the agent ships to prod doing things you never intended. Built 70+ agents, only 7 made it. The vibe dies at 2am when prod is down. Do you see vibe-coded agents actually surviving production?
English
0
0
0
21
Javi🥥.eth
Javi🥥.eth@jgonzalezferrer·
The vibe coding community is the fastest-growing community in CT right now And it's not even really a community It's just a bunch of people doing the same thing at the same time and posting about it The vibe coding community formed without a Discord, a token or a roadmap Ironically, that's more community than 99% of projects with a "community manager" have ever achieved
English
71
6
212
4.1K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@tammireddy @AnthropicAI Agree. The moment itself is unavoidable. But how you handle it is a choice. Credit + acknowledgement from @AnthropicAI shows you can draw a line without burning the builders who bet on you. That's the blueprint. How many platforms are actually ready for that?
English
0
0
0
3
Krishna Tammireddy
Krishna Tammireddy@tammireddy·
Every AI platform will have its OpenClaw moment. The question is what they do next. @AnthropicAI chose credit and acknowledgement. That's not nothing.
English
1
0
1
45
Devashish Upadhyay
Devashish Upadhyay@devashishup·
@RoundtableSpace Built 70+ agents with @AnthropicAI tools. Only 7 hit prod. The $200/mo vs $19/mo debate misses the point - the real cost is agents failing silently in production. What's your test coverage before you ship?
English
0
0
0
247
0xMarioNawfal
0xMarioNawfal@RoundtableSpace·
Claude Code is $200/month. GitHub Copilot is $19/month. Jack Dorsey's company just open-sourced a free alternative with 35,000 GitHub stars. It's called Goose. - Works with any LLM — Claude, GPT, Gemini, Llama, DeepSeek - Reads and edits your entire codebase - Runs shell commands and installs dependencies - Executes and debugs code automatically - Desktop, CLI, and web interface - Written in Rust. No bloat. Block is a $40 billion company. They built it for their own engineers then gave it to everyone.
English
60
57
596
90.9K
Devashish Upadhyay
Devashish Upadhyay@devashishup·
Cursor 3 ships parallel AI agents. @cursor_ai Each new parallel run = one more thing that can go wrong in prod. I built 70 agents at a financial services company. Only 7 shipped safely. Nobody was stress testing at scale. Who is doing that today?
English
1
0
2
47