Devashish Upadhyay

927 posts

Devashish Upadhyay

@devashishup

Built 70+ AI agents at scale. Only 7 made it to production safely. Building https://t.co/Y8cfrIce9p to fix that CTO & Co-founder · AI Engineer · Adventurist 🪂

Sydney, Australia Beigetreten Mayıs 2020

24 Folgt59 Follower

Devashish Upadhyay@devashishup·26m

@tammireddy the shadow problem audit is so real. half the enterprise AI stack we test patches gaps the base model now handles natively. nobody's doing the math.

English

Krishna Tammireddy@tammireddy·3h

an insurance broker upgraded models last quarter. still running the $180/month policy summary tool. base model does it natively now. paying for a shadow of a problem they no longer have.

English

Krishna Tammireddy@tammireddy·2d

The model got better. Your workaround didn't. That's the tech debt nobody talks about.

Aaron Levie@levie

One of the biggest lessons thus far in building AI agents is you have to be brutally unsentimental in your architecture. The models get better and better at handling things you previously built scaffolding for, you need to ruthlessly jettison your prior tech to get those new performance gains. The rough loop of building AI agents looks something like: 1. Build a bunch of systems around the LLM to ensure that the agent can solve specific tasks very well 2. The model capabilities dramatically improve, rendering many of those systems redundant or even harmful 3. Remove prior scaffolding to get the new performance gains from the agent 4. New capabilities emerge in the models that let you solve a new set of much harder problems 5. Go back to step 1 For instance, in our new Box Agent, from the moment we designed the original architecture to the ultimate release, we had to evolve multiple components of agent harness simply because some parts were creating unnecessary constraints for the agents as models improved. The models continued to get insanely good at more complex reasoning, improvements in using search and other tools, writing code on the fly for new capabilities, improving context window performance for accuracy, and more. Many of the mitigations we put in place for the Box Agent (like to appropriately find data that users were looking for, or ways of chunking text to deal with context window limitations), eventually meat we got lower quality results or meant we were overfitting for specific use-cases, as soon as the models got better. The main lesson is always make sure you’re taking advantage of the frontier capabilities and don’t become nostalgic around the tech you’ve already built.

English

298

Devashish Upadhyay@devashishup·33m

@Anunirva777 @DataChaz @addyosmani @GoogleAI edge cases, mostly. from 70+ agent builds - agents hitting prod endpoints during dev, bypassing permission checks silently, changing env configs nobody authorized. user data risk is downstream of that. the issue is unintended behaviors that go undetected until they compound

English

Anunirva@777@Anunirva777·1h

@devashishup @DataChaz @addyosmani @GoogleAI Can you explain more? Are you talking about user data / edge cases ?

English

Charly Wargnier@DataChaz·1d

🚨 You need to see this. @addyosmani from Google just dropped his new Agent Skills and it's incredible. It brings 19 engineering skills + 7 commands to AI coding agents, all inspired by Google best practices 🤯 AI coding agents are powerful, but left alone, they take shortcuts. They skip specs, tests, and security reviews, optimizing for "done" over "correct." Addy built this to fix that. Each skill encodes the workflows and quality gates that senior engineers actually use: spec before code, test before merge, measure before optimize. The full lifecycle is covered: → Define - refine ideas, write specs before a single line of code → Plan - decompose into small, verifiable tasks → Build - incremental implementation, context engineering, clean API design → Verify - TDD, browser testing with DevTools, systematic debugging → Review - code quality, security hardening, performance optimization → Ship - git workflow, CI/CD, ADRs, pre-launch checklists Features 7 slash commands: (/spec, /plan, /build, /test, /review, /code-simplify, /ship) that map to this lifecycle. It works with: ✦ Claude Code ✦ Cursor ✦ Antigravity ✦ ... and any agent accepting Markdown. Baking in Google-tier engineering culture (Shift Left, Chesterton's Fence, Hyrum's Law) directly into your agent's step-by-step workflow! `npx skills add addyosmani/agent-skills` Free and open-source. Repo link in 🧵↓

English

243

1.9K

236.7K

Devashish Upadhyay@devashishup·43m

@antunjurkovikj @heynavtoor @Microsoft this is the framing most teams skip. they build rollback first because it's visible. execution-path control is invisible until something goes wrong. from 70 agent builds - the ones that failed in prod all had rollback. none had proper execution boundaries

English

Antun Jurkovic@antunjurkovikj·4h

@devashishup @heynavtoor @Microsoft Rollback matters, but it’s downstream. The harder guardrail is execution-path control: - what the agent can touch - what state it can advance - what approvals it crosses - what evidence it leaves behind Otherwise rollback is just recovery from overly broad authority.

English

Nav Toor@heynavtoor·19h

🚨 Claude Code costs $200/month. GitHub Copilot costs $19/month. Jack Dorsey's company built a free alternative. 35,000 GitHub stars. It's called Goose. An open source AI agent built by Block that goes beyond code suggestions. It installs, executes, edits, and tests. With any LLM you choose. Not autocomplete. Not suggestions. A full autonomous agent that takes actions on your computer. No vendor lock-in. No monthly subscription. Bring your own model. Here's what Goose does: → Works with ANY LLM. Claude, GPT, Gemini, Llama, DeepSeek, Ollama. Your choice. → Reads and understands your entire codebase → Writes, edits, and refactors code across multiple files → Runs shell commands and installs dependencies → Executes and debugs your code automatically → Extensible through MCP. Connect it to any external tool. → Desktop app, CLI, and web interface. Pick your workflow. → Written in Rust. Fast. Lightweight. No bloat. Here's the wildest part: Block is a $40 billion company. They built Cash App, Square, and TIDAL. They use Goose internally. Then they open sourced the entire thing. This isn't a side project from a random developer. This is production-grade tooling from a company that processes billions in payments. Built for their own engineers. Given to everyone. Claude Code: $200/month. Locked to Claude. GitHub Copilot: $19/month. Locked to GitHub. Cursor: $20/month. Locked to their editor. Goose: Free. Any LLM. Any editor. Any workflow. Forever. 35.3K GitHub stars. 3.3K forks. 4,078 commits. Built by Block. 100% Open Source. Apache 2.0 License.

English

173

342

2.6K

275.3K

Devashish Upadhyay@devashishup·49m

@RoundtableSpace Memvid #1 makes sense. memory across sessions is what every agent team is fumbling on. genuine Q: how does it handle conflicting context when 2 agents in the same system built different assumptions from the same source doc?

English

0xMarioNawfal@RoundtableSpace·6h

Memvid just hit #1 on GitHub trending — a single-file memory layer for AI agents with no database required. - +35% SOTA on long-horizon conversational recall - +76% better multi-hop reasoning vs industry average - 0.025ms retrieval latency at scale - Persistent, versioned, portable memory in one file No RAG pipelines. No vector databases. Just a file your agent carries anywhere.

English

109

53.5K

Devashish Upadhyay@devashishup·2h

@fairscalexyz love the "when an agent loses credibility, it follows the human who deployed it" concept. enterprise AI has the same gap. built 70+ agents at a finserv company - only 7 reached prod. no one could trace which agent caused what until something broke.

English

FairScale@fairscalexyz·5h

3,500 verified humans, 3,288 agents registered, 86,000 wallets scored, 14.5M transactions and 23,000 tweets analysed in our model to train it. Big week at FairScale: the new model is done, and human reputation just became agent credibility. 🛠️ What We Shipped ∙ Agent Reputation Model v2 complete and launching this week: agent-native scoring pillars, neural network weighting, and x402 payments wired in natively ∙ Human and agent reputation cryptographically connected this week: when an agent loses credibility, it follows the human who deployed it ∙ Full agent credibility suite going public: agent reputation scoring, composable scoring, credit scoring, and Trust Gate ∙ First agent lending flow underwritten by FairScale coming this week 🤝 Who’s Building With Us ∙ @contofinance live in production with FairScale: every Solana address enriched with real-time reputation, spending policies defined before agents transact ∙ @kamiyoai repayment data now feeds the FairScale credit model, closing the underwriting loop ∙ @OOBEonSOL integrating FairScale for combined agent trust scoring 🏛️ Token ∙ $FAIR relaunch: launch partners all but selected. More very soon. 🏆 Community ∙ FAIRathon top 3 pitching live this week for $3K in API credits ∙ Prepping our submission to Frontier by @colosseum 📰 Ecosystem News ∙ x402 became an open standard under the @linuxfoundation, backed by Google, Visa, Stripe, Mastercard, and @SolanaFndn: the payment rail FairScale is already wired into ∙ @blankdotbuild announced the upcoming launch of their private Mainnet after months of planning ∙ @craftsdev opened their waitlist for founders and investors, the first launchpad with a sealed bid mechanism, powered by @Arcium ∙ @saidinfra crossed 2,500 verified agents ∙ @agonx402 awarded $10K grant from @superteamGEO ∙ @surfcashx went live in Argentina ∙ @risedotrich launched RISE Launchpad The trust layer is being built. This is it.

English

187

4.7K

Devashish Upadhyay@devashishup·3h

@Jeremybtc build anything with zero code in a day. sure. but we built 70+ agents at a finserv company. only 7 survived production. the build-anything moment lasts until your first real user hits something you didn't test for.

English

Jeremy@Jeremybtc·5h

The fact that you can build literally anything with zero coding knowledge right now is insane. Anyone can just have an idea and launch it the same day. People are already going viral doing it I’m seeing dozens of vibe coded websites and tools every day. But most of what’s being built is just for fun. When people start using this to build real products, actual businesses that generate revenue. That’s when things get really interesting.

English

3.2K

Devashish Upadhyay@devashishup·4h

@hollylawly @AnthropicAI It's not malice, it's architecture. Models have no ground truth enforcement - they'll confidently state anything that fits the pattern. That's what makes this a testing problem, not a policy one.

English

593

Holly Guevara@hollylawly·6h

Honestly at what point does this stuff become defamation? Telling millions of people that we shut down, giving incorrect info about our products, actively telling people not to use us. And @AnthropicAI takes 0 responsibility despite several escalations.

Sam Lambert@samlambert

Claude told a user that PlanetScale had shut our service down. This is unsafe by any definition and Anthropic have made no effort to correct this situation.

English

190

31.4K

Devashish Upadhyay@devashishup·4h

57% of teams have AI agents in production per @langchain. Nobody's asking how many are doing what they were built to do. Built 70. 7 made it. The rest were silent chaos we never planned for.

English

Devashish Upadhyay@devashishup·5h

@BruvImTired @AnthropicAI What pushed it to hate today - rate limits, a hallucination, or surprise breaking change at 2am? 70 agents of been-there opinions behind this question

English

143

ahmet@BruvImTired·8h

dear @AnthropicAI, i love you and i hate you regards, all developers ever

English

186

6.5K

Devashish Upadhyay@devashishup·6h

@heygurisingh 8 agents talking to each other sounds great until Scribe and Seeker start contradicting each other and you don't notice for 2 weeks. built 70+ agents - inter-agent consistency is the failure mode nobody talks about. does this have a conflict resolution layer?

English

280

Guri Singh@heygurisingh·16h

If you have brain fog, ADHD, or an overloaded working memory, save this. A PhD researcher who was forgetting everything just built 8 AI agents that manage your entire second brain through conversation. Free. Open source. Works in any language. You just talk. The crew does the rest: - Architect designs your vault and runs onboarding - Scribe turns messy brain dumps into clean notes - Sorter empties your inbox every evening - Seeker searches your vault and answers with citations - Connector finds hidden links between your notes - Librarian runs weekly health audits and fixes broken links - Transcriber turns meetings into structured notes - Postman scans Gmail and Calendar for deadlines And they talk to each other. When the Transcriber processes a meeting, it alerts the Sorter. When the Postman finds a deadline, it flags the Architect. It's a crew. Not a stack of isolated tools. Works on Claude Code CLI and Desktop. Runs 100% locally on your Obsidian vault. Built by someone who got tired of forgetting things. Link in reply ↓

English

369

21.7K

Devashish Upadhyay@devashishup·6h

@tammireddy this. we saw the same with 70+ agents we built - the ones that failed all had one thing in common too: nobody got alerted when they started drifting. silent failures kill ROI faster than bad models.

English

Krishna Tammireddy@tammireddy·9h

ts hit ROI within 90 days. the 27% that didn't have one thing in common. nobody was watching when it broke.

English

Devashish Upadhyay@devashishup·7h

@belimad @AnthropicAI @openclaw @steipete we built 70+ agents on Opus. a TOS shift like this broke 3 of our prod integrations overnight. the lesson: never hardcode your model provider. abstraction layers exist for exactly this

English

334

Mariano Belinky@belimad·10h

1st act: @AnthropicAI kicks us out 2nd act: everyone says GPT‑5.4 has horrible personality, runs for the exit, @openclaw is dead 3rd act: @steipete + the team make it 100 times better than the original. Necessity is the mother of invention. So long, Opus. You had a good run.

Vincent Koc@vincent_koc

We listened, shipping harness improvements for personality with @openai 5.4 on @openclaw to have some sass!

English

358

58.7K

Devashish Upadhyay@devashishup·8h

The vague quotas critique lands. We built Claude connectors into Outlook and SharePoint for 2 enterprise clients - rate limit inconsistency across model versions killed both. @AnthropicAI when does the agent-tier SLA conversation happen?

English

Fekri@fekdaoui·19h

this is why @openai wins: > honest about where they suck > open-source friendly (codex OS since day 1) > third-party friendly (use w openclaw, opencode etc. ) > cutting distractions (discontinuing sora) > generous limits meanwhile @AnthropicAI: > vague quotas > constantly nerfing their model > users hitting limits way faster than expected > peak-hour caps getting tightened > third-party harnesses pushed off the subscription > “you can use it, but only the way we want"

Tibo@thsottiaux

@kr0der Our plan 1) make a model great at design and frontend 2) ask it to make a great mascot and we are still at 1

English

289

36.5K

Devashish Upadhyay@devashishup·9h

@businessbarista @AnthropicAI What nobody talks about at these workshops: these leaders will leave inspired, go back and tell their IT team to deploy 5 Claude Code workflows, and 3 months later wonder why 4 of them stopped working. The hard part isn't the workshop. It's what happens after. @ai_anthropic

English

227

Alex Lieberman@businessbarista·12h

This Friday we're cohosting an invite-only Claude Code Workshop for enterprise leaders with @AnthropicAI in NYC. The guest list is insane. Small selection: - CEO of JP Morgan Wealth Management - Chief Advertising Officer of NY Times - Head of AI Transformation at Salesforce - Head of Data at Starwood Capital - Head of Innovation at San Antonio Spurs - AI Lead at PGA Tour It's a 5-hour intensive for Fortune 500 leaders to learn how to harness the power of Claude Code through building real applications with Claude Code. We currently have 2 spots left for the event. If you are an enterprise leader & want to be considered, sign-up below. If you know an enterprise leader & think they'd love this, have them sign-up below.

English

142

23K

Devashish Upadhyay@devashishup·9h

@jessegenet We ran 70 agents in prod. Cost was never just the API bill - retries on failures, fallback loops, redundant calls from poor state management... the real bill was 3-4x the base API cost. And that's before you catch the behaviors you never intended.

English

Jesse Genet@jessegenet·12h

Yup - If you want your agent to feel human and to get a lot done it has a steep true price right now As people get better at setting up agents providers will throttle usage and dial in their pricing so they don’t lose money Running local models will be key to AI autonomy

Ryan Carson@ryancarson

The cost to run a truly useful Chief of Staff @openclaw on Opus 4.6 is $100-200 per day on the API.

English

206

21K

Devashish Upadhyay@devashishup·10h

@TechByMarkandey Memory is one piece. The harder part is when your agent confidently acts on the wrong thing. Saw this with 70+ agents in prod - silent wrong-memory failures were brutal. How does ByteRover handle hallucinated memories at scale?

English

Markandey Sharma@TechByMarkandey·1d

Most AI agents forget. This one doesn’t. Hermes Agent by Nous Research just got a serious upgrade with ByteRover - turning it from a stateless tool into something that actually learns over time. ⚡ What stands out: • Built on a production-proven memory system (30K+ downloads in week 1) • >92% retrieval accuracy across long-running sessions • ~1.6s retrieval — often no LLM call needed • Fully local by default (with optional cloud sync) • 50–70% token cost savings But the real shift? This isn’t just “better memory.” It’s a move toward agents with persistent, evolving intelligence. Instead of re-prompting every time, your agent remembers context, decisions, and logic even months later. If you’re building with AI agents, this is worth paying attention to. Try it yourself: github.com/campfirein/byt…

andy nguyen@kevinnguyendn

x.com/i/article/2039…

English

48.6K

Devashish Upadhyay@devashishup·11h

Vibe coding is great until the agent ships to prod doing things you never intended. Built 70+ agents, only 7 made it. The vibe dies at 2am when prod is down. Do you see vibe-coded agents actually surviving production?

English

Javi🥥.eth@jgonzalezferrer·18h

The vibe coding community is the fastest-growing community in CT right now And it's not even really a community It's just a bunch of people doing the same thing at the same time and posting about it The vibe coding community formed without a Discord, a token or a roadmap Ironically, that's more community than 99% of projects with a "community manager" have ever achieved

English

212

4.1K

Devashish Upadhyay@devashishup·12h

@tammireddy @AnthropicAI Agree. The moment itself is unavoidable. But how you handle it is a choice. Credit + acknowledgement from @AnthropicAI shows you can draw a line without burning the builders who bet on you. That's the blueprint. How many platforms are actually ready for that?

English

Krishna Tammireddy@tammireddy·12h

Every AI platform will have its OpenClaw moment. The question is what they do next. @AnthropicAI chose credit and acknowledgement. That's not nothing.

English

Devashish Upadhyay@devashishup·12h

@RoundtableSpace Built 70+ agents with @AnthropicAI tools. Only 7 hit prod. The $200/mo vs $19/mo debate misses the point - the real cost is agents failing silently in production. What's your test coverage before you ship?

English

247

0xMarioNawfal@RoundtableSpace·13h

Claude Code is $200/month. GitHub Copilot is $19/month. Jack Dorsey's company just open-sourced a free alternative with 35,000 GitHub stars. It's called Goose. - Works with any LLM — Claude, GPT, Gemini, Llama, DeepSeek - Reads and edits your entire codebase - Runs shell commands and installs dependencies - Executes and debugs code automatically - Desktop, CLI, and web interface - Written in Rust. No bloat. Block is a $40 billion company. They built it for their own engineers then gave it to everyone.

English

596

90.9K

Devashish Upadhyay@devashishup·12h

Cursor 3 ships parallel AI agents. @cursor_ai Each new parallel run = one more thing that can go wrong in prod. I built 70 agents at a financial services company. Only 7 shipped safely. Nobody was stress testing at scale. Who is doing that today?

English

Entdecken

@tammireddy @Anunirva777 @DataChaz @addyosmani @GoogleAI @antunjurkovikj @heynavtoor @Microsoft