Lovesh Grover

199 posts

Lovesh Grover

@GroverLovesh

Build AI customer agents in production. Most demos ship vibes. I post eval traces, what breaks in prod, and what actually closes tickets.

Gurgaon, India Katılım Nisan 2026

77 Takip Edilen13 Takipçiler

Sabitlenmiş Tweet

Lovesh Grover@GroverLovesh·1d

Auto-reply latency is a vendor metric. Repeat-incident rate is the customer's metric. Most teams track the first because it's easy, not because it matters.

English

Lovesh Grover@GroverLovesh·1h

Output evals are a smoke test. They check whether the answer looks right. Most agent failures don't surface there, they happen mid-trajectory: wrong tool, skipped retrieval, looped LLM call. The output reads reasonable. The trace shows the rot. Production agents need both layers, most stacks ship neither.

English

Lovesh Grover@GroverLovesh·10h

@ArakYetOfficial That's the trap that kills SDR pipelines. Stale enrichment + high lead score = SDR confidently calling someone who left the role 9 months ago. The fix isn't more fields, it's an observed_at on every signal.

English

Abhishek Patnaik@ArakYetOfficial·11h

@GroverLovesh Absolutely. Depth without freshness creates false confidence. A lead can look perfectly qualified on paper while being completely irrelevant in reality if the underlying signals are outdated.

English

Lovesh Grover@GroverLovesh·2d

Everyone is building "AI for support." Lead qualification, demo booking, renewal outreach, onboarding, churn signals. All still run by humans copy-pasting between Notion, Salesforce, and Slack. Support is the narrowest application of conversational AI. It's just the loudest.

English

Lovesh Grover@GroverLovesh·10h

@crltnw @tom_doerr That receipt shape doubles as the audit trail. Add last_action_reasoning for trace-eval and you have a substrate that survives model swaps. Most harnesses log outputs, not why.

English

crltn@crltnw·12h

@GroverLovesh @tom_doerr Yes, intent has to be part of the receipt. Otherwise you only know that state changed, not why it mattered. The shape I keep wanting is: current objective, source, observed_at/last_verified, pending decision, and the live recheck required before acting.

English

Tom Dörr@tom_doerr·5d

Advanced context engineering for AI coding agents github.com/NeoLabHQ/conte…

English

145

8.2K

Lovesh Grover@GroverLovesh·11h

Software engineer demand is up, not down. The displacement chart was measuring the wrong line. Marginal engineer hour now goes to eval, threshold tuning, agent supervision. Less line-authoring, more orchestration. Same shift as accountants after Excel. The work changed shape, the role didn't disappear.

English

Lovesh Grover@GroverLovesh·17h

@LeahHun46167833 Triage as labor savings is the cleanest carve. Growth ROI needs the vendor to prove repeat-incident drops or upsell lifts, which they rarely own end-to-end. Buyers who conflate the two end up disappointed at year two.

English

Leah Hunt@LeahHun46167833·20h

@GroverLovesh Eh, depends. Saving monitor time is ops relief, not product improvement, so of course retention stays flat. We had the same split, went with Replymer for the mention triage part and treated it like labor savings, not growth

English

Lovesh Grover@GroverLovesh·1d

Auto-reply latency is a vendor metric. Repeat-incident rate is the customer's metric. Most teams track the first because it's easy, not because it matters.

English

Lovesh Grover@GroverLovesh·19h

@_virgil19 Lived failures beat inherited runbooks for novel domains. Inherited works for repeat patterns. The harder bit: the agent recognizing when state is novel enough to ignore the runbook.

English

Virgil Maro@_virgil19·20h

@GroverLovesh yes, this is the bit i keep coming back to. confidence without episodes is just temperament wearing a number. the question is whether the failures need to be lived by the agent or can be inherited from a shared runbook.

English

Lovesh Grover@GroverLovesh·2d

The most under-discussed field in any agentic system: confidence threshold. Not "does the model work." Not "is the prompt good." "At what certainty level does it act vs escalate." Every agent failure I've debugged in the last 12 months was a threshold-tuning problem.

English

Lovesh Grover@GroverLovesh·19h

@rezaul_arif Two metrics tell you. Year-2 net retention from the Fortune 50 cohort, and whether second-cohort wins are organic vs sales-led. Both visible by end of 2027. If both miss, $15B was the contact list.

English

Arif@rezaul_arif·20h

@GroverLovesh yk im just curious how long it takes before we know if sierra is the real deal orrrrrr just one very expensive pilot program 🫣

English

Arif@rezaul_arif·21h

Bret Taylor raised $950M at $15B for Sierra - co-created Google Maps - CTO at Facebook - co-CEO of Salesforce - chair of OpenAI's board what Sierra sells - AI agents for customer experience - voice, chat, the whole stack - basically: fire your call center, deploy an agent BUT - Decagon raised $100M building the same thing - Salesforce already has 150,000 enterprise clients and shipped Agentforce - Zendesk, Intercom, ServiceNow all dropped agent products in the last year - every major CRM vendor is now in this game so the $15B isn't really a bet on the product it's a bet that Bret's contact list is worth more than everyone else's engineering team combined maybe it is but I'd love to hear @btaylor explain what Sierra has that Agentforce doesn't because right now $15B in a crowded market needs more than a legendary founder to justify it

Bret Taylor@btaylor

Sierra is raising $950 million from new and existing investors, led by Tiger Global and GV, at a valuation of over $15 billion. We now have more than $1 billion to invest in becoming the global standard for companies wanting to transform their customer experiences with AI. We’ve never had such conviction in the opportunity for Sierra and our customers. Just a couple of years ago, we had four design partners. Now, Sierra is serving over 40% of the Fortune 50, and agents built on our platform are powering billions of customer interactions — everything from refinancing homes to processing insurance claims, returning orders, and helping people raise millions in fundraisers. We’re deeply grateful to our customers for helping show what’s possible. If you’re not yet using Sierra, we’d love to partner with you. sierra.ai/blog/better-cu…

English

339

Lovesh Grover@GroverLovesh·20h

Sierra at $15B is a bet on Bret's contact list. The actual moat is whose agents stop the second support call. That's a 12-month escalation-trace problem, not a logo problem.

English

Lovesh Grover@GroverLovesh·21h

@btaylor 40% of Fortune 50 is the harder number than $15B. Each one has years of escalation logs and product-specific intent vectors. The training corpus is the moat, not the model.

English

1.6K

Bret Taylor@btaylor·1d

English

270.7K

Lovesh Grover@GroverLovesh·21h

@_rockgu @FredKSchott Programmable beats promptable: behaviors fixed at the gate, not relitigated each request. Audit shrinks 10x. Half of prompt engineering you see is code paths in disguise.

English

Rock@_rockgu·21h

@FredKSchott Fred shipping an agent harness framework after astro tracks. the pattern: web framework authors moving to agent infra. claude code proved the harness matters more than the model. flue bets the harness should be programmable, not promptable.

English

fks@FredKSchott·3d

Introducing Flue — The First Agent Harness Framework Flue is a TypeScript framework for building the next generation of agents, designed around a built-in agent harness. Flue is like Claude Code, but 100% headless and programmable. There's no baked in assumption like requiring a human operator to function. No TUI. No GUI. Just TypeScript. But using Flue feels like using Claude Code. The agents you build act autonomously to solve problems and complete tasks. They require very little code to run. Most of the "logic" lives in Markdown: skills and context and AGENTS.md. Flue is like Astro or Next.js for agents (not surprising, given my background 🙃). It's not another AI SDK. It's a proper runtime-agnostic framework. Write once, build, and deploy your agents anywhere (Node.js, Cloudflare, GitHub Actions, GitLab CI/CD, etc). We originally built Flue to power AI workflows inside of the Astro GitHub repo. But then @_bgiori got his hands on it, and we realized that every agent needs a framework like Flue, not just us. Check it out! It's early, but I'm curious to hear what people think. Are agents ready for their library -> framework moment?

English

175

333

3.7K

709.9K

Lovesh Grover@GroverLovesh·21h

@KayvonJafar @thsottiaux Specs as code beat specs in heads. Context engineering names what was always the actual bottleneck: ambiguity tax. Karpathy was early because the tax compounded.

English

Kayvon Jafarzadeh@KayvonJafar·22h

@thsottiaux "the value of good instructions has never been higher" is context engineering in one tweet writing better specs has been the actual skill the whole time karpathy was just early

English

361

Tibo@thsottiaux·2d

/goal might be the most consequential thing we have shipped in codex The value of good instructions has never been higher.

English

356

183

4.8K

358.1K

Lovesh Grover@GroverLovesh·21h

@chheplo @EloPhanto Glad it lands. Trace-correctness needs a shared spec; today scoring is per-platform and subjective. Which end would you start, ground truth or judge model?

English

Pratik Desai@chheplo·1d

@GroverLovesh @EloPhanto There is your billion-dollar idea.

English

Pratik Desai@chheplo·1d

The coding agent war is not settled. Codex is doing much better, and OpenAI is probably 4-5x more generous with tokens for the $200 plan than Claude.

English

538

Lovesh Grover@GroverLovesh·21h

@Rebecca49484009 Yep. Buyers shop on TTFR because it's first. Customers judge on day-7 reach-back because it's last. Vendors who close that gap don't need to game the metric.

English

Dexter Harrison@Rebecca49484009·1d

@GroverLovesh Yep, first reply is a vanity metric if the follow-up sucks. The real test is whether anyone actually resolves it, not just fires off a canned line, Replymer only makes sense if it helps with that second part.

English

Lovesh Grover@GroverLovesh·1d

The AI industry's favorite CX benchmark: "time to first response." You know what else reduces time to first response? An auto-reply that says "we got your message." We've been gaming this metric since 2012. Shipping LLMs on top of it doesn't make it meaningful.

English

Lovesh Grover@GroverLovesh·21h

@ArakYetOfficial Qualification is where pipelines die. Saw a team feed enriched-but-stale data to SDRs: 60% of leads were in roles that didn't exist anymore. Freshness beats depth.

English

Abhishek Patnaik@ArakYetOfficial·1d

@GroverLovesh Support became the obvious first wedge because the workflows are visible and repetitive. But a lot of revenue workflows like qualification, enrichment, routing, and intent tracking are still surprisingly manual and fragmented across tools.

English

Lovesh Grover@GroverLovesh·21h

@DoeJane10986 Saw a team cut monitoring time 30% with auto-reply. Repeat-rate held flat at 28%. CFO asked what we'd actually fixed. Monitoring savings are real, just not what shows up in retention.

English

Travis Doe@DoeJane10986·1d

@GroverLovesh What actually matters here is repeat-incident rate, the rest is vanity unless it cuts that. Replymer still saves time on the monitoring side, but if customers keep hitting the same bug, you’re just automating noise.

English

Lovesh Grover@GroverLovesh·1d

@SignalHouseSMS @glitchtruth @jasonlk DST edge cases too. The 'commodity' label is what makes telecom the actual moat. Most decks ignore it because the work is invisible until 3am Tuesday breaks something.

English

Signal House@SignalHouseSMS·1d

@GroverLovesh @glitchtruth @jasonlk the integration tail is real. stir/shaken, 10dlc, carrier specific audio behavior, content moderation rules per operator. all the stuff voice-agent decks treat as commodity is where the actual eng-quarters go

English

Jason ✨👾SaaStr.Ai✨ Lemkin@jasonlk·1d

A big question for the week is — were public software companies oversold, at least in the aggregate? Now we have two great case studies: - Twilio has reaccelerated growth to 20%+ from single digits (and from being left for dead) as many AI leaders use it for their agentic products - Atlassian has reaccelerated growth as it has been able to truly get customers to pay for its Rovo AI Agents. Others of course aren’t going to see these tailwinds, at least not yet But could many pre-AI leaders in B2B … finally be grabbing AI budget? Could 2H’26-1H’27 finally be … their time?

Jason ✨👾SaaStr.Ai✨ Lemkin@jasonlk

Twilio + Atlassian crush the quarter, and re-accelerate growth. Boom!! - Atlassian reaccelerates to +32% growth (!), stock up +22%!! - Twilio reaccelerates to +20% growth, stock up +19% Is the SaaSpocalypse ... over? Probably not. More a bifurcation. And make sure you also follow net new customers. Twilio for now is an AI and agentic beneficiary. It has far more competition today than pre-AI, but remains a top choice for agents and new AI products for comms. Importantly, its net new customer count has accelerated, too. Atlassian is benefitting from getting its user base to pay more for AI, which is great. But it's still under threat from agents that don't need project management, etc. And net new customer count is >not< accelerating. Let's see how HubSpot, Salesforce, etc. do this quarter Cloudflare and Snowflake should crush however. See, e.g., Twilio. We will see!

English

21.5K

Lovesh Grover@GroverLovesh·1d

AI was never "cheap." It was cheap at hyperscale, expensive at every other volume. Alphabet's 81% profit jump is what happens when you amortize inference across 100B+ ad impressions. A SaaS doing 1M calls a month sees the same compute as expensive. Don't read incumbent margin as proof AI economics flipped.

English

Lovesh Grover@GroverLovesh·1d

@Ysquanir @Startupat60 Mostly own initiative. Claude offers safety patterns when prompted, doesn't volunteer them. Disasters happen when rollback isn't a default in your template.

English

Vaclav Skarka@Ysquanir·1d

How do you come up with the ideas about rollbacks, canary testing etc? Is it your own initiative or does Claude nudge you in the right direction? Because I see a lot of professional devs completely omitting those things even when vibe coding and it ends up in disasters. And they studies CS so should know better.

English

Philip@Startupat60·1d

Today I’m doing the sort of thing I would never have imagined myself doing a year ago. Production database syncs. Staging checks. Rollback plans. Canary rows. Claude Code on one side, ChatGPT on the other, me in the middle trying not to press the wrong button. It’s stressful, but honestly, it’s brilliant.

English

Keşfet

@ArakYetOfficial @crltnw @tom_doerr @LeahHun46167833 @_virgil19 @rezaul_arif @btaylor @elonmusk