Arsh Shah Dilbagi (@arshdilbagi) - Twitter Profili

Sabitlenmiş Tweet

The hard part about LLM failures is that their outputs rarely look like failures. The demo “works.” The output sounds coherent. The user actively uses the product. And your dashboard looks normal. Meanwhile, the system can be wrong, unsafe, or quietly driving up token spend. And you won’t notice until the damage adds up. Prompts often serve as business logic (policies, safety, and product context). But many teams ship them without the basics, such as versioning, reviewable changes, end-to-end traces, and eval gates. In production, it doesn’t crash. It degrades via wrong answers, policy misses, and surprise spending. No crash. No error. No alert. I cover this exact issue in my @Stanford CS 224G guest lecture on AI Observability and Evaluations. Here are the core ideas: • If you only log the final output, you’re guessing. Full traces show where it broke. • Evals are feedback loops. Use clear pass/fail criteria tied to outcomes. • Run evals continuously on production traces and don’t wait for support tickets. The moat isn’t prompt cleverness. It’s a measured improvement. Full lecture + blog below 👇

English

29

19

127

19.8K

Arsh Shah Dilbagi@arshdilbagi·1d

@kevinyang 🚀

QME

0

36

Kevin Yang@kevinyang·1d

We raised $6.5M to build the agent for professionals. When your reputation is on the line, you need an agent that's reliable, secure, and one step ahead. Try it now at serif.ai

English

70

59

433

65.9K

Arsh Shah Dilbagi@arshdilbagi·2d

@itsbenjyyy @TenkaraAI @trueventures 🚀

QME

0

2

66

Benjamin Stern@itsbenjyyy·2d

Ten years ago I was building factories. Today I'm building the tools I wish I had inside them. @TenkaraAI raised $7M led by @trueventures.

English

51

20

134

29.9K

Arsh Shah Dilbagi@arshdilbagi·6d

@brainstub @sundayrobotics 🤖🚀

QME

0

1

30

Camilla Guo@brainstub·12 Mar

In celebration of our Series B announcement today, here's what I observed as the 7th employee @sundayrobotics: 1/ Full-stack is everything. Our Memory Glove took 50+ iterations from hardware to ML. Left: first prototype 2024. Right: latest gen.

English

28

36

581

71.2K

Arsh Shah Dilbagi@arshdilbagi·11 Mar

The full story of year one, what we believed, what worked, and what completely surprised us, is here: go.adaline.ai/I1uEV6u

English

0

95

Arsh Shah Dilbagi@arshdilbagi·11 Mar

We started a newsletter, and 100,000+ product leaders started reading it. A year ago, we launched Adaline Labs to document what we were actually learning as we built with LLMs. No grand strategy. Just a gap we kept running into: Product teams were being asked to ship AI products without a practical resource to lean on. The research was too dense. The hype was too thin. So we wrote the thing we wished existed. The lesson that surprised us the most is that the more specific and technical we got, the faster the audience grew. Builders do not want surface-level takes. They want depth they can act on. Our readers told us exactly what they needed: ⚬ How do LLMs actually work? ⚬ How do I build with them reliably? ⚬ How do I evaluate what goes to production? ⚬ How do I keep up as the models keep changing? ⚬ How to build a modern workflow with Claude Code, Cursor, etc.? We did not pick those topics. We listened, researched, studied, and wrote. Year two is already underway. Here is what we are watching: ⚬ AI agents are entering real production infrastructure quickly. ⚬ Evals and observability are becoming non-negotiable. ⚬ AI coding tools are changing how teams ship. ⚬ The definition of product work is being rewritten in real time. We are here for all of it. Read the full story below, where you can see what we believed, what turned out to be true, and what completely surprised us.

English

2

0

6

311

Arsh Shah Dilbagi@arshdilbagi·10 Mar

@JohnPhamous 😂

QME

0

39

JohnPhamous@JohnPhamous·10 Mar

@arshdilbagi I hate when my agent tries to gaslight me

English

1

0

1

55

Arsh Shah Dilbagi@arshdilbagi·4 Mar

The hard part about LLM failures is that their outputs rarely look like failures. The demo “works.” The output sounds coherent. The user actively uses the product. And your dashboard looks normal. Meanwhile, the system can be wrong, unsafe, or quietly driving up token spend. And you won’t notice until the damage adds up. Prompts often serve as business logic (policies, safety, and product context). But many teams ship them without the basics, such as versioning, reviewable changes, end-to-end traces, and eval gates. In production, it doesn’t crash. It degrades via wrong answers, policy misses, and surprise spending. No crash. No error. No alert. I cover this exact issue in my @Stanford CS 224G guest lecture on AI Observability and Evaluations. Here are the core ideas: • If you only log the final output, you’re guessing. Full traces show where it broke. • Evals are feedback loops. Use clear pass/fail criteria tied to outcomes. • Run evals continuously on production traces and don’t wait for support tickets. The moat isn’t prompt cleverness. It’s a measured improvement. Full lecture + blog below 👇

English

29

19

127

19.8K

Arsh Shah Dilbagi@arshdilbagi·8 Mar

@JohnPhamous @adrianmg 🤯

QME

0

17

JohnPhamous@JohnPhamous·8 Mar

@adrianmg i'm upgrading this rn after chatting with @arshdilbagi

English

1

0

1

328

Adrián Mato 🐙@adrianmg·6 Mar

Me every time I meet with @JohnPhamousto to talk about agents and personal software 🤯

English

3

0

9

1.4K

Arsh Shah Dilbagi@arshdilbagi·4 Mar

Full Stanford CS 224G lecture: go.adaline.ai/ftUI7Nk Detailed blog with implementation frameworks: go.adaline.ai/GKoO2A7 If you’re shipping LLM products without observability and evals, we should talk.

English

0

1

23

820

Arsh Shah Dilbagi@arshdilbagi·27 Şub

14/14 AI changes what’s possible. It doesn’t change what’s required to build a real business. Trust, usage, distribution, judgment are still the only things that compound. Full breakdown from all 4 investors: go.adaline.ai/BTflOOL

English

0

1

224

Arsh Shah Dilbagi@arshdilbagi·27 Şub

13/14 Governance won’t arrive through regulation first. It'll be built into products — audibility, access controls, rollback mechanisms. Trust is earned through control, not promises. The companies that operationalize governance first will own the most regulated markets.

English

1

0

2

289

Arsh Shah Dilbagi@arshdilbagi·27 Şub

AI has never attracted more capital. Yet betting on the right companies has never been harder. 4 investors share exactly where value compounds and where it doesn’t.

English

3

6

30

15.3K

Arsh Shah Dilbagi

Keşfet