Arsh Shah Dilbagi

617 posts

Arsh Shah Dilbagi banner
Arsh Shah Dilbagi

Arsh Shah Dilbagi

@arshdilbagi

https://t.co/C3UaMh1UYM

San Francisco, CA Katılım Aralık 2010
1 Takip Edilen2.2K Takipçiler
Sabitlenmiş Tweet
Arsh Shah Dilbagi
Arsh Shah Dilbagi@arshdilbagi·
The hard part about LLM failures is that their outputs rarely look like failures. The demo “works.” The output sounds coherent. The user actively uses the product. And your dashboard looks normal. Meanwhile, the system can be wrong, unsafe, or quietly driving up token spend. And you won’t notice until the damage adds up. Prompts often serve as business logic (policies, safety, and product context). But many teams ship them without the basics, such as versioning, reviewable changes, end-to-end traces, and eval gates. In production, it doesn’t crash. It degrades via wrong answers, policy misses, and surprise spending. No crash. No error. No alert. I cover this exact issue in my @Stanford CS 224G guest lecture on AI Observability and Evaluations. Here are the core ideas: • If you only log the final output, you’re guessing. Full traces show where it broke. • Evals are feedback loops. Use clear pass/fail criteria tied to outcomes. • Run evals continuously on production traces and don’t wait for support tickets. The moat isn’t prompt cleverness. It’s a measured improvement. Full lecture + blog below 👇
English
29
19
127
19.8K
Kevin Yang
Kevin Yang@kevinyang·
We raised $6.5M to build the agent for professionals. When your reputation is on the line, you need an agent that's reliable, secure, and one step ahead. Try it now at serif.ai
English
70
59
433
65.9K
Benjamin Stern
Benjamin Stern@itsbenjyyy·
Ten years ago I was building factories. Today I'm building the tools I wish I had inside them. @TenkaraAI raised $7M led by @trueventures.
Benjamin Stern tweet media
English
51
20
134
29.9K
Camilla Guo
Camilla Guo@brainstub·
In celebration of our Series B announcement today, here's what I observed as the 7th employee @sundayrobotics: 1/ Full-stack is everything. Our Memory Glove took 50+ iterations from hardware to ML. Left: first prototype 2024. Right: latest gen.
Camilla Guo tweet media
English
28
36
581
71.2K
Arsh Shah Dilbagi
Arsh Shah Dilbagi@arshdilbagi·
We started a newsletter, and 100,000+ product leaders started reading it. A year ago, we launched Adaline Labs to document what we were actually learning as we built with LLMs. No grand strategy. Just a gap we kept running into: Product teams were being asked to ship AI products without a practical resource to lean on. The research was too dense. The hype was too thin. So we wrote the thing we wished existed. The lesson that surprised us the most is that the more specific and technical we got, the faster the audience grew. Builders do not want surface-level takes. They want depth they can act on. Our readers told us exactly what they needed: ⚬ How do LLMs actually work? ⚬ How do I build with them reliably? ⚬ How do I evaluate what goes to production? ⚬ How do I keep up as the models keep changing? ⚬ How to build a modern workflow with Claude Code, Cursor, etc.? We did not pick those topics. We listened, researched, studied, and wrote. Year two is already underway. Here is what we are watching: ⚬ AI agents are entering real production infrastructure quickly. ⚬ Evals and observability are becoming non-negotiable. ⚬ AI coding tools are changing how teams ship. ⚬ The definition of product work is being rewritten in real time. We are here for all of it. Read the full story below, where you can see what we believed, what turned out to be true, and what completely surprised us.
Arsh Shah Dilbagi tweet media
English
2
0
6
311
Arsh Shah Dilbagi
Arsh Shah Dilbagi@arshdilbagi·
The hard part about LLM failures is that their outputs rarely look like failures. The demo “works.” The output sounds coherent. The user actively uses the product. And your dashboard looks normal. Meanwhile, the system can be wrong, unsafe, or quietly driving up token spend. And you won’t notice until the damage adds up. Prompts often serve as business logic (policies, safety, and product context). But many teams ship them without the basics, such as versioning, reviewable changes, end-to-end traces, and eval gates. In production, it doesn’t crash. It degrades via wrong answers, policy misses, and surprise spending. No crash. No error. No alert. I cover this exact issue in my @Stanford CS 224G guest lecture on AI Observability and Evaluations. Here are the core ideas: • If you only log the final output, you’re guessing. Full traces show where it broke. • Evals are feedback loops. Use clear pass/fail criteria tied to outcomes. • Run evals continuously on production traces and don’t wait for support tickets. The moat isn’t prompt cleverness. It’s a measured improvement. Full lecture + blog below 👇
English
29
19
127
19.8K
Arsh Shah Dilbagi
Arsh Shah Dilbagi@arshdilbagi·
14/14 AI changes what’s possible. It doesn’t change what’s required to build a real business. Trust, usage, distribution, judgment are still the only things that compound. Full breakdown from all 4 investors: go.adaline.ai/BTflOOL
English
0
0
1
224
Arsh Shah Dilbagi
Arsh Shah Dilbagi@arshdilbagi·
13/14 Governance won’t arrive through regulation first. It'll be built into products — audibility, access controls, rollback mechanisms. Trust is earned through control, not promises. The companies that operationalize governance first will own the most regulated markets.
Arsh Shah Dilbagi tweet media
English
1
0
2
289
Arsh Shah Dilbagi
Arsh Shah Dilbagi@arshdilbagi·
AI has never attracted more capital. Yet betting on the right companies has never been harder. 4 investors share exactly where value compounds and where it doesn’t.
Arsh Shah Dilbagi tweet media
English
3
6
30
15.3K