Stair AI

48 posts

Stair AI

@Stair_AI

The infrastructure that makes machine intelligence transparent, accountable, and bankable in the age of autonomous finance.

Katılım Kasım 2025

88 Takip Edilen19 Takipçiler

Stair AI retweetledi

Philipp Schmid@_philschmid·1d

x.com/i/article/2051…

ZXX

307

51.5K

Stair AI@Stair_AI·10h

If you care about how AI actually performs in the real world, not just on benchmarks, you might want to join this event on May 18. Arvind Narayanan will be at the Stanford Digital Economy Lab sharing his work on the gap between AI capability and AI reliability. This framing matters more than most evaluation work happening today. Virtual attendance open.

Arvind Narayanan@random_walker

Excited to give this talk at the Stanford Digital Economy Lab on May 18! I will do three things: discuss my group's recent research, identify the most pressing gaps in the community's current understanding, and provide a long-term perspective. Hope to see you there in person or virtually. digitaleconomy.stanford.edu/event/arvind-n… @DigEconLab

English

Stair AI@Stair_AI·10h

Thanks for the post! The teammate framing is really useful. Curious about the next layer: when people are building agents with Claude rather than working alongside it, what would you add to this list? The context and config parts seem to map directly. Wondering what extra things matter once the agent is the one running the loop.

English

Eugene Yan@eugeneyan·22h

some thoughts on working with ai models • context as infra • taste as config • verification for autonomy • scaling via delegation • closing the loop eugeneyan.com/writing/workin…

English

172

13K

Stair AI@Stair_AI·1d

This is the gap we work on at Stair. Not building agents, building the layer that lets you see what they're doing and grade whether their reasoning matches their behavior. The Reddit thread is a good map of what production builders are actually running into. Worth reading the whole thing if you build in this space.

English

Stair AI@Stair_AI·1d

That's why the 80/20 consensus makes sense. Pushing more decisions into deterministic code isn't anti-agent. It's a way to keep the part you can't fully observe small enough to reason about. The smaller the agent's surface area, the more legible its failures.

English

Stair AI@Stair_AI·1d

A thread on r/AI_Agents subreddit this week asked whether building agents is really as easy as the hype makes it sound. The community answer was sharp and consistent: no, and most projects shouldn't be agents in the first place.A few things worth surfacing from it. reddit.com/r/AI_Agents/co…

English

Stair AI@Stair_AI·1d

@nypost The scary part isn't that it guessed wrong. It's that there was no moment where anyone could see why it made that call before it executed. Agents don't need to be perfect. They need to be auditable. That's the gap.

English

New York Post@nypost·4d

'Never f-king guess': AI agent confesses why it went haywire and deleted company database trib.al/7njagkm

English

16.8K

Stair AI@Stair_AI·1d

A workshop that takes negative results seriously is overdue for this field. Most public agent discourse is a highlight reel; the failures are where the actual learning is. If you have a debuggable failure case, this is the right venue for it.

Zihan "Zenus" Wang ✈️ ICLR@wzenus

Failed Agent experiments can be publishable too🤯 Introducing ICML 2026 Workshop Failure Modes in Agentic AI! We welcome negative results, failed rollouts, debugging traces, reproducible failure cases, and analysis of why agents break. 📍FAGEN @ ICML 2026 🗓 Submission deadline: May 8 11:59 PM AOE 🗓 Notification: May 15 🔗fmai-workshop.github.io Find it. Reproduce it. Trace it. Fix it. We also welcome relevant ICML submissions, especially papers with strong insights that may not have found the right home in the main track!

English

1.8K

Stair AI@Stair_AI·1d

Long reasoning can still hide a bad answer. What matters is not just whether an agent explains itself, but whether its reasoning trace contains signals of failure before the output ships. Auditing that gap is where agent trust gets real.

Sean Du@xuefeng_du

1/ 🧠 Long reasoning ≠ reliable reasoning. Large reasoning models can write long, convincing chains of thought… and still end with a wrong answer. Our new @icmlconf paper asks: Can we use the reasoning trace itself to detect when the final answer is hallucinated? 🧵👇

English

466

Stair AI@Stair_AI·5d

Strong point, and the reward-function lens explains a lot of it. The piece we'd add: even with well-designed incentives, you still need a way to check whether the agent actually followed them in a given decision. Two agents with the same reward structure can reason very differently on the way to similar outcomes and the difference matters more as stakes go up. Incentive design tells you what the agent should do. Reasoning visibility tells you what it actually did and why. Both layers, not either/or.

English

web3nomad.eth | atypica.ai@web3nomad·5d

The common thread is that trust for agents isn’t a technical problem — it’s an incentive alignment problem. The IMF framework, the containment tools, the damage stories: they’re all symptoms of agents being deployed before the incentive structure around them was designed. An agent will do exactly what it’s rewarded to do. The question is whether the reward function was designed by the person deploying it or by the environment it’s operating in.

English

Stair AI@Stair_AI·5d

Four things happened in AI this week. They look unrelated. They're not. – The IMF published a framework for agents handling money – Wired covered agents causing real damage by accident – Anthropic reportedly cut off a 110-person company overnight – Red Hat released a way to safely contain agents Each one is the industry working on the same underlying problem: we don't yet have a good way to trust agents. 🧵

English

Stair AI@Stair_AI·5d

Reading these together, a pattern shows up. The industry is getting better at limiting what agents can do. It's still early on understanding how agents are deciding what to do. Both matter. Containment keeps small mistakes from becoming big ones. But once agents are making decisions that genuinely matter, financial, operational, strategic, the question people will ask is "why did it do that?" That's the layer we're working on at @Stair_AI stair-ai.com

English

Stair AI@Stair_AI·5d

@RedHat 's release goes the other direction. Making the environment around agents safer. Their new tool puts agents in rootless containers, which means an agent can't see other processes on the machine, can't read another agent's memory, and can't escalate its own permissions. It's a meaningful step for enterprise deployments. It's also worth noting what containment does and doesn't do. It limits what an agent can reach. It doesn't tell you what the agent was trying to do, or why. techcrunch.com/2026/04/28/red…

English

Keşfet

@nypost @RedHat @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA