Ryan Harty

534 posts

Ryan Harty

@HartyRyan

AI Engineer building agentic systems | Ex-quant, ex-VC ($1B AUM) | Now deploying AI at scale in energy & life science

Katılım Kasım 2011

385 Takip Edilen251 Takipçiler

Ryan Harty@HartyRyan·3d

@trq212 The openclaw killer

English

Thariq@trq212·4d

We just released Claude Code channels, which allows you to control your Claude Code session through select MCPs, starting with Telegram and Discord. Use this to message Claude Code directly from your phone.

English

1.7K

2.4K

25.7K

7.3M

Ryan Harty@HartyRyan·5d

@trq212 Skills have revolutionized how my team works. Databases get turned into MCPS, SOPs get turned into skills. What took hours of manually work is done in minutes with cowork/Claude code.

English

152

Thariq@trq212·6d

x.com/i/article/2033…

ZXX

362

2.2K

15.7K

6.5M

Ryan Harty@HartyRyan·9 Mar

@cgtwts fire cheat sheet. Building custom mcps for data that is hard to connect to, then running custom skills for the automated workflows that require data, is the move

English

6.3K

CG@cgtwts·8 Mar

Most people are sleeping on this. Anthropic dropped a 33 pages cheat sheet for building Claude skills You can set it up for stock trading and business workflows: - act like a custom copilot - run technical + fundamental analysis - manage a live portfolio - score 2,800 stocks

0xMarioNawfal@RoundtableSpace

Anthropic dropped a 33 pages cheat sheet for building Claude skills resources.anthropic.com/hubfs/The-Comp…

English

690

10.3K

3.7M

Ryan Harty@HartyRyan·4 Mar

@alexalbert__ The timing is wild. Getting blacklisted by the Pentagon turned into the best marketing week in AI history. Claude hits #1 on the App Store while the company that said yes to the DoW scrambles to rewrite their contract

English

215

Alex Albert@alexalbert__·3 Mar

It was easy to miss all the stuff we shipped at Anthropic this past week given... everything else that happened. A few of my favorites across Claude Code, claude dot ai, and Cowork:

English

1.7K

251.7K

Ryan Harty@HartyRyan·4 Mar

@GaryMarcus I see this in production constantly. Ask a model to review a strategy doc and it finds reasons to agree with whatever you wrote. The most dangerous AI failure isn't hallucination. It's confident confirmation of your worst ideas

English

Gary Marcus@GaryMarcus·3 Mar

New study that everyone who uses LLMs should read. “When AI systems are trained to be helpful, they may inadvertently prioritize data that validates the user’s narrative over data that gets them closer to the truth.” open.substack.com/pub/garymarcus…

English

108

483

25.5K

Ryan Harty@HartyRyan·4 Mar

@emollick Coding benchmarks are easy because code has clear pass/fail. The hard enterprise work, summarizing messy contracts, extracting data from scanned PDFs, triaging ambiguous emails, has no clean eval. That's where most AI value actually lives.

English

Ethan Mollick@emollick·3 Mar

What a great illustration of the central problem of AI benchmarking for real work All of the effort is going into benchmarking for coding, but that is a small part of the actual jobs people do, which leaves the true trajectory of AI progress less clear. arxiv.org/pdf/2603.01203

English

494

43K

Ryan Harty@HartyRyan·3 Mar

@bindureddy Leaderboard position means less every month. The real question is whether it holds up on the messy inputs enterprises actually send it. Benchmarks are clean rooms. Production is a construction site.

English

Bindu Reddy@bindureddy·3 Mar

Gemini Pro 3.1 IS ON TOP OF LIVEBENCH - BEATS EVER OTHER MODEL BY A LOT While it's on top of almost all leaderboards, we do find that it scores lower on hidden or "unseen" questions This implies that the model is benchmark optimized. However even on our internal rankings it scores just below the Opus and Sonnet but above GPT 5.2 and 5.3 This makes Gemini 3.1 Pro very competitive

English

336

31K

Ryan Harty@HartyRyan·3 Mar

@simonw @mempirate @AmpCode This is the whole game. An agent can write code all day but when it breaks at 2am, a human answers the page. Until agents can own consequences, they're tools not teammates

English

Simon Willison@simonw·3 Mar

@mempirate @AmpCode Coding agents can't take accountability for their mistakes Eventually you want someone who's job is on the line to be making decisions about things as important as securing the system

English

5.8K

mempirate@mempirate·3 Mar

OTOH, if we're starting to trust agents to ship code, why shouldn't we trust them (other instances probably) to review it too? Just invoke the @AmpCode oracle ;) I don't do this yet btw because I'd feel like I'm flying blind, but I guess eventually it'll change?

Simon Willison@simonw

The people I want to hear from right now are the security teams at large companies who have to try and keep systems secure when dozens of teams of engineers of varying levels of experience are constantly shipping new features

English

8.8K

Ryan Harty@HartyRyan·3 Mar

@emollick Naming matters more than people admit. 'o3' sounds iterative. 'GPT-5' signals a generational leap. The capability jump was real but the branding buried it.

English

Ethan Mollick@emollick·3 Mar

This would have been more obvious if o3 had been called GPT-5 instead.

English

144

24.9K

Ethan Mollick@emollick·3 Mar

From an AI user perspective, the four big leaps so far in ability: 1. GPT-3.5 (ChatGPT, November 2022) 2. GPT-4 (Spring 2023) 3. Reasoners (starts with o1-preview, but the real deal was o3, Spring 2025) 4. Workable agentic systems (Harness + good reasoner models, December 2025)

English

117

139

2.5K

241.8K

Ryan Harty@HartyRyan·3 Mar

@GaryMarcus This is the underrated risk. Everyone talks about hallucination. Sycophancy is worse because the user never realizes they're wrong. At least hallucinations are obviously false sometimes

English

Gary Marcus@GaryMarcus·3 Mar

Quote of the year? “sycophantic AI distorts belief, manufacturing certainty where the should be doubt”

Robert Youssef@rryssf_

"i asked ChatGPT and it confirmed my approach" Princeton just ran a 557-person study showing that's exactly the problem. default GPT suppresses discovery at the same rate as an AI built to be a yes-man. unbiased feedback produced 5x better results. your AI isn't validating your ideas. it's mirroring them back to you:

English

118

12.5K

Ryan Harty@HartyRyan·3 Mar

@svpino The skills/plugins/MCP naming shuffle is the real problem. Every update changes the abstraction layer. I stopped using framework features and just write plain .md context files. Boring but they don't break

English

400

Santiago@svpino·3 Mar

Skills in Claude Code right now are a cat-and-mouse game. Today, they work. Tomorrow, they fail. This is extremely frustrating. When you run them manually, you can catch any new mistakes and update the skill so they don't happen again (hopefully!) But this means you can't automate much because the model keeps "inventing" new ways to perform the task, even when those methods violate the skill rules. Last week alone, I had to make 4 updates to a skill that ran 6 times. There's still a lot of work to do until this is reliable.

English

273

35.7K

Ryan Harty@HartyRyan·3 Mar

@simonw This is the real tension. Security teams want to gate access. Engineering teams already have 6 agents running in production. The gap between policy and reality grows every week

English

223

Simon Willison@simonw·3 Mar

swyx@swyx

this is the Final Boss of Agentic Engineering: killing the Code Review at this point multiple people are already weighing how to remove the human code review bottleneck from agents becoming fully productive. @ankitxg was brave enough to map out how he sees SDLC being turned on its head. i'm not personally there yet, but I tend to be 3-6 months behind these people and yeah its definitely coming.

English

698

100.1K

Ryan Harty@HartyRyan·3 Mar

@swyx @ankitxg The bottleneck was never the code review itself. It was the context transfer. Reviewer doesn't know why you made that choice. Agents have the same problem, they just skip the part where someone asks.

English

174

swyx@swyx·3 Mar

Latent.Space@latentspacepod

🆕 How to Kill The Code Review latent.space/p/reviews-dead the volume and size of PRs is skyrocketing. @simonw called out StrongDM’s “Dark Factory” last month: no human code, but *also* no human review (!?) in this week’s guest post, @ankitxg makes a 5 step layered playbook for how this can come true.

English

167

1.2K

403.7K

Ryan Harty@HartyRyan·3 Mar

@emollick The gap between leap 3 and leap 4 is where most enterprise value got created. Reasoners made AI useful for real work. Everything before that was demos

English

154

Ryan Harty@HartyRyan·3 Mar

@emollick This matches what I see in production. Models that ace benchmarks choke on messy real-world inputs. Corrupted PDFs, mixed-language docs, weird formatting. Robustness is the real moat, not leaderboard scores

English

530

Ethan Mollick@emollick·3 Mar

This is good empirical evidence backing up the intuition that the major Chinese open weights models are quite fragile, good at some narrow areas but much less capable in general tasks or out-of-distribution work than the frontier closed models.

ARC Prize@arcprize

International models on ARC-AGI-2 Semi Private - Kimi K2.5 (@Kimi_Moonshot): 12%, $0.28 - Minimax M2.5 (@MiniMax_AI): 5%, $0.17 - GLM-5 (@Zai_org): 5%, $0.27 - Deepseek V3.2 (@deepseek_ai): 4%, $0.12 These models score below July 2025 frontier labs

English

701

73.2K

Ryan Harty@HartyRyan·3 Mar

@GaryMarcus The moat was never the model. It's distribution and switching costs. When everyone copies the same play, the winner is whoever locked in enterprise contracts first. This is basic market structure

English

Gary Marcus@GaryMarcus·2 Mar

“Right now, every major player is running the same play” That should tell you something there. So called “innovators” ain’t innovating. They’re copying.

Dustin@r0ck3t23

Satya Nadella just said what the entire industry is too invested to admit. Every CEO signing $100 billion data center contracts right now is making a bet that history may not honor. Nadella: “We are one sort of innovation away from the entire regime changing.” Right now, every major player is running the same play. More data. More GPUs. Bigger clusters. Same architecture. They’ve convinced themselves scale is destiny. They’ve convinced themselves the biggest balance sheet wins. They’ve convinced themselves this is a resource war. It’s not. Nadella: “If you look at where we’ve gone, it was all about pre-training scale, then it was about post-training, then we came up with reasoning, then we said, ‘oh, there’s RL.’” The architecture isn’t stable. It never was. It’s been mutating the entire time. Each shift rewriting the rules. Each breakthrough making the previous moat irrelevant. And the companies that didn’t see it coming didn’t get a warning. They just woke up behind. Nadella: “A new model architecture that could even be more efficient in its performance.” When that lands, the $100 billion clusters don’t matter. The hoarded GPUs don’t matter. The multi-decade infrastructure advantage doesn’t matter. Every castle built for the current paradigm becomes a monument to the wrong bet. This is what makes the AI race unlike anything in history. In nuclear competition, more warheads meant more power. The advantage was permanent. Cumulative. Compounding. In this race, one person with the right insight at 2am in an apartment somewhere erases a trillion dollars of infrastructure before the market opens. No warning. No negotiation. No second place. The most dangerous competitor in this race doesn’t have a data center. They just have the equation.

English

209

27.1K

Ryan Harty@HartyRyan·3 Mar

@emollick The wild part is watching non-technical people debug their own prompts now. My content system has 16 scheduled agents and the hardest part wasn't

English

152

Ethan Mollick@emollick·2 Mar

Everyone just speaks Claude, now.

English

642

24.7K

Ethan Mollick@emollick·2 Mar

[[Topic of discussion]] is not [[analogy]]. [[Dramatic fact given own line]]. [[Dramatic fact given own line]]. [[Dramatic fact given own line]]. [[Dramatic summary sentence.]] [[Topic of discussion]] is [[different analogy]]. [[Implications delivered with certainty]].

English

113

609

7.2K

200K

Ryan Harty@HartyRyan·3 Mar

@simonw Annotated prompts are underrated. Half the value of my agent system came from iterating on the context docs, not the code. The prompt IS the product at this point

English

172

Simon Willison@simonw·2 Mar

I started a new section of my Agentic Engineering guide for annotated versions of prompts I've used for projects - the first is a prompt I used to have Claude Code for web build me a web UI for compressing GIFs using a WebAssembly build of Gifsicle simonwillison.net/guides/agentic…

English

381

28.5K

Ryan Harty@HartyRyan·3 Mar

@svpino Missing the big one: give it your actual project context, not just rules. I have agents reading brain files, past analytics, and brand voice docs before writing a single line. The .md file is the new system prompt

English

633

Santiago@svpino·2 Mar

Claude Code tips: (Add these to your CLAUDE .md file) 1. Before writing any code, describe your approach and wait for approval. 2. If the requirements I give you are ambiguous, ask clarifying questions before writing any code. 3. After you finish writing any code, list the edge cases and suggest test cases to cover them. 4. If a task requires changes to more than 3 files, stop and break it into smaller tasks first. 5. When there’s a bug, start by writing a test that reproduces it, then fix it until the test passes. 6. Every time I correct you, reflect on what you did wrong and come up with a plan to never make the same mistake again.

English

151

1.8K

185.3K

Ryan Harty@HartyRyan·3 Mar

@emollick I built a content system that flags exactly these patterns. The irony is most "AI thought leaders" couldn't pass the same detector they're supposedly building with

English

569

Keşfet

@trq212 @cgtwts @alexalbert__ @GaryMarcus @emollick @bindureddy @simonw @mempirate