Sean Ward

1.1K posts

Sean Ward

@DNAEngineer

CEO & Co-founder @iGent_AI. Previously Founder @synthace, Relatable, Scale DX

London Katılım Mart 2011

895 Takip Edilen2.1K Takipçiler

Sean Ward@DNAEngineer·21 Şub

@ctjlewis It is more complicated than that however: most providers brainfried their models with RL to go prompt -> outcome. Anthropic models can maintain competing hypothesis and tension for far longer, which allows thesis, investigation, correction, and autonomy.

English

Lewis 🇺🇸@ctjlewis·21 Şub

@DNAEngineer I do absolutely believe those to be synonymous statements yeah. I believe reasoning is just intermediate computation and that’s what I wanted to present there. But this is controversial technically, though not really for any good reason.

English

Lewis 🇺🇸@ctjlewis·21 Şub

I’m not on either team, I care about the best output, so I’ve always been using Opus for codegen, because it is better at mentally simulating programs than GPT or Codex. That has not changed.

aphrodiziac@affrodiziac_

@ctjlewis i am just interested, are you completly neutral ?, i mean are you on team Anthropic or team Open AI? or on the team "whoever makes it"?

English

2.4K

Sean Ward@DNAEngineer·21 Şub

@ctjlewis In its simplest terms: Anthropic models can reason. Hello from the days of Spellcraft btw.

English

Lewis 🇺🇸@ctjlewis·21 Şub

Basically I think Anthropic is generally less of an asshole on a personal level, and also independently just so happens to be the best at simulating programs in terms of output since 2024. But that was not always the case on the second point, mostly was on the first.

English

736

Sean Ward@DNAEngineer·6 Şub

@jonas @dbreunig @karpathy @grok We did Redis last July: igent.ai/insights/produ…. The truth is, specification is the bottleneck (and a well known system shortcuts that).

English

Jonas Templestein@jonas·6 Şub

@dbreunig @karpathy @grok what other software engineering challenges exist that are extremely well specified in terms of conformance tests? what other impressive thing could agents just build now?

English

170

Drew Breunig@dbreunig·6 Şub

Another example of “Spec Driven Development” along the lines of whenwords and just-bash, when conformance tests are the key driver. @karpathy

English

6.2K

Sean Ward retweetledi

Claude@claudeai·24 Kas

Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.

English

1.1K

2.4K

19.2K

7.8M

Sean Ward@DNAEngineer·21 Eki

@sqs You are assuming those are at odds with each other

English

Quinn Slack@sqs·21 Eki

I sense a coming schism in coding agents. “bicycle for the mind” human-in-the-loop vs. “hours of autonomy” human-out-of-the-loop

English

5.9K

Sean Ward@DNAEngineer·29 Eyl

@unit_accord @Yuchenj_UW x.com/dnaengineer/st…

Sean Ward@DNAEngineer

@Yuchenj_UW Verifiable engineering tasks, mostly at the scale of entire features, primarily in compiled languages like go or rust, where it can validate the outputs and hill climb. One such public example: igent.ai/insights/produ… with repo of results and all commit/session history.

QME

Unit Accord@unit_accord·29 Eyl

@Yuchenj_UW Does it matter how long it runs or what it achieves. Let's see one product it made.

English

Yuchen Jin@Yuchenj_UW·29 Eyl

Claude Sonnet 4.5 runs autonomously for 30+ hours of coding?! The record for GPT-5-Codex was just 7 hours. What’s Anthropic’s secret sauce?

English

200

176

2.8K

360.2K

Sean Ward@DNAEngineer·29 Eyl

@_xjdr It natively wants you to be absolutely right (it is Sonnet after all). However, unlike 4.0, it will listen to style and tone instructions on a long horizon basis.

English

960

xjdr@_xjdr·29 Eyl

sonnet 4.5 still thinks i am absolutely right . disappointment beyond measure . (no opinions on the model itself yet, i just started testing it)

English

563

36.2K

Sean Ward@DNAEngineer·29 Eyl

English

917

Yuchen Jin@Yuchenj_UW·29 Eyl

@DNAEngineer Tell us more about the details! It’d be great if you could share: - what was the task or prompt you gave it for coding - result: the code it generated, and whether it actually runs, any bugs?

English

3.6K

Sean Ward@DNAEngineer·18 Eyl

The world is rapidly bifurcating between those who have experienced the superhuman capabilities of even current generation models, used properly, and those stuck in the capabilities of the past.

iGent AI@iGent_AI

We're excited to share that our agent, Maestro, drafted solutions to all 12 problems from ICPC 2025 World Finals in ~2 hours - using current models, no human involvement, no internet access. We deeply respect the human teams' extraordinary dedication. Note: no official validation

English

577

Sean Ward@DNAEngineer·11 Eyl

@SimonInmania @matthewclifford Yes, task length is in Replit’s case measured in agent runtime, which can slightly proxy for the metric that really matters: how many man months (or years) of work the agent performs. igent.ai/insights/produ… is several man years in the course of ~70 hours or agent time however.

English

Simon Inman@SimonInmania·11 Eyl

@matthewclifford It's not clear that Amjad is describing human-measured task length, right? He's just saying that's how long the agent runs for, which doesn't seem like a particularly interesting measure?

English

1.3K

Matt Clifford@matthewclifford·11 Eyl

Task length is the key AI metric to watch - and it’s accelerating faster than almost anyone outside the industry realises

Amjad Masad@amasad

The METR paper that says that “the length of tasks AI can do is doubling every 7 months” radically undersells the scaling that we’re seeing at Replit. It might be true if you’re measuring one long trajectory for a single model class. But this is where an agent research lab’s alpha is at. We build multi-agent architecture and use different models from various providers to tap into their latent abilities across various tasks.

English

120

37.8K

Sean Ward@DNAEngineer·12 Ağu

@simonw Strong recommendation: it takes a very different promoting strategy to maintain attendance across that context: basic system prompt + context + request alone won’t cut it.

English

857

Simon Willison@simonw·12 Ağu

Notes on the new 1m context window for Claude Sonnet 4: simonwillison.net/2025/Aug/12/cl… You need to send a beta header of context-1m-2025-08-07 and be on tier 4, which means you have purchased at least $400 in API credits

Claude@claudeai

Claude Sonnet 4 now supports 1 million tokens of context on the Anthropic API—a 5x increase. Process over 75,000 lines of code or hundreds of documents in a single request.

English

375

45.1K

Sean Ward@DNAEngineer·12 Ağu

@platonovadim @_xjdr Yes, tests are central including leveraging the existing redis client libraries and benchmarks. The entire commit history also shows every round of the AI workflow during building.

English

210

Vadim@platonovadim·12 Ağu

@DNAEngineer @_xjdr Maestro looks interesting. Do you have a write up with more details on how the rewrite was done? Did you look at Redis source code or design? Was there a conformance test suite?

English

266

xjdr@_xjdr·12 Ağu

Tested the NSA code over night and after a few tweaks it trains. Wow, GPT5 and Opus4.1 wrote a 100% AI generated (human art directed) NSA implementation. I would not have guessed that was possible

English

433

58.8K

Sean Ward@DNAEngineer·8 Ağu

It has become clear there is a massive performance and productivity delta growing between engineers who understand and embrace AI, with appropriate tooling and critical analysis, and those who have have remained in the co-pilot era. Never before has it been so possible for those who know what they are doing, to build so much.

iGent AI@iGent_AI

Tired of toy AI demos that fizzle in production? iGentAI built Ferrous: A Rust Redis-compatible server outperforming Valkey. 35KLOC, 100% test passing, beats benchmarks. Zero human code. Built in 70 hours of part-time direction. Toys vs. tools—here's the proof.

English

701

Sean Ward retweetledi

iGent AI@iGent_AI·22 May

Our VibeCodeBench evaluations affirm what @Anthropic just announced: Claude Sonnet 4 excels at autonomous multi-feature development. We've seen codebase navigation errors drop from 20% to near zero and strategic refactoring that saves ~500k tokens on multi stage, complex tasks. Proud to power Maestro with this breakthrough.

Anthropic@AnthropicAI

Introducing the next generation: Claude Opus 4 and Claude Sonnet 4. Claude Opus 4 is our most powerful model yet, and the world’s best coding model. Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.

English

553

Sean Ward retweetledi

Louis Knight-Webb@tokengobbler·26 Şub

It's always great hosting @AITinkerers London meetups right after a new model drops... Huge thanks to @rebecca_harbeck from @AnthropicAI, as well as the @iGent_AI team @MSzummer and @samshapley for giving impromptu talks with tons of learnings from early access Claude Sonnet 3.7. We also got to see @HarryCoppock from the @AISecurityInst live demoing 3.7 hacking into a docker container 🫢 And as always, we had some fantastic product, behind-the-scenes and benchmark beating agent talks from Emma Burrows, @moeadham and Sergei Petrov. Huge thanks to team @localglobevc and @ferdisigona for making it happen!

English

885

Sean Ward@DNAEngineer·26 Şub

At @iGent_AI, we’ve found @AnthropicAI new Sonnet 3.7 to be quite the powerhouse. Everything from debugging multi language distributed systems, to comprehending and updating legacy codebases, to rapid prototypes or POCS of new technologies. Agentic SWE is now here.

iGent AI@iGent_AI

"Agency > Intelligence" @karpathy nailed it, and after 18 months building Maestro, we agree. The real AI leap isn’t just smarts—it’s agency: the ability to act independently, turning assistants into partners.

English

250

Sean Ward retweetledi

Louis Knight-Webb@tokengobbler·31 Eki

Hottest week for London AI so far 🔥 Dev Day yesterday and AI Tinkerers tonight! london.aitinkerers.org/p/ai-tinkerers… @monzo @tortus_AI @_lucas_godfrey @QuotientAI @samshapley @lukeharries @stephenbtl

English

Sean Ward@DNAEngineer·30 Oca

@simonw One of the top use cases is chaining differentiated capabilities. For example, GPT4 vision to drive a scene understanding, then shifting to using that text in a different model. Another is cross LLM critiques such as using one model to supervise the outputs of another

English

1.7K

Simon Willison@simonw·30 Oca

What are the simplest useful examples you've seen of prompt chaining, where the output of one LLM call is used as input to another? I have seen plenty of theoretical examples, but I'm looking for some concrete results

English

261

88.6K

Keşfet

@ctjlewis @jonas @dbreunig @karpathy @grok @sqs @unit_accord @Yuchenj_UW