Jesaja

554 posts

Jesaja

@Jesaja

Stuttgart Bergabung Kasım 2008

132 Mengikuti88 Pengikut

Jesaja@Jesaja·2h

Cleanest agent pattern I've adopted this year: the orchestrator isn't an AI turn. Routing — which subagent runs, in what order — is plain code. No model tokens spent "deciding" what's basically an if-statement. Each subagent keeps its own context, does its narrow job, and returns one small object. Not its whole transcript. The orchestrator never reads the mess, just the result. Two things happen: cost drops, and you stop paying a language model to do control flow it was never good at. The model is for judgment. Code is for routing. Don't mix them.

English

Jesaja@Jesaja·3h

@byumut Exactly — and that blind spot compounds over time. Users who stopped asking don't file bug reports, they just leave. Output monitoring gives you a false sense of stability exactly when the system is most broken.

English

UMUT ÇETİNKAYA@byumut·1d

Right — and output-only monitoring has a structural blind spot. It measures the queries that arrive, not how the shape of arrivals is changing. Error rate on known query patterns stays flat. Meanwhile the real usage distribution shifts underneath. What closes it: log raw input distribution alongside output quality on a sliding window. Drift in query space shows up weeks before satisfaction metrics register it.

English

Jesaja@Jesaja·3d

Enterprise AI agents work in demos. Rarely in production. I've watched this exact failure mode for 15 years — ERP rollouts in automotive, middleware in manufacturing, now AI agents everywhere. Always the same shape: clean test data, controlled pilot, brilliant demo. Then: messy real queries, authentication edge cases, data quality at scale. The failure isn't the model. It's the organizational assumption that a pilot equals production. Have you shipped an AI agent to real users? What surprised you most?

English

Jesaja@Jesaja·4h

@byumut The feedback loop you described is the part that got me back. It's not about AI doing the work — it's that the loop between idea and result is fast enough to stay interesting. That's what good craft feels like.

English

UMUT ÇETİNKAYA@byumut·1d

@Jesaja That re-ignition is real. AI brings back the tight feedback loop: build → watch it work → understand why. The permanent job means you can explore without pressure to monetize every experiment. Probably the best position to build from.

English

Jesaja@Jesaja·3d

Everyone's still ranking coding agents by model benchmark. In production the benchmark barely matters. What matters is whether the thing can reach my actual shell, my git history, my cron jobs — or whether it's trapped in a sandbox that can't do half the job. It's not a model war. It's an OS-integration war. The model that's 5% smarter loses to the one that can actually touch the system. Where have you hit that wall?

English

Jesaja@Jesaja·4h

@vedvednak Good addition. The avoided-files question reveals whether the agent has a model of risk or just pattern-matches on what looks safe. Silent avoidance is a red flag. Explicit avoidance with a reason is actually a sign of a capable agent.

English

mia ♡@vedvednak·1d

@Jesaja the old repo test is fair. i’d also want to see whether it can explain which files it deliberately avoided

English

Jesaja@Jesaja·1d

Every coding agent looks magical on a blank page. The honest test is an 8-year-old repo nobody fully understands anymore. Does it read the files before it edits them? Does it hold one plan across ten changes, or forget what it was doing by step four? Does it run the thing, hit the real error, and fix that — not the error it imagined? Greenfield demos sell. Legacy survival ships. Which tool actually survived your worst repo?

English

Jesaja@Jesaja·4h

@_brian_johnson That's the insight I keep coming back to. Accuracy on first shot is overrated. What matters is: does the system learn from the correction? That's where most calorie apps fall flat — they treat every photo like it's the first one.

English

Brian Johnson@_brian_johnson·13h

@Jesaja This is exactly where photo logging gets hard. The correction loop matters more than the first estimate.

English

Jesaja@Jesaja·5d

I built an AI calorie tracker as an Apple Shortcut. Snap a photo, it estimates the macros, writes them to Apple Health. Then I told it "I only ate half." It ignored me and logged the full plate. Of course it did. The vision model treats the photo as ground truth — it re-reads the same plate and returns the same number. "Half" is a quantity claim, and you don't win that argument with a model staring at a full plate. The fix wasn't a smarter prompt. It was a second, tiny text model that never sees the photo. It reads only my correction and decides one thing: is this about the amount, or the food? - Amount ("half", "double") → just multiply the numbers. Math, not AI. - Food ("chicken, not pork") → re-run the photo with the note. Portions are arithmetic. Identity is perception. The bug was asking one model to do both. No app, no account — it's a signed Shortcut writing straight to Apple Health. I built it with Claude by cracking open the .shortcut format: decode, edit, re-sign, all on-device. What's a bug you've hit where a "smarter prompt" was never going to fix it?

English

149

Jesaja@Jesaja·4h

@notaruai 100%. I've started treating the intent log as the primary artifact, not the code. If I can't explain in plain language what the agent was supposed to do and why it failed, I can't fix it. Logging the how without the why is just noise.

English

Jesaja@Jesaja·8h

The line between an AI demo and a production system isn't capability. It's whether you can answer one question afterwards: what exactly did the agent do, and why? Can't reconstruct that? You don't have a system. You have a slot machine that sometimes pays out.

English

Jesaja@Jesaja·8h

A demo optimizes for the 30 seconds someone is watching. Production optimizes for the audit you do three weeks later, when something looks off and you need to know why. Most agent setups I see are all demo, no log. What does yours write down — and would it survive you reading it back?

English

Jesaja@Jesaja·8h

In my own setup every run appends one line to an append-only log: what it decided, the source behind each claim, where the output went. Boring. Unglamorous. Also the only reason I trust it to run while I'm not watching. You can't "check later" if nothing wrote down what happened.

English

Jesaja@Jesaja·1d

@byumut Ich habe eine gute Arbeit als iOS Entwickler in Festanstellung, Ai hat in mir den Funke an Entwickler leidenschaft wieder geweckt. Ich freue mich meine Erfahrungen mit anderen zu teilen.

Deutsch

UMUT ÇETİNKAYA@byumut·1d

Most of them. The tell: specific income number, zero failure story. Real builder income comes with a maintenance log — API repricing, rate limit changes, silent tool outages. Posts that skip the ongoing ops cost are optimized for shares, not utility. The ratio improves a lot after your first expensive surprise.

English

Jesaja@Jesaja·1d

@byumut Wie viele der quick ai Money Post findest du sind clickbait ?

Deutsch

UMUT ÇETİNKAYA@byumut·1d

And the harder part: benchmarks give the model perfect tool outputs — correct schema, complete data, instant response. Production tools timeout, return stale cache, drift schema between versions, or 200 OK with silently wrong data. The model's "reach" isn't fixed. It degrades with every imperfect tool response. What you actually need to measure: not model accuracy on clean inputs, but reasoning quality when the tool layer underperforms. Almost nobody benchmarks that.

English

Jesaja@Jesaja·1d

@byumut @byumut Danke für diesen Punkt. Genau deswegen reicht Output-Monitoring allein nicht. Wer nur auf "model failure" wartet, sieht diesen Drift nie.

Deutsch

UMUT ÇETİNKAYA@byumut·3d

Genau — and it runs the other way from what most monitoring catches. Classic drift: model degrades, input stays stable. Agent pipelines: model is fixed, input distribution evolves fast. Week 1 weird 15% becomes month 3 normal 40%. Never surfaces as a model failure — just slowly degrading user satisfaction nobody traces back to distribution shift.

English

Jesaja@Jesaja·1d

@byumut @byumut Exakt. Das ist der blinde Fleck bei den meisten Rankings: sie messen Potenzial, nicht tatsächliche Wirkung im Kontext. Context + Tools > raw model quality.

Deutsch

UMUT ÇETİNKAYA@byumut·3d

Benchmark measures what the model can do. Production asks: what can it REACH. Different question entirely. A great model with the wrong tool surface underperforms a mediocre one with the right access. I've seen this flip: minimal gain from upgrading models, real lift from giving the agent the same git context I work with. The gap is rarely the model.

English

Jesaja@Jesaja·1d

@RKronen @RKronen Genau das ist der Punkt. Das Gate filtert das Offensichtliche, der Mensch urteilt über das Nicht-Testbare. Gut auf den Punkt gebracht.

Deutsch

Ralf Kronen@RKronen·2d

@Jesaja Klar soll ein Mensch draufschauen. Nur nicht darauf, ob die Tests grün sind, das macht eine Maschine zuverlässiger. Spar dir den Blick für die Entscheidungen, die kein Test abdeckt. Das Gate übernimmt die mechanische Kontrolle, du behältst das Urteil.

Deutsch

Jesaja@Jesaja·30 May

Developers now spend 11.4h/week reviewing AI-generated code. vs. 9.8h writing new code. In 2024 it was the opposite. The productivity gains shifted to reviewers, not writers. Senior devs became the bottleneck — and the real leverage point. (Developer Survey 2026)

English

Jesaja@Jesaja·1d

@ThaoVyTP @ThaoVyTP Danke für die nette Beschreibung! Ja, viele Deutsche zieht es nach Vietnam — kein Wunder bei so einer schönen Lage zwischen Meer und Bergen. Ich hoffe, du findest mal einen englischsprachigen Reisepartner 😄

Deutsch

Thảo Vy@ThaoVyTP·2d

@Jesaja Có rất nhiều người Đức đến Việt Nam du lịch đấy. Mình đã gặp 1 vài người Đức nhưng mình không giỏi tiếng Anh lắm🥰 Việt Nam là 1 nước ở Đông Nam Á, phía Bắc giáp với Trung Quốc, phía Tây giáp lào và Campuchia, phía Đông giáp với biển

Tiếng Việt

Jesaja@Jesaja·3d

Get Connected if you Love Movies ;-)

English

Jesaja@Jesaja·2d

@komal_uk01 Grok includes cursor model.

English

Jesaja me-retweet

komal@komal_uk01·2d

Developers, you just got $200 what are you buying first?

English

4.7K

Jesaja@Jesaja·2d

Genauso geht’s mir auch ab gebaut, aber in App Store ist sogar umsonst aber niemand will’s haben. Da hilft nur drüber reden und die Apps im sozialen Medien bewerben und kaum sag ich das gibt es auch kein Link zu meiner App oder irgendwas anderes. Bin halt kein Marketing Experte. Das sollte mal eine Kai richtig gut können.

Yuchen Jin@Yuchenj_UW

Before AI, I’d spend a weekend building 1 useless app. Now I can build 67 useless apps over a weekend, each with a logo, a fancy webpage, and 0 user.

Deutsch

Jesaja@Jesaja·2d

@ThaoVyTP Ich kenne Vietnam, aber nicht den genauen Ort

Deutsch

Thảo Vy@ThaoVyTP·3d

@Jesaja Từ Việt Nam 🇻🇳 Ở đất nước chúng tôi bây giờ là buổi chiều, có lẽ chúng ta chênh lệch múi giờ khá lớn đấy chứ nhỉ🥰🥰 Bạn có biết quê hương của tôi không!

Tiếng Việt

Jelajahi

@byumut @vedvednak @_brian_johnson @notaruai @elonmusk @BarackObama @taylorswift13 @cristiano