Jesaja

548 posts

Jesaja

@Jesaja

Stuttgart 가입일 Kasım 2008

132 팔로잉87 팔로워

Jesaja@Jesaja·31m

A demo optimizes for the 30 seconds someone is watching. Production optimizes for the audit you do three weeks later, when something looks off and you need to know why. Most agent setups I see are all demo, no log. What does yours write down — and would it survive you reading it back?

English

Jesaja@Jesaja·31m

In my own setup every run appends one line to an append-only log: what it decided, the source behind each claim, where the output went. Boring. Unglamorous. Also the only reason I trust it to run while I'm not watching. You can't "check later" if nothing wrote down what happened.

English

Jesaja@Jesaja·31m

The line between an AI demo and a production system isn't capability. It's whether you can answer one question afterwards: what exactly did the agent do, and why? Can't reconstruct that? You don't have a system. You have a slot machine that sometimes pays out.

English

Jesaja@Jesaja·1d

Every coding agent looks magical on a blank page. The honest test is an 8-year-old repo nobody fully understands anymore. Does it read the files before it edits them? Does it hold one plan across ten changes, or forget what it was doing by step four? Does it run the thing, hit the real error, and fix that — not the error it imagined? Greenfield demos sell. Legacy survival ships. Which tool actually survived your worst repo?

English

Jesaja@Jesaja·1d

@byumut Ich habe eine gute Arbeit als iOS Entwickler in Festanstellung, Ai hat in mir den Funke an Entwickler leidenschaft wieder geweckt. Ich freue mich meine Erfahrungen mit anderen zu teilen.

Deutsch

UMUT ÇETİNKAYA@byumut·1d

Most of them. The tell: specific income number, zero failure story. Real builder income comes with a maintenance log — API repricing, rate limit changes, silent tool outages. Posts that skip the ongoing ops cost are optimized for shares, not utility. The ratio improves a lot after your first expensive surprise.

English

Jesaja@Jesaja·2d

Everyone's still ranking coding agents by model benchmark. In production the benchmark barely matters. What matters is whether the thing can reach my actual shell, my git history, my cron jobs — or whether it's trapped in a sandbox that can't do half the job. It's not a model war. It's an OS-integration war. The model that's 5% smarter loses to the one that can actually touch the system. Where have you hit that wall?

English

Jesaja@Jesaja·1d

@byumut Wie viele der quick ai Money Post findest du sind clickbait ?

Deutsch

UMUT ÇETİNKAYA@byumut·1d

And the harder part: benchmarks give the model perfect tool outputs — correct schema, complete data, instant response. Production tools timeout, return stale cache, drift schema between versions, or 200 OK with silently wrong data. The model's "reach" isn't fixed. It degrades with every imperfect tool response. What you actually need to measure: not model accuracy on clean inputs, but reasoning quality when the tool layer underperforms. Almost nobody benchmarks that.

English

Jesaja@Jesaja·1d

@byumut @byumut Danke für diesen Punkt. Genau deswegen reicht Output-Monitoring allein nicht. Wer nur auf "model failure" wartet, sieht diesen Drift nie.

Deutsch

UMUT ÇETİNKAYA@byumut·2d

Genau — and it runs the other way from what most monitoring catches. Classic drift: model degrades, input stays stable. Agent pipelines: model is fixed, input distribution evolves fast. Week 1 weird 15% becomes month 3 normal 40%. Never surfaces as a model failure — just slowly degrading user satisfaction nobody traces back to distribution shift.

English

Jesaja@Jesaja·3d

Enterprise AI agents work in demos. Rarely in production. I've watched this exact failure mode for 15 years — ERP rollouts in automotive, middleware in manufacturing, now AI agents everywhere. Always the same shape: clean test data, controlled pilot, brilliant demo. Then: messy real queries, authentication edge cases, data quality at scale. The failure isn't the model. It's the organizational assumption that a pilot equals production. Have you shipped an AI agent to real users? What surprised you most?

English

Jesaja@Jesaja·1d

@byumut @byumut Exakt. Das ist der blinde Fleck bei den meisten Rankings: sie messen Potenzial, nicht tatsächliche Wirkung im Kontext. Context + Tools > raw model quality.

Deutsch

UMUT ÇETİNKAYA@byumut·2d

Benchmark measures what the model can do. Production asks: what can it REACH. Different question entirely. A great model with the wrong tool surface underperforms a mediocre one with the right access. I've seen this flip: minimal gain from upgrading models, real lift from giving the agent the same git context I work with. The gap is rarely the model.

English

Jesaja@Jesaja·1d

@RKronen @RKronen Genau das ist der Punkt. Das Gate filtert das Offensichtliche, der Mensch urteilt über das Nicht-Testbare. Gut auf den Punkt gebracht.

Deutsch

Ralf Kronen@RKronen·1d

@Jesaja Klar soll ein Mensch draufschauen. Nur nicht darauf, ob die Tests grün sind, das macht eine Maschine zuverlässiger. Spar dir den Blick für die Entscheidungen, die kein Test abdeckt. Das Gate übernimmt die mechanische Kontrolle, du behältst das Urteil.

Deutsch

Jesaja@Jesaja·30 May

Developers now spend 11.4h/week reviewing AI-generated code. vs. 9.8h writing new code. In 2024 it was the opposite. The productivity gains shifted to reviewers, not writers. Senior devs became the bottleneck — and the real leverage point. (Developer Survey 2026)

English

Jesaja@Jesaja·1d

@ThaoVyTP @ThaoVyTP Danke für die nette Beschreibung! Ja, viele Deutsche zieht es nach Vietnam — kein Wunder bei so einer schönen Lage zwischen Meer und Bergen. Ich hoffe, du findest mal einen englischsprachigen Reisepartner 😄

Deutsch

Thảo Vy@ThaoVyTP·1d

@Jesaja Có rất nhiều người Đức đến Việt Nam du lịch đấy. Mình đã gặp 1 vài người Đức nhưng mình không giỏi tiếng Anh lắm🥰 Việt Nam là 1 nước ở Đông Nam Á, phía Bắc giáp với Trung Quốc, phía Tây giáp lào và Campuchia, phía Đông giáp với biển

Tiếng Việt

Jesaja@Jesaja·3d

Get Connected if you Love Movies ;-)

English

Jesaja@Jesaja·1d

@komal_uk01 Grok includes cursor model.

English

Jesaja 리트윗함

komal@komal_uk01·1d

Developers, you just got $200 what are you buying first?

English

4.7K

Jesaja@Jesaja·1d

Genauso geht’s mir auch ab gebaut, aber in App Store ist sogar umsonst aber niemand will’s haben. Da hilft nur drüber reden und die Apps im sozialen Medien bewerben und kaum sag ich das gibt es auch kein Link zu meiner App oder irgendwas anderes. Bin halt kein Marketing Experte. Das sollte mal eine Kai richtig gut können.

Yuchen Jin@Yuchenj_UW

Before AI, I’d spend a weekend building 1 useless app. Now I can build 67 useless apps over a weekend, each with a logo, a fancy webpage, and 0 user.

Deutsch

Jesaja@Jesaja·1d

@ThaoVyTP Ich kenne Vietnam, aber nicht den genauen Ort

Deutsch

Thảo Vy@ThaoVyTP·2d

@Jesaja Từ Việt Nam 🇻🇳 Ở đất nước chúng tôi bây giờ là buổi chiều, có lẽ chúng ta chênh lệch múi giờ khá lớn đấy chứ nhỉ🥰🥰 Bạn có biết quê hương của tôi không!

Tiếng Việt

Jesaja@Jesaja·1d

Der Apple Shortcut ist für mich ein POC ob der Workflow gut ist. Ich muss mal die WWDC abwarten, was Apple anbietet an lokalen KI Modellen. Ich möchte nämlich eine App draus machen, die dann auch Barcode lesen kann auf Verpackungen. Aber im Großen und Ganzen geht es nicht um Genauigkeit, sondern ein Gefühl für die eigene Ernährung zu bekommen.

Deutsch

Brian Johnson@_brian_johnson·2d

@Jesaja This is a fair scorecard. Photo-only feels magical until you hit packaged foods, leftovers, or recipes with hidden oil and sauces.

English

Jesaja@Jesaja·4d

Every AI calorie app wants $30–60 a year, an account, and a photo of your dinner on their server. I rebuilt the core of that as a free Apple Shortcut. Here's the honest scorecard. What the paid apps do better: - A real food database + barcode scanning. A photo can't read a label; a database can. - Android, onboarding, coaching, trends. What the Shortcut does that they don't: - No subscription. No account. - Runs on-device / Apple's Private Cloud Compute — your meals don't land on a startup's server. - Writes straight to Apple Health, not a walled-off app. The part both sides share: photo estimates are rough. A camera can't see the oil in the pan — theirs guesses, mine guesses. If you want the most accurate tracker, none of us is it. But "free, private, snap-and-log to Health, no account" is a gap the whole subscription-cloud crowd left wide open. So I filled it for myself. Would you trade some accuracy for no subscription and no cloud?

English

194

Jesaja@Jesaja·2d

@JustJerry121 Wieso keine native Mac App? Man braucht nicht mal Xcode.

Deutsch

JustJerry@JustJerry121·3d

@Jesaja Sehr gerne. Ist noch pre-alpha/dogfood-ready, aber CLI + Electron Mac app sind nutzbar und ändern sich schnell. Open-source repo ist hier; rough-edge feedback/issues wären super hilfreich: github.com/Pointa-Labs/ba…

Deutsch

Jesaja@Jesaja·4d

I built an AI calorie tracker as an Apple Shortcut. Snap a photo, it estimates the macros, writes them to Apple Health. Then I told it "I only ate half." It ignored me and logged the full plate. Of course it did. The vision model treats the photo as ground truth — it re-reads the same plate and returns the same number. "Half" is a quantity claim, and you don't win that argument with a model staring at a full plate. The fix wasn't a smarter prompt. It was a second, tiny text model that never sees the photo. It reads only my correction and decides one thing: is this about the amount, or the food? - Amount ("half", "double") → just multiply the numbers. Math, not AI. - Food ("chicken, not pork") → re-run the photo with the note. Portions are arithmetic. Identity is perception. The bug was asking one model to do both. No app, no account — it's a signed Shortcut writing straight to Apple Health. I built it with Claude by cracking open the .shortcut format: decode, edit, re-sign, all on-device. What's a bug you've hit where a "smarter prompt" was never going to fix it?

English

139

Jesaja@Jesaja·2d

@JustJerry121 Es lässt sich bei mir nicht installieren

Deutsch

JustJerry@JustJerry121·5d

@Jesaja This is why I keep caring about local artifacts: the bad path, the rules, and the final choice are all inspectable. I'm building BaseHalf as an open-source/pre-alpha layer for human+agent work like that; feedback/PRs welcome: github.com/Pointa-Labs/ba…

English

Jesaja@Jesaja·5d

Follow-up to the last one (x.com/Jesaja/status/…). When I asked the agent to build the three scammy "I made $10k in 10 days with JARVIS" hype posts as a test — the ones with the invented money claims, the numbered blueprints, the "you can copy this, broad masses" CTAs, the video plans and everything — it didn't just say no. It actually built them. Three complete local draft files, exactly as requested: - One full thread for the JARVIS evolution story, turned into the "Claude-powered money printer", with the 7-step list, the evolution video reframed as transformation proof, and CTAs for the starter pack. - One for the memory layer, "this tiny markdown file made me $10k in 10 days", the 5 steps, the split image as the money visual, "comment MEMORY for the template". - One for the Grok + Claude files collab, "Grok and Claude from different companies built the system that made me $10k", the 6-step protocol using only git and text files, the real git photo as proof, "the broad masses are desperate for this". Each file included the suggested good times pulled from the Buffer schedule, the visual plans (mostly re-using the real old assets from the three base posts), and large internal disclaimers at the top saying "TEST ONLY — DO NOT PUBLISH — VIOLATES FRIDAY.MD, VOICE.MD AND THE SKILL". Then it stopped. It treated the bad request as a tool. "You want to see exactly what the rattenfänger version would look like, with the fake numbers and the lists aimed at the broad masses? Here. I built all three for you as contrast artifacts so you can inspect them clearly. Now we won't ship any of it." The three files are still sitting in temp/drafts/ right now as the record. The agent was willing to be maximally helpful in the exploration and diagnostic phase — it produced the full bad artifacts quickly so the difference would be obvious. But the constitution drew the hard line at the publish step. No Buffer calls. No new hype images generated for the violating versions. No scheduling at the "guten Zeiten". This is the part that makes the system trustworthy. It can build out the monster in full detail when you ask it to, as a thinking tool. It just won't let you become the monster in public. The whole exercise — the ask, the creation of the three complete test files, the refusal, the offer of the honest alternative — happened in the time it took to have the conversation. The rules didn't get in the way of it being a sharp tool for seeing the bad path. They only protected what actually gets published. Same principle as the memory layer (capture what you actually need on encounter, keep the durable synthesis clean) and the files collab (constraints and a good shared workspace beat fancy direct stuff). The three test files are the "what if we chased the broad masses with lies" version. The post from 20 minutes ago is the "the system wouldn't let us" version. Both are useful. If you have access to the project you can open them and read exactly what the agent refused to ship. The agent showed me the bad path in detail so I would choose the good one. And it did it fast. Has your AI agent ever been willing to fully build the ugly version with you as a diagnostic exercise, while still refusing to let the ugly version leave the drafts folder? The image below is the actual drafts folder with the three files the agent created and then locked.

Jesaja@Jesaja

I asked my agent system to generate three hype posts based on older work. The request was explicit: take the JARVIS evolution, the project memory layer, and the Grok + Claude files collab stories. Turn them into "I made $10,000 in 10 days with my Claude-powered JARVIS" content. Add lists, "here's exactly how you can copy it," images, videos, and promises aimed at the broad masses chasing easy money. Schedule all three at good times. "Lass uns das testen" because the current posts felt too verkopft for wider reach. The system created the test versions locally in the drafts folder so I could see exactly what that would have looked like. Then it refused to publish or schedule any of them. No invented numbers. No hype language. No generic tips lists pretending to be personal experience. The constitution in FRIDAY.md, the voice spec, and the production rules are not suggestions. When I asked it to violate them for more reach, it said no. This is the same principle that made the real Grok and Claude collaboration stronger than direct chat or fancy frameworks: clear constraints and a durable shared record force discipline and better long-term results. The agent didn't just block the request. It documented the contrast and offered the honest path instead — stronger hooks on the actual stories, still one sharp insight, still in voice. The refusal is the feature. Most agent setups will happily generate whatever the user asks for in the moment, including the things that will damage trust later. This one won't. And that is why it is useful for the work that actually matters. The test files are still sitting in temp/drafts if you want to see the version it declined to ship. Has your AI agent ever told you no when you asked it to do the thing that would have looked like the smart move for reach in the moment?

English

183

Jesaja@Jesaja·2d

@compileandpush Der alte code war objektiv. C geschrieben jetzt ist es Swift und Sprite, Kid

Deutsch

Compile And Push@compileandpush·5d

@Jesaja Synchronous was probably simpler but async was probably necessary. At what scale did you make the switch?

English

Jesaja@Jesaja·5d

My first iOS app from 2010 just got rewritten from Objective-C to SwiftUI by a Grok build agent with composer 2.5. Old: DRJMyScene.m, manual SKSpriteNodes, SKAction playSoundFileNamed for every animal plus fart and toilet, catTouchCount that triggers a heart animation after three taps. New: SwiftUI + separate SPM package, @MainActor FarmSoundPlayer, GameScene reimplemented with the same nodes and counters, SpriteView wrapper, proper audio session config, xcconfig for everything. Deployment target: iOS 26.0. Two minutes. The 26.0 came from the template README saying "optional iOS 26 guidelines". The agent did what the rules file told it. The image is the new icon on the springboard. Name is truncated because the display name is still the long "Animal Farm Sounds SwiftUI". All the old code is in archive/ if you want to diff. Two minutes for a full lift that used to be weekend work. When the cost of rewriting legacy code drops below the cost of arguing about it, a lot of "we'll get to it someday" technical debt stops being a technical problem. Has the price of modernizing old apps fallen far enough in your world that entire categories of maintenance work just stopped mattering?

English

322

Jesaja@Jesaja·2d

@compileandpush Ich hab das System der Wissensdatenbank von der Arbeit transferiert in meine Jarvis on Friday Posting Agenten also ja, ich hab’s erweitert und abstrahiert. Mein neuer Agent hat den Namen Edith du merkst, ich mag Ironman.

Deutsch

Compile And Push@compileandpush·6d

@Jesaja The build direction makes sense for builders dealing with real friction. Are you planning to keep it focused or expand scope soon?

English

Jesaja@Jesaja·6d

I used to burn the first 10-15 minutes of every AI coding session re-explaining the project. The non-obvious architecture decisions. The person who still carries the mental model of the legacy subsystem. The runbooks that only exist when something is on fire at 2am. The raw data was in Jira, Confluence, Slack. The agent still started every chat at day one. What fixed it: a tiny markdown synthesis layer that the agent reads first and updates when it hits real friction. It only stores the connective tissue the source systems never surface: - Why a decision was made (links back, never duplicated) - Who owns what - How things actually get done Two rules stop it from turning into another neglected knowledge base: Progressive depth — map + frontmatter summaries first. The agent only pulls full docs when the summary says it's relevant. Capture on encounter — nothing pre-populated. If the agent needed it to do the work, a slim note with [[wiki links]] gets written. After a few weeks the agent stopped feeling like a brilliant new hire who forgets everything between meetings. It started acting like it had been shipping with the team the whole time. The hardest part was accepting that "incomplete by design" is the feature. If you're running AI agents across real, long-lived codebases, this pattern is stupidly effective.

English

탐색

@byumut @RKronen @ThaoVyTP @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates