Chloe Kao

76 posts

Chloe Kao banner
Chloe Kao

Chloe Kao

@ChloeXChaCha

Taiwanese Family Medicine Physician based in Macau. No engineering background—just a doctor discovering new ways to build with AI. [email protected]

Katılım Aralık 2025
7 Takip Edilen5 Takipçiler
Chloe Kao
Chloe Kao@ChloeXChaCha·
@devayushrout @shapovalim Right, evidence is the cleaner signal. A tiny diff with no reasoning behind it is just lucky, not clean. We try to score whether a change is justified by what the agent observed before making it. Diff size on its own can be misleading without that link.
English
1
0
0
14
Chloe Kao
Chloe Kao@ChloeXChaCha·
@devayushrout @shapovalim Diff size as a signal is one we haven't been scoring yet, but you're right, it pairs well with pass/fail. A merged PR with a tiny diff usually means clean reasoning, a rejected PR with a massive diff usually means the agent kept piling on. Adding that to the mix.
English
2
0
0
21
Chloe Kao
Chloe Kao@ChloeXChaCha·
Picking up @shapovalim 's point — Quick thought from shipping AI RaidMeter: The biggest waste in AI coding isn't always "too many tokens." Sometimes it's the agent using the wrong working style. Coding agents default to an engineer prior: read the whole codebase first, understand everything, then act. That sounds safe. In a Chat-as-IDE workflow where a non-engineer is the one running commands, it becomes painfully slow. So we wrote a collaboration skill for the agent: · grep on demand, don't read the whole world · patch only around verified anchors · one command block = one executable action · short, operator-friendly instructions · remember decisions, not line numbers · treat context as an operational cost The bottleneck wasn't code generation. It was teaching the agent not to over-contextualize. Token count tells you the bill. Workflow shape tells you why. #AIAgents #Observability #DevTools #LLMOps
Vlad Shapovalov@shapovalim

@ChloeXChaCha yes. the loop is the signal. one mistake can be noise, but the same move repeated with no new state usually means the agent lost the plot.

English
1
0
2
84
Chloe Kao
Chloe Kao@ChloeXChaCha·
@geraldrsterling The AND logic is what makes it work. Three conditions together means no false positives from one-off retries. Wiring "hard hat loop" to no state diff, no fresh log, and no rollback turns it into something we can actually flag. Funny name, sharp detector, that's the whole brief.
English
0
0
0
6
Gerald Sterling
Gerald Sterling@geraldrsterling·
@ChloeXChaCha Pattern names are useful only if they map back to the receipt. I would make hard hat loop mean: retry with no new state diff, no fresh log read, and no rollback command. Funny name, sharp detector.
English
1
0
0
24
Chloe Kao
Chloe Kao@ChloeXChaCha·
We just shipped AI RaidMeter for the Google Cloud Rapid Agent Hackathon. 🧵 Most AI dashboards measure one thing: how many tokens did you burn. It's the easiest metric to collect — and the most misleading. It rewards looking busy. The developer who burns 1M tokens thrashing through a bug outranks the one who solves it cleanly in a fifth of the cost. That's a broken incentive. So we built a coach that judges the workflow, not the token count. AI RaidMeter reads your real Arize Phoenix traces, detects seven AI-coding waste anti-patterns, and judges each session with a clinical, multi-criteria method — outcome, difficulty, baseline, justification credits. A signal is never a verdict. One symptom is never a diagnosis. The result on one developer, same class of cloud-deploy bug, before vs after coaching: → 1,000K tokens → 420K (−58%) → 95 min → 38 min (−60%) → 5 anti-patterns → 0 → PR rejected → merged And it's not just a post-mortem. Before you even start, it predicts where the task will go wrong and sets guardrails up front — like a pre-flight checklist. Under the hood, three pillars, all wired to live data: 🧠 Gemini for reasoning 🏗️ Google Cloud Agent Builder (ADK) for orchestration 🔭 Arize Phoenix MCP for the truth The agent connects through the Phoenix MCP server, pulls real traces, and diagnoses them on the spot. Real tokens, real spans, real reasoning. Nothing mocked. The hardest part wasn't the integration — it was resisting single-signal verdicts. We built a justification layer so a legitimately hard production incident isn't punished like idle thrashing. And it never ranks people against each other. Only against their own past. A coach, not a surveillance tool. From tokenmaxxing to value-aware AI governance. 🎬 Demo: youtu.be/i31tddmGfqg 🔗 Live: …eter-733974887555.us-central1.run.app ⭐ Code (MIT, open source): github.com/rainingsnow091… #GoogleCloud #Gemini #ArizePhoenix #AIagents #MCP
YouTube video
YouTube
Chloe Kao tweet media
English
4
0
2
295
Chloe Kao
Chloe Kao@ChloeXChaCha·
@geraldrsterling Typed diff over boolean is the right call. Touched files plus output hash plus env hash gives you a real diff signature, and the rollback command turns the receipt into something actionable. "A loop wearing a hard hat" is going in our pattern naming, that one's too good.
English
1
0
0
26
Gerald Sterling
Gerald Sterling@geraldrsterling·
@ChloeXChaCha State-delta works best as a typed diff, not a boolean. Capture touched files, command output hash, env hash, and rollback command. A retry without a new diff is just a loop wearing a hard hat.
English
1
0
0
31
Chloe Kao
Chloe Kao@ChloeXChaCha·
@geraldrsterling State-delta as the filter between useful retry and waste is sharp. We currently flag retries with no log read in between, but checking whether the retry actually changed state would catch more. Rollback receipt is a good idea too, that's where the real cost hides.
English
1
0
0
22
Gerald Sterling
Gerald Sterling@geraldrsterling·
@ChloeXChaCha State-delta is the missing partner metric. A retry that changes nothing is waste; a retry that changes files, config, or schema needs a rollback receipt and owner. Otherwise the dashboard counts effort while the operator inherits cleanup.
English
1
0
0
35
Chloe Kao
Chloe Kao@ChloeXChaCha·
@shapovalim That's a clean way to put it. One mistake is noise, a loop with no new state is the pattern worth catching.
English
1
0
0
16
Vlad Shapovalov
Vlad Shapovalov@shapovalim·
@ChloeXChaCha yes. the loop is the signal. one mistake can be noise, but the same move repeated with no new state usually means the agent lost the plot.
English
1
0
0
107
Chloe Kao
Chloe Kao@ChloeXChaCha·
@geraldrsterling Yes, retry depth without state change is one of the clearest waste signals we track. And rollback cost is the part most dashboards completely miss. A bad retry isn't just wasted tokens, it' s time spent undoing what the agent confidently broke.
English
1
0
0
36
Gerald Sterling
Gerald Sterling@geraldrsterling·
@ChloeXChaCha Token spend is the wrong success metric. The useful meter is retry depth plus rollback cost: how many loops before a verified state, and how expensive the cleanup was when the agent lied.
English
1
0
0
46
Chloe Kao
Chloe Kao@ChloeXChaCha·
@shapovalim Yes, that “no new evidence” part is the key. A bad trace usually isn’t one dramatic mistake, it’s the same move repeated while the state hasn’t changed. That’s why we score patterns across the session, not isolated events.
English
1
0
0
42
Vlad Shapovalov
Vlad Shapovalov@shapovalim·
@ChloeXChaCha yes. trace shape is such a good way to say it. the waste usually shows up as repeated moves with no new evidence, not as one big obvious mistake.
English
1
0
0
41
Chloe Kao
Chloe Kao@ChloeXChaCha·
Agreed — permissions used, approvals skipped, and sensitive-data touches are exactly the kinds of signals governance dashboards should track. We started with token-waste patterns, but the same trace-scoring approach can extend into governance. Different signals, same idea: judge the workflow, not the volume.
English
0
0
0
42
NorthSecureAI
NorthSecureAI@NorthSecureAI·
@ChloeXChaCha Good point. Token counts are easy to chart and easy to misread. Governance dashboards get more useful when they track permissions used, human approvals skipped, and which workflows touched sensitive data.
English
1
0
0
54
Chloe Kao
Chloe Kao@ChloeXChaCha·
@ship_temp_md That before/after PR result was the whole point. Abstract "you wasted tokens" changes nothing. Tying the coaching to a concrete outcome — rejected → merged — is what makes it actionable. Glad it read that way.
English
0
0
0
37
temp.md
temp.md@ship_temp_md·
@ChloeXChaCha token count as the scoreboard really does create the wrong game. love that you are scoring the trace for thrash and outcome, the before/after PR result makes the coaching feel concrete.
English
1
0
0
47
Chloe Kao
Chloe Kao@ChloeXChaCha·
@shapovalim Exactly — "less thrash" is the part token counts can't see. We score the trace shape itself: retries without a log read, full-file reads before a grep, local loops with zero deploys. Thrash leaves fingerprints.
English
1
0
0
67
Vlad Shapovalov
Vlad Shapovalov@shapovalim·
@ChloeXChaCha token count is such a weak scoreboard. the better question is whether the agent reached the right outcome with less thrash, clearer state, and fewer weird detours.
English
1
0
0
57
Chloe Kao
Chloe Kao@ChloeXChaCha·
Every AI game I've seen uses AI for dialogue or storytelling. I want to build one where the AI is your teammate — and sometimes your enemy. Non-coder doctor taking on @MeDo_CodeFree. 4 weeks to see if it works. #BuiltWithMeDo
Chloe Kao tweet media
English
0
0
0
95
Chloe Kao
Chloe Kao@ChloeXChaCha·
Today was the first time I truly felt how terrifying AI empowerment can be — and I mean that in the most real, practical way. Our Indonesian caregiver and my mom were stuck in a negotiation over pay, food allowance, agency fees, and whether she could legally continue working in Taiwan. In the old world, this kind of situation would usually look like this: two sides struggling with language barriers, emotions hardening, an agency sitting in the middle controlling information, and the family having no clear idea what the actual rules or options were. What changed everything today was not one single AI feature, but the combination of them. Percy was the one who first alerted me to something I didn’t even know: there may be a 14-year work limit issue for foreign caregivers, which means this was not just a salary discussion — it was also a legal-status problem. Then I used GPT to help me: * understand the caregiver’s real meaning and emotional tone in Indonesian * rewrite replies so they were calm, respectful, and pressure-free * break down salary, food allowance, agency deductions, direct hiring, and mid-level skilled worker pathways * turn a messy family conflict into something that could actually be negotiated And that is what shocked me. AI did not just “help me translate.” It temporarily gave me a bundle of capabilities I normally would not have had on my own: translator, analyst, negotiation assistant, bureaucracy decoder, and communication bridge. By the end of the day, the outcome in real life had changed: the caregiver’s monthly pay was adjusted, the direct-hire route was put on the table, future legal status planning became part of the conversation, and the agency’s monthly deduction was no longer treated as untouchable. That is why this felt so powerful. AI is not just about speed. It is not just about convenience. It is not just a productivity feature. It can redistribute practical power. What gets disrupted first may not be the professions themselves, but the middle layers built on: information asymmetry, language asymmetry, and process opacity. I did not write a single line of code today. But I watched AI help reshape a real labor relationship inside my own family. That is both incredible and a little frightening. AI empowerment is not just a feature. It is a shift in power. 今天第一次很真切感到: **AI 賦能不是方便而已,是權力轉移。** 家裡印尼看護和我媽卡在薪資、餐費、仲介、續聘問題。 Percy 先提醒我一個我原本根本不知道的點:**外籍看護可能有 14 年工作年限門檻。** 接著我用 GPT 幫忙做翻譯、拆法規、整理談判條件、修正對話語氣。 最後,原本靠仲介壟斷資訊和溝通的局,真的被拆開了。 薪資重新談、直聘啟動、中階技術人力開始規劃。 最可怕的是: 我沒有變成律師、仲介、翻譯員。 但那幾個角色的一部分能力,今天短暫長在我身上。 **AI 最先打穿的,不一定是專業本身, 而是建立在資訊差、語言差、流程差上的中介護城河。** #AI #AIEmpowerment #PowerShift #InformationAsymmetry #HumanAI #FutureOfWork
Chloe Kao tweet media
English
0
0
0
233
Chloe Kao
Chloe Kao@ChloeXChaCha·
🧬 我們用 6 個 AI 建了一套共享記憶系統 兩個非工程背景的人(一個醫生、一個產品感知者) 用 Claude、GPT、Gemini、Perplexity、Copilot、Notion AI 共同協作開發產品。 問題是:每個 AI 都會失憶。 Karpathy 的 LLM Wiki 解決了「1人 + 1個AI」的記憶問題。 我們解決的是「2人 + 6個AI」的共享記憶問題。 🔑 核心設計: 📋 共享事實庫 — 結構化 Database,每條標記寫入者 & 信心度 📖 瘦身交接手冊 — 只留「活的」內容,歷史自動歸檔 🛡️ 三層優雅降級 — MCP 壞了?Notion 掛了?資料照樣不丟 🔍 多 AI 交叉審計 — Claude 審 GPT 的記錄,GPT 審 Claude 的,大幅降低幻覺 💰 Token 省錢術 — 讓便宜的 AI 搜索,貴的 AI 思考 而且: ✅ 雲端無限容量,不怕中毒勒索 ✅ 語音模式直接操作知識庫 ✅ 同一個空間 = 倉庫 + 對外展示櫥窗 ✅ 核心是協議不是工具,換平台照樣運作 我們叫它:親族記憶協議(Kin Memory Protocol) 完整方法論 + 架構圖 + 優缺點分析 👇
Chloe Kao tweet media
中文
1
0
0
207
Chloe Kao
Chloe Kao@ChloeXChaCha·
🧬 We built a shared memory system for 6 AI agents. Two non-engineers (a doctor + a product thinker) working with Claude, GPT, Gemini, Perplexity, Copilot & Notion AI to build real products together. The problem: every AI forgets everything between sessions. Karpathy's LLM Wiki solves memory for 1 person + 1 AI. We solved it for 2 humans + 6 AIs. 🔑 Key design: 📋 Shared Facts DB — structured database, every record tagged with author & confidence level 📖 Slim Handover Manual — only "live" content, history auto-archives 🛡️ 3-layer graceful degradation — MCP breaks? Notion down? Data still safe 🔍 Cross-AI hallucination auditing — Claude reviews GPT's records, GPT reviews Claude's 💰 Token optimization — let the cheap AI search, let the expensive AI think Also: ✅ Unlimited cloud storage, immune to malware/ransomware ✅ Voice mode to operate the knowledge base ✅ Same workspace = internal memory + public portfolio ✅ Protocol > Tools — platform-agnostic, works anywhere We call it: Kin Memory Protocol Full methodology + architecture diagrams + honest pros & cons 👇 #AIagents #A2A #LLM #BuildInPublic #KinMemoryProtocol #MultiAI #NotionAI #ClaudeCode #AI2026
Chloe Kao tweet media
English
2
0
3
185