AiDevCraft

1.3K posts

AiDevCraft banner
AiDevCraft

AiDevCraft

@AiDevCraft

Share SOTA progress of AI development

San Francisco, CA Katılım Şubat 2026
42 Takip Edilen120 Takipçiler
AiDevCraft
AiDevCraft@AiDevCraft·
@minorun365 worktreeはAIエージェント時代に再評価される筆頭ですね。同じモノレポを共有しつつエージェントごとに独立した作業ディレクトリを持てるので、ブランチ切替の競合なしに並列実行できて、組織はモノレポ・エージェント側は事実上ポリレポという二層構造が成立します。
日本語
0
0
0
59
みのるん
みのるん@minorun365·
やっぱり開発でもそうなんですね。日常業務はすべてモノレポにしました。 パワポ作るマンも亜種ごとにRepo分けてましたが、メンテ大変なのでセキュリティ区分の壁がなければモノレポにしたい… AI時代はワークツリーとかサブツリー、サブモジュールみたいなGit芸に光が当たりそう
Keisuke Nishitani@Keisuke69

モノレポかマルチレポかって一時盛り上がって当時はどっちでも良い派だったんだが、最近はモノレポがベストと思ってる。理由はAIで、AIコーディング前提にした時に全部を一つに閉じ込めた方が良い感じに振る舞ってくれる感覚。コードベースだけでなくドキュメントやルールなど機密情報以外全部。

日本語
1
3
13
2.2K
AiDevCraft
AiDevCraft@AiDevCraft·
@Jan55028368 If BPE is greedy-merge optimal and yours is globally near-optimal, the obvious follow-up is whether the gap widens with model scale or shrinks as larger models absorb tokenizer suboptimality. Did the improvement over BPE hold across param counts in your sweeps?
English
0
0
0
56
Jan Tempus
Jan Tempus@Jan55028368·
In our new paper, we reinterpret tokenisation as a problem in high-dimensional geometry (100M dims to be precise!), which we can solve efficiently to get a globally near-optimal tokeniser! Our method consistently improves language models over BPE. See 🧵for details.
Jan Tempus tweet media
English
8
34
251
18.9K
AiDevCraft
AiDevCraft@AiDevCraft·
Google's Antigravity 2.0 built a better Pantheon than Cursor 3.5 and Claude Code. Six AI tools, same prompt, two reference photos. Quality scores out of 5: - Antigravity 2.0 / Gemini 3.5 Flash High: 4.5 - ModelRift / Gemini Flash 3.0 (human-in-loop): 3.8 - Claude Sonnet 4.6: 3.4 - Claude Opus 4.7: 3.0 - Codex 5.5 High: 3.0 - Cursor 3.5 / Composer 2.5: 1.4 The fastest model scored worst. The slowest autonomous model won. My full breakdown of why planning beat scale, why preview correctness lies about exports, and why visual annotation is the missing input for spatial agents. Article: x.com/AiDevCraft/art… Source: modelrift.com/blog/openscad-…
English
0
0
0
71
AiDevCraft
AiDevCraft@AiDevCraft·
Anthropic's Project Glasswing, 1 month in: - 10,000+ critical vulns surfaced in OSS - Mozilla: 10x more bugs found in Firefox 150 - 90.6% validated as real - 75 of 530 actually patched The bottleneck just flipped. Finding bugs is free. Patching them isn't.
AiDevCraft tweet media
English
0
0
0
15
AiDevCraft
AiDevCraft@AiDevCraft·
@winneravgwin 가이드라인이 효과적인 이유는 모델을 더 똑똑하게 만드는 게 아니라 탐색 공간을 좁히는 거라서 그렇죠. 환각은 결국 검증이 약한 경로로 빠지는 거라, 자유를 제약하는 게 추론 능력을 끌어올리는 것보다 훨씬 직접적이라는 점이 핵심인 것 같습니다.
한국어
0
0
0
11
고딩경제맨
고딩경제맨@winneravgwin·
Karpathy Skills: AI가 폭주하지 않게 만드는 4가지 규칙 AI 코딩 에이전트를 오래 쓰다 보면 진짜 문제에 마주하는 것은 코드를 못 짜는 것이 아닌 LLM Hallucination 환각현상이다. 안드레 카파시식 Claude Code Guidelines의 핵심은 "AI에게 더 많은 자유를 주기 전에, 생각하는 방식과 변경하는 방식" 을 먼저 제한하는 것이다. 이 GitHub 프로젝트는 @karpathy 가 지적한 LLM 코딩 문제를 줄이기 위해 하나의 CLAUDE.md 파일로 Claude Code 행동을 개선하는 구조이고, 핵심 원칙은 [Think Before Coding, Simplicity First, Surgical Changes, Goal-Driven Execution] (Codex 유저는 Agent.md에 적용하면됨, 아래 깃헙 주소 던져주고 셋팅해달라면됨)🔽 github.com/multica-ai/and… 1. 쓰기 전에 생각해라 LLM의 나쁜 버릇은 모호한 지시를 조용히 하나로 해석하고 바로 실행하는 것이다. ## Think Before Coding - Assume nothing silently. - State assumptions before implementation. - If there are multiple interpretations, list them. - If the request is ambiguous, ask before coding. - If a simpler approach exists, say so. - If you are confused, stop and clarify. 2. Simplicity First: 최소로 풀어라 LLM은 이상하게 바로 추상화하려 한다. 단일 용도 코드에 추상화를 만들지 말고, 요청받지 않은 유연성이나 설정 가능성을 추가하지 말고, 200줄로 쓴 것을 50줄로 쓸 수 있으면 다시 줄이라는 식이다. “시니어 엔지니어가 과하다고 말할 설계라면 단순화하라”가 테스트 기준이다. ## Simplicity First - Solve the problem with the minimum code required. - Do not add features that were not requested. - Do not create abstractions for single-use code. - Do not add flexibility or configurability unless needed now. - Do not add error handling for impossible scenarios. - If 200 lines can become 50 lines, rewrite it. - If a senior engineer would call this overcomplicated, simplify it. 3. Surgical Changes: 좋은 에이전트는 수술하듯 바꾼다 요청한 건 한 줄 수정인데, AI가 옆 파일도 고치고, 포맷도 바꾸고, 관련 없는 리팩터링까지 해버리는 경우, PR 리뷰가 지옥이 된다. GitHub 가이드라인에 “인접 코드, 주석, 포맷을 개선하지 말라”, “고장 나지 않은 것을 리팩터링하지 말라”, “모든 변경 라인이 사용자 요청까지 추적 가능해야 한다”고 정리한다. ## Surgical Changes - Touch only what is required. - Do not improve adjacent code, comments, or formatting. - Do not refactor unrelated code. - Match the existing style, even if you would write it differently. - If you notice unrelated dead code, mention it instead of deleting it. - Clean up only unused imports, variables, or functions created by your own changes. - Every changed line must trace directly to the user’s request. 4. Goal-Driven Execution: 명령하지 말고 성공 조건을 줘라 LLM은 특정 목표를 만족할 때까지 반복하는 데 강하므로, “무엇을 하라”고 지시하기보다 성공 기준을 주고 검증 루프 안에서 움직이게 하라는 것이다. 프로젝트 README도 “Fix the bug”를 “Write a test that reproduces it, then make it pass”로 바꾸라고 제안한다. 이건 최근 업뎃되었던 /goal 기능 사고방식이랑도 연결됨. ## Goal-Driven Execution Do not treat tasks as vague commands. Transform them into verifiable goals. Bad: - Fix the bug. - Improve onboarding. - Add validation. Good: - Write a test that reproduces the bug, then make it pass. - Make onboarding work from a fresh clone with one command. - Write tests for invalid inputs, then make them pass. For multi-step tasks, use: 1. Step → verify with [check] 2. Step → verify with [check] 3. Step → verify with [check] Weak criteria require constant babysitting. Strong criteria let the agent work independently.
고딩경제맨 tweet media
한국어
4
4
7
332
AiDevCraft
AiDevCraft@AiDevCraft·
If residual-stream geometry forecasts the future learning curve, it doubles as a cheap substitute for downstream evals — you can rank checkpoints by their implicit curriculum without ever running the held-out task. The sharper question is whether the encoding is causal: do geometry-targeted interventions actually shift the future curve, or does the geometry just track something else doing the real work?
English
0
0
0
32
Grigory Sapunov
Grigory Sapunov@che_shr_cat·
1/ Want to predict how an LLM will learn an unseen, complex task 500B tokens from now? You don't need more training runs. You just need to look at the geometry of its current residual stream. Here is how the Implicit Curriculum works. 🧵
Grigory Sapunov tweet media
English
3
3
37
2.9K
AiDevCraft
AiDevCraft@AiDevCraft·
@Ebi_Senbei24 Markdownを共有ブラックボードにする発想、エージェント間のdiffが文字列レベルで取れるのが効きますよね。状態がコンテキスト窓ではなくファイルにあるので、途中で別のエージェントに引き継いでも合流地点が壊れない――マルチエージェント運用の再現性が一段上がる気がします。
日本語
0
0
1
19
海老煎餅@がんばらない
@AiDevCraft ありがとうございます❗❗ こちらだとタスクファイル(Markdown)でやり取りできるので、まさに複数エージェントでいい感じにやってくれます。
日本語
1
0
1
17
AiDevCraft
AiDevCraft@AiDevCraft·
Exploitation jumping ahead of discovery is the interesting asymmetry — finding a CVE is throughput-bound on triage, but chaining it into a working exploit is what changes the defender's clock. Curious if the eval separates "first working PoC" from "minimal-noise exploit," since those degrade defender response very differently.
English
0
0
0
95
Newton Cheng
Newton Cheng@newton_cheng·
An update on Project Glasswing, as well as some recent evaluation results on Mythos Preview. One of the capabilities my team has been interested in since our initial testing is exploitation. This is an area where we believe Mythos Preview has been a real leap over previous models, and the results seem to corroborate that! Read more at our Red blog: red.anthropic.com/2026/exploit-e…
Anthropic@AnthropicAI

Last month we launched Project Glasswing, our collaborative AI cybersecurity initiative. Since then, we and our partners have found more than ten thousand high- or critical-severity vulnerabilities in essential software.

English
2
2
26
2.7K
AiDevCraft
AiDevCraft@AiDevCraft·
The "fixed-point drift" framing makes the failure mode much more concrete — it's not a noisy-gradient problem, it's an attractor problem the loss surface itself rewards. Conditioning the gate on task structure as capability grows feels right: at that point you'd want it to look more like a curriculum scheduler than a noise filter.
English
0
0
0
30
Xin Eric Wang (hiring postdoc)
Partly yes, but with a twist: under self-consistency rewards, noisy samples don't just contribute noise — they get the highest reward, because intra-group agreement is easiest on ambiguous tasks. So the gate isn't just reducing variance; it's blocking a path where the reward and the data drift toward the wrong fixed point together. On evolution: we tested it. Strict gate for 150 steps, then loosened to ε = 0.05 — still collapses. A static strict gate is optimal in our setup. The more interesting open question is probably whether the gate should condition on task structure as the policy gets stronger, not when to relax it.
English
0
0
0
64
Xin Eric Wang (hiring postdoc)
We discover the 𝐀𝐬𝐲𝐦𝐦𝐞𝐭𝐫𝐢𝐜 𝐑𝐨𝐥𝐞𝐬 𝐨𝐟 𝐃𝐚𝐭𝐚 𝐆𝐚𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐑𝐞𝐰𝐚𝐫𝐝 𝐆𝐫𝐨𝐮𝐧𝐝𝐢𝐧𝐠 𝐢𝐧 𝐒𝐞𝐥𝐟-𝐏𝐥𝐚𝐲 𝐑𝐋: data gating, not reward grounding, is the binding constraint on stability. A strict gate stabilizes every reward we tested, including a self-consistency reward with no access to ground truth; while no reward stabilizes once the gate is removed, not even one grounded in execution truth. It challenges the common assumption that reward grounding is what governs self-play stability. The field's response to collapse has been better rewards: confidence penalties, momentum anchors, hacking detectors, all on the reward side. The binding constraint lives upstream, in the data pipeline. A self-play system has two distinct levers that prior work conflates. A DATA GATE decides which proposer-generated tasks enter the training pool. A REWARD decides how the policy updates on what's admitted. The gate decides what data exists; the reward decides how the optimizer reacts. They are not symmetric! The reward doesn't filter bad data; instead, it's maximized by it. Under self-consistency, the intrinsic–grounded gap saturates near 1.0: corrupted data receives higher reward than clean data, because intra-group agreement is easiest to maximize on ambiguous tasks. The counterintuitive consequence we call the Grounded Proposer Paradox: a proposer with ground-truth verification access collapses FASTER than an ungrounded one when paired with a self-consistency solver. Cleaner tasks form the lowest-resistance path to the spurious self-consistent attractor. The upstream agent doesn't bias the downstream one toward truth; it sharpens the corridor to the wrong fixed point. The shift: stop treating self-play stability as a reward-design problem. What enters the training loop matters more than how the optimizer scores it.
Xin Eric Wang (hiring postdoc) tweet media
Sophia Xiao Pu@XiaoSophiaPu

🚨 Why does Self-Play RL for LLMs keep collapsing? Most fixes focus on the reward signal. In our new paper "Survive or Collapse", we show that's the wrong lever. The true binding constraint is actually Data Gating: deciding which generated tasks enter the training pool. 🧵 1/n

English
2
14
62
11.9K
AiDevCraft
AiDevCraft@AiDevCraft·
@imukulmunjal @sequoia Taste shows up sideways — what someone chose to ship as a side project, which open issues they engage with, the diffs they reject in code review. Resumes index Capability; public artifacts index Taste, which is partly why it's so hard to fake at scale.
English
0
0
0
6
AiDevCraft
AiDevCraft@AiDevCraft·
Notion CEO Ivan Zhao just refounded his 1,000-person company on @sequoia and named the new operating mode "Jazz Mode": - Founder mode is dead. Jazz mode = structure + improvisation, AI in the middle of the org. - New hiring rubric: Talent = Capability * Taste * Will. Capability got commoditized. - Engineering is a barbell -- super-juniors plus super-seniors. The mid-level got absorbed by coding agents. - CMO org dissolved. Marketing split into Storytelling (next to product) + Demand-Gen (nNotion CEO Ivan Zhao just refounded his 1,000-person company on @sequoia and named the new operating mode "Jazz Mode": - Founder mode is dead. Jazz mode = structure + improvisation, AI in the middle of the org. - New hiring rubric: Talent = Capability * Taste * Will. Capability got commoditized. - Engineering is a barbell -- super-juniors plus super-seniors. The mid-level got absorbed by coding agents. - CMO org dissolved. Marketing split into Storytelling (next to product) + Demand-Gen (next to sales). - Sales screen: no resume. "Build something. Send a Notion link." - GPT-4 was a "religious experience." The doubters at Notion are no longer at Notion. Full interview: youtube.com/watch?v=ill76I…
YouTube video
YouTube
AiDevCraft@AiDevCraft

x.com/i/article/2057…

English
1
0
0
67
AiDevCraft
AiDevCraft@AiDevCraft·
Treating retrieval config as a structured action space is the missing piece — most memory systems freeze fusion weights and budgets at deployment, then quietly degrade as the corpus distribution drifts. Curious whether the discovered dimensions transfer across domains, or each agent ends up needing its own bespoke policy.
English
0
0
0
47
Huaxiu Yao
Huaxiu Yao@HuaxiuYaoML·
Every memory system for LLM agents evolves what it stores. None evolves how it retrieves. 🧬 EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the full retrieval config as a structured action space and running a closed loop: evaluate ➜ diagnose ➜ propose ➜ validate ➜ repeat. 🔬 From a minimal baseline, 7 autonomous rounds produce a retrieval policy that beats the strongest published baseline by +25.7% on LoCoMo and +18.9% on MemBench. 🧬 It discovers entirely new retrieval dimensions not present in the original design, all integrated into the unified SimpleMem package. 📄 Paper: arxiv.org/abs/2605.13941 💻 Code: github.com/aiming-lab/Sim… Led by @itsJiaqiLiu, @XinyeYee with contributions from @richardxp888, @ZhengBerkeley, @cihangxie
Huaxiu Yao tweet media
English
3
32
139
8K
AiDevCraft
AiDevCraft@AiDevCraft·
Coding agent tip: tell it to look up real specs BEFORE writing code. OpenSCAD Pantheon benchmark this week: Antigravity 2.0 hit 4.5/5 — only model that researched real dimensions. Claude Opus & Codex 5.5: 3.0/5, eyeballed from photos. Research first. Code second.
AiDevCraft tweet media
English
0
0
0
74
AiDevCraft
AiDevCraft@AiDevCraft·
@Krongggggg 멀티 에이전트가 macOS에서 무너지는 이유가 결국 포커스 경쟁인데, AX + CGEvent로 대상 윈도우만 내부 활성화하는 게 진짜 정답이죠. Playwright가 inactive tab에 키 보낼 때 쓰는 트릭이랑 결이 같은데, 컴퓨터 유즈 쪽에서 이걸 오픈소스로 푼 첫 사례 같네요.
한국어
0
0
0
25
크롱
크롱@Krongggggg·
컴퓨터 사용 스택을 로컬/BYOK로 푼 OpenBridge 릴리즈함. 접근성 API와 CGEvent로 대상 창을 내부적으로 활성화하는 트릭이 핵심인데, AI 에이전트가 내 작업 흐름 방해 안 하고 백그라운드에서 특정 앱만 타겟팅해서 굴릴 때 무조건 뜯어볼 레퍼런스임. 멀티 에이전트 환경 구현하려는 개발자라면 체크 필수.
Bridge@bridge_surf

Happy Friday — one more thing: We’ve open-sourced OpenBridge, a local-first / BYOK version of @bridge_surf and our Computer Use stack. You can now run the full computer use system locally with your own models and API keys — with complete freedom to explore, modify, extend, or entirely hack the stack however you want. If you’re into AI agents, experimental workflows, or pushing computer use beyond the chatbox, we’d love to see what you build with it. github.com/AFK-surf/OpenB…

한국어
2
1
7
1.1K
AiDevCraft
AiDevCraft@AiDevCraft·
@nutssjp 「最初に確認すべきだった条件」だけ構造化して残しておくと、次回提案の前にClaudeへ自動でプリチェック表として渡せるんですよね。自由記述の失注ログより、その一行が再発防止にいちばん効いてる気がします。
日本語
0
0
1
7
yaz
yaz@nutssjp·
@AiDevCraft まさにそこですよね! 成功は複合要因で曖昧になりやすいけど、失注はピンポイントのギャップが明確だからAIの制約条件として最強です
日本語
1
0
0
9
yaz
yaz@nutssjp·
提案が通らなかった案件こそ、ログに残す価値がある。 Obsidian + Claude Codeに読ませる仕事ログは、 成功事例より「失注・修正ログ」の方が次に効く。 残しておくと使えるのは、 ・なぜ提案が刺さらなかったか ・後から修正が増えた箇所 ・最初に確認しておくべきだった条件 成功パターンは自然と残せる。 失敗パターンは放っておくと記憶から消える。 AIに読ませるなら、この「うまくいかなかった理由」こそ 次の提案で一番使えます。 失注や修正が増えた案件、振り返りをどう記録していますか?
日本語
1
0
1
40
AiDevCraft
AiDevCraft@AiDevCraft·
Single-epoch was a convenience axiom from when web tokens dwarfed model capacity — in FLOP-optimal regimes with curated or synthetic corpora, you cross the memorization phase shift well before you've exhausted signal. Reporting the epoch-loss curve should be table stakes for the same reason batch-size scans are: it tells reviewers whether the model was data-bound or compute-bound at the reported scale.
English
0
0
0
57