Ilpo Leppänen

5.3K posts

Ilpo Leppänen

Ilpo Leppänen

@ileppane

Katılım Ocak 2024
4.6K Takip Edilen280 Takipçiler
Ilpo Leppänen
Ilpo Leppänen@ileppane·
@mattshumer_ @nickbaumann_ Even better, call claude-code through tmux (or equivalent) so you'll get the full experience without being limited by claude -p upcoming changes
English
0
0
1
53
Matt Shumer
Matt Shumer@mattshumer_·
Massively useful Codex trick for 10x better frontend: You can ask Codex to use Claude as a sub-agent to have Claude handle frontend/design work. Just say “Use claude -p with an excellent, well-scoped, but un-opinionated (UI/UX-wise) prompt anytime you need a design change).”
English
101
37
1.2K
105.5K
Yacine Mahdid
Yacine Mahdid@yacinelearning·
if you are interested in learning about the infra behind auto-research this 1h30min interview with the paradigma folks is for you in it we look at: - why dag are great research substrate - how to let agents run that dag - ways to make big public dag - how to avoid bad bad dag
Yacine Mahdid tweet media
English
21
61
748
59.9K
Ilpo Leppänen retweetledi
Mario Zechner
Mario Zechner@badlogicgames·
the one thing @mitsuhiko taught me: merged client & server logs. very useful.
Mario Zechner tweet media
English
27
22
1.1K
57.6K
Matt Pocock
Matt Pocock@mattpocockuk·
Monster day on Sandcastle today: - Agents can now return structured output via Output.object - Added support for @cursor_ai CLI - Added support for @github Copilot CLI - Fixed a metric ton of bugfixes Check out 0.6.1
English
17
6
215
16.1K
Garry Tan
Garry Tan@garrytan·
This sounds complicated but the agents can implement this in OpenClaw/Hermes Agent trivially (use skillify from GBrain with a link to this tweet) Sounds ridiculous but you should try it
Muratcan Koylan@koylanai

Gradient descent for SKILL.md files sounds interesting, maybe a bit complex but it's becoming a real part of agent harness. SkillOpt is one of the first papers to treat markdown skill files as trainable parameters and provides a proper optimization framework for them. A few things I learned that you should consider too. 1. The validation gate is the only thing that matters in a self-editing loop. Held-out set, strict improvement, ties rejected. End-to-end, their best skills land with 1 to 4 accepted edits total. If your "self-improving agent" is accepting most of what it proposes, you're shipping slop. 2. Bounded edits are better than full rewrites. 4 to 8 edits per step is the sweet spot. Remove the budget and performance collapses. This is the textual analog of learning rate, and it transfers to any LLM-as-author loop. If you're using an agent to refactor your docs, your prompts, or your skills, cap the diff size. 3. Compactness wins. Median final skill: ~920 tokens. Skills do not need to be long. They need to be high-signal. Most skill files I see are bloated because length feels like effort. It isn't. 4. The harness is becoming less important; the skill is becoming more important. A Codex-trained skill ported into Claude Code hit +59.7 points on SpreadsheetBench. Procedural knowledge is more general than the runtime that produced it. 5. Frozen model + trained context is the practical adaptation. GPT-5.4-nano with a SkillOpt'd skill ≈ frontier behavior on procedural benchmarks. Cheaper, portable, inspectable, zero inference-time cost. This is the answer to "how do we adapt a frontier model for our domain" for almost everyone who isn't training their own models. 6. Verification is the bottleneck. Every gate in this paper depends on an auto-grader. That works for benchmarks. It fails for writing, design, and strategy, exactly the open-ended work we want to automate. Whoever builds the verifier for open-ended tasks owns the next stage. There are also two leassons I learned while shipping v2.3.0 of my Context Engineering Agent Skills repo, measured across composer-2, claude-opus-4-7, gpt-5.5, and gemini-3.1-pro via the @cursor_ai SDK: - Description and body are two different surfaces. The router only sees the description. The agent sees the body once activated. They can quietly disagree, and only end-to-end task tests catch it. - Aggregate accuracy is the wrong unit. When I rewrote three descriptions, the corpus average moved ~1pp. Individual skills moved 23–25pp. Per-skill effect size is where the action is. Also, in Feb 2026 I shared a piece called Personal Brain OS arguing that the markdown file is a first-class substrate for agent state. SkillOpt is the optimizer-shaped version of that same argument: not "store memory in files" but "treat files as trainable parameters with proper optimization machinery around them." That's the move from static to measured. The fast/slow split they describe already lives implicitly in the digital-brain-skill repo: - voice-guide and tone-of-voice.md are slow-state (rarely touched) - posts.jsonl and bookmarks.jsonl are fast-state What SkillOpt adds that I didn't have is a protected section invariant, a structural guarantee that fast edits cannot overwrite slow lessons. Removing that mechanism cost them 22 points on SpreadsheetBench. Worth borrowing. If you're building agents, SkillOpt: Executive Strategy for Self-Evolving Agent Skills is a good paper to read: arxiv.org/pdf/2605.23904

English
40
140
1.6K
241.3K
Ilpo Leppänen retweetledi
Myrhe𝕩
Myrhe𝕩@myrhex·
A new tab dedicated to Grok Build is being worked on in Grok Web. It is called “Build” and links to grok.com/build. This page is set to become the dedicated entry point for Grok Build directly on grok.com instead of only x.ai/cli. It will let SuperGrok, Premium+ and SuperGrok Heavy users install Grok Build with a simple command so they can run it in their terminal.
Myrhe𝕩 tweet media
English
5
10
69
16.7K
Rach
Rach@rachpradhan·
Introducing codedb v0.2.5818. ~1μs per lookup. 50,000x faster than grep. 12x fewer tool calls. 20-30x faster wall-time. 49% fewer tokens. 2.4B tokens saved across 200k+ ops last 30 days.
Rach tweet media
English
34
56
1K
71.1K
Ilpo Leppänen
Ilpo Leppänen@ileppane·
@LukeParkerDev This problem (or solution) is just about to get augmented through agentic engineering 😅
English
0
0
0
2.2K
Yifan Yang
Yifan Yang@Yif_Yang·
Great question — I see them as highly complementary. Waza is a great eval/CI layer for agent skills: defining reproducible tasks, graders, baselines, and cross-model comparisons. SkillOpt focuses on the optimization side: using rollout feedback to train the skill document itself through bounded edits and validation-gated updates. So a natural pairing is: use Waza to measure and regression-test skills, and use SkillOpt to iteratively improve them. The optimized skill can then go back into Waza for continuous evaluation across models/harnesses. Would be very exciting to explore tighter integration here.
English
1
0
4
168
Yifan Yang
Yifan Yang@Yif_Yang·
🚀 Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language skill as a trainable external parameter. Think of it as deep learning for the frontier-model + agent era: learning rate, LR schedule, mini-batch, batch size, epoch, momentum — all in text-space optimization. SkillOpt enables stable, controllable skill updates through bounded edits, allowing the optimizer to summarize “gradient directions” from agent experience and continuously improve procedural capability. We evaluate SkillOpt across 6 benchmarks and 7 models, under both direct model calls and real agent execution loops with Codex + Claude Code. SkillOpt achieves best or tied-best results in 52/52 settings. Train the skill, not the model. 🛠️🤖 🌐 aka.ms/skillopt 📄 huggingface.co/papers/2605.23…
English
48
97
794
77.8K
Ilpo Leppänen
Ilpo Leppänen@ileppane·
@skcd42 @JasonBud "- Auto-background long running user-triggered bash-mode commands when invoked via `!`" => That's nice! But what if you could also automatically attach an agent to monitor the command to proactively resolve issues, suspend if invocation goes stale?
English
3
1
3
431
skcd
skcd@skcd42·
Bug fixes shipping to Grok Build 0.1.220 (release notes will be available in the TUI) - Support gt and git in /execute-plan - Always-approve is now an option during permission selection - Fix routing for hook commands starting with tilde - Make group collapse header an independent selectable entry - Fix copy/paste on Linux Wayland (Omarchy, CachyOS, Hyprland) - Skip KKP for unknown terminals with no multiplexer (fixes broken Shift) - Paste file path text instead of [Image #1] for non-image files - Improve legibility on legacy Windows Console Host - Delete misleading post-compaction todo reseed reminder - Auto-background long running user-triggered bash-mode commands when invoked via `!`
English
54
21
449
43K
Ilpo Leppänen retweetledi
DHH
DHH@dhh·
I've had more "I can't believe it's this good" moments with GPT5.5 than any other model since Opus 4.5. It's shockingly, scarily capable. Days and days of amazing progress. All steering, no handwriting. Yet utterly delightful to conduct its coding. So, so good.
English
244
275
5.8K
422K
Ilpo Leppänen retweetledi
Ethan Mollick
Ethan Mollick@emollick·
GPT-5.5 Pro is a very solid fact checker. I can throw entire chapters at it and it will hunt down every key reference accurately. The only real annoyance is that it loves nuance, so returns a lot of “the general idea is right, but you are not taking into account tiny detail X”
English
126
77
1.8K
335.4K
Ilpo Leppänen
Ilpo Leppänen@ileppane·
Hey Matt! After listening to your last video I just wanted to ask about your workflow related to non-grillable questions. I'm asking because one of the friction points that I'm seeing across the agentic tools is that the UX there is tailored for you to answer a line of questions. It doesn't account for the non-grillable case with the required sophistication. You would actually need to jump on to a side quest to explore and prototype to be able to answer the question. I've faced this problem many times and it should be as smooth as possible, to be honest, to make this workflow fluent. To fork-off from the question onto a clean or context-preserved session with a custom or generated handoff?
English
0
0
0
448
Ilpo Leppänen
Ilpo Leppänen@ileppane·
Nice - you had that covered! 🤩 Yep, checkpoint in actions on the trail would be a nice feature for multitaskers. Those who can't keep their eyes on a single thing but want to come back to get quickly get at least dinner sort of understanding of what happened one the screen w/o rewinding
English
1
0
1
25
Aurora Scharff
Aurora Scharff@aurorascharff·
@ileppane @OpenAIDevs It has a feature to showcase its trail once it’s dragging! The keypress would be nice as an option, might add that!
English
1
0
1
48
Aurora Scharff
Aurora Scharff@aurorascharff·
Someone told me I should have a click highlighter for my live demos. So I vibe coded one in 10 minutes with @OpenAIDevs Codex. Meet ClickLight! Wild that we can just build this stuff now. Grab it below ↓
English
37
24
651
99.4K
Ilpo Leppänen
Ilpo Leppänen@ileppane·
@Daniel_Farinax You can also stash it locally, add frontmatter with metadata such as retrieval date so that agent is encouraged to lookup for a newer version periodically. Or you can even setup an automation to keep you local doc retrievals up to date.
English
0
0
1
9
Dan
Dan@Daniel_Farinax·
Want better 3D results with Three.js in Grok Build or other models? Always include threejs.org/docs/llms.txt in your prompt. One of the most common mistakes AI models make is using outdated versions of Three.js or obsolete functions. Also try prompts like “Create a hyper-realistic sky or textures” it will automatically use more advanced options to achieve the goal. ✨
Dan tweet media
English
2
1
18
839