Ilpo Leppänen

5.3K posts

Ilpo Leppänen

@ileppane

Katılım Ocak 2024

4.6K Takip Edilen280 Takipçiler

@mattshumer_ @nickbaumann_ Even better, call claude-code through tmux (or equivalent) so you'll get the full experience without being limited by claude -p upcoming changes

English

Matt Shumer@mattshumer_·8h

Massively useful Codex trick for 10x better frontend: You can ask Codex to use Claude as a sub-agent to have Claude handle frontend/design work. Just say “Use claude -p with an excellent, well-scoped, but un-opinionated (UI/UX-wise) prompt anytime you need a design change).”

English

101

1.2K

105.5K

Ilpo Leppänen@ileppane·7h

@nurijanian George, speaking of engineering, I'd also add github.com/instructa/agen… by @kevinkern to the list And @doodlestein, can't list any particular single resource - he's got one hell of a ecosystem of tools out there

English

271

George from 🕹prodmgmt.world@nurijanian·9h

my favorite engineering skills for AI: - Compound Engineering: github.com/EveryInc/compo… - Ryan Singer's shaping skills: github.com/rjs/shaping-sk… - Matt Pocock's skills: github.com/mattpocock/ski… I switched from Superpowers to Compound Engineering as they perfected the plugin over time, and I'm pretty sure I still only use like 10% of it

English

498

29.9K

Ilpo Leppänen retweetledi

MiniMax (official)@MiniMax_AI·12h

#MSA #OpenSource #M3 🫣😎

Skyler Miao@SkylerMiao7

Something BIG is coming

QME

114

1.6K

184.6K

Ilpo Leppänen@ileppane·8h

@yacinelearning youtu.be/zBlu6j5ryo0?si…

YouTube

QME

Yacine Mahdid@yacinelearning·1d

if you are interested in learning about the infra behind auto-research this 1h30min interview with the paradigma folks is for you in it we look at: - why dag are great research substrate - how to let agents run that dag - ways to make big public dag - how to avoid bad bad dag

English

748

59.9K

Ilpo Leppänen retweetledi

Mario Zechner@badlogicgames·1d

the one thing @mitsuhiko taught me: merged client & server logs. very useful.

English

1.1K

57.6K

Ilpo Leppänen@ileppane·8h

@mattpocockuk @cursor_ai @github github.com/mattpocock/san…

QME

Matt Pocock@mattpocockuk·10h

Monster day on Sandcastle today: - Agents can now return structured output via Output.object - Added support for @cursor_ai CLI - Added support for @github Copilot CLI - Fixed a metric ton of bugfixes Check out 0.6.1

English

215

16.1K

Ilpo Leppänen@ileppane·9h

@garrytan github.com/garrytan/gbrai… x.com/garrytan/statu…

Garry Tan@garrytan

"What's Skillify?" you ask? Here's the answer. It just so happens to be the thing to make it so you don't have to repeat yourself anymore to your OpenClaw, and if you use Hermes Agent it'll test your auto-created skills for you x.com/garrytan/statu…

QME

Garry Tan@garrytan·21h

This sounds complicated but the agents can implement this in OpenClaw/Hermes Agent trivially (use skillify from GBrain with a link to this tweet) Sounds ridiculous but you should try it

Muratcan Koylan@koylanai

Gradient descent for SKILL.md files sounds interesting, maybe a bit complex but it's becoming a real part of agent harness. SkillOpt is one of the first papers to treat markdown skill files as trainable parameters and provides a proper optimization framework for them. A few things I learned that you should consider too. 1. The validation gate is the only thing that matters in a self-editing loop. Held-out set, strict improvement, ties rejected. End-to-end, their best skills land with 1 to 4 accepted edits total. If your "self-improving agent" is accepting most of what it proposes, you're shipping slop. 2. Bounded edits are better than full rewrites. 4 to 8 edits per step is the sweet spot. Remove the budget and performance collapses. This is the textual analog of learning rate, and it transfers to any LLM-as-author loop. If you're using an agent to refactor your docs, your prompts, or your skills, cap the diff size. 3. Compactness wins. Median final skill: ~920 tokens. Skills do not need to be long. They need to be high-signal. Most skill files I see are bloated because length feels like effort. It isn't. 4. The harness is becoming less important; the skill is becoming more important. A Codex-trained skill ported into Claude Code hit +59.7 points on SpreadsheetBench. Procedural knowledge is more general than the runtime that produced it. 5. Frozen model + trained context is the practical adaptation. GPT-5.4-nano with a SkillOpt'd skill ≈ frontier behavior on procedural benchmarks. Cheaper, portable, inspectable, zero inference-time cost. This is the answer to "how do we adapt a frontier model for our domain" for almost everyone who isn't training their own models. 6. Verification is the bottleneck. Every gate in this paper depends on an auto-grader. That works for benchmarks. It fails for writing, design, and strategy, exactly the open-ended work we want to automate. Whoever builds the verifier for open-ended tasks owns the next stage. There are also two leassons I learned while shipping v2.3.0 of my Context Engineering Agent Skills repo, measured across composer-2, claude-opus-4-7, gpt-5.5, and gemini-3.1-pro via the @cursor_ai SDK: - Description and body are two different surfaces. The router only sees the description. The agent sees the body once activated. They can quietly disagree, and only end-to-end task tests catch it. - Aggregate accuracy is the wrong unit. When I rewrote three descriptions, the corpus average moved ~1pp. Individual skills moved 23–25pp. Per-skill effect size is where the action is. Also, in Feb 2026 I shared a piece called Personal Brain OS arguing that the markdown file is a first-class substrate for agent state. SkillOpt is the optimizer-shaped version of that same argument: not "store memory in files" but "treat files as trainable parameters with proper optimization machinery around them." That's the move from static to measured. The fast/slow split they describe already lives implicitly in the digital-brain-skill repo: - voice-guide and tone-of-voice.md are slow-state (rarely touched) - posts.jsonl and bookmarks.jsonl are fast-state What SkillOpt adds that I didn't have is a protected section invariant, a structural guarantee that fast edits cannot overwrite slow lessons. Removing that mechanism cost them 22 points on SpreadsheetBench. Worth borrowing. If you're building agents, SkillOpt: Executive Strategy for Self-Evolving Agent Skills is a good paper to read: arxiv.org/pdf/2605.23904

English

140

1.6K

241.3K

Ilpo Leppänen retweetledi

Myrhe𝕩@myrhex·20h

A new tab dedicated to Grok Build is being worked on in Grok Web. It is called “Build” and links to grok.com/build. This page is set to become the dedicated entry point for Grok Build directly on grok.com instead of only x.ai/cli. It will let SuperGrok, Premium+ and SuperGrok Heavy users install Grok Build with a simple command so they can run it in their terminal.

English

16.7K

Ilpo Leppänen@ileppane·16h

@rachpradhan Very speedy if true ✨

English

565

Rach@rachpradhan·18h

Introducing codedb v0.2.5818. ~1μs per lookup. 50,000x faster than grep. 12x fewer tool calls. 20-30x faster wall-time. 49% fewer tokens. 2.4B tokens saved across 200k+ ops last 30 days.

English

71.1K

Ilpo Leppänen@ileppane·18h

@LukeParkerDev This problem (or solution) is just about to get augmented through agentic engineering 😅

English

2.2K

Luke Parker@LukeParkerDev·1d

you've got to be kidding me

TANSTACK@tan_stack

TanStack Virtual now has first-class chat support: end anchoring, append-follow, stable prepends, and streaming messages that stay pinned when they should. The modern web is now a lot of streaming UI on top of lists, so this needed to feel boring 😉 tanstack.com/blog/tanstack-…

English

655

196K

Ilpo Leppänen@ileppane·20h

@Yif_Yang @spboyer

QAM

Yifan Yang@Yif_Yang·20h

Great question — I see them as highly complementary. Waza is a great eval/CI layer for agent skills: defining reproducible tasks, graders, baselines, and cross-model comparisons. SkillOpt focuses on the optimization side: using rollout feedback to train the skill document itself through bounded edits and validation-gated updates. So a natural pairing is: use Waza to measure and regression-test skills, and use SkillOpt to iteratively improve them. The optimized skill can then go back into Waza for continuous evaluation across models/harnesses. Would be very exciting to explore tighter integration here.

English

168

Yifan Yang@Yif_Yang·1d

🚀 Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language skill as a trainable external parameter. Think of it as deep learning for the frontier-model + agent era: learning rate, LR schedule, mini-batch, batch size, epoch, momentum — all in text-space optimization. SkillOpt enables stable, controllable skill updates through bounded edits, allowing the optimizer to summarize “gradient directions” from agent experience and continuously improve procedural capability. We evaluate SkillOpt across 6 benchmarks and 7 models, under both direct model calls and real agent execution loops with Codex + Claude Code. SkillOpt achieves best or tied-best results in 52/52 settings. Train the skill, not the model. 🛠️🤖 🌐 aka.ms/skillopt 📄 huggingface.co/papers/2605.23…

English

794

77.8K

Ilpo Leppänen@ileppane·21h

@skcd42 @JasonBud "- Auto-background long running user-triggered bash-mode commands when invoked via `!`" => That's nice! But what if you could also automatically attach an agent to monitor the command to proactively resolve issues, suspend if invocation goes stale?

English

431

skcd@skcd42·22h

Bug fixes shipping to Grok Build 0.1.220 (release notes will be available in the TUI) - Support gt and git in /execute-plan - Always-approve is now an option during permission selection - Fix routing for hook commands starting with tilde - Make group collapse header an independent selectable entry - Fix copy/paste on Linux Wayland (Omarchy, CachyOS, Hyprland) - Skip KKP for unknown terminals with no multiplexer (fixes broken Shift) - Paste file path text instead of [Image #1] for non-image files - Improve legibility on legacy Windows Console Host - Delete misleading post-compaction todo reseed reminder - Auto-background long running user-triggered bash-mode commands when invoked via `!`

English

449

43K

Ilpo Leppänen retweetledi

DHH@dhh·1d

I've had more "I can't believe it's this good" moments with GPT5.5 than any other model since Opus 4.5. It's shockingly, scarily capable. Days and days of amazing progress. All steering, no handwriting. Yet utterly delightful to conduct its coding. So, so good.

English

244

275

5.8K

422K

Ilpo Leppänen@ileppane·1d

Gotta be in love with these inventory threads! High-reach individuals harvesting context and then you can just @grok it afterwards. ✌️

jason@jxnlco

Spending today learning more about ai design tools. Outside of paper what else should I try

English

Ilpo Leppänen@ileppane·1d

Writing thoughtfully is hard; jumping to code is easy, much easier than starting to tackle my overflowing inbox

Theo - t3.gg@theo

Rewriting Bun in Rust: 6 days Writing a blog post about it: 2+ weeks

English

Ilpo Leppänen retweetledi

Ethan Mollick@emollick·3d

GPT-5.5 Pro is a very solid fact checker. I can throw entire chapters at it and it will hunt down every key reference accurately. The only real annoyance is that it loves nuance, so returns a lot of “the general idea is right, but you are not taking into account tiny detail X”

English

126

1.8K

335.4K

Ilpo Leppänen@ileppane·1d

Hey Matt! After listening to your last video I just wanted to ask about your workflow related to non-grillable questions. I'm asking because one of the friction points that I'm seeing across the agentic tools is that the UX there is tailored for you to answer a line of questions. It doesn't account for the non-grillable case with the required sophistication. You would actually need to jump on to a side quest to explore and prototype to be able to answer the question. I've faced this problem many times and it should be as smooth as possible, to be honest, to make this workflow fluent. To fork-off from the question onto a clean or context-preserved session with a custom or generated handoff?

English

448

Matt Pocock@mattpocockuk·1d

For sharing outside X - here are the most common things folks get wrong with /grill-me and /grill-with-docs aihero.dev/things-people-…

English

252

32.1K

Ilpo Leppänen@ileppane·1d

Nice - you had that covered! 🤩 Yep, checkpoint in actions on the trail would be a nice feature for multitaskers. Those who can't keep their eyes on a single thing but want to come back to get quickly get at least dinner sort of understanding of what happened one the screen w/o rewinding

English

Aurora Scharff@aurorascharff·1d

@ileppane @OpenAIDevs It has a feature to showcase its trail once it’s dragging! The keypress would be nice as an option, might add that!

English

Aurora Scharff@aurorascharff·2d

Someone told me I should have a click highlighter for my live demos. So I vibe coded one in 10 minutes with @OpenAIDevs Codex. Meet ClickLight! Wild that we can just build this stuff now. Grab it below ↓

English

651

99.4K

Ilpo Leppänen@ileppane·1d

@Daniel_Farinax You can also stash it locally, add frontmatter with metadata such as retrieval date so that agent is encouraged to lookup for a newer version periodically. Or you can even setup an automation to keep you local doc retrievals up to date.

English

Dan@Daniel_Farinax·1d

Want better 3D results with Three.js in Grok Build or other models? Always include threejs.org/docs/llms.txt in your prompt. One of the most common mistakes AI models make is using outdated versions of Three.js or obsolete functions. Also try prompts like “Create a hyper-realistic sky or textures” it will automatically use more advanced options to achieve the goal. ✨

English

839

Keşfet

@mattshumer_ @nickbaumann_ @nurijanian @kevinkern @doodlestein @yacinelearning @mitsuhiko @mattpocockuk