Tom Jansen (@thomasjansn) - โปรไฟล์ Twitter

Tom Jansen@thomasjansn·8h

First results are in: Score for openai-codex/gpt-5.5 with xhigh thinking: 61.77% Run details: Generated: 100/100 tasks Scored: 100/100 tasks Judge: openai/gpt-4.1-mini Scoring mode: criterion-level Judge runs: 3 Median task score: 65.66% Zero-score tasks: 7 Generation timeouts: 2 tasks hit the 1-hour cap, but still had outputs and were scored This lands below the OpenRouter blog’s reported 69%, so it no longer looks suspiciously inflated. Cost/tokens: Pi generation: $612.60 Pi-reported token-equivalent cost, not necessarily your OpenAI bill since this used the openai-codex subscription route Judge cost: $12.07 estimated OpenAI API cost Generation tokens: 42,801,690 input, 1,759,182 output, 691,636,736 cache-read Judge tokens: 22,758,417 input, 828,911 output You'll notice that the judge model is GPT 4.1 mini here which is a choice GPT5.5 made "to save costs". It's not ideal, and the DRACO paper itself suggests using Gemini 3.1 Pro, GPT 5.2 or Sonnet 5.4, so before switching to the local model run, I'll re-run the judging with GPT 5.4. As judging via API pricing already was 12$ with GPT 4.1 mini, I'll use a little workaround for the GPT 5.4 run: pi , oh beloved pi with all your flexibility, can be used to run this via the Codex Subscription and you can just tell it so send requests without it's own system prompt, to mimick just calling the API. NICE !

Tom Jansen@thomasjansn

Quick update on that: I am running the first DRACO benchmark, GPT 5.5 xhigh with pi.dev as harness and GPT 5.5 xhigh as judge and synthesizer via OpenAI API directly (same benchmark as the one used by OpenRouter) as a baseline, then going into testing different combinations of local models, with either paid and/or local models for judge and synthesizing. Benchmarking takes way longer than I expected .... DRACO has 100 tasks and this first run is already running more than 24 hours, will take another 3 hours for sure. Also doing this only with API pricing would land at about 600$, but via the 200$ sub it's completely fine. Will keep you updated once it's finished. research.perplexity.ai/articles/evalu…

English

0

1

71

Tom Jansen@thomasjansn·1d

@Klotzkette Technologisch haben sie irgendwo Recht, wobei man auch sagen muss dass Modelle wie GLM 5.2 definitiv gut genug sind und von Unternehmen dieser Größenordnung ziemlich problemlos "lokal" oder als "managed service" in einem Datacenter selbst betrieben werden könnten.

Deutsch

1

206

Klotzkette@Klotzkette·1d

Ich sehe gerade konkret, wie viele sehr moderne und technologisch führende deutsche Unternehmen (nicht die allergrößten aber schon in der Nähe von Champions) breitflächig US-amerikanische KI einführen (nicht für Jura, sondern für viele andere Dinge), weil sie jetzt den Anschluss nicht verpassen und Ihre Mitarbeiter nicht verlieren wollen und keine Alternative aus Deutschland oder der EU finden, die leistungsfähig genug ist, was die Menge an zu verarbeiten den Daten und hohe Fähigkeiten der KI angeht. Eine KI wie Mistral wird nur als zu schlecht wahrgenommen, europäische Cloud-Lösungen ebenfalls. Und die sind nicht dumm, dieses Unternehmen, die gucken sich schon um. Ich berichte nur. Don’t kill the messenger. Ich kann es auch nicht ändern.

Deutsch

13

1

66

7.3K

Tom Jansen@thomasjansn·1d

Quick update on that: I am running the first DRACO benchmark, GPT 5.5 xhigh with pi.dev as harness and GPT 5.5 xhigh as judge and synthesizer via OpenAI API directly (same benchmark as the one used by OpenRouter) as a baseline, then going into testing different combinations of local models, with either paid and/or local models for judge and synthesizing. Benchmarking takes way longer than I expected .... DRACO has 100 tasks and this first run is already running more than 24 hours, will take another 3 hours for sure. Also doing this only with API pricing would land at about 600$, but via the 200$ sub it's completely fine. Will keep you updated once it's finished. research.perplexity.ai/articles/evalu…

Tom Jansen@thomasjansn

Based on the @OpenRouter Fusion approach, I made an extenstion for pi.dev by @earendilworks @badlogicgames that lets you define your own set of fusion models, thinking-effort per model etc. so you can easily play around with it too. First tests give good results, but I didn't have any time for benchmarking yet, which I still intend to do -> github.com/tjansn/pi-fusi…

English

0

118

Tom Jansen@thomasjansn·2d

@ThePrimeagen It's really interesting to watch people like you and @unclebobmartin go through this whole process of "it's shit" - "NO, it's good!" - "Oh, no it's actually shit" - "I found a way" ... Hope this one brings you where you want to be 🤞

English

0

113

ThePrimeagen@ThePrimeagen·2d

I have come up with a way for agents to produce code I like with "some hand holding" Feels pretty fun, see where I can take this

English

106

6

952

79.6K

Tom Jansen@thomasjansn·2d

There is a workaround via github.com/fitchmultz/pi-… It will actually use cursor as a harness then, so pi becomes some kind of a meta harness. Maybe thats enough to play around, but some of the real magic actually lies in the super small system prompt which you wouldn't really get the taste of then.

English

1

0

4

105

Philip Miglinci@pmigat·2d

I would have liked to give pi.dev a try, with models provided via cursor, but this doesn't seem to be an option atm. Or am I gettings something wrong @mitsuhiko ?

English

7

0

5

7.1K

Tom Jansen@thomasjansn·2d

I know that @mattpocockuk /grill-me and /grill-with-docs Skills sometimes produce ridiculous amounts of questions, my record is 110 ish. But has anyone ever reported an agent saying "Done. I used the Grill With Docs workflow" And there were ZERO questions ?

English

0

2

46

Tom Jansen@thomasjansn·2d

@thorstenball 🤣🤣🤣

QME

0

51

Thorsten Ball@thorstenball·2d

@thomasjansn That's in there, point #3: "Read things properly before you tweet comments." (Just kidding)

English

1

0

1

77

Thorsten Ball@thorstenball·3d

Some thoughts on ownership I shared with the team this morning

English

44

98

1.1K

52.4K

Tom Jansen@thomasjansn·2d

A great baseline blueprint for how work should be handled in general.

Thorsten Ball@thorstenball

Some thoughts on ownership I shared with the team this morning

English

0

39

Tom Jansen@thomasjansn·2d

@thorstenball Fair. Reading properly should also be the baseline 😅

English

1

0

61

Thorsten Ball@thorstenball·2d

@thomasjansn It is, but see first sentence: I haven't explained it in a while and I was in the mood to write something.

English

1

0

3

760

Tom Jansen@thomasjansn·2d

@badlogicgames yes gramps, will do gramps

GIF

English

0

100

Mario Zechner@badlogicgames·2d

listen to your (dashing and vital looking) elders.

Mitchell Hashimoto@mitchellh

The problem with the "if it works who cares what the code looks like" mindset for agentic work is that it assumes the agent has a perfect understanding of "works." Realistically, things are underspecified, agents make bad assumptions, etc. To be fair, agents are pretty good at unit test coverage. They're pretty bad at designing human experiences (API, CLI flags, etc.), especially cohesive ones for future roadmap plans they may not have visibility into (unless your backlog is perfect and vision fully laid out, which I doubt). They're bad at knowing where performance matters and what type (CPU vs memory tradeoffs). They're bad at where compatibility matters and where it doesn't (and tend to err on the side of preserving it without further guidance). Etc. Unless you have this ALL specified, you can't possibly claim "it works" without taking a look and thinking about it.

English

5

137

19.3K

Tom Jansen@thomasjansn·2d

Nice! I used same approach for my pi.de extension called pi-fusion github.com/tjansn/pi-fusi…

OrcaRouter 🐳@OrcaRouter

Fable 5 is dead. We just resurrected it — cheaper, open and you hold the keys. OpenRouter dropped Fusion 48h ago and broke the internet. We tested it hard. The synthesizer is insane for deep research… but absolute dogshit for coding. So we fixed it. Meet OrcaRouter.ai DSL — the version you actually own. One prompt → fans out to any panel you want → judge + synthesizer → one god-tier answer. But unlike black-box slugs, you control the entire graph in YAML. Fable 5 level intelligence… without waiting for Anthropic to turn it back on 🧵👇

English

0

1

45

Tom Jansen@thomasjansn·2d

@kskrygan ahm. How about @opencode or pi.dev ?

English

0

28

Kirill Skrygan@kskrygan·3d

Now it’s official. JetBrains remains the last big independent player in developer tooling. Our job is to deliver the most cost-efficient, deeply integrated, and genuinely enjoyable AI experience across our IDEs and beyond. We’re on it. Stay tuned.

Cursor@cursor_ai

We're excited to join forces with @SpaceX to advance the frontier of useful AI. Expect significant improvements to Cursor soon.

English

237

109

3.5K

378.7K

Tom Jansen@thomasjansn·2d

YES. My pi-fusion extension is exactly with that in mind: Using several local models in fusion mode to achieve SOTA deep research results. Benchmarking still tbd, but it's there and it works github.com/tjansn/pi-fusi…

Armin Ronacher ⇌@mitsuhiko

Do more experimentations with local models people! vickiboykis.com/2026/06/15/run…

English

0

35

Tom Jansen@thomasjansn·3d

Based on the @OpenRouter Fusion approach, I made an extenstion for pi.dev by @earendilworks @badlogicgames that lets you define your own set of fusion models, thinking-effort per model etc. so you can easily play around with it too. First tests give good results, but I didn't have any time for benchmarking yet, which I still intend to do -> github.com/tjansn/pi-fusi…

OpenRouter@OpenRouter

Introducing the Fusion API, the smartest compound model in the market. Fusion achieves Fable-level intelligence at half the price. How it works 👇

English

0

1

114

Tom Jansen@thomasjansn·3d

"Define what your application needs in secretspec.toml, then plug in keyring, 1Password, Vault, AWS, or any of 11 backends — same code, every environment." Very eager to look into this 👀

Tom Jansen@thomasjansn

😻

English

0

43

Tom Jansen@thomasjansn·3d

😻

geoff@GeoffreyHuntley

🆒 secretspec.dev

ART

0

61

Tom Jansen@thomasjansn·3d

@Simon__Grimm Somehow I want to point into the direction of Occams Razor and the "It is vain to do with more what can be done with less." but then I remember that Europe is not exactly in the "do with less" game 😅

English

0

17

Simon Grimm@Simon__Grimm·3d

An important fact! Full post: siliconcontinent.com/cp/202043931

English

6

7

43

4.3K

Tom Jansen@thomasjansn·3d

@GergelyOrosz Some things (most actually) happening in the US atm are getting weirder by the day ... combined with that weird need to talk about "how bad europe has it" it gives me a really strange feeling of a system close to total collapse

English

0

129

Gergely Orosz@GergelyOrosz·3d

So apparently after Meta leadership: - Force reassigned some of the best devs on teams to AI data labelling fulltime - Laid off another 10% - Started to record every dev’s screen in the US 24/7 They now realized that it has, indeed started to destroy their eng culture. And are now trying to walk back. All of the above was unprompted, not forced by anything external or even business reasons (Meta recorded record revenue, record profits) The biggest self-inflicted eng culture destruction I’ve seen in a matter of weeks

English

264

501

9.7K

749.7K

Tom Jansen@thomasjansn·3d

@shokhkarim1212 Hi! Right now I am building a pi.dev extension for the model fusion pattern that @OpenRouter described in their post. Still very much in the making but I am really excited to test it. Will post when it's done! openrouter.ai/blog/announcem…

English

1

0

2

22

Shokhrukh Karimov@shokhkarim1212·4d

Hey builders! 👋 Looking to connect with people building in: 🍽️ SaaS 🚀 Tech 📲 Automation 🧠 AI tools 📱 Product Development 🔥 Web APP 💻 Devs Drop what you're working on 👇 (and your link — I'll check it out)

English

91

2

61

3.9K

Tom Jansen

ค้นพบ