Tom Jansen

392 posts

Tom Jansen

Tom Jansen

@thomasjansn

forward deployed human. agentic ai. product. strategy and general deep thinking.

these days mostly terminal เข้าร่วม Ocak 2012
734 กำลังติดตาม114 ผู้ติดตาม
Tom Jansen
Tom Jansen@thomasjansn·
First results are in: Score for openai-codex/gpt-5.5 with xhigh thinking: 61.77% Run details: Generated: 100/100 tasks Scored: 100/100 tasks Judge: openai/gpt-4.1-mini Scoring mode: criterion-level Judge runs: 3 Median task score: 65.66% Zero-score tasks: 7 Generation timeouts: 2 tasks hit the 1-hour cap, but still had outputs and were scored This lands below the OpenRouter blog’s reported 69%, so it no longer looks suspiciously inflated. Cost/tokens: Pi generation: $612.60 Pi-reported token-equivalent cost, not necessarily your OpenAI bill since this used the openai-codex subscription route Judge cost: $12.07 estimated OpenAI API cost Generation tokens: 42,801,690 input, 1,759,182 output, 691,636,736 cache-read Judge tokens: 22,758,417 input, 828,911 output You'll notice that the judge model is GPT 4.1 mini here which is a choice GPT5.5 made "to save costs". It's not ideal, and the DRACO paper itself suggests using Gemini 3.1 Pro, GPT 5.2 or Sonnet 5.4, so before switching to the local model run, I'll re-run the judging with GPT 5.4. As judging via API pricing already was 12$ with GPT 4.1 mini, I'll use a little workaround for the GPT 5.4 run: pi , oh beloved pi with all your flexibility, can be used to run this via the Codex Subscription and you can just tell it so send requests without it's own system prompt, to mimick just calling the API. NICE !
Tom Jansen@thomasjansn

Quick update on that: I am running the first DRACO benchmark, GPT 5.5 xhigh with pi.dev as harness and GPT 5.5 xhigh as judge and synthesizer via OpenAI API directly (same benchmark as the one used by OpenRouter) as a baseline, then going into testing different combinations of local models, with either paid and/or local models for judge and synthesizing. Benchmarking takes way longer than I expected .... DRACO has 100 tasks and this first run is already running more than 24 hours, will take another 3 hours for sure. Also doing this only with API pricing would land at about 600$, but via the 200$ sub it's completely fine. Will keep you updated once it's finished. research.perplexity.ai/articles/evalu…

English
0
0
1
71
Tom Jansen
Tom Jansen@thomasjansn·
@Klotzkette Technologisch haben sie irgendwo Recht, wobei man auch sagen muss dass Modelle wie GLM 5.2 definitiv gut genug sind und von Unternehmen dieser Größenordnung ziemlich problemlos "lokal" oder als "managed service" in einem Datacenter selbst betrieben werden könnten.
Deutsch
1
1
1
206
Klotzkette
Klotzkette@Klotzkette·
Ich sehe gerade konkret, wie viele sehr moderne und technologisch führende deutsche Unternehmen (nicht die allergrößten aber schon in der Nähe von Champions) breitflächig US-amerikanische KI einführen (nicht für Jura, sondern für viele andere Dinge), weil sie jetzt den Anschluss nicht verpassen und Ihre Mitarbeiter nicht verlieren wollen und keine Alternative aus Deutschland oder der EU finden, die leistungsfähig genug ist, was die Menge an zu verarbeiten den Daten und hohe Fähigkeiten der KI angeht. Eine KI wie Mistral wird nur als zu schlecht wahrgenommen, europäische Cloud-Lösungen ebenfalls. Und die sind nicht dumm, dieses Unternehmen, die gucken sich schon um. Ich berichte nur. Don’t kill the messenger. Ich kann es auch nicht ändern.
Deutsch
13
1
66
7.3K
Tom Jansen
Tom Jansen@thomasjansn·
Quick update on that: I am running the first DRACO benchmark, GPT 5.5 xhigh with pi.dev as harness and GPT 5.5 xhigh as judge and synthesizer via OpenAI API directly (same benchmark as the one used by OpenRouter) as a baseline, then going into testing different combinations of local models, with either paid and/or local models for judge and synthesizing. Benchmarking takes way longer than I expected .... DRACO has 100 tasks and this first run is already running more than 24 hours, will take another 3 hours for sure. Also doing this only with API pricing would land at about 600$, but via the 200$ sub it's completely fine. Will keep you updated once it's finished. research.perplexity.ai/articles/evalu…
Tom Jansen@thomasjansn

Based on the @OpenRouter Fusion approach, I made an extenstion for pi.dev by @earendilworks @badlogicgames that lets you define your own set of fusion models, thinking-effort per model etc. so you can easily play around with it too. First tests give good results, but I didn't have any time for benchmarking yet, which I still intend to do -> github.com/tjansn/pi-fusi…

English
0
0
0
118
Tom Jansen
Tom Jansen@thomasjansn·
@ThePrimeagen It's really interesting to watch people like you and @unclebobmartin go through this whole process of "it's shit" - "NO, it's good!" - "Oh, no it's actually shit" - "I found a way" ... Hope this one brings you where you want to be 🤞
English
0
0
0
113
ThePrimeagen
ThePrimeagen@ThePrimeagen·
I have come up with a way for agents to produce code I like with "some hand holding" Feels pretty fun, see where I can take this
English
106
6
952
79.6K
Tom Jansen
Tom Jansen@thomasjansn·
There is a workaround via github.com/fitchmultz/pi-… It will actually use cursor as a harness then, so pi becomes some kind of a meta harness. Maybe thats enough to play around, but some of the real magic actually lies in the super small system prompt which you wouldn't really get the taste of then.
English
1
0
4
105
Philip Miglinci
Philip Miglinci@pmigat·
I would have liked to give pi.dev a try, with models provided via cursor, but this doesn't seem to be an option atm. Or am I gettings something wrong @mitsuhiko ?
English
7
0
5
7.1K
Tom Jansen
Tom Jansen@thomasjansn·
I know that @mattpocockuk /grill-me and /grill-with-docs Skills sometimes produce ridiculous amounts of questions, my record is 110 ish. But has anyone ever reported an agent saying "Done. I used the Grill With Docs workflow" And there were ZERO questions ?
English
0
0
2
46
Thorsten Ball
Thorsten Ball@thorstenball·
@thomasjansn That's in there, point #3: "Read things properly before you tweet comments." (Just kidding)
English
1
0
1
77
Thorsten Ball
Thorsten Ball@thorstenball·
Some thoughts on ownership I shared with the team this morning
Thorsten Ball tweet media
English
44
98
1.1K
52.4K
Thorsten Ball
Thorsten Ball@thorstenball·
@thomasjansn It is, but see first sentence: I haven't explained it in a while and I was in the mood to write something.
English
1
0
3
760
Tom Jansen
Tom Jansen@thomasjansn·
Nice! I used same approach for my pi.de extension called pi-fusion github.com/tjansn/pi-fusi…
OrcaRouter 🐳@OrcaRouter

Fable 5 is dead. We just resurrected it — cheaper, open and you hold the keys. OpenRouter dropped Fusion 48h ago and broke the internet. We tested it hard. The synthesizer is insane for deep research… but absolute dogshit for coding. So we fixed it. Meet OrcaRouter.ai DSL — the version you actually own. One prompt → fans out to any panel you want → judge + synthesizer → one god-tier answer. But unlike black-box slugs, you control the entire graph in YAML. Fable 5 level intelligence… without waiting for Anthropic to turn it back on 🧵👇

English
0
0
1
45
Kirill Skrygan
Kirill Skrygan@kskrygan·
Now it’s official. JetBrains remains the last big independent player in developer tooling. Our job is to deliver the most cost-efficient, deeply integrated, and genuinely enjoyable AI experience across our IDEs and beyond. We’re on it. Stay tuned.
Cursor@cursor_ai

We're excited to join forces with @SpaceX to advance the frontier of useful AI. Expect significant improvements to Cursor soon.

English
237
109
3.5K
378.7K
Tom Jansen
Tom Jansen@thomasjansn·
Based on the @OpenRouter Fusion approach, I made an extenstion for pi.dev by @earendilworks @badlogicgames that lets you define your own set of fusion models, thinking-effort per model etc. so you can easily play around with it too. First tests give good results, but I didn't have any time for benchmarking yet, which I still intend to do -> github.com/tjansn/pi-fusi…
OpenRouter@OpenRouter

Introducing the Fusion API, the smartest compound model in the market. Fusion achieves Fable-level intelligence at half the price. How it works 👇

English
0
0
1
114
Tom Jansen
Tom Jansen@thomasjansn·
"Define what your application needs in secretspec.toml, then plug in keyring, 1Password, Vault, AWS, or any of 11 backends — same code, every environment." Very eager to look into this 👀
Tom Jansen@thomasjansn

😻

English
0
0
0
43
Tom Jansen
Tom Jansen@thomasjansn·
@Simon__Grimm Somehow I want to point into the direction of Occams Razor and the "It is vain to do with more what can be done with less." but then I remember that Europe is not exactly in the "do with less" game 😅
English
0
0
0
17
Tom Jansen
Tom Jansen@thomasjansn·
@GergelyOrosz Some things (most actually) happening in the US atm are getting weirder by the day ... combined with that weird need to talk about "how bad europe has it" it gives me a really strange feeling of a system close to total collapse
English
0
0
0
129
Gergely Orosz
Gergely Orosz@GergelyOrosz·
So apparently after Meta leadership: - Force reassigned some of the best devs on teams to AI data labelling fulltime - Laid off another 10% - Started to record every dev’s screen in the US 24/7 They now realized that it has, indeed started to destroy their eng culture. And are now trying to walk back. All of the above was unprompted, not forced by anything external or even business reasons (Meta recorded record revenue, record profits) The biggest self-inflicted eng culture destruction I’ve seen in a matter of weeks
English
264
501
9.7K
749.7K
Shokhrukh Karimov
Shokhrukh Karimov@shokhkarim1212·
Hey builders! 👋 Looking to connect with people building in: 🍽️ SaaS 🚀 Tech 📲 Automation 🧠 AI tools 📱 Product Development 🔥 Web APP 💻 Devs Drop what you're working on 👇 (and your link — I'll check it out)
English
91
2
61
3.9K