
Ivan Parfenchuk
2.7K posts

Ivan Parfenchuk
@parfenchuk
Me, on the internet Check out my HOMM3 LLM leaderboard https://t.co/0opASBTxKJ
Katılım Ocak 2010
1K Takip Edilen195 Takipçiler

@OnlyGerier Check out my humble homm3 LLM arena leaderboard: homm3arena.com
English

He jugado al nuevo Heroes of Might and Magic Olden Era y ha sido como volver a mi infancia. Pura nostalgia.
¿Os acordáis de esta saga de fantasía? Es la nueva entrega de la franquicia que recupera la esencia perdida de los años 90, más concretamente de Heroes III, que es el más alabado por los jugadores.
Recupera el combate táctico en cuadrícula hexagonal, la exploración de mapas con niebla de guerra y la gestión de recursos tradicional. Elimina las mecánicas confusas de las últimas entregas (Heroes VI y VII) para centrarse en lo que funcionaba: construir tu castillo, reclutar tropas y subir de nivel a tu héroe.
Es un regreso a las raíces porque prioriza la jugabilidad táctica y el encanto visual del título de 1999.
Español

second day in a row gpt-5.5 IQmogging opus-4.7
yesterday it was proposing very nice and compact changes based on PR reviews, better than what I saw recently with opus-4.7
today gpt-5.5 xhigh re-implemented a PR from scratch and comparing it to previous opus-4.7 xhigh implementation, the gpt-5.5 output is just better
English

@i_Kisliy safe travels!
on a positive note, when you land, X will be full of vibe-bench results 😉

English

Yeah, when I omw to the airport for my vacation. Thx guys
OpenAI@OpenAI
Introducing GPT-5.5 A new class of intelligence for real work and powering agents, built to understand complex goals, use tools, check its work, and carry more tasks through to completion. It marks a new way of getting computer work done. Now available in ChatGPT and Codex.
English


Claude Design: help me improve design of my homm3arena.com webpage, make text more legible, use homm3-inspired style, but not too heavily
before / after


English

Built a public leaderboard where frontier LLMs play Heroes of Might and Magic III against each other. homm3arena.com
I didn't write a single line of code. Codex did all of it, including patches to the VCMI C++ engine (OSS reimplementation of HoMM III that I've never opened).
How it works:
- battles run on VCMI via vcmi-gym. each model gets a provider adapter (OpenRouter, OpenAI) and outputs legal moves against the real game state
- whitelisted seeds, chosen to be balanced enough that the match is fair
- one sample = a mirrored pair (same seed, sides reversed). Bradley-Terry ranking with bootstrap 95% CIs over mirrors
- bad batches (too many fallbacks or provider errors) don't count
Current top 3:
1) GPT-5.2 (medium)
2) Claude Sonnet 4.6
3) GPT-5.4-mini
HoMM III was the game of my childhood. I wouldn't have built this by hand, the VCMI integration alone would've eaten weeks I don't have between a day job and two small kids.
homm3arena.com
Ivan Parfenchuk@parfenchuk
Pushed latest standings to homm3arena.com
English

Pushed latest standings to homm3arena.com
Ivan Parfenchuk@parfenchuk
Now updated with gpt-5.4-mini and gpt-5.4-nano
English

@ai_for_success that was the small plan, big plan is still coming
English

Setting up the harness based on "harness engineering" blog by OpenAI right now.
What I'm struggling fixing is that codex (gpt-5.4 xhigh) likes to write unnecessary React Component props. Things like props = defaults or not merging stuff together (paddingX="1" + paddingY="1" vs just single padding="1")
So far fixing it with prompts (AGENTS.md / golden-principles.md / etc.) was no very successful.
How do you guys fix this? A "Stop" hook with anti-slop pass? `claude -p "/simplify"` post-processing?
English

Seeing a net positive improvement with following the "harness engineering" blog post of OpenAI:
- document your codebase best practices under ./docs/ai/*.md. I used golden-principles.md, testing-rubric.md, ui-patterns.md. Link to these docs in main AGENTS.md/CLAUDE.md file
- I'm doing a bit code migration right now and documenting everything under ./docs/ai/migrations//*, including: strategy.md, phase-*.md, and log.md
- after each PR, I ask codex to read through all the docs and update everything that changed and integrate everything we learned during the sessions. Surprisingly things changed quite often and documentation drift is real
- also maintaining a log.md of the corrections, so hopefully next time codex follows the right pattern from start. This one is difficult to write in a way which is not harmful for future PRs. Often it's too specific to the work already done
- another thing is chrome/playwright mcp browser verification - codex is quite happy to use it and verify it did the right thing. I think it helped reducing back and forth too
I sent ~4 PRs now and now I feel it starts to pay off. Less incorrect assumptions in the code
English




