
Quick update on that: I am running the first DRACO benchmark, GPT 5.5 xhigh with pi.dev as harness and GPT 5.5 xhigh as judge and synthesizer via OpenAI API directly (same benchmark as the one used by OpenRouter) as a baseline, then going into testing different combinations of local models, with either paid and/or local models for judge and synthesizing. Benchmarking takes way longer than I expected .... DRACO has 100 tasks and this first run is already running more than 24 hours, will take another 3 hours for sure. Also doing this only with API pricing would land at about 600$, but via the 200$ sub it's completely fine. Will keep you updated once it's finished. research.perplexity.ai/articles/evalu…










