
Chris Clark
1.3K posts

Chris Clark
@cclark
Co-founder & COO @OpenRouterAI




We're proud to lead @axiommathai's $200M Series A at a $1.6B valuation! Mathematics is the right foundation for AI that can truly reason. Seven months in, @CarinaLHong and her team have proven it, and we're betting that verified, safe code will become as essential as generating it. Read more: mnlo.vc/axiom-series-a


"Auto Exacto" is now live, and on by default for tool-calling requests. Over the last few days, OpenRouter has reduced tool error rates by 15-90% across providers automatically. Here's how it works:



re: what “OpenAI compatible” actually means in 2025 through the lens of gpt-oss reasoning_effort
the term gets used a lot across the industry, and it carries way more implied guarantees than it should
historically, “OpenAI compatible” has meant support for an OpenAI-style chat completions API. same high-level schema, same messages array, same basic parameters. this shape became the default largely due to OpenAI's first-mover advantage
that API worked well enough to start. it was built quickly, for the moment, before tool calling, hybrid reasoning, structured outputs, or multimodal inputs were common. it's hard to fault the original design given the capabilities at the time
but once you try to apply that same API shape across 60+ providers, inference engines, and model families, the cracks show up very fast. the practical result is that the same request can succeed, fail, or subtly change model behavior depending on where it runs.
let's start with the messages array:
on paper it's simple: ordered turns, each with a role and some content. in practice, this array is handled wildly differently. some providers support arrays of content types per turn which is typically used with text+image in a single turn, but could just be multiple text strings. others throw errors when you try to do this - others silently concatenate. some models were trained for it, others weren't, which means you may get degraded performance even when the request “works.”
role ordering is another source of variance - for example, some providers accept a messages array with a single turn of system role, while others require at least one user turn. some allow assistant prefill and correctly continue generation - others allow prefill but with a specific parameter, and that parameter differs. others ignore it or throw errors. all of this can happen on the same model depending on where it's hosted
that's before you touch sampling parameters
even temperature ranges differ - some cap at 1, some allow higher. logprobs can come back in different shapes. newer OpenAI models don't allow modifying temperature or top-p at all, while open-source models still rely on them heavily. compatibility here often means 'best effort'
structured outputs add another layer
json object vs json schema, partial json schema support, streaming sometimes supported, sometimes not. some providers support reasoning plus structured outputs, others don't
tool calling is where layers of variance really add up:
tool calling is structured output plus special tokens plus a parser plus chat templating plus finish reasons. tool parsers are frequently incorrect, and that's not always the provider's fault - even when a model lab works with the popular engines like vllm and sglang, we see tool call parser issues well after launch. the kimi k2 vendor verifier project uncovered various problems in the inference engine implementations weeks after model launch
tool_choice support varies by model and provider. auto usually works. none often breaks in subtle ways. forced is rare. function-by-name works on some stacks and not others. finish_reason=tool_call is not guaranteed
even tool call IDs are inconsistent - regex expectations differ, length limits differ. reuse the same ID across providers and you will eventually hit a hard error somewhere.
at this point, "compatible" describes the shape of the request, not the semantics
reasoning_effort is a newer example of the same pattern, brought to the limelight by @xeophon's work with gpt-oss benchmarks. he surfaced a measurable variance caused by provider-level incompatibilities
OpenAI introduced a new enum parameter, reasoning_effort. initially the enum was low, medium, high. then minimal, none, extra high. gpt-oss only supports a subset (low, medium, high). meanwhile, most open source reasoning models only support reasoning enabled or disabled (think GLM family, DeepSeek after v3.1). when gpt-oss released, none of the inference providers had support for the reasoning_effort parameter, as it was mostly used on OpenAI's proprietary models. everyone rushed to launch gpt-oss, and the parameter and its impact on the amount of reasoning was swept under the hood for months.
eventually, @xeophon published a tweet of a benchmark he ran through OpenRouter showing a ton of providers not changing the amount of reasoning based on the effort value sent to the OpenRouter API. Xeophon originally blamed the providers, but as soon as I saw it I realized the issue was largely our fault - we hadn't implemented support for it for each provider.
this was a miss on both the OpenRouter team and the providers given how new and underspecified the parameter was - to this day, most providers do not have the parameter or supported values documented anywhere, and most providers did not communicate with us when support was added. we often have to chase teams down to tell us about their apis, as we need a deep understanding of implementations to be able to properly support user intent being transformed to upstream acceptable values.
the fix for this could have been that we added piecemeal support for each provider - but i wanted to avoid this problem entirely in the future, so instead i spent a few weeks refactoring a ton of code to implement model-level reasoning configs. this means that in our database i can now specify a few things:
- whether a model supports reasoning effort
- what values of the enum it supports (out of none/minimal/low/medium/high/xhigh)
- and what the default value should be
this solves multiple issues for us in a scalable way:
- it prevents upstream APIs from throwing errors if we were to pass reasoning_effort param to models that don't support it
- it prevents upstream APIs from throwing errors if we were to pass a value that is not supported
- and it normalizes the default effort value across providers to ensure that if the user doesn't specify, the behavior is consistent.
once this model level config implementation was done, we still needed to ensure that we were plumbing the values into the right fields for different providers that expected it in different places. some expect it in chat template kwargs, others in a top-level reasoning object.
once all this was done, @xeophon was able to get consistent reasoning effort behavior across all providers (minor caveat for bedrock API, which we fixed after communicating with their team about their unique param implementation.)
so now, for reasoning_effort specifically, the OpenRouter experience should be much better. and in the future, any new models with effort support will be much easier for us to support, and we will be able to move more quickly with them.
and all of that's just the request side. here’s a quick, non-exhaustive sample:
token accounting differs - counts for cached tokens, reasoning tokens, image tokens etc differ. sometimes reasoning is returned separately. sometimes wrapped in





