Amar Singh@AmarSVS
Introducing our new Schematron benchmark. We took some time to compare all of the latest open source models to see which one takes the crown.
The benchmark essentially measures the ability of LLMs to take raw HTML along with a JSON schema, and then fill out that schema. We measure things like recall/precision, hallucinations, and ability to handle ambiguity.
The benchmarks are graded with an ensemble of frontier models on a 5 point rubric.
We can see that GLM 5 is the best open source model currently for schema extraction. Surprisingly, GPT-OSS 120B does very well at these type of extraction tasks as well.
Another interesting result is we noticed degradation of quality using Qwen3.5 Plus on this task versus the original Qwen3.5 397B MOE.
The inputs can be up to 120K tokens, so this is akin to a long context benchmark, with an additional reasoning layer.
We will be open sourcing this benchmark if it gains sufficient traction. Also, more benchmarks coming from our side!