Evocati retweetledi

Our paper on diagnosing legal reasoning capabilities in language models has been accepted at ACL 2026 Findings 🎉. So excited to share more about this work in San Diego. Our benchmark contains some of the most complex legal reasoning tasks available to the public. And we take on some fundamental challenges to legal evaluation:
Scaling. Traditional benchmarks rely on direct expert annotation (1 annotation → 1 solution), which limits size and diversity. OpenExempt instead encodes legal rules into a machine-computable form, allowing us to generate a large space of legal reasoning tasks and dynamically compute their solutions.
Data Leakage. Static datasets quickly lose value once models train on them. Because OpenExempt generates novel tasks on demand, it enables evaluation on entirely unseen problems, even after release.
Diagnostic Evaluation. A model's failure on a static task provides only a single, opaque signal of error. By allowing users to precisely control task complexity, structure and scope, we can isolate specific reasoning skills and diagnose exactly where models succeed or fail.
Read the paper: arxiv.org/abs/2601.13183

English
























