
Today we’re introducing OfficeQA, a new benchmark grounded in ~89,000 pages of U.S. Treasury Bulletins that reflects the complex, document-heavy tasks enterprises actually face. Unlike existing benchmarks, OfficeQA measures economically valuable, real-world reasoning: parsing dense tables, navigating scanned PDFs, and retrieving facts across decades of documents. Even strong agents reach only ~45% accuracy, showing how far the field has to go. The benchmark is now open to the community, and the Databricks Grounded Reasoning Cup in Spring 2026 will challenge teams to push these capabilities forward. databricks.com/blog/introduci…





















