Ayush Jain retweetledi

We're open-sourcing PulseBench-Tab, a frontier benchmark for table extraction.
Table parsing remains one of the hardest and most poorly measured problems in document intelligence. TEDS operates on DOM trees and conflates HTML formatting conventions with structural errors. Needleman-Wunsch linearizes a two-dimensional structure into a one-dimensional sequence, so column transpositions can still score well because values align with nearby cells. GriTS uses greedy grid matching rather than optimal assignment and does not distinguish edge directions. The upshot: existing metrics cannot reliably separate content errors from structural errors, which makes provider comparisons noisy and downstream reliability unknowable.
Alongside the dataset, our research team developed T-LAG. It parses each table into a cell-position grid, emits directed RIGHT and BELOW adjacency edges (suppressed within spanning cells, deduplicated by source, target, and direction), weights each candidate edge pair by the product of Levenshtein-derived similarities on source and target text, and uses the Hungarian algorithm for globally optimal one-to-one assignment. The F1 over matched edge weight is the T-LAG score. Structure and content are evaluated in one unified pass. HTML formatting choices do not affect the result. Rankings are invariant to the similarity exponent across k ∈ {7, 8, 9, 11}.
The dataset contains 1,820 human-annotated tables across 9 languages and 4 scripts (Latin, CJK, Arabic, Cyrillic), drawn from 380 real-world financial filings, government reports, and regulatory disclosures. Tables range from 2 to 1,183 cells; 48.1% contain merged or spanning cells. Ground truth was produced through 8 annotation rounds with native speakers per language, independent cross-lingual review, and adversarial cell-by-cell audits against source images.
We evaluated 9 commercial and open-source systems independently across the full dataset under exclude-missing scoring. Selected findings:
@Pulse__AI Ultra 2 scores 0.9347 T-LAG; the next closest system scores 0.8155. Pulse Ultra 2 is the only provider with a median of 1.0, corresponding to perfect extraction on 57.9% of samples.
Non-Latin scripts produce the widest cross-provider variance. On Arabic, the spread between top and bottom systems exceeds 75 percentage points.
Structural hallucinations are pervasive. The second-ranked system achieves a perfect-extraction rate of 28.6%, meaning structural or content errors on 71.4% of tables (fabricated rows, invented content, incorrect span attributes, shifted data).
Coverage failure is underreported. Multiple evaluated systems return no output on 19% to 21% of samples. Raw accuracy numbers without coverage disclosure favor selection bias.
Thank you to Dushyanth Sekhar and Mohammed Hadi of S&P Global's Enterprise Data Organization for their academic contributions to the benchmark methodology.
Dataset: huggingface.co/datasets/pulse…
Evaluation: github.com/Pulse-Software…
Blog: runpulse.com/blog/pulsebenc…
Research methodology: benchmark.runpulse.com/research-report
Viewer: benchmark.runpulse.com

English





















