Tarun Menta

37 posts

Tarun Menta

Tarun Menta

@_tarunmenta

Founding Research Engineer @datalabto | Ex @Adobe MDSR | @IITHyderabad `23

Brooklyn, NY Katılım Ağustos 2023
225 Takip Edilen113 Takipçiler
Tarun Menta retweetledi
Vik Paruchuri
Vik Paruchuri@VikParuchuri·
We shipped Chandra OCR 1.5 to the Datalab API this week! This is a big improvement over 1.1, especially on tables, lists, charts, chemistry, and math.
Vik Paruchuri tweet media
English
3
7
22
2.2K
Tarun Menta retweetledi
Datalab
Datalab@datalabto·
People evaluating OCR ask the same two questions: 1. Which models are actually good 2. How do they behave on my documents Benchmarks help with the first, but they rarely show you the qualitative side of model behavior across real layouts, scripts, bad scans, and messy photos. So we built two tools to solve this 🧵
Datalab tweet media
English
1
3
12
730
Tarun Menta retweetledi
Datalab
Datalab@datalabto·
Launch Week - Day 4: Spreadsheet Parsing 🚀 We’ve added native spreadsheet support to the Datalab API. Spreadsheets look structured until you hit real-world files: - staggered / overlapping tables - sparse regions with fake separators - hidden columns, merged cells - stray cells that break heuristics - images of tables (why?) Getting reliable structure out of grids is genuinely hard.
Datalab tweet media
English
1
2
13
1.9K
Asfi
Asfi@AsfiShaheen·
@VikParuchuri @datalabto @_tarunmenta Is this available by default when I call your API or do i need a setting for it? Have a long job running as we speak so wondering if I pause it. It’s a super useful feature.
English
1
0
1
48
Vik Paruchuri
Vik Paruchuri@VikParuchuri·
Section headers can make or break the accuracy of your RAG system. Today, we're shipping a feature that outputs perfect header levels, even across hundreds of pages.
Vik Paruchuri tweet media
English
4
5
58
2.7K
Tarun Menta retweetledi
Datalab
Datalab@datalabto·
We're kicking off December with Launch week 🚀 Day 1: Chandra 1.1, our latest upgrade to Datalab’s SoTA OCR model. Massive improvements across layout, math, tables, and multilingual performance 🧵
Datalab tweet media
English
1
3
19
7.3K
Tarun Menta retweetledi
Vik Paruchuri
Vik Paruchuri@VikParuchuri·
Chandra OCR scores 93.9 on the OlmOCR benchmark, if we correct for minor formatting differences. Let's discuss what this says about OCR benchmarking 🧵
Vik Paruchuri tweet media
English
10
13
171
11K
Tarun Menta retweetledi
Datalab
Datalab@datalabto·
We ran an experiment to test a simple idea: better OCR leads to better structured extraction. Modern LLMs can follow schemas and parse complex documents, but only if they can actually read them. On real-world invoices with rotated scans, dense tables, and skewed text, that’s not always the case. 🧵 to see how Gemini 2.5 Flash and GPT 5-mini performed in our research:
English
2
1
15
995
Akshay 🚀
Akshay 🚀@akshay_pachaar·
Everyone is sleeping on this new OCR model! Datalab's Chandra topped independent benchmarks and beat the previously best dots-ocr. - Support for 40+ languages - Handles text, tables, formulas seamlessly I tested on Ramanujan's handwritten letter from 1913. 100% open-source.
English
35
297
2.4K
144K
Tarun Menta retweetledi
Modal
Modal@modal·
We're excited to collaborate with @datalabto to make high-throughput document intelligence accessible to all developers. Instantly deploy and scale Datalab's best-in-class Marker pipeline on Modal GPUs 📑
Modal tweet media
English
3
4
36
3.2K
Tarun Menta retweetledi
Datalab
Datalab@datalabto·
We just released Workflows (beta) — a way to chain document-processing steps like parsing, extraction, segmentation, and conditional logic into a single, reusable pipeline. Read more here → datalab.to/blog/build-doc…
English
1
2
12
1.1K
Tarun Menta
Tarun Menta@_tarunmenta·
@martian_2090 @VikParuchuri We convert Chandra’s output to plain HTML, so this won’t be represented. However the bounding boxes are accurately extracted too (seen in the visualizations), so its possible with some heuristics plus basic CSS :)
English
0
0
0
26
Pavan Kumar Singh
Pavan Kumar Singh@martian_2090·
@VikParuchuri Thanks @VikParuchuri for open sourcing this wonderful model. I can see that text extraction is very accurate but placement of texts especially in header is always left aligned after OCR even though it is center aligned or right aligned in the original image.
English
3
0
2
190
Vik Paruchuri
Vik Paruchuri@VikParuchuri·
Many famous authors had messy handwriting. Let's see how well Chandra OCR does on some samples. First up, Edgar Allen Poe. I really hope he got paid for these articles.
Vik Paruchuri tweet media
English
5
13
118
6.1K
Tarun Menta
Tarun Menta@_tarunmenta·
We've been training a lot of models @datalabto lately, and evaluating OCR quality is hard. Most benchmarks rely on brittle string-matching metrics olmOCR-bench’s unit-test approach has been the most trustworthy and aligns well with our manual judging - we love this benchmark!
Jake Poznanski@jakepoznanski

It's officially the week of OCR! I thought to share some of the lessons learned since olmOCR v1: 1. You need a reliable way to measure your performance. v1 was done on vibes-only, but it was hard to be confident in any changes we wanted to make. olmOCR-bench was our answer.

English
0
0
6
162
Vik Paruchuri
Vik Paruchuri@VikParuchuri·
Introducing chandra OCR - now available on the Datalab API: - Top scores in table, math benchmarks - Handles messy handwriting - Form support (incl checkboxes) - 30+ language coverage - Full layout infomation - Open source (with HF + VLLM support) coming soon
Vik Paruchuri tweet mediaVik Paruchuri tweet mediaVik Paruchuri tweet mediaVik Paruchuri tweet media
English
16
52
428
27.3K
Tarun Menta
Tarun Menta@_tarunmenta·
@protobluf @VikParuchuri @zach_nussbaum The model in our blog post is a smaller model available through Marker/Surya. Chandra is a larger model that decodes the full page in a single shot. We route between both models for the best results on our API!
English
1
0
1
32