Tarun Menta

37 posts

Tarun Menta

@_tarunmenta

Founding Research Engineer @datalabto | Ex @Adobe MDSR | @IITHyderabad `23

Brooklyn, NY Katılım Ağustos 2023

225 Takip Edilen113 Takipçiler

Tarun Menta retweetledi

Vik Paruchuri@VikParuchuri·23 Oca

We shipped Chandra OCR 1.5 to the Datalab API this week! This is a big improvement over 1.1, especially on tables, lists, charts, chemistry, and math.

English

2.2K

Tarun Menta retweetledi

Datalab@datalabto·9 Ara

People evaluating OCR ask the same two questions: 1. Which models are actually good 2. How do they behave on my documents Benchmarks help with the first, but they rarely show you the qualitative side of model behavior across real layouts, scripts, bad scans, and messy photos. So we built two tools to solve this 🧵

English

730

Tarun Menta retweetledi

Datalab@datalabto·4 Ara

Launch Week - Day 4: Spreadsheet Parsing 🚀 We’ve added native spreadsheet support to the Datalab API. Spreadsheets look structured until you hit real-world files: - staggered / overlapping tables - sparse regions with fake separators - hidden columns, merged cells - stray cells that break heuristics - images of tables (why?) Getting reliable structure out of grids is genuinely hard.

English

1.9K

Tarun Menta@_tarunmenta·3 Ara

@AsfiShaheen @VikParuchuri @datalabto It's running by default, and live in the API as of a few days ago, so you should be good!

English

Asfi@AsfiShaheen·3 Ara

@VikParuchuri @datalabto @_tarunmenta Is this available by default when I call your API or do i need a setting for it? Have a long job running as we speak so wondering if I pause it. It’s a super useful feature.

English

Vik Paruchuri@VikParuchuri·3 Ara

Section headers can make or break the accuracy of your RAG system. Today, we're shipping a feature that outputs perfect header levels, even across hundreds of pages.

English

2.7K

Tarun Menta retweetledi

Datalab@datalabto·1 Ara

We're kicking off December with Launch week 🚀 Day 1: Chandra 1.1, our latest upgrade to Datalab’s SoTA OCR model. Massive improvements across layout, math, tables, and multilingual performance 🧵

English

7.3K

Tarun Menta@_tarunmenta·17 Kas

Super cool work from @zach_nussbaum training a custom speculative decoding model to speed up Chandra inference on our API 🚀

Datalab@datalabto

We shipped Chandra (our SOTA OCR model) but base latency wasn't good enough for production. So we trained an Eagle3 draft model: ✅3× lower p99 latency ✅40% higher throughput ✅zero accuracy loss Here's how we made Chandra OCR 3× faster with Eagle3 speculative decoding 🧵

English

414

Tarun Menta retweetledi

Vik Paruchuri@VikParuchuri·13 Kas

Chandra OCR scores 93.9 on the OlmOCR benchmark, if we correct for minor formatting differences. Let's discuss what this says about OCR benchmarking 🧵

English

171

11K

Tarun Menta retweetledi

Datalab@datalabto·4 Kas

We ran an experiment to test a simple idea: better OCR leads to better structured extraction. Modern LLMs can follow schemas and parse complex documents, but only if they can actually read them. On real-world invoices with rotated scans, dense tables, and skewed text, that’s not always the case. 🧵 to see how Gemini 2.5 Flash and GPT 5-mini performed in our research:

English

995

Tarun Menta@_tarunmenta·2 Kas

@_avichawla @akshay_pachaar We provide structured extraction with citations backed by our OCR models! Read more about it it here - datalab.to/blog/structure…

English

Avi Chawla@_avichawla·1 Kas

@akshay_pachaar would be super useful to build citation-backed RAG!

English

3.4K

Akshay 🚀@akshay_pachaar·1 Kas

Everyone is sleeping on this new OCR model! Datalab's Chandra topped independent benchmarks and beat the previously best dots-ocr. - Support for 40+ languages - Handles text, tables, formulas seamlessly I tested on Ramanujan's handwritten letter from 1913. 100% open-source.

English

297

2.4K

144K

Tarun Menta retweetledi

Niels Rogge@NielsRogge·31 Eki

This OCR model was probably the best one with the least hype Awesome release with both a serverless API and open models on @huggingface The org only has 85 followers on the hub ?!

Datalab@datalabto

Last week we launched Chandra, the newest model in our OCR family 🚀 Despite a busy week for OCR releases, it topped independent benchmarks and received incredible community feedback.

English

483

62.3K

Tarun Menta retweetledi

Modal@modal·31 Eki

We're excited to collaborate with @datalabto to make high-throughput document intelligence accessible to all developers. Instantly deploy and scale Datalab's best-in-class Marker pipeline on Modal GPUs 📑

English

3.2K

Tarun Menta retweetledi

Datalab@datalabto·23 Eki

We just released Workflows (beta) — a way to chain document-processing steps like parsing, extraction, segmentation, and conditional logic into a single, reusable pipeline. Read more here → datalab.to/blog/build-doc…

English

1.1K

Tarun Menta@_tarunmenta·23 Eki

@martian_2090 @VikParuchuri We convert Chandra’s output to plain HTML, so this won’t be represented. However the bounding boxes are accurately extracted too (seen in the visualizations), so its possible with some heuristics plus basic CSS :)

English

Pavan Kumar Singh@martian_2090·23 Eki

@VikParuchuri Thanks @VikParuchuri for open sourcing this wonderful model. I can see that text extraction is very accurate but placement of texts especially in header is always left aligned after OCR even though it is center aligned or right aligned in the original image.

English

190

Vik Paruchuri@VikParuchuri·22 Eki

Many famous authors had messy handwriting. Let's see how well Chandra OCR does on some samples. First up, Edgar Allen Poe. I really hope he got paid for these articles.

English

118

6.1K

Tarun Menta@_tarunmenta·23 Eki

We've been training a lot of models @datalabto lately, and evaluating OCR quality is hard. Most benchmarks rely on brittle string-matching metrics olmOCR-bench’s unit-test approach has been the most trustworthy and aligns well with our manual judging - we love this benchmark!

Jake Poznanski@jakepoznanski

It's officially the week of OCR! I thought to share some of the lessons learned since olmOCR v1: 1. You need a reliable way to measure your performance. v1 was done on vibes-only, but it was hard to be confident in any changes we wanted to make. olmOCR-bench was our answer.

English

162

Tarun Menta@_tarunmenta·22 Eki

Chandra OCR is now open source! 🔥 Thrilled for this launch - and there’s a lot more coming from @datalabto.

Vik Paruchuri@VikParuchuri

I'm excited to announce that Chandra OCR is open source! - Full layout information - Extracts and captions images and diagrams - Strong handwriting, form, table support - Works with transformers and vLLM

English

250

Tarun Menta@_tarunmenta·21 Eki

@protobluf @VikParuchuri @zach_nussbaum Combination of a lot of things, including models and heuristics

English

protobluf@protobluf·21 Eki

@_tarunmenta @VikParuchuri @zach_nussbaum oh, follow up question -- how do you know when to route? is that another model that makes the decision?

English

Vik Paruchuri@VikParuchuri·17 Eki

Introducing chandra OCR - now available on the Datalab API: - Top scores in table, math benchmarks - Handles messy handwriting - Form support (incl checkboxes) - 30+ language coverage - Full layout infomation - Open source (with HF + VLLM support) coming soon

English

428

27.3K

Tarun Menta@_tarunmenta·21 Eki

@protobluf @VikParuchuri @zach_nussbaum The model in our blog post is a smaller model available through Marker/Surya. Chandra is a larger model that decodes the full page in a single shot. We route between both models for the best results on our API!

English

protobluf@protobluf·21 Eki

@VikParuchuri @_tarunmenta @zach_nussbaum is this the same model as described here? x.com/datalabto/stat…

Datalab@datalabto

Launch Day 3 of 6: Layout Model Updates 🚀 If your doc parser gets layout wrong, everything downstream breaks. ⚠️ Reading order scrambled → unusable output ⚠️ Blocks missed → footnotes, financial figures, legal text lost We just rolled out major layout upgrades at Datalab to fix this at the root.

English

Tarun Menta retweetledi

Datalab@datalabto·16 Eki

Gearing up for an exciting launch. Watch this space 👀

Vik Paruchuri@VikParuchuri

This is from a model I'm training - it's significantly better at reading handwriting than me, so I can't tell if it's correct or not...

English

603

Tarun Menta@_tarunmenta·16 Eki

We’ve got something major in the works at @datalabto. Stay tuned 🔥

Vik Paruchuri@VikParuchuri

This is from a model I'm training - it's significantly better at reading handwriting than me, so I can't tell if it's correct or not...

English

Tarun Menta@_tarunmenta·25 Eyl

Poured a ton into this over the past few months - late nights, endless brainstorming, and lots of debugging marathons alongside @VikParuchuri and @zach_nussbaum. Couldn’t be prouder that it’s finally in the hands of users 🚀

Datalab@datalabto

English

298

Keşfet

@AsfiShaheen @VikParuchuri @datalabto @zach_nussbaum @_avichawla @akshay_pachaar @huggingface @martian_2090