Daniel van Strien

4.7K posts

Daniel van Strien banner
Daniel van Strien

Daniel van Strien

@vanstriendaniel

Machine Learning Librarian @huggingface 🤗 I like datasets.

Scotland Katılım Eylül 2014
1.5K Takip Edilen5.5K Takipçiler
Daniel van Strien
Daniel van Strien@vanstriendaniel·
Bunch of new open OCR models recently — all available as uv scripts on @huggingface. 19 models from 0.9B–8B. Some standouts: - Qianfan-OCR - 192 languages - dots.mocr — charts/figures → editable SVG - GLM-OCR — 94.6% accuracy, only 0.9B params
Daniel van Strien tweet media
English
4
14
127
5.8K
Daniel van Strien
Daniel van Strien@vanstriendaniel·
Is olmOCR-bench getting close to saturation? Top score is now 85.9%. Yesterday @datalabto took #1 with chandra-ocr-2. A year ago, the best was 79. Visualised the race to get there using @huggingface leaderboard data
English
7
10
55
14.3K
Daniel van Strien retweetledi
Nathan
Nathan@nathanhabib1011·
NEW SOTA OCR MODEL DROPPED Congrats to @VikParuchuri and team for releasing Chandra OCR 2! - 85.9% on olmocr bench, making it first place 🏆 - 90+ language support - 4B model - Full layout information - Extracts + captions images and diagrams - Strong handwriting, math, form, table support Compare every OCR model on the hub and choose the one adapted to your needs 👇
Nathan tweet media
English
8
37
367
30.3K
Daniel van Strien retweetledi
Vik Paruchuri
Vik Paruchuri@VikParuchuri·
I'm excited to open source Chandra OCR 2! - 85.9% (sota) on olmocr bench - 90+ language support w/benchmarks - 4B model (down from 9B) - Full layout information - Extracts + captions images and diagrams - Strong handwriting, math, form, table support
Vik Paruchuri tweet media
English
32
76
512
35.4K
Daniel van Strien
Daniel van Strien@vanstriendaniel·
@fujikanaeda @nvidia @huggingface @GroqInc Seemed to do a good job with the pipelines without much prompting! Suspect for a more standard task, it would do even better (last time I tried a few months ago, it definitely struggled more).
English
0
0
1
63
Daniel van Strien retweetledi
alphaXiv
alphaXiv@askalphaxiv·
Introducing MCP for arXiv Let your research agents stand on the shoulders of giants Fast multi-turn retrieval, keyword search, and embedding search tools across millions of arXiv papers 🚀
English
69
390
3K
236.9K
Daniel van Strien
Daniel van Strien@vanstriendaniel·
One of the nicest things about @nvidia model releases is that they ship the training data. What does it look like? I sampled 250k examples from 24 datasets in the Nemotron post-training v3 collection and built an interactive Embedding Atlas to explore it.
Daniel van Strien tweet media
English
4
10
63
4.5K
Daniel van Strien
Daniel van Strien@vanstriendaniel·
@mvansegb @nvidia This is for sure very valuable too! Would be quite excited to see the MCP pipelines being used to generate training data for domain-specific MCP/tool use!
English
1
0
2
151
Maarten Van Segbroeck
Exactly. Transparency is a big priority for our Nemotron releases. Not only did we open the training data, but we’re also open-sourcing the SDG recipes behind it as much as possible. We just pushed the recipes for Text-to-Code and Agentic Search, which you can check out here: github.com/NVIDIA-NeMo/Da…
English
1
0
12
477