

Daniel van Strien
4.7K posts

@vanstriendaniel
Machine Learning Librarian @huggingface 🤗 I like datasets.









I'm excited to open source Chandra OCR 2! - 85.9% (sota) on olmocr bench - 90+ language support w/benchmarks - 4B model (down from 9B) - Full layout information - Extracts + captions images and diagrams - Strong handwriting, math, form, table support






Who's going to create an open dataset and model for this task and share it on @huggingface?

First attempt at replicating an open dataset to help train an open context compaction model. Claude Code did this one using @nvidia NeMo DataDesigner + @huggingface Inference Providers (Kimi-K2 via @GroqInc). Hopefully someone else (or their agent) can do a better job!

Introducing FlashCompact - the first specialized model for context compaction 33k tokens/sec 200k → 50k in ~1.5s Fast, high quality compaction

Due to popular demand, I've updated this figure to include DeepSeek-V2 and Mistral Large 2. It's also more zoomed for readability.



One of the nicest things about @nvidia model releases is that they ship the training data. What does it look like? I sampled 250k examples from 24 datasets in the Nemotron post-training v3 collection and built an interactive Embedding Atlas to explore it.
