Daniel van Strien (@vanstriendaniel) - Twitter Profili

Daniel van Strien@vanstriendaniel·18h

@jbenton @staghado @allen_ai Depends on the licence you need but Chandra and Surya models from @datalabto worth trying

English

0

1

12

Joshua Benton (@joshuabenton.com on Bluesky)@jbenton·5d

@vanstriendaniel @staghado @allen_ai If you were working with a lot of multicolumn scans of 19th-c newspapers, what model would you use?

English

1

0

193

Daniel van Strien@vanstriendaniel·5d

I ran 10 newer OCR models on @allen_ai's olmOCR-bench "old scans" subset. The ranking flips depending on what you actually want. On the headline score, PaddleOCR-VL beats NuExtract3 (38.6 vs 37.8). But rank by how much of the page each model actually reads, and NuExtract3 is well ahead (41.6 vs 31.2). Same two models, opposite order. The score rewards dropping boilerplate, i.e. letterheads, stamps, page numbers, so a model that reads the page more faithfully can rank lower. IMO this is because a lot of VLM-based OCR models were made to provide tokens for training. It's less useful if you want faithful OCR of the whole page, like an archive where the letterhead is part of the record. Two other things: a 1B model (LightOnOCR-2) has the best raw transcription in the field, and PaddleOCR-VL 1.6 sometimes hallucinates Chinese characters on English scans.

English

6

9

81

7.7K

Daniel van Strien@vanstriendaniel·18h

@dannyhoek @allen_ai Yes both also very nice models. The Chandra licence is a bit tricky for some use cases but if it's not a concern it's definitely one to try first imo

English

0

6

Danny Hoek@dannyhoek·5d

@vanstriendaniel @allen_ai What about Chandra 2 and dots.mocr? I'm really impressed by the quality of those two models.

English

1

0

1

54

Daniel van Strien@vanstriendaniel·18h

@hu_yifei Would be excited to see you ship some open models again. Your OCR VLMs were very nice in the past!

English

0

1

107

Yifei Hu@hu_yifei·2d

From reliable source: Reducto is developing new models.

English

4

0

22

2.3K

Daniel van Strien@vanstriendaniel·18h

@fujikanaeda Congratulations on the great work and curious to see what's next!

English

0

1

61

Eric W. Tramel@fujikanaeda·2d

Today marks the end of my last week at Nvidia. I joined with the rest of the excellent Gretel team when we were acquired back in April of 2025, which now seems like 10 years ago thanks to the time-dilation of AI progress. Over that year, I got to do quite a lot: help build the best synthetic data generation tool in the industry (NeMo Data Designer), scaling it out for pre & post-training datasets for Nemotron by building some slick cluster tooling, contributed to 4 (!!) Nemotron LLM builds (Nvidia doesn’t mess around with open models), took a small merging experiment reproduction from small scale (30B 😅) up to 550B and pass along some pretty significant eval compute savings as a consequence for our pretraining heros, pull my hair out over the state of public evals & benchmarks, and most importantly, got to collaborate with some of the best researchers and engineers in the field, all working to advance the frontier of open-source AI and build the best computing platforms in the world to support it. Thank you to everyone @ Nvidia & Nemotron for welcoming me into Team Green 💚. Thank you to the Gretelers for your support over this journey 💜.

English

28

4

159

16.1K

Daniel van Strien@vanstriendaniel·1d

@YichiZ03 Thanks! Will keep playing around with this very nice model!

English

0

3

104

Yichi Zhang@YichiZ03·1d

Thank you Daniel for trying MOSS-Transcribe-Diarize — and for such a detailed, generous writeup. The MTD model's core strength is long audio input, and the test case is exactly what it is designed for. Your engineering notes are really valuable to us. Really appreciate you pushing the model. Feel free to open issues or reach out — we'd love to keep in touch.

Daniel van Strien@vanstriendaniel

This week @Open_MOSS released MOSS-Transcribe-Diarize, a 0.9B open model (Apache 2.0) that transcribes, diarizes, and timestamps in a single pass. I used it to make 174 hours of Apollo 11 mission audio searchable by who said what, when. Total cost: $9.46. These are the real NASA tapes from July 1969, hosted by the @internetarchive. All 103 run through one @huggingface Job: a100-large, 3.8h, 47x realtime, @sgl_project serving the model inside the job. 45,355 timestamped speaker segments. Search the mission and hear any moment from the original tape! Space: huggingface.co/spaces/davanst…

English

1

3

10

2.4K

Daniel van Strien retweetledi

Chayenne Zhao@GenAI_is_real·1d

Thank you Daniel — seeing sglang-omni power something this creative made our week. At the risk of self-promotion: we work closely with the OpenMOSS team, and every sglang-omni commit runs short- and long-sequence multi-speaker tests on MOSS-Transcribe-Diarize (benchmark docs are in the repo), so performance and correctness only move one way. CI this aggressive benefits every user — the only people it tortures are the joint MOSS × SGLang dev team 😂 The missing offline engine is deliberate: an in-process engine pushes batch assembly onto the user, while the server gives you continuous batching for free — your 47x with ~6 tapes in flight vs 3.2x sequential is exactly that effect. I've argued the same in RL infra, and the field has largely converged on server-based rollouts. A server is less ergonomic than a Python API, but the learning curve is small — your subprocess-in-a-Job recipe is exactly the pattern we recommend. The ~62-min serving cap and the mid-tape stops are both worth chasing down — mind opening issues with your configs? Small teaser: DP + MPS on a single GPU should land early next week, bringing several-fold TTS/ASR throughput gains on H100-class cards. Dozens of collaborators across the US, China, Singapore and the UAE built this — keep the feedback coming!

Daniel van Strien@vanstriendaniel

This week @Open_MOSS released MOSS-Transcribe-Diarize, a 0.9B open model (Apache 2.0) that transcribes, diarizes, and timestamps in a single pass. I used it to make 174 hours of Apollo 11 mission audio searchable by who said what, when. Total cost: $9.46. These are the real NASA tapes from July 1969, hosted by the @internetarchive. All 103 run through one @huggingface Job: a100-large, 3.8h, 47x realtime, @sgl_project serving the model inside the job. 45,355 timestamped speaker segments. Search the mission and hear any moment from the original tape! Space: huggingface.co/spaces/davanst…

English

1

2

18

2.7K

Daniel van Strien@vanstriendaniel·1d

Script + full recipe (plus Cohere Transcribe and word-level alignment variants): huggingface.co/datasets/uv-sc…

English

0

2

236

Daniel van Strien@vanstriendaniel·1d

The underused @huggingface Jobs trick behind this: don't just run a script in the job. serve the model inside it. sglang server + async driver in one uv script, run from its Hub URL. Continuous batching turned 3.2x realtime into 47x on the same GPU. Works for any 'big pile of files through a model' problem, not just moon landings!

Daniel van Strien@vanstriendaniel

This week @Open_MOSS released MOSS-Transcribe-Diarize, a 0.9B open model (Apache 2.0) that transcribes, diarizes, and timestamps in a single pass. I used it to make 174 hours of Apollo 11 mission audio searchable by who said what, when. Total cost: $9.46. These are the real NASA tapes from July 1969, hosted by the @internetarchive. All 103 run through one @huggingface Job: a100-large, 3.8h, 47x realtime, @sgl_project serving the model inside the job. 45,355 timestamped speaker segments. Search the mission and hear any moment from the original tape! Space: huggingface.co/spaces/davanst…

English

2

5

22

1.7K

Daniel van Strien@vanstriendaniel·3d

Run it on your own audio in one command. The recipe (batch + serve-inside-the-job variants): huggingface.co/datasets/uv-sc…

English

0

3

332

Daniel van Strien@vanstriendaniel·3d

This week @Open_MOSS released MOSS-Transcribe-Diarize, a 0.9B open model (Apache 2.0) that transcribes, diarizes, and timestamps in a single pass. I used it to make 174 hours of Apollo 11 mission audio searchable by who said what, when. Total cost: $9.46. These are the real NASA tapes from July 1969, hosted by the @internetarchive. All 103 run through one @huggingface Job: a100-large, 3.8h, 47x realtime, @sgl_project serving the model inside the job. 45,355 timestamped speaker segments. Search the mission and hear any moment from the original tape! Space: huggingface.co/spaces/davanst…

English

8

10

52

10.4K

Daniel van Strien retweetledi

Matei Zaharia@matei_zaharia·4d

3) Harnesses make a huge difference in cost-performance. The very simple Pi harness (@badlogicgames) got the same success rate as harnesses from the LLM vendors with Opus and GPT 5.5, but at 2x less cost! Seems to be mainly due to smaller inputs to the LLM.

English

12

40

390

105.2K

Daniel van Strien retweetledi

Georgi Gerganov@ggerganov·4d

llama.cpp recently added DFlash support to its speculative decoding arsenal. Along with MTP, Eagle3 and various ngram-based techniques, the local model performance takes another step up. Special thanks to NVIDIA team and Ruixiang Wang specifically for leading this effort! github.com/ggml-org/llama…

English

16

49

395

79.3K

Daniel van Strien retweetledi

Harry Mellor@hmellor_·4d

I have HUGE news about the Transformers modelling backend for @vllm_project v0.25.0 🚀 It has reached performance parity with native vLLM model implementations 🤯 The Transformers modelling backend has just become a zero-effort, zero-compromise way to deploy to vLLM!

English

5

14

92

31K

Daniel van Strien retweetledi

Quentin Lhoest 🤗@lhoestq·4d

Breaking: @huggingface and @CommonCrawl partnered to democratize access to the largest dataset for AI You can now load Common Crawl in one LoC from ANYWHERE and for FREE thx to pre-warmed CDN in multi-region/multi-cloud, no data movement fees, and to @daftengine @everettkleven

English

3

9

48

10.7K

Daniel van Strien retweetledi

Teknium 🪽@Teknium·5d

Hermes Agent can now export your agent sessions, or sets of sessions, into a variety of formats and places. Get full conversations out in HTML, Markdown, JSON and more, or upload entire datasets of your sessions to private @huggingface repos with ease. You also get all the filters to help select with that we added for the session pruning, so you can collect them up by model, date ranges, conversation source (i.e. cronjob or telegram), and much more! `hermes update` and you'll have full control over your data to export, inspect, share, and store however you like!

English

53

65

649

136K

Daniel van Strien@vanstriendaniel·5d

@staghado @slimcat0101 @allen_ai Yeah I like the idea of lots of unit tests. Also makes it easier to see where a model is failing and if you care about that failure mode or not. IMO it would be nicer if these didn't then always end up mostly reported on the top level score though.

English

0

2

44

Said Taghadouini@staghado·5d

@slimcat0101 @vanstriendaniel @allen_ai to be clear OlmoOCR-bench has its flaws im not saying it’s perfect, but OmniDocBench is hardcoding a single ground truth for the full page, it’s easily hackable by SFT’ing on similar text, unit tests are much harder to game although possible(see mistral/chandra pprocesing tricks.

English

1

0

3

44

Daniel van Strien@vanstriendaniel·5d

@slimcat0101 @allen_ai >a massive 100-year OCR survey this sounds great! Have this long-term idea that you could write a pretty good history of ML/AI by writing a history of OCR!

English

0

13

Cheng Cui@slimcat0101·5d

@vanstriendaniel @allen_ai Yeah, OCR evaluation is still broken—no benchmark out there gives a true picture of model capacity. OmniDocBench is a great effort but far from perfect. Long road ahead. We’ll actually be touching on this in a massive 100-year OCR survey we're dropping soon.

English

2

0

2

57

Daniel van Strien@vanstriendaniel·5d

@slimcat0101 @allen_ai Yeah, this is also my impression, but a lot of people seem to be reporting on it still. IMO benchmarks often don't translate so I'm often using github.com/davanstrien/oc… to do a VLM ranking + human review. I do think there is space for some new public ocr benchmarks though.

English

1

0

1

150

Cheng Cui@slimcat0101·5d

@vanstriendaniel @allen_ai We stopped tracking that eval set btw. Our analysis showed it’s not representative of actual model capabilities at all.

English

1

0

4

192

Daniel van Strien@vanstriendaniel·5d

Write-up + plus how I ran the whole thing on HF Jobs, no local GPU: danielvanstrien.xyz/posts/2026/ocr…

English

0

6

333

Daniel van Strien

Keşfet