
If you regularly transcribe audio, @cohere Transcribe was just released - it's a free, open-source model that runs locally and is definitely worth checking out. I ran some tests against OpenAI's Whisper (which powers ChatGPT and many other apps).
I used Steve Jobs' 2005 Stanford Commencement Address (15 min) on YouTube as the test video. Both models running locally on a MacBook M4.
Some highlights of what each model heard:
Cohere: "I learned about serif and sans serif typefaces"
Whisper: "I learned about Sarah and Sans Sarah of typefaces"
Cohere: "Bob Noyce"
Whisper: "Bob Nois"
Cohere: "tried to apologize for screwing up so badly"
Whisper: "tried to apologize for sparing up so badly"
I also tested Whisper's largest model (1.55B parameters) to get a closer comparison to Cohere's 2B parameters. It fixed some of the name errors but started repeating phrases and took much longer.
How they compared:
- Cohere (2B params): 119 seconds, ~98% accuracy
- Whisper base (74M params): 69 seconds, ~90% accuracy
- Whisper large (1.55B params): 915 seconds, ~93% accuracy
Full side-by-side transcript comparison: github.com/jaxson/tests-p…
(Note I believe that some of the different word counts stem from hallucination loops that were encountered by Whisper).
Cohere Transcribe Model on Hugging Face: huggingface.co/CohereLabs/coh…
Test video on YouTube: youtube.com/watch?v=UF8uR6…
* Results may vary based on hardware, audio quality, and content. This is a very non scientific test!
**Audio clips used under fair use for commentary/analysis. All rights belong to their respective owners.

YouTube
English
