최형석 (Hyeong-Seok Choi)

Joe Reeve - 🇬🇧/acc@isnit0

5

31

6K

최형석 (Hyeong-Seok Choi) retweetledi

ElevenLabs Developers@ElevenLabsDevs·18 Şub

Joe from our Growth team built an app that lets you talk to any statue. It uses ElevenAgents, Voice Designer, and OpenAI to identify and create a clone of the characters in a photographed statue. Here’s how to build your own: elevenlabs.io/blog/talk-to-a…

I built an app that lets you talk to statues. Naturally, I took it for a spin at the British Museum. Full conversations in the thread.

English

2

1

25

2.7K

최형석 (Hyeong-Seok Choi)@92HsChoi·18 Şub

This is why rigorous evaluation matters, especially when you’re competing with state-of-the-art models. Great to see ScribeV2 in the front row now, and by a solid margin 😌 (as our internal eval already suggested).

Artificial Analysis@ArtificialAnlys

Announcing AA-WER v2.0 Speech to Text accuracy benchmark, and AA-AgentTalk, a new proprietary dataset focused on speech directed at voice agents AA-AgentTalk focuses on the speech that matters most to voice agents. As a held-out, proprietary dataset, AA-AgentTalk also mitigates the risk of models training to perform well on public test sets. Leading public Speech to Text datasets contain errors in their reference transcripts, where the ground truth doesn't match what was actually said. We've manually corrected these and are open-sourcing cleaned versions of VoxPopuli and Earnings22 on Hugging Face. What's changed in v2.0: ➤ New held-out, proprietary dataset - AA-AgentTalk (50% weighting): 469 samples (~250 minutes) of speech directed at voice agents, and it's private so models can't train on it. Spans voice agent & call center interaction, AI agent interaction, industry jargon, meetings, consumer & personal, and media content across 17 accent groups, 8 speaking styles, and a mix of devices and environments. ➤ Cleaned transcripts for existing public datasets: We identified errors in the original ground truth transcriptions for public datasets, VoxPopuli and Earnings22 - instances where reference transcripts didn't accurately capture what was actually said. Inaccurate ground truth unfairly penalizes models that correctly transcribe the audio, so we manually reviewed and created cleaned versions, VoxPopuli-Cleaned-AA and Earnings22-Cleaned-AA. ➤ Removal of AMI-SDM: We removed the AMI-SDM dataset as the transcript errors were too extensive to correct without making a large number of judgment calls we weren't comfortable with (e.g., heavily overlapping speech). ➤ Improved text normalization: We developed a custom text normalizer building on OpenAI’s whisper normalizer package to reduce artificially inflated WER from formatting differences rather than genuine transcription errors. Key fixes include digit splitting to prevent number grouping mismatches (e.g., 1405 553 272 vs. 1405553272), preservation of leading zeros, normalization of spoken symbols (e.g., “+”, “_”), stripping redundant :00 in times (e.g., 7:00pm vs. 7pm), adding additional US / UK English spelling equivalences (e.g., totalled vs totaled), and accepted equivalent spellings for ambiguous proper nouns in our dataset (e.g., Mateo vs. Matteo). This ensures models are evaluated on actual transcription accuracy rather than surface-level formatting choices. The new weighting is 50% AA-AgentTalk, 25% VoxPopuli-Cleaned-AA, 25% Earnings22-Cleaned-AA. Key results: @elevenlabs's Scribe v2 leads at 2.3% AA-WER v2.0, followed by @GoogleDeepMind's Gemini 3 Pro at 2.9%, @MistralAI's Voxtral Small at 3.0%, Google's Gemini 3 Flash at 3.1%, and ElevenLabs Scribe v1 at 3.2%. ElevenLabs Scribe v2 leads on two of the three component datasets, AA-AgentTalk and Earnings22-Cleaned-AA, while Google's Gemini 3 Pro leads on VoxPopuli-Cleaned-AA. See below for further detail.

English

8

704

최형석 (Hyeong-Seok Choi)@92HsChoi·4 Şub

We’ve reached the important milestone. 11B. Stay tuned for our upcoming TTS model as well :)

ElevenLabs@ElevenLabs

We raised $500M at an $11B valuation to transform how people interact with technology.

English

7

336

최형석 (Hyeong-Seok Choi)@92HsChoi·10 Oca

Scribe V2 is out! Was fun working on the research side of it.

ElevenLabs@ElevenLabs

Today we’re introducing Scribe v2: the most accurate transcription model ever released. While Scribe v2 Realtime is optimized for ultra low latency and agents use cases, Scribe v2 is built for batch transcription, subtitling, and captioning at scale.

English

4

231

최형석 (Hyeong-Seok Choi) retweetledi

ElevenLabs@ElevenLabs·5 Oca

At @sxsw, the ElevenLabs Impact Program will be premiering Eleven Voices, a new documentary series about losing your voice, and finding it again.

English

7

10

87

12.1K

최형석 (Hyeong-Seok Choi)@92HsChoi·15 Kas

@randall_balestr @fdellaert Appreciated, thank you!

English

1

38

Randall Balestriero@randall_balestr·14 Kas

@92HsChoi @fdellaert That's a great question! Will look into those for our 2B model and put some figures for it on our github hopefully by next week

English

Randall Balestriero@randall_balestr

0

2

150

Frank Dellaert@fdellaert·14 Kas

LeJEPA is making waves.Having played around with Dino (which is great) I do appreciate the LeJEPA effort to have a heuristics-free, first-principles approach to SSL. Very promising are the small dataset pre-training vs. transfer learning results in the paper. Modulo alg 1 issue.

LeJEPA: a novel pretraining paradigm free of the (many) heuristics we relied on (stop-grad, teacher, ...) - 60+ arch., up to 2B params - 10+ datasets - in-domain training (>DINOv3) - corr(train loss, test perf)=95% Paper: arxiv.org/pdf/2511.08544 Code: github.com/rbalestr-lab/l…

English

6

15

152

22.9K

최형석 (Hyeong-Seok Choi)@92HsChoi·14 Kas

@randall_balestr @fdellaert Did it still have nice a dense feature map when visualized with pca even when you scaled to 2B? Curious because dinov3 showed that simply scaling the model collapsed on this aspect and had to resort to some heuristics (gram loss thing)

English

0

209

Randall Balestriero@randall_balestr·14 Kas

Thank you! Being limited by the refresh rate of arxiv, but the replacement with the correction will be up! Heuristic-free is our only hope to scale to new domains and to give back to academia an opportunity to pretrain on their particular data! As we saw on the galaxy dataset, you can scale to billion size pretraining data (dino), but if you just use inet1k as anchor to curate your dataset, you remain biased towards natural images.

English

0

19

1.3K

최형석 (Hyeong-Seok Choi) retweetledi

ElevenLabs@ElevenLabs·11 Kas

Introducing Scribe v2 Realtime – the most accurate real-time Speech to Text model. Built for voice agents, meeting notetakers, and live applications, it transcribes in 150ms across 90+ languages, including English, French, German, Italian, Spanish, Portuguese, Hindi, and Japanese. Available today by API and through ElevenLabs Agents.

English

158

207

1.6K

315K

최형석 (Hyeong-Seok Choi)@92HsChoi·15 Eki

@_sdbuchanan @TongPetersb @sainingxie Could you tell me in which sense the high dimensional latent is called compressed?

English

0

1

104

Sam Buchanan@_sdbuchanan·15 Eki

Very interesting new paper from @TongPetersb @sainingxie et al.! The RAE framework actually aligns quite well with the theoretical framework we lay out in our new book. RAE suggests performing "latent" diffusion in the (high dimensional) representation space of a pretrained image encoder, with diffusion transformers. In our book, we discuss how transformer-like architectures promote learning compressed representations of the input data distribution, via its patches, and how performing controllable generation with the data distribution (via diffusion) is far easier in this better-organized space!

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English

2

0

15

2.4K

최형석 (Hyeong-Seok Choi)@92HsChoi·15 Eki

@sedielem I see. Are you aware of methods that can measure intrinsic dimensionality regardless of effective dimensionality? Also, can we define the term intrinsic dimensionality somehow?

English

1

76

Sander Dieleman@sedielem·15 Eki

I believe this is still a compressed representation -- it is important to distinguish the effective and intrinsic dimensionality of these representations. I would expect the latter to be a lot lower, and more similar to the # channels in traditional AE-based representations (where the effective and intrinsic dimensionalities are usually much closer).

English

0

4

251

Sander Dieleman@sedielem·14 Eki

In my blog post on latents for generative modelling, I pointed out that representation learning and reconstruction are two separate tasks (§6.3), which autoencoders try to solve simultaneously. Separating them makes sense. It opens up a lot of possibilities, as this work shows!

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English

9

20

339

33.8K

최형석 (Hyeong-Seok Choi)@92HsChoi·15 Eki

@irotem98 @sainingxie Yeah but at least in the pictures they showed, mae seems the best for the small letters

English

9

rotem@irotem98·15 Eki

@92HsChoi @sainingxie they say in the paper that despite the low fid the mae produced blurry images

English

0

7

Saining Xie@sainingxie·14 Eki

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English

57

330

1.9K

413.8K

최형석 (Hyeong-Seok Choi)@92HsChoi·14 Eki

@irotem98 @sainingxie In that sense MAE looks the best

English

0

17

rotem@irotem98·14 Eki

@92HsChoi @sainingxie you can see in the section of the encoder choice that they already tried siglip and it was slightly worse

English

0

25

최형석 (Hyeong-Seok Choi)@92HsChoi·14 Eki

@irotem98 @sainingxie Will siglip work tho?

English

0

29

rotem@irotem98·14 Eki

@sainingxie dino works well for images like imagenet but to support images like logos, screenshots, text and any digital image you should use siglip

English

0

100

최형석 (Hyeong-Seok Choi)@92HsChoi·14 Eki

@__JohnNguyen__ Maybe your approach could be even better without dimension reducer

English

1

59

John Nguyen@__JohnNguyen__·14 Eki

@92HsChoi Yes, hence I said “with different approach”. The problem here is not using vae. Sorry if it wasn’t clear in the tweet

English

0

125

John Nguyen@__JohnNguyen__·14 Eki

we solve the same problem with VUGEN although with a different approach. Let’s leave VAE in the early 2020s, 2025 is time for a new class of generative models.

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English

33

4.8K

최형석 (Hyeong-Seok Choi)@92HsChoi·14 Eki

I was always talking about it that it should just work tbh Audio guys were already kind of doing it although not diffusion E.g., arxiv.org/abs/2203.16930

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English

10

1.1K

최형석 (Hyeong-Seok Choi)@92HsChoi·14 Eki

@TongPetersb Why do you think the dino latent still doesn’t capture some high freq info such as small letters? (While MAE does) Do you think such info is not in the latent anymore?

English

1

457

Peter Tong@TongPetersb·14 Eki

The work opened my eyes. Since my PhD, I've been studying visual representations for understanding and generation. I long thought pretrained vision encoders (CLIP, DINO, etc.) produced features too semantic for generation/reconstruction, but that's not true! These features outperform VAE features for generation, without any compression or finetuning. So, the representation that can suit both understanding and generation may have been there all along, just not used properly.

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English

13

46

476

58.8K

최형석 (Hyeong-Seok Choi)@92HsChoi·24 Eyl

@jiasenlu @giffmana Was wondering why manzano tokenizer doesn’t have reconstruction term

English

39

Jiasen Lu@jiasenlu·24 Eyl

@92HsChoi @giffmana I will leave you to figure out 😁

English