최형석 (Hyeong-Seok Choi)

1.3K posts

최형석 (Hyeong-Seok Choi) banner
최형석 (Hyeong-Seok Choi)

최형석 (Hyeong-Seok Choi)

@92HsChoi

Love almost everything related to music. Research @elevenlabsio. Previously Co-founder and Research Lead @ Supertone, PhD @ Seoul National University, MARG

대한민국 서울 Katılım Aralık 2018
489 Takip Edilen984 Takipçiler
Sabitlenmiş Tweet
최형석 (Hyeong-Seok Choi)
V3 is out! After a long wait, we’re finally releasing what our research team has been building. Try it out, and it will give you the most fascinating experience you’ve ever had with a TTS model.
ElevenLabs@ElevenLabs

Introducing Eleven v3 (alpha) - the most expressive Text to Speech model ever. Supporting 70+ languages, multi-speaker dialogue, and audio tags such as [excited], [sighs], [laughing], and [whispers]. Now in public alpha and 80% off in June.

English
1
5
31
6K
최형석 (Hyeong-Seok Choi) retweetledi
ElevenLabs Developers
ElevenLabs Developers@ElevenLabsDevs·
Joe from our Growth team built an app that lets you talk to any statue. It uses ElevenAgents, Voice Designer, and OpenAI to identify and create a clone of the characters in a photographed statue. Here’s how to build your own: elevenlabs.io/blog/talk-to-a…
Joe Reeve - 🇬🇧/acc@isnit0

I built an app that lets you talk to statues. Naturally, I took it for a spin at the British Museum. Full conversations in the thread.

English
2
1
25
2.7K
최형석 (Hyeong-Seok Choi)
This is why rigorous evaluation matters, especially when you’re competing with state-of-the-art models. Great to see ScribeV2 in the front row now, and by a solid margin 😌 (as our internal eval already suggested).
Artificial Analysis@ArtificialAnlys

Announcing AA-WER v2.0 Speech to Text accuracy benchmark, and AA-AgentTalk, a new proprietary dataset focused on speech directed at voice agents AA-AgentTalk focuses on the speech that matters most to voice agents. As a held-out, proprietary dataset, AA-AgentTalk also mitigates the risk of models training to perform well on public test sets. Leading public Speech to Text datasets contain errors in their reference transcripts, where the ground truth doesn't match what was actually said. We've manually corrected these and are open-sourcing cleaned versions of VoxPopuli and Earnings22 on Hugging Face. What's changed in v2.0: ➤ New held-out, proprietary dataset - AA-AgentTalk (50% weighting): 469 samples (~250 minutes) of speech directed at voice agents, and it's private so models can't train on it. Spans voice agent & call center interaction, AI agent interaction, industry jargon, meetings, consumer & personal, and media content across 17 accent groups, 8 speaking styles, and a mix of devices and environments. ➤ Cleaned transcripts for existing public datasets: We identified errors in the original ground truth transcriptions for public datasets, VoxPopuli and Earnings22 - instances where reference transcripts didn't accurately capture what was actually said. Inaccurate ground truth unfairly penalizes models that correctly transcribe the audio, so we manually reviewed and created cleaned versions, VoxPopuli-Cleaned-AA and Earnings22-Cleaned-AA. ➤ Removal of AMI-SDM: We removed the AMI-SDM dataset as the transcript errors were too extensive to correct without making a large number of judgment calls we weren't comfortable with (e.g., heavily overlapping speech). ➤ Improved text normalization: We developed a custom text normalizer building on OpenAI’s whisper normalizer package to reduce artificially inflated WER from formatting differences rather than genuine transcription errors. Key fixes include digit splitting to prevent number grouping mismatches (e.g., 1405 553 272 vs. 1405553272), preservation of leading zeros, normalization of spoken symbols (e.g., “+”, “_”), stripping redundant :00 in times (e.g., 7:00pm vs. 7pm), adding additional US / UK English spelling equivalences (e.g., totalled vs totaled), and accepted equivalent spellings for ambiguous proper nouns in our dataset (e.g., Mateo vs. Matteo). This ensures models are evaluated on actual transcription accuracy rather than surface-level formatting choices. The new weighting is 50% AA-AgentTalk, 25% VoxPopuli-Cleaned-AA, 25% Earnings22-Cleaned-AA. Key results: @elevenlabs's Scribe v2 leads at 2.3% AA-WER v2.0, followed by @GoogleDeepMind's Gemini 3 Pro at 2.9%, @MistralAI's Voxtral Small at 3.0%, Google's Gemini 3 Flash at 3.1%, and ElevenLabs Scribe v1 at 3.2%. ElevenLabs Scribe v2 leads on two of the three component datasets, AA-AgentTalk and Earnings22-Cleaned-AA, while Google's Gemini 3 Pro leads on VoxPopuli-Cleaned-AA. See below for further detail.

English
0
0
8
704
최형석 (Hyeong-Seok Choi) retweetledi
ElevenLabs
ElevenLabs@ElevenLabs·
At @sxsw, the ElevenLabs Impact Program will be premiering Eleven Voices, a new documentary series about losing your voice, and finding it again.
English
7
10
87
12.1K
Randall Balestriero
Randall Balestriero@randall_balestr·
@92HsChoi @fdellaert That's a great question! Will look into those for our 2B model and put some figures for it on our github hopefully by next week
English
1
0
2
150
Frank Dellaert
Frank Dellaert@fdellaert·
LeJEPA is making waves.Having played around with Dino (which is great) I do appreciate the LeJEPA effort to have a heuristics-free, first-principles approach to SSL. Very promising are the small dataset pre-training vs. transfer learning results in the paper. Modulo alg 1 issue.
Randall Balestriero@randall_balestr

LeJEPA: a novel pretraining paradigm free of the (many) heuristics we relied on (stop-grad, teacher, ...) - 60+ arch., up to 2B params - 10+ datasets - in-domain training (>DINOv3) - corr(train loss, test perf)=95% Paper: arxiv.org/pdf/2511.08544 Code: github.com/rbalestr-lab/l…

English
6
15
152
22.9K
최형석 (Hyeong-Seok Choi)
@randall_balestr @fdellaert Did it still have nice a dense feature map when visualized with pca even when you scaled to 2B? Curious because dinov3 showed that simply scaling the model collapsed on this aspect and had to resort to some heuristics (gram loss thing)
English
1
0
0
209
Randall Balestriero
Randall Balestriero@randall_balestr·
Thank you! Being limited by the refresh rate of arxiv, but the replacement with the correction will be up! Heuristic-free is our only hope to scale to new domains and to give back to academia an opportunity to pretrain on their particular data! As we saw on the galaxy dataset, you can scale to billion size pretraining data (dino), but if you just use inet1k as anchor to curate your dataset, you remain biased towards natural images.
English
1
0
19
1.3K
최형석 (Hyeong-Seok Choi) retweetledi
ElevenLabs
ElevenLabs@ElevenLabs·
Introducing Scribe v2 Realtime – the most accurate real-time Speech to Text model. Built for voice agents, meeting notetakers, and live applications, it transcribes in 150ms across 90+ languages, including English, French, German, Italian, Spanish, Portuguese, Hindi, and Japanese. Available today by API and through ElevenLabs Agents.
English
158
207
1.6K
315K
Sam Buchanan
Sam Buchanan@_sdbuchanan·
Very interesting new paper from @TongPetersb @sainingxie et al.! The RAE framework actually aligns quite well with the theoretical framework we lay out in our new book. RAE suggests performing "latent" diffusion in the (high dimensional) representation space of a pretrained image encoder, with diffusion transformers. In our book, we discuss how transformer-like architectures promote learning compressed representations of the input data distribution, via its patches, and how performing controllable generation with the data distribution (via diffusion) is far easier in this better-organized space!
Saining Xie@sainingxie

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English
2
0
15
2.4K
최형석 (Hyeong-Seok Choi)
@sedielem I see. Are you aware of methods that can measure intrinsic dimensionality regardless of effective dimensionality? Also, can we define the term intrinsic dimensionality somehow?
English
0
0
1
76
Sander Dieleman
Sander Dieleman@sedielem·
I believe this is still a compressed representation -- it is important to distinguish the effective and intrinsic dimensionality of these representations. I would expect the latter to be a lot lower, and more similar to the # channels in traditional AE-based representations (where the effective and intrinsic dimensionalities are usually much closer).
English
1
0
4
251
Sander Dieleman
Sander Dieleman@sedielem·
In my blog post on latents for generative modelling, I pointed out that representation learning and reconstruction are two separate tasks (§6.3), which autoencoders try to solve simultaneously. Separating them makes sense. It opens up a lot of possibilities, as this work shows!
Saining Xie@sainingxie

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English
9
20
339
33.8K
rotem
rotem@irotem98·
@92HsChoi @sainingxie they say in the paper that despite the low fid the mae produced blurry images
English
1
0
0
7
Saining Xie
Saining Xie@sainingxie·
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
Saining Xie tweet media
English
57
330
1.9K
413.8K
rotem
rotem@irotem98·
@92HsChoi @sainingxie you can see in the section of the encoder choice that they already tried siglip and it was slightly worse
English
1
0
0
25
rotem
rotem@irotem98·
@sainingxie dino works well for images like imagenet but to support images like logos, screenshots, text and any digital image you should use siglip
English
1
0
0
100
John Nguyen
John Nguyen@__JohnNguyen__·
@92HsChoi Yes, hence I said “with different approach”. The problem here is not using vae. Sorry if it wasn’t clear in the tweet
English
1
0
0
125
John Nguyen
John Nguyen@__JohnNguyen__·
we solve the same problem with VUGEN although with a different approach. Let’s leave VAE in the early 2020s, 2025 is time for a new class of generative models.
Saining Xie@sainingxie

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English
1
1
33
4.8K
최형석 (Hyeong-Seok Choi)
@TongPetersb Why do you think the dino latent still doesn’t capture some high freq info such as small letters? (While MAE does) Do you think such info is not in the latent anymore?
English
0
0
1
457
Peter Tong
Peter Tong@TongPetersb·
The work opened my eyes. Since my PhD, I've been studying visual representations for understanding and generation. I long thought pretrained vision encoders (CLIP, DINO, etc.) produced features too semantic for generation/reconstruction, but that's not true! These features outperform VAE features for generation, without any compression or finetuning. So, the representation that can suit both understanding and generation may have been there all along, just not used properly.
Saining Xie@sainingxie

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

English
13
46
476
58.8K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
Hey chat, I need your opinions! Later this week, I'll teach my usual Transformers class. However, I just found out that someone is giving "foundations of attention and transformers" lecture before me already. So I'm thinking of still doing a "recap Lucas style" but then spending more time on some topics my lecture usually doesn't cover, or just scratches the surface. What more advanced/recent topics would you like to see included? Keep in mind this is a teaching/class style talk. Some ideas: more in-depth on decoding, kv-cache. Flex/flash/paged attention. Spend more time on multimodal versions? Tokenizers? I think bad ideas: geglu, global/local, rmsnorm, ... I feel like these are all trivially understood and not worth "teaching", though you may convince me otherwise.
Lucas Beyer (bl16) tweet media
English
84
12
597
91.3K