Dan Lyth

123 posts

Dan Lyth banner
Dan Lyth

Dan Lyth

@danlyth

Research engineer at @sesame. Previously leading speech research at @StabilityAI and @RockstarGames.

شامل ہوئے Aralık 2022
310 فالونگ680 فالوورز
Nuno Job
Nuno Job@dscape·
@danlyth wanna come speak in an ai conference about acoustic tokens?
English
2
0
0
28
Dan Lyth ری ٹویٹ کیا
Brendan Iribe
Brendan Iribe@brendaniribe·
And we’re building hardware.
Brendan Iribe tweet media
English
14
4
105
7.9K
Robbie Martin
Robbie Martin@FluorescentGrey·
@stableaudio been trying to use stable audio and in the last 24 hours only 50% of my attempts to generate audio actually don't time out, is this a system wide issue or something with my account?
English
1
0
1
753
Alexander Doria
Alexander Doria@Dorialexander·
Big announcement: @pleiasfr releases a massive open corpus of 2 million Youtube videos in Creative Commons (CC-By) on @huggingface. Youtube-Commons features 30 billion words of audio transcriptions in multiple languages, and soon other modalities huggingface.co/datasets/PleIA…
Alexander Doria tweet media
English
21
124
556
87.8K
Hubert Siuzdak
Hubert Siuzdak@HubertSiuzdak·
SNAC encodes audio into hierarchical tokens, similar to SoundStream, EnCodec, and DAC. It introduces a simple change: coarse tokens are sampled less frequently, covering a broader time span. It is designed mainly for language models to accurately capture long-form audio with a consistent structure
Hubert Siuzdak tweet media
English
3
0
9
621
Hubert Siuzdak
Hubert Siuzdak@HubertSiuzdak·
recently I've been experimenting with audio compression & vector quantization and I'm happy to present Multi-Scale Neural Audio Codec (SNAC), which can compress audio below 2 Kb/s with decent quality 🎧 Listen to the samples and see how it compares to the state of the art:
English
2
3
33
6K
Dan Lyth
Dan Lyth@danlyth·
@erogol I was wondering the same thing, there’s not a lot of detail on that.
English
0
0
1
49
erogol
erogol@erogol·
@danlyth one thing unclear to me how they expand BPE before decoding. be cause otherwise they need to learn to alight it with the output audio frames. do you see how?
English
2
0
3
149
Dan Lyth
Dan Lyth@danlyth·
Moving beyond naturalness and WER, they propose a set of sentences that test the model’s ability to deal with compound nouns, emotions, foreign words, paralinguistics (e.g. whispering if the text requires it) etc. etc. The full test set is included in the appendix. 👏 2/7
Dan Lyth tweet media
English
1
0
5
646
Dan Lyth
Dan Lyth@danlyth·
There are a bunch of other interesting elements to this work, and it’s worth a read. Plenty of examples on the demo site too. Nice work Mateusz Łajszczak, @guillecambara, Yang Li, and all the other contributors. arxiv.org/abs/2402.08093
English
1
0
4
374
Dan Lyth
Dan Lyth@danlyth·
The speech “de-tokenizer” (or decoder) is a convolutional model that’s streamable and 3x faster than their diffusion-based baseline (and also sounds better). It’s built around BigVGAN. 6/7
Dan Lyth tweet media
English
2
0
3
517