Linnn | internship szn
16.1K posts

Linnn | internship szn
@_lindrew
CS Student | Hiraya Manawari!



Would loveee to try the last three. Wanna get caught by a friend

That was such an unfortunate way to die





life outside twt been eating so much after i said i was going to do caldef 🤡🤡🤡

? What makes Audio LMs interesting ? (as opposed to garden variety text LMs) Audio carries more info than text. If we could properly train on audio, the models should be smarter. Look at info transmitted per second (bits per second): - English language is about 1 bit per character: ~1 bpc x 15 char / second => ~15 bps. - Audio represents text's linguistic content, but also: + pitch: ~8 levels (3 bits) at 5 Hz = ~15 bps + stress: ~2 bits/syllable × 3 syl/sec → ~6 bps + emotion / voice quality: slow-varying → ~3 bps + speaker identity: ~10 bits (you can recognise tons of speakers), amortized over 5 seconds → ~2 bps + acoustic environment: mostly constant → ~1 bps So audio is worth: ~15 bps linguistic + ~27 bps non-linguistic = ~42 bps => ~2.5-3x more information than text alone The motivation for a true audio LM is being able to capture the intelligence that goes beyond text... ...and make LMs more "socially intelligent" - at least from an audio standpoint. Perhaps some of that intelligence could transfer to text too, although maybe that's more of a stretch... Today's LMs (incl. multi-modal) probably don't capture all of that 27 bps of non-linguistic content (at least not efficiently), for reasons I've hinted at, and will get to with a deeper analysis of audio encoding tomorrow.













