Hongwu Peng
147 posts

Hongwu Peng
@Hongwu_Peng
Foundation model pretraining @Adobe Research Ph.D. @ UConn CSE. https://t.co/FPAiorbsK3


Introducing Nested Learning: A new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing. Our proof-of-concept model, Hope, shows improved performance in language modeling. Learn more: goo.gle/47LJrzI @GoogleAI








Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale! > It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training. > They use WSD with a "Simple moving average" averaging the last 6 ckpt every 8B token. > They trained on Finemath, Fineweb2, DCLM, TxT360. > Lot of details in the finetuning data they used, for instance they used EvolKit and did some "dataset fusion" to have more compressed knowledge into the data. > They mention they also tried Normalized GPT, QK-Norm and Cross Layer Attention.



Source: papercopilot.com/paper-list/neu…


















New ultra-fast ‘multi-head’ speech recognition model drop from @_aiOla, beats OpenAI Whisper. Officially dubbed Whisper-Medusa, the model builds on Whisper but uses a novel “multi-head attention” architecture that predicts far more tokens at a time So they seem to have added more attention heads on top of whisper. They claim the same accuracy but 50% faster. Their demo does one text in 1.9s while "baseline" whisper does it in 4s. Code and weights opensource under MIT. they have started with a 10-head model but will soon expand to a larger 20-head version capable of predicting 20 tokens at a time, leading to faster recognition and transcription without any loss of accuracy.




