Adam Ibrahim

74 posts

Adam Ibrahim

@ai_phd

Paris Katılım Haziran 2019

427 Takip Edilen539 Takipçiler

Sabitlenmiş Tweet

Adam Ibrahim@ai_phd·29 May

Our tech report for Zamba-7B-v1 is out. We manage to come close to Llama 3 8B, Mistral 7B and others' level of performance, with only 1T tokens, with faster inference and less memory usage at a fixed context length. Read up to learn about our not-so-secret sauce!

Quentin Anthony@QuentinAnthon15

Zyphra is dropping the tech report for Zamba-7B, along with: - Model weights (phase 1 and final annealed) at huggingface.co/Zyphra - Inference/generation code (both pure PyTorch and HuggingFace) at github.com/Zyphra/Zamba-t… and github.com/huggingface/tr… Tech report: arxiv.org/abs/2405.16712 Intermediate checkpoints and improved phase 1 dataset (+dataset paper!) coming soon. Intermediate optimizer states available on request. Some highlights:

English

2.2K

Adam Ibrahim retweetledi

Rylan Schaeffer@RylanSchaeffer·26 Haz

Another #ICML2025 paper! Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive? TLDR: Predicting language model performance with scale on multiple choice question-answer (MCQA) benchmarks is made difficult b/c ... 1/3

English

13K

Adam Ibrahim retweetledi

Rylan Schaeffer@RylanSchaeffer·9 Tem

Excited to announce our paper ⬇️ was selected as an **Outstanding** paper at @TiFA_ICML2024 🔥🔥🔥 What did the paper show? Let's try to summarize the paper in a single tweet!! 1/3

Rylan Schaeffer@RylanSchaeffer

❤️‍🔥❤️‍🔥Excited to share our new paper ❤️‍🔥❤️‍🔥 **Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?** w/ @haileysch__ @BrandoHablando @gabemukobi @varunrmadan @herbiebradley @ai_phd @BlancheMinerva @sanmikoyejo arxiv.org/abs/2406.04391 1/N

English

7.8K

Adam Ibrahim retweetledi

Rylan Schaeffer@RylanSchaeffer·10 Haz

English

259

69K

Adam Ibrahim@ai_phd·17 Nis

@xwang_lk Hey Xin long time no see (well since UCSB) ! Anyway maybe our model shows the reply might be "yes" ? 😄twitter.com/BerenMillidge/…

Beren Millidge@BerenMillidge

Extremely excited to announce Zamba! A 7B SSM with a novel architecture competitive with Gemma-7B and Mistral-7B and significantly beating Llama2-7B trained on only 1T open training tokens.

English

419

Xin Eric Wang@xwang_lk·16 Nis

Transformer, Mamba, or RWKV?

Català

7.1K

Adam Ibrahim@ai_phd·17 Nis

Worth noting that we're working with @huggingface to release the model over the next week. Stay tuned !

English

228

Adam Ibrahim@ai_phd·16 Nis

something cool we did

Beren Millidge@BerenMillidge

Extremely excited to announce Zamba! A 7B SSM with a novel architecture competitive with Gemma-7B and Mistral-7B and significantly beating Llama2-7B trained on only 1T open training tokens.

English

2.5K

Adam Ibrahim@ai_phd·16 Nis

@QuentinAnthon15 @BerenMillidge @yury_tokpanov And kudos to you too, especially for suffering through @BerenMillidge and my memes for a month😁

English

154

Quentin Anthony@QuentinAnthon15·16 Nis

On a more personal note, I'm so damn proud of my team. Zyphra is punching way above our weight in terms of manpower, compute, and data. We aren't industry insiders, and achieved this with willingness to learn and a ton of grit. Kudos to you all @BerenMillidge @yury_tokpanov @PaoloGlorioso1 @jcrwhittington @ai_phd @J_Pilault

English

3.1K

Quentin Anthony@QuentinAnthon15·16 Nis

Zyphra is pleased to announce Zamba-7B: - 7B Mamba/Attention hybrid - Competitive with Mistral-7B and Gemma-7B on only 1T fully open training tokens - Outperforms Llama-2 7B and OLMo-7B - All checkpoints across training to be released (Apache 2.0) - Achieved by 7 people, on 128 H100 GPUs, in 30 days - zyphra.com/zamba - venturebeat.com/ai/zyphra-rele… Want more details? A 🧵

English

423

185.3K

Adam Ibrahim@ai_phd·20 Mar

@Ar_Douillard @MassCaccia DIstributed CApacity PRIOritisation ?

English

Arthur Douillard@Ar_Douillard·20 Mar

@MassCaccia Oooh, that's a great idea!

English

279

Arthur Douillard@Ar_Douillard·19 Mar

I'm super excited to release DiPaCo, a new kind of mixture of experts, that can scale engineering-wise to data centers across the entire world! A few words about it in this thread 🧵

AK@_akhaliq

Google presents DiPaCo Distributed Path Composition Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high

English

290

108.7K

Adam Ibrahim@ai_phd·18 Mar

@BrandoHablando @benjamintherien @AiEleuther @laion_ai @ThomasScialom Note that from the EN->DE experiments and some feedback I had from people working on adapting to time-series, stronger shifts seem to require more replay than EN->EN (>5%). But again, relative performance of replay%appears fast enough that searching this hyperparameter is viable.

English

Adam Ibrahim@ai_phd·18 Mar

@BrandoHablando @benjamintherien @AiEleuther @laion_ai I don't know a way to do it automatically, but so far @ThomasScialom's work (Continual T0) and ours show 1-5% is often good for EN->EN. Our Fig 4. shows that fortunately, comparing relative performance of different choices of replay % reqs only ~10B tokens twitter.com/benjamintherie…

Benjamin Thérien@benjamintherien

To mitigate forgetting, we study the effect of replaying previous data with a fixed compute budget across two distribution shifts and many replay percentages. We find that an appropriate amount of replay significantly mitigates forgetting, even for hundreds of billions of new tokens. 4/N

English

116

Benjamin Thérien@benjamintherien·14 Mar

Interested in seamlessly updating your #LLM on new datasets to avoid wasting previous efforts & compute, all while maintaining performance on past data? Excited to present Simple and Scalable Strategies to Continually Pre-train Large Language Models! 🧵arxiv.org/abs/2403.08763 1/N

English

159

49.4K

Adam Ibrahim@ai_phd·18 Mar

@BrandoHablando @benjamintherien @AiEleuther @laion_ai Which reweightings ?

English

Brando Miranda@BrandoHablando·18 Mar

@ai_phd @benjamintherien @AiEleuther @laion_ai Grateful for detailed responses! Yes, I know that the arch changed :) A part of me is skpetical (now it feels unfairly skeptical) but perhaps I will save it for a 2nd read for later. Curious, do you know how to compute these reweightings automatically for continual pre-training?

English

Adam Ibrahim@ai_phd·18 Mar

@BrandoHablando @benjamintherien @AiEleuther @laion_ai Note that Llama i+1 is a separate setting however since there are architectural changes involved (context length, GQA, ...), besides dataset changes. If it were only dataset changes, I believe our Pile -> SP experiments suggest it could have worked to continue pretraining.

English

Adam Ibrahim@ai_phd·18 Mar

@BrandoHablando @benjamintherien @AiEleuther @laion_ai We pretrained both 405M and 10B sized models from scratch on the 300-600B tokens of each of these modalities. Our results are consistent at both scales with those 2 actual EN pretraining datasets, but also in the entirely separate case of the stronger EN-> DE (200B tokens!) shift

English

Adam Ibrahim@ai_phd·18 Mar

@BrandoHablando @benjamintherien @AiEleuther @laion_ai If your dataset is much smaller than the ones we consider though (e.g. few billions), I'd treat it more like a fine-tuning dataset and not a pretraining dataset (the ones considered in our work/experiments are hundreds of billions of tokens). I'd bet replay still works though.

English

Adam Ibrahim@ai_phd·18 Mar

@BrandoHablando @benjamintherien @AiEleuther @laion_ai The LR scheduler results are even more likely to be agnostic of the setting, since those are more tied to fundamentals of optimisation with transformers at scale, Adam, and cos decays.

English

Keşfet

@TiFA_ICML2024 @haileysch__ @BrandoHablando @gabemukobi @varunrmadan @herbiebradley @BlancheMinerva @sanmikoyejo