Adam Ibrahim

74 posts

Adam Ibrahim

Adam Ibrahim

@ai_phd

Paris Katılım Haziran 2019
427 Takip Edilen539 Takipçiler
Sabitlenmiş Tweet
Adam Ibrahim
Adam Ibrahim@ai_phd·
Our tech report for Zamba-7B-v1 is out. We manage to come close to Llama 3 8B, Mistral 7B and others' level of performance, with only 1T tokens, with faster inference and less memory usage at a fixed context length. Read up to learn about our not-so-secret sauce!
Quentin Anthony@QuentinAnthon15

Zyphra is dropping the tech report for Zamba-7B, along with: - Model weights (phase 1 and final annealed) at huggingface.co/Zyphra - Inference/generation code (both pure PyTorch and HuggingFace) at github.com/Zyphra/Zamba-t… and github.com/huggingface/tr… Tech report: arxiv.org/abs/2405.16712 Intermediate checkpoints and improved phase 1 dataset (+dataset paper!) coming soon. Intermediate optimizer states available on request. Some highlights:

English
0
4
19
2.2K
Adam Ibrahim retweetledi
Rylan Schaeffer
Rylan Schaeffer@RylanSchaeffer·
Another #ICML2025 paper! Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive? TLDR: Predicting language model performance with scale on multiple choice question-answer (MCQA) benchmarks is made difficult b/c ... 1/3
Rylan Schaeffer tweet media
English
2
14
88
13K
Adam Ibrahim retweetledi
Xin Eric Wang
Xin Eric Wang@xwang_lk·
Transformer, Mamba, or RWKV?
Català
6
1
9
7.1K
Adam Ibrahim
Adam Ibrahim@ai_phd·
Worth noting that we're working with @huggingface to release the model over the next week. Stay tuned !
English
0
0
2
228
Quentin Anthony
Quentin Anthony@QuentinAnthon15·
Zyphra is pleased to announce Zamba-7B: - 7B Mamba/Attention hybrid - Competitive with Mistral-7B and Gemma-7B on only 1T fully open training tokens - Outperforms Llama-2 7B and OLMo-7B - All checkpoints across training to be released (Apache 2.0) - Achieved by 7 people, on 128 H100 GPUs, in 30 days - zyphra.com/zamba - venturebeat.com/ai/zyphra-rele… Want more details? A 🧵
Quentin Anthony tweet media
English
23
81
423
185.3K
Arthur Douillard
Arthur Douillard@Ar_Douillard·
I'm super excited to release DiPaCo, a new kind of mixture of experts, that can scale engineering-wise to data centers across the entire world! A few words about it in this thread 🧵
AK@_akhaliq

Google presents DiPaCo Distributed Path Composition Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high

English
12
49
290
108.7K
Adam Ibrahim
Adam Ibrahim@ai_phd·
@BrandoHablando @benjamintherien @AiEleuther @laion_ai @ThomasScialom Note that from the EN->DE experiments and some feedback I had from people working on adapting to time-series, stronger shifts seem to require more replay than EN->EN (>5%). But again, relative performance of replay%appears fast enough that searching this hyperparameter is viable.
English
0
0
2
71
Benjamin Thérien
Benjamin Thérien@benjamintherien·
Interested in seamlessly updating your #LLM on new datasets to avoid wasting previous efforts & compute, all while maintaining performance on past data? Excited to present Simple and Scalable Strategies to Continually Pre-train Large Language Models! 🧵arxiv.org/abs/2403.08763 1/N
Benjamin Thérien tweet media
English
4
48
159
49.4K
Brando Miranda
Brando Miranda@BrandoHablando·
@ai_phd @benjamintherien @AiEleuther @laion_ai Grateful for detailed responses! Yes, I know that the arch changed :) A part of me is skpetical (now it feels unfairly skeptical) but perhaps I will save it for a 2nd read for later. Curious, do you know how to compute these reweightings automatically for continual pre-training?
English
1
0
0
36
Adam Ibrahim
Adam Ibrahim@ai_phd·
@BrandoHablando @benjamintherien @AiEleuther @laion_ai Note that Llama i+1 is a separate setting however since there are architectural changes involved (context length, GQA, ...), besides dataset changes. If it were only dataset changes, I believe our Pile -> SP experiments suggest it could have worked to continue pretraining.
English
1
0
2
55
Adam Ibrahim
Adam Ibrahim@ai_phd·
@BrandoHablando @benjamintherien @AiEleuther @laion_ai We pretrained both 405M and 10B sized models from scratch on the 300-600B tokens of each of these modalities. Our results are consistent at both scales with those 2 actual EN pretraining datasets, but also in the entirely separate case of the stronger EN-> DE (200B tokens!) shift
English
1
0
1
65
Adam Ibrahim
Adam Ibrahim@ai_phd·
@BrandoHablando @benjamintherien @AiEleuther @laion_ai If your dataset is much smaller than the ones we consider though (e.g. few billions), I'd treat it more like a fine-tuning dataset and not a pretraining dataset (the ones considered in our work/experiments are hundreds of billions of tokens). I'd bet replay still works though.
English
1
0
2
86