Mohammad Rastegari

107 posts

Mohammad Rastegari

Mohammad Rastegari

@morastegari

Distinguished AI Scientist at Meta. Affiliate Assistant Professor at University of Washington.

Seattle, WA Katılım Şubat 2017
114 Takip Edilen1.1K Takipçiler
Mohammad Rastegari
Mohammad Rastegari@morastegari·
This work was one of the last works that was done by my team when I was working at Apple. A lot of credit to @sacmehtauw whose dedication was the key to this project. Main point behind here is to show as a contributor to the AI community we play our role to be fully open.
AK@_akhaliq

Apple presents OpenELM An Efficient Language Model Family with Open-source Training and Inference Framework The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and

English
1
7
63
25.3K
Mohammad Rastegari retweetledi
AK
AK@_akhaliq·
Apple presents Speculative Streaming Fast LLM Inference without Auxiliary Models Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.
AK tweet media
English
4
69
506
72.1K
Mohammad Rastegari
Mohammad Rastegari@morastegari·
This has been one of my favorite directions on enabling #llms to run effectively on device. Thanks to the great team who are pushing state-of-the-art in this direction. In the Apple MIND team, we try to attack research problems that move us to the next level of experiencing AI.
AK@_akhaliq

Apple announces LLM in a flash: Efficient Large Language Model Inference with Limited Memory paper page: huggingface.co/papers/2312.11… Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their intensive computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory but bringing them on demand to DRAM. Our method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this flash memory-informed framework, we introduce two principal techniques. First, "windowing'" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.

English
0
0
14
1.2K
Mohammad Rastegari
Mohammad Rastegari@morastegari·
Accurate training aware weight quantization was computationally intractable for LLMs. But now in Apple MIND we developed a method to solve the problem very efficiently and it pushes the boundary to 3-bit quantization. eDKM: arxiv.org/abs/2309.00964 #LLM, #LLMoptimizaton
English
1
8
29
4.3K
Mohammad Rastegari retweetledi
Oncel Tuzel
Oncel Tuzel@OncelTuzel·
“Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement” is an #iccv2023 paper from #Apple. By just swapping the ImageNet dataset with the “reinforced” ImageNet+ dataset, a model can be trained up to 7x faster to reach the same accuracy
Fartash Faghri@FartashFg

Excited that our "Dataset Reinforcement" paper introducing "ImageNet+ dataset" is accepted to #ICCV2023! Up to ~7x FASTER training! Paper: arxiv.org/abs/2303.08983 ImageNet+/Code: Coming Soon w/ @HPouransari @sacmehtauw @MFarajtabar @morastegari @OncelTuzel Ali Farhadi 1/7

English
0
2
15
13.9K
Behnam Neyshabur
Behnam Neyshabur@bneyshabur·
In the last two years, I have initiated 4 projects with junior researchers who were seeking supervision and had no prior work relationship with me or anyone I knew. One led to a NeurIPS publication, one is ongoing and the other two didn't lead to a publication. 2/6
English
2
0
52
0
Behnam Neyshabur
Behnam Neyshabur@bneyshabur·
We often prefer collaborating with people we know or those of high status. That makes it very difficult for hardworking and motivated junior researchers to get enough support to flourish. Is it possible to reduce this barrier? I'v been running some experiments to find out! 1/6
English
8
49
433
0
Behnam Neyshabur
Behnam Neyshabur@bneyshabur·
After several years of reviewing & AC work for @NeurIPSConf, @iclr_conf & @icmlconf, I have strong opinions about the reviewing system and some suggestions that many may not like or agree with. Summarizing my points in this thread (hastily written & NOT carefully considered): 1/
English
3
8
92
0
Mohammad Rastegari
Mohammad Rastegari@morastegari·
These CVPR policies are frustrating. Given all the randomness in the review process I feel there is no point submitting paper into conferences anymore. By the law of large data (large number of papers and readers) just submitting to arXiv will be enough for a good paper to shine.
Lucas Beyer (bl16)@giffmana

1/2 @CVPR so if somebody posts wrong facts about a paper I co-authored, I may not correct them in an answer in any way? i.e. you prefer the spread of falsehoods? What if someone hates me and creates anon account, acting like me and answering threads about my submission?

English
3
3
21
0