Nikos Karampatziakis

238 posts

Nikos Karampatziakis

Nikos Karampatziakis

@eigenikos

Principal machine learner at Microsoft.

'; drop table location; -- Katılım Haziran 2010
530 Takip Edilen291 Takipçiler
Nikos Karampatziakis retweetledi
João Gante
João Gante@joao_gante·
A new version of 🤗 transformers has landed 🛬 v4.44 ended up being a performance-oriented upgrade for LLM users: faster compiled models, lower GPU memory requirements, and even support for mobile devices! Let's dive 🤿 (resources in a comment) 1️⃣ `torch.compile()` updates We've been working to expand and improve our compilation support! Recent upgrades include: - `model.generate()` can be compiled! This is very experimental, and slightly faster than compiling the forward pass alone. More importantly, it opens the door to export the whole generate function into a single graph we can export, to be used in other devices ⚡️ - 3-5x faster compilation time of `model.forward()`, when used in `model.generate()`. tl;dr we now prepate a static-shape attention mask in generate. In essence, the same speed up, but less wait time the first time you run it ⌛️ - support for Whisper! Well, this was actually available since v4.43, released two weeks ago, but we haven't communicated much about it 🤫 2️⃣ CPU-offloaded KV cache When you're using an LLM with a large context window, you'll notice that your GPU RAM quickly gets devoured by the KV cache. You had to spin up more/larger GPUs just to use the same model, or rely on CPU computations. Not anymore! v4.44 includes a CPU-offloaded KV cache that only keeps in memory what it needs. In essence, while computing layer N, it moves to the CPU the KV cache for layer N-1, while prefetching to the GPU layer N+1. All computations happen on GPU. This is useful for large context windows or large batch sizes -- it allows you to run larger settings without upgrading your GPU, at a minimal speed penalty. Perfect for the GPU poor team 💛 This was kindly added to our library by @eigenikos 🤗 3️⃣ `torch.export()` support for generation The PyTorch team is working on `torch.export()` -- in a nutshell, ahead of time compilation to enable multiple downstream uses. You can then use the exported graph with ExecuTorch in e.g. a mobile device! We can now export LLMs with `torch.export()`. Support was kindly enabled by Guang Yang from the PyTorch team 🤗 And that's it for now! See you next release 🛫
João Gante tweet media
English
4
11
70
10.8K
Nikos Karampatziakis retweetledi
Weizhu Chen
Weizhu Chen@WeizhuChen·
I would like to invite you to try phi-3-mini: aka.ms/try-phi3-hf-ch…. You can also download the weights from HF with more model weights on the way. Besides what was described in technical report, one specific thing I want to mention is the 128K context support. It takes us a lot of efforts to fit 128K context in such a mini model with little sacrifice on short context quality. Go to play with it and share with us your feedback: aka.ms/try-phi3-hf-ch…
English
4
18
155
21.1K
Nikos Karampatziakis retweetledi
Sebastien Bubeck
Sebastien Bubeck@SebastienBubeck·
phi-3 is here, and it's ... good :-). I made a quick short demo to give you a feel of what phi-3-mini (3.8B) can do. Stay tuned for the open weights release and more announcements tomorrow morning! (And ofc this wouldn't be complete without the usual table of benchmarks!)
Sebastien Bubeck tweet media
English
38
175
919
485.7K
Nikos Karampatziakis retweetledi
AK
AK@_akhaliq·
Microsoft announces Phi-3 A Highly Capable Language Model Locally on Your Phone We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing,
AK tweet media
English
11
201
876
136.7K
Nikos Karampatziakis retweetledi
Weizhu Chen
Weizhu Chen@WeizhuChen·
Meet In the Middle (MIM) : A New Pretraining Paradigm. MIM(2.7B) outperforms CodeGen 16B, Incoder 6.7B, PaLM 540B, LLaMA 65B, FIM 2.7B in Code generation tasks. Read arxiv.org/abs/2303.07295 to know why MIM could be a new pre-training paradigm for left-to-right and infilling LMs.
Weizhu Chen tweet media
English
18
79
420
105.5K
Leshem (Legend) Choshen 🤖🤗
@eigenikos You wouldn't know, you might overtrain to the point that you see benefits only from the computation invested. Rephrasing:is the r2l better when just adding the l2r Also you made so many other improvements that the paper would have been fine, double inference and sharing and all
English
1
0
1
22
Nikos Karampatziakis
Nikos Karampatziakis@eigenikos·
@LChoshen We did not try training on half the data or for equal amount of time (the increase in time is modest not 2x). FWIW, I expect we would beat in equal time training. Not sure I understand the second part of of your question. If it were obstructing, we wouldn't be writing a paper.
English
1
0
0
11
Leshem (Legend) Choshen 🤖🤗
@eigenikos Btw, I suppose in terms of training time, it is more costly than l2r one sided, but better performing than a l2r model trained half the time? In other words, does learning the regularization and other direction improve or obstruct?
English
1
0
1
18
Nikos Karampatziakis
Nikos Karampatziakis@eigenikos·
@LChoshen Glad you liked the paper! Note that when the context is just the prefix, you can do inference using the left-to right model only. The results reporting held out perplexity are computed with only the left-to-right model.
English
1
0
0
8
Leshem (Legend) Choshen 🤖🤗
But how can we do inference with that? Do you just run it twice? Sounds slow. On the contrary, it can be quite fast Run on parallel from both sides when they meet see if they agree (on n tokens)
English
2
0
2
240
Sebastian Raschka
Sebastian Raschka@rasbt·
@crystalwizard Have yet to read the paper but based on the abstract I am somewhat skeptical. If you want bidirectional training you are back to BERT. I think that the unidirectional pretraining is more aligned with the target objective of generating text one token at a time.
English
2
0
0
36
Eric
Eric@ericmitchellai·
@_akhaliq Is this a special case of XLNet? It's only briefly mentioned in related work
English
1
0
1
500
Nikos Karampatziakis retweetledi
Francesco Orabona
Francesco Orabona@bremen79·
New paper with @kwangsungjun Tight Concentrations and Confidence Sequences from the Regret of Universal Portfolio arxiv.org/abs/2110.14099 I particularly like it, let me tell you about it 🧵
English
1
4
36
0
Francesco Orabona
Francesco Orabona@bremen79·
Given that even @BU_ece changed my title on the website, I guess it is now official: I have been awarded tenure and promoted to Associate Professor This was a long and stressful journey, but I f*cking did it!!! 💪🥳🎉
English
29
2
328
0
Zachary Lipton
Zachary Lipton@zacharylipton·
Since this tweet I got enough of a push from kind mentors & strangers to roll out an experimental PhD seminar: CMU 10721: "Philosophical Foundations of Machine Intelligence". As the course concept and reading plan coalesce, I'll assemble the plan here: github.com/acmi-lab/cmu-1…
Zachary Lipton@zacharylipton

I would like to teach a learning theory class that would include both statistical and philosophical theory, including original text readings from Popper & Hacking alongside Wolpert & Vapnik. I doubt my qualifications on both sides, but maybe that needn’t be...disqualifying?

English
15
45
261
0
Nikos Karampatziakis
Nikos Karampatziakis@eigenikos·
Our work on confidence bounds that hold uniformly over time for off-policy evaluation in contextual bandits has been accepted at ICML 🥳 arxiv.org/abs/2102.09540 These confidence sequences enable continuous monitoring of experiments without union bounds or peeling tricks.
English
2
3
44
0