Nikos Karampatziakis

238 posts

Nikos Karampatziakis

@eigenikos

Principal machine learner at Microsoft.

'; drop table location; -- Katılım Haziran 2010

530 Takip Edilen291 Takipçiler

Nikos Karampatziakis retweetledi

João Gante@joao_gante·7 Ağu

A new version of 🤗 transformers has landed 🛬 v4.44 ended up being a performance-oriented upgrade for LLM users: faster compiled models, lower GPU memory requirements, and even support for mobile devices! Let's dive 🤿 (resources in a comment) 1️⃣ `torch.compile()` updates We've been working to expand and improve our compilation support! Recent upgrades include: - `model.generate()` can be compiled! This is very experimental, and slightly faster than compiling the forward pass alone. More importantly, it opens the door to export the whole generate function into a single graph we can export, to be used in other devices ⚡️ - 3-5x faster compilation time of `model.forward()`, when used in `model.generate()`. tl;dr we now prepate a static-shape attention mask in generate. In essence, the same speed up, but less wait time the first time you run it ⌛️ - support for Whisper! Well, this was actually available since v4.43, released two weeks ago, but we haven't communicated much about it 🤫 2️⃣ CPU-offloaded KV cache When you're using an LLM with a large context window, you'll notice that your GPU RAM quickly gets devoured by the KV cache. You had to spin up more/larger GPUs just to use the same model, or rely on CPU computations. Not anymore! v4.44 includes a CPU-offloaded KV cache that only keeps in memory what it needs. In essence, while computing layer N, it moves to the CPU the KV cache for layer N-1, while prefetching to the GPU layer N+1. All computations happen on GPU. This is useful for large context windows or large batch sizes -- it allows you to run larger settings without upgrading your GPU, at a minimal speed penalty. Perfect for the GPU poor team 💛 This was kindly added to our library by @eigenikos 🤗 3️⃣ `torch.export()` support for generation The PyTorch team is working on `torch.export()` -- in a nutshell, ahead of time compilation to enable multiple downstream uses. You can then use the exported graph with ExecuTorch in e.g. a mobile device! We can now export LLMs with `torch.export()`. Support was kindly enabled by Guang Yang from the PyTorch team 🤗 And that's it for now! See you next release 🛫

English

10.8K

Nikos Karampatziakis retweetledi

Weizhu Chen@WeizhuChen·23 Nis

I would like to invite you to try phi-3-mini: aka.ms/try-phi3-hf-ch…. You can also download the weights from HF with more model weights on the way. Besides what was described in technical report, one specific thing I want to mention is the 128K context support. It takes us a lot of efforts to fit 128K context in such a mini model with little sacrifice on short context quality. Go to play with it and share with us your feedback: aka.ms/try-phi3-hf-ch…

English

155

21.1K

Nikos Karampatziakis retweetledi

Sebastien Bubeck@SebastienBubeck·23 Nis

Game on! huggingface.co/microsoft/Phi-…

English

341

57.8K

Nikos Karampatziakis retweetledi

Sebastien Bubeck@SebastienBubeck·23 Nis

phi-3 is here, and it's ... good :-). I made a quick short demo to give you a feel of what phi-3-mini (3.8B) can do. Stay tuned for the open weights release and more announcements tomorrow morning! (And ofc this wouldn't be complete without the usual table of benchmarks!)

English

175

919

485.7K

Nikos Karampatziakis retweetledi

AK@_akhaliq·23 Nis

Microsoft announces Phi-3 A Highly Capable Language Model Locally on Your Phone We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing,

English

201

876

136.7K

Nikos Karampatziakis@eigenikos·23 Mar

@nutzer_inaktiv @WeizhuChen Thanks for bringing it to our attention, we are investigating this

English

Nikos Karampatziakis retweetledi

Weizhu Chen@WeizhuChen·14 Mar

Meet In the Middle (MIM) : A New Pretraining Paradigm. MIM(2.7B) outperforms CodeGen 16B, Incoder 6.7B, PaLM 540B, LLaMA 65B, FIM 2.7B in Code generation tasks. Read arxiv.org/abs/2303.07295 to know why MIM could be a new pre-training paradigm for left-to-right and infilling LMs.

English

420

105.5K

Nikos Karampatziakis@eigenikos·16 Mar

@LChoshen I see. Good question. We have not looked into it yet.

English

Leshem (Legend) Choshen 🤖🤗@LChoshen·16 Mar

@eigenikos You wouldn't know, you might overtrain to the point that you see benefits only from the computation invested. Rephrasing:is the r2l better when just adding the l2r Also you made so many other improvements that the paper would have been fine, double inference and sharing and all

English

Nikos Karampatziakis retweetledi

Leshem (Legend) Choshen 🤖🤗@LChoshen·15 Mar

➡️Mindblowing pretraining paradigm⬅️ Train the same model to predict the two directions separately🔛 Better results, more parallelization arxiv.org/abs/2303.07295 @MSFTResearch @NguynTu24128917 @eigenikos @WeizhuChen #deepRead

English

8.9K

Nikos Karampatziakis@eigenikos·15 Mar

@LChoshen We did not try training on half the data or for equal amount of time (the increase in time is modest not 2x). FWIW, I expect we would beat in equal time training. Not sure I understand the second part of of your question. If it were obstructing, we wouldn't be writing a paper.

English

Leshem (Legend) Choshen 🤖🤗@LChoshen·15 Mar

@eigenikos Btw, I suppose in terms of training time, it is more costly than l2r one sided, but better performing than a l2r model trained half the time? In other words, does learning the regularization and other direction improve or obstruct?

English

Nikos Karampatziakis@eigenikos·15 Mar

@LChoshen Glad you liked the paper! Note that when the context is just the prefix, you can do inference using the left-to right model only. The results reporting held out perplexity are computed with only the left-to-right model.

English

Leshem (Legend) Choshen 🤖🤗@LChoshen·15 Mar

But how can we do inference with that? Do you just run it twice? Sounds slow. On the contrary, it can be quite fast Run on parallel from both sides when they meet see if they agree (on n tokens)

English

240

Nikos Karampatziakis@eigenikos·14 Mar

@rasbt @crystalwizard While two directions are used during training, the resulting model is unidirectional.

English

Sebastian Raschka@rasbt·14 Mar

@crystalwizard Have yet to read the paper but based on the abstract I am somewhat skeptical. If you want bidirectional training you are back to BERT. I think that the unidirectional pretraining is more aligned with the target objective of generating text one token at a time.

English

Crystalwizard@crystalwizard·14 Mar

@rasbt comments on this?

AK@_akhaliq

Meet in the Middle: A New Pre-training Paradigm abs: arxiv.org/abs/2303.07295

English

Nikos Karampatziakis@eigenikos·14 Mar

@AwokeKnowing @_akhaliq We appreciate any references to any papers that have done something similar and are not cited.

English

139

James@AwokeKnowing·14 Mar

@_akhaliq How is this new?

English

174

AK@_akhaliq·14 Mar

Meet in the Middle: A New Pre-training Paradigm abs: arxiv.org/abs/2303.07295

English

155

22.9K

Nikos Karampatziakis@eigenikos·14 Mar

@ericmitchellai @_akhaliq XLNet does not encourage consistency and agreement among the different permutations.

English

Eric@ericmitchellai·14 Mar

@_akhaliq Is this a special case of XLNet? It's only briefly mentioned in related work

English

500

Nikos Karampatziakis@eigenikos·14 Mar

twitter.com/WeizhuChen/sta…

Weizhu Chen@WeizhuChen

ZXX

152

Nikos Karampatziakis@eigenikos·14 Mar

I don't always write deep learning papers, but when I do it's🔥 Really proud of how this work with @NguynTu24128917 and @WeizhuChen came together.

English

183

Nikos Karampatziakis retweetledi

Francesco Orabona@bremen79·19 Tem

New paper with @kwangsungjun Tight Concentrations and Confidence Sequences from the Regret of Universal Portfolio arxiv.org/abs/2110.14099 I particularly like it, let me tell you about it 🧵

English

Nikos Karampatziakis@eigenikos·16 May

@bremen79 @BU_ece Congratulations! Well deserved!

English

Francesco Orabona@bremen79·16 May

Given that even @BU_ece changed my title on the website, I guess it is now official: I have been awarded tenure and promoted to Associate Professor This was a long and stressful journey, but I f*cking did it!!! 💪🥳🎉

English

328

Nikos Karampatziakis@eigenikos·15 May

@zacharylipton Why the laser eyes? Is Vapnik into bitcoin?

English

Zachary Lipton@zacharylipton·15 May

With proper preview ♥️ github.com/acmi-lab/cmu-1…

English

Zachary Lipton@zacharylipton·15 May

Since this tweet I got enough of a push from kind mentors & strangers to roll out an experimental PhD seminar: CMU 10721: "Philosophical Foundations of Machine Intelligence". As the course concept and reading plan coalesce, I'll assemble the plan here: github.com/acmi-lab/cmu-1…

Zachary Lipton@zacharylipton

I would like to teach a learning theory class that would include both statistical and philosophical theory, including original text readings from Popper & Hacking alongside Wolpert & Vapnik. I doubt my qualifications on both sides, but maybe that needn’t be...disqualifying?

English

261

Nikos Karampatziakis@eigenikos·9 May

@kwangsungjun Thanks Kwang. I was not aware of this work and it is indeed very relevant.

English

Kwang-Sung (Kwang) Jun@kwangsungjun·9 May

@eigenikos Interesting work! You may want to check out arxiv.org/pdf/1902.01500… I wrote with Francesco Orabona, especially Section 7.2 where we discuss how betting is related to concentration inequalities.

English

Nikos Karampatziakis@eigenikos·9 May

Our work on confidence bounds that hold uniformly over time for off-policy evaluation in contextual bandits has been accepted at ICML 🥳 arxiv.org/abs/2102.09540 These confidence sequences enable continuous monitoring of experiments without union bounds or peeling tricks.

English

Keşfet

@WeizhuChen @LChoshen @MSFTResearch @NguynTu24128917 @rasbt @crystalwizard @AwokeKnowing @_akhaliq