Xinyao Niu

14 posts

Xinyao Niu

@sirius_ctrl

Doer

Sumali Mayıs 2023

39 Sinusundan20 Mga Tagasunod

Xinyao Niu nag-retweet

Mr.RC｜𝟎𝐱𝐔@MrRyanChi·10 Mar

x.com/i/article/2031…

ZXX

166

1.2K

5.1K

2.2M

Xinyao Niu@sirius_ctrl·1 Şub

Agent will become the new frontend of our digital world

English

Xinyao Niu nag-retweet

nader dabit@dabit3·9 Oca

x.com/i/article/2009…

ZXX

128

1.3K

373.4K

Xinyao Niu nag-retweet

Lech Mazur@LechMazur·31 Oca

All the data is here: github.com/lechmazur/writ… The top three best stories overall are now from R1 (linked there).

English

1.2K

Xinyao Niu nag-retweet

leloy!@leloykun·26 Oca

(Linear) Attention Mechanisms as Test-Time Regression By now, you've probably already heard of linear attention, in-context learning, test-time scaling, etc... Here, I'll discuss: 1. The unifying framework that ties them all together; 2. How to derive different linear attention variants from scratch; and 3. How to parallelize training linear attention models

leloy!@leloykun

Deep Learning Optimizers from First Principles Now with more maths! In this thread, I'll discuss: 1. The difference between 1st order gradient dualizaton approaches and 2nd order optimization approaches. 2. Preconditioning--how to do it and why. 3. How to derive a couple of deep learning optimizers from scratch using both approaches. (1/n)

English

429

75.1K

Xinyao Niu nag-retweet

Riccardo Grazzi@riccardograzzi·22 Kas

LLMs can now track states, finally matching this cat! And we prove it. But how? 🧵👇 1/ Paper: arxiv.org/abs/2411.12537 with @julien_siems @jkhfranke @ZelaArber @FrankRHutter @MPontil

GIF

English

7.5K

Xinyao Niu nag-retweet

Jack Parker-Holder@jparkerholder·4 Ara

Introducing 🧞Genie 2 🧞 - our most capable large-scale foundation world model, which can generate a diverse array of consistent worlds, playable for up to a minute. We believe Genie 2 could unlock the next wave of capabilities for embodied agents 🧠.

English

277

462

2.6K

2.6M

Xinyao Niu@sirius_ctrl·30 May

Perhaps this is what translation tasks look like in the new era, and maybe this is the charm of large-scale pre-training. Perhaps, in the context of large-scale synthetic data, the critical factor is the underlying rule governing data generation, rather than the content itself?

English

Xinyao Niu nag-retweet

Bonnie Li@bonniesjli·12 Nis

How do LLMs scale to million token context window? Ring Attention is a nice trick to parallelize long sequence across devices and rotate them in a ring with zero overhead scaling. In our new blog, we cover the tricks behind this magic. It looks like this (1/5🧵)

English

115

678

101.7K

Xinyao Niu@sirius_ctrl·20 Ara

@FireworksAI_HQ May I ask how you get the FP8 version of mixtral?

English

467

Fireworks AI@FireworksAI_HQ·20 Ara

Mixtral: one more expert to break the tie Mixtral has 8 experts, but only 2 are active for each token. Do more than 2 help? Surprisingly, it helps in fp8, but in original 16 bit precision. With this trick, fp8 can almost match fp16 on MMLU! Why is that? 1/5

English

141

23K

Xinyao Niu nag-retweet

Horace He@cHHillee·17 Ara

Two additions to gpt-fast this week. The first one is an optimization to tensor-parallelism added by @foofoobuggy which improves our TP perf by 20-50%. This gives us 200 => 330 tok/s for Llama-7B fp16 and 64 => 91 tok/s for Llama-70B int4 with *no* speculative decoding. (1/4)

English

389

107.3K

Xinyao Niu nag-retweet

Simon Boehm@Si_Boehm·3 Oca

I wrote the most naive CUDA matrix multiply and iteratively optimised it to ~80% of cuBLAS performance: siboehm.com/articles/22/CU…

English

166

1.1K

249K

Xinyao Niu nag-retweet

Dmitry Tuzoff@Tuzoff·22 Kas

Very cool dataset BTW, recently, I tried ChatGPT 4 on Caribou Contest (online Canadian math Olympiad) tasks for Grade 2 and Grade 7-8 (I photographed each problem from screen) To my surprise, it solved only 1 out of 8 for Grade 1 and 9 out of 14 for Grade 7-8. The problem is that practically all of the problems for lower grades are visual and almost all of the problems for upper grades are textual. Turns out GPT4 is great at OCR but poorer at precise object classification and abstracting higher-level concepts from images Can be a nice way to test multi-modal models. I’ll be happy if someone develops this further

English

1.1K

Xinyao Niu nag-retweet

Jim Fan@DrJimFan·21 Kas

Instead of taking OAI's merger offer, Anthropic launched major updates for Claude 2.1🎉. I think the below chart is the most interesting: this is how all LLM papers that claim "long context" should report: error rates on "Beginning", "Middle", and "End". There're a bunch of papers making wild claims, all the way up to "1B context tokens". Here's a friendly reminder that the 30-year-old LSTM literally supports infinite context. It's a meaningless number unless you show detailed evaluations at different locations in the context. LLMs tend to be "Lost in the Middle", i.e. struggle to remember and reason on information at the middle section of the context window: arxiv.org/abs/2307.03172 Claude 2.1 also claims "2x hallucination" - please take this with a BIG grain of salt. A while back, I expressed my concerns about Vectara's benchmarking protocol. Same concerns apply here too. The trivial solution to achieve 0% hallucination is simply refusing to answer every query. One cannot claim victory here without a careful Safety vs Usefulness analysis. How many questions that Claude used to answer correctly are now rejected? In any case, kudos to Dario & Anthropic team on assuring us a solid alternative during turmoil! 🩷anthropic.com/index/claude-2…

English

609

134.4K

Tuklasin

@julien_siems @jkhfranke @ZelaArber @FrankRHutter @MPontil @FireworksAI_HQ @foofoobuggy @elonmusk