Xinyao Niu

14 posts

Xinyao Niu

Xinyao Niu

@sirius_ctrl

Doer

Beigetreten Mayıs 2023
39 Folgt20 Follower
Xinyao Niu
Xinyao Niu@sirius_ctrl·
Agent will become the new frontend of our digital world
English
0
0
0
18
Xinyao Niu retweetet
leloy!
leloy!@leloykun·
(Linear) Attention Mechanisms as Test-Time Regression By now, you've probably already heard of linear attention, in-context learning, test-time scaling, etc... Here, I'll discuss: 1. The unifying framework that ties them all together; 2. How to derive different linear attention variants from scratch; and 3. How to parallelize training linear attention models
leloy! tweet media
leloy!@leloykun

Deep Learning Optimizers from First Principles Now with more maths! In this thread, I'll discuss: 1. The difference between 1st order gradient dualizaton approaches and 2nd order optimization approaches. 2. Preconditioning--how to do it and why. 3. How to derive a couple of deep learning optimizers from scratch using both approaches. (1/n)

English
6
77
429
75.1K
Xinyao Niu retweetet
Jack Parker-Holder
Jack Parker-Holder@jparkerholder·
Introducing 🧞Genie 2 🧞 - our most capable large-scale foundation world model, which can generate a diverse array of consistent worlds, playable for up to a minute. We believe Genie 2 could unlock the next wave of capabilities for embodied agents 🧠.
English
277
462
2.6K
2.6M
Xinyao Niu
Xinyao Niu@sirius_ctrl·
Perhaps this is what translation tasks look like in the new era, and maybe this is the charm of large-scale pre-training. Perhaps, in the context of large-scale synthetic data, the critical factor is the underlying rule governing data generation, rather than the content itself?
Xinyao Niu tweet media
English
0
0
0
30
Xinyao Niu retweetet
Bonnie Li
Bonnie Li@bonniesjli·
How do LLMs scale to million token context window? Ring Attention is a nice trick to parallelize long sequence across devices and rotate them in a ring with zero overhead scaling. In our new blog, we cover the tricks behind this magic. It looks like this (1/5🧵)
Bonnie Li tweet media
English
14
115
678
101.7K
Fireworks AI
Fireworks AI@FireworksAI_HQ·
Mixtral: one more expert to break the tie Mixtral has 8 experts, but only 2 are active for each token. Do more than 2 help? Surprisingly, it helps in fp8, but in original 16 bit precision. With this trick, fp8 can almost match fp16 on MMLU! Why is that? 1/5
Fireworks AI tweet media
English
4
22
141
23K
Xinyao Niu retweetet
Horace He
Horace He@cHHillee·
Two additions to gpt-fast this week. The first one is an optimization to tensor-parallelism added by @foofoobuggy which improves our TP perf by 20-50%. This gives us 200 => 330 tok/s for Llama-7B fp16 and 64 => 91 tok/s for Llama-70B int4 with *no* speculative decoding. (1/4)
Horace He tweet mediaHorace He tweet media
English
9
45
389
107.3K
Xinyao Niu retweetet
Simon Boehm
Simon Boehm@Si_Boehm·
I wrote the most naive CUDA matrix multiply and iteratively optimised it to ~80% of cuBLAS performance: siboehm.com/articles/22/CU…
English
14
166
1.1K
249K
Xinyao Niu retweetet
Dmitry Tuzoff
Dmitry Tuzoff@Tuzoff·
Very cool dataset BTW, recently, I tried ChatGPT 4 on Caribou Contest (online Canadian math Olympiad) tasks for Grade 2 and Grade 7-8 (I photographed each problem from screen) To my surprise, it solved only 1 out of 8 for Grade 1 and 9 out of 14 for Grade 7-8. The problem is that practically all of the problems for lower grades are visual and almost all of the problems for upper grades are textual. Turns out GPT4 is great at OCR but poorer at precise object classification and abstracting higher-level concepts from images Can be a nice way to test multi-modal models. I’ll be happy if someone develops this further
English
0
2
0
1.1K
Xinyao Niu retweetet
Jim Fan
Jim Fan@DrJimFan·
Instead of taking OAI's merger offer, Anthropic launched major updates for Claude 2.1🎉. I think the below chart is the most interesting: this is how all LLM papers that claim "long context" should report: error rates on "Beginning", "Middle", and "End". There're a bunch of papers making wild claims, all the way up to "1B context tokens". Here's a friendly reminder that the 30-year-old LSTM literally supports infinite context. It's a meaningless number unless you show detailed evaluations at different locations in the context. LLMs tend to be "Lost in the Middle", i.e. struggle to remember and reason on information at the middle section of the context window: arxiv.org/abs/2307.03172 Claude 2.1 also claims "2x hallucination" - please take this with a BIG grain of salt. A while back, I expressed my concerns about Vectara's benchmarking protocol. Same concerns apply here too. The trivial solution to achieve 0% hallucination is simply refusing to answer every query. One cannot claim victory here without a careful Safety vs Usefulness analysis. How many questions that Claude used to answer correctly are now rejected? In any case, kudos to Dario & Anthropic team on assuring us a solid alternative during turmoil! 🩷anthropic.com/index/claude-2…
Jim Fan tweet media
English
21
91
609
134.4K