Brian Lester

93 posts

Brian Lester banner
Brian Lester

Brian Lester

@blester125

Senior Research Engineer at Google Deep Mind working on parameter-efficient adaptation and few-shot generalization, mostly within NLP. View are my own. he/him

Katılım Temmuz 2013
243 Takip Edilen449 Takipçiler
Brian Lester
Brian Lester@blester125·
@jkobject Merging models is also handled by plug-ins! #L62" target="_blank" rel="nofollow noopener">github.com/r-three/git-th… Let me know if you want any guidance on writing the plug-in, it would be nice to have something beyond simple averaging!
English
0
0
1
80
Jérémie Kalfon
Jérémie Kalfon@jkobject·
@blester125 Next you should implement ideas from the pepper on weight authoring for merging multiple fine tuned models!
English
2
0
0
101
Brian Lester
Brian Lester@blester125·
We just pushed a new update adding support for the (very impressive) safetensors library from our friends at @huggingface! Git-Theta's plug-in system meant that we spent more time waiting on CI/CD than actually adding support (I'll get off my soapbox now 🧼📦).
Brian Lester@blester125

Introducing Git-Theta, a Git extension that enables collaborative and continual development of ML models with merges, diffs, and parameter-efficient updates—all using the standard Git workflow! 📄 arxiv.org/abs/2306.04529 💽 github.com/r-three/git-th… 🗣️ cccml.zulipchat.com 🧵⬇️

English
0
3
20
5.3K
Brian Lester
Brian Lester@blester125·
@samcoward @colinraffel What the "leaves" of a model are is controlled by the checkpoint plug-in github.com/r-three/git-th…. A new plug-in that returns layers instead of weights may do what you want (although other parts might need to be tweaked, we made some assumptions about single tensors)
English
0
0
3
371
Brian Lester
Brian Lester@blester125·
Git-Theta is designed around plug-ins—this means that if we don’t support your favorite framework, merging strategy, or parameter-efficient update yet, you can add it! Join us on GitHub github.com/r-three/git-th… or Zulip cccml.zulipchat.com to start contributing!
Brian Lester tweet media
English
1
1
13
815
Brian Lester retweetledi
Tu Vu
Tu Vu@tuvllms·
While parameter-efficient tuning methods are originally proposed to reduce computation & storage costs, it turns out they can help overcome catastrophic forgetting and thus improve performance on zero-shot cross-lingual generation. Checkout our work @GoogleAI @emnlpmeeting👇1/10
Tu Vu tweet media
English
1
30
107
0
Brian Lester
Brian Lester@blester125·
Am I missing something wrt to the name "gradient checkpointing"? Clearing cached activations and recomputing them in the backwards pass seems like the opposite of checkpointing. The name makes it sound like we are storing the activations on disk. docs.aws.amazon.com/sagemaker/late…
English
2
0
1
0
Brian Lester
Brian Lester@blester125·
@LiamFedus Shouldn't GPT should be earlier in your timeline? The first GPT paper isn't on ArXiv (read timestamped) but it was cited by BERT.
English
0
0
2
0
William Fedus
William Fedus@LiamFedus·
A brief 4 year LLM history: enc-only (BERT) -> enc-dec (T5) -> dec-only (GPT) As of 2022, the most compute is in decoder models -- what research supports this? Is this the best approach? Enc-dec: T5, AlphaCode, Switch, ST-MoE, RETRO Dec-only: GPT-{1,2,3}, {🐭, 🐹}, PaLM
English
8
34
212
0
Brian Lester retweetledi
Tu Vu
Tu Vu@tuvllms·
Happy to share our soft prompt transfer (SPoT) paper made it to #ACL2022 🎉. On the SuperGLUE leaderboard, SPoT is the first parameter-efficient approach that is competitive with methods that tune billions of parameters. w/ @blester125, @noahconst, @aboSamoor, @daniel_m_cer
Tu Vu@tuvllms

Sharing my internship work @GoogleAI: 1) w/ Soft Prompt Transfer, Prompt Tuning matches or significantly outperforms Model Tuning across model sizes, 2) tasks can help each other via their prompts & task prompts can be used as task embeddings to formalize task similarity. 🧵 1/8

English
2
9
55
0
Brian Lester
Brian Lester@blester125·
@KarimiRabeeh Plus, arxiv.org/abs/2110.04366 reformulates prompt-like approaches as a weighted sum of Attn(Q,K,V) and Attn(Q,Pk,Pv). |K|>>|Pk| so the overhead is minimal. This reformulation is really cool and shows prompt-like and adapter methods just differ on where they are applied.
Brian Lester tweet media
English
1
0
3
0
Rabeeh Karimi
Rabeeh Karimi@KarimiRabeeh·
generally, I am not sure why NLP community is very excited on prompt-tuning methods currently, here are my arguments 1) attention scales quadratically with sequence length and prompt-tuning adds to token length 2) prompt-tuning is usually slow to converge
English
8
35
251
0