Alec Radford

562 posts

Alec Radford banner
Alec Radford

Alec Radford

@AlecRad

ML developer/researcher at OpenAI

San Francisco, CA Katılım Ekim 2012
297 Takip Edilen61.9K Takipçiler
Sabitlenmiş Tweet
Alec Radford
Alec Radford@AlecRad·
What I've been working on for the past year! blog.openai.com/p/7fa97c36-611… Inspired by CoVE, ELMo, and ULMFiT we show that a single transformer language model can be finetuned to a wide variety of NLP tasks and performs very well with little tuning/tweaking.
English
43
448
1.7K
0
Alec Radford retweetledi
Grace Luo
Grace Luo@graceluo_·
We trained diffusion models on a billion LLM activations, and we want you to use them! New preprint: Learning a Generative Meta-Model of LLM Activations Joint work with @feng_jiahai, @trevordarrell, @AlecRad, @JacobSteinhardt. More in thread 🧵
English
30
170
1.3K
188.7K
Alec Radford retweetledi
Neil Rathi
Neil Rathi@neil_rathi·
New paper, w/@AlecRad Models acquire a lot of capabilities during pretraining. We show that we can precisely shape what they learn simply by filtering their training data at the token level.
Neil Rathi tweet media
English
25
94
1K
89K
Alec Radford
Alec Radford@AlecRad·
@skornblith @DGBassani It's the max width with 12 layers that could fit in memory on the dev box that trained GPT-1. Also worked out to a month to train which was edge of my patience. The prototypes went 6 layer 512 wide (og tformer paper "base") to 12 layer 512 wide to 12 layer 768 wide.
English
2
3
56
0
Simon Kornblith
Simon Kornblith@skornblith·
@DGBassani The original Transformer paper has results for nets with layers of dimension 512 (2^9) and 1024 (2^10). 768 is halfway between 512 and 1024, so it's a logical choice. The BERT paper says they use 768 because that's what GPT did, so @AlecRad might be the person who decided this.
English
1
0
13
0
Alec Radford
Alec Radford@AlecRad·
@NPCollapse The raw version used for gpt-2 is available at gs://gpt-2/data/lambada_development.jsonl and gs://gpt-2/data/lambada_test.jsonl
English
0
1
16
0
Connor Leahy
Connor Leahy@NPCollapse·
As an aside, is it just me or is the LAMBADA dataset website currently down? Anyone know how I can acquire the dataset?
English
2
0
2
0
Alec Radford
Alec Radford@AlecRad·
@chipro Dynamic eval improves an AWD-LSTM baseline by 0.11 nats. Can't be sure it'd have equal sized benefits for both architectures (though arxiv.org/abs/1904.08378 suggests it works fine) but if that gain carried over, the Transformer-XL model would be 48.6 test perplexity.
English
0
0
8
0
Chip Huyen
Chip Huyen@chipro·
@AlecRad SOTA for PTB without extra data is 46.54 (Transformer-XL 54.5). On paperswithcode, all top models on WikiText-103 & 1billion are transformer, and all top models on small datasets are lstm. Could just be hp but could also be something else paperswithcode.com/sota/language-…
English
2
5
22
0
Chip Huyen
Chip Huyen@chipro·
NLP folks, I have a quick question: why does RNN work much better than Transformer on small datasets like Penn Treebank and WikiText-2?
English
4
17
97
0
Alec Radford
Alec Radford@AlecRad·
This is a really fun live experiment with twitch chat predictably oscillating between love and hate based on the sample.
Alec Radford tweet media
English
15
16
175
0
Alec Radford retweetledi
Christine McLeavey
Christine McLeavey@mcleavey·
Extremely excited to share work I've been doing at OpenAI the past few months: MuseNet, a neural net music generator. It's been a huge team effort pulling this all together!
OpenAI@OpenAI

Introducing MuseNet, a neural network which discovered how to generate music using many different instruments and styles. Listen & interact: openai.com/blog/musenet/ MuseNet will play an experimental concert today from 12–3pmPT on livestream: twitch.tv/openai

English
35
201
999
0
Alec Radford retweetledi
Jeremy Howard
Jeremy Howard@jeremyphoward·
@RogerGrosse Do you have a sense of why high dropout is useful in the earliest iterations?
English
1
0
4
0
Roger Grosse
Roger Grosse@RogerGrosse·
Excited to release our paper on Self-Tuning Networks, a way of adapting regularization hyperparameters online during training. This is the work of Matt MacKay, Paul Vicol, and @jonLorraine9, to appear at ICLR 2019. arxiv.org/abs/1903.03088
English
3
93
466
0
Alec Radford
Alec Radford@AlecRad·
@tallinzen @mcxfrank @emilymbender @yoavgo Don't know exact # since there is not a traditional word-level tokenization step. There are 9B tokens total and the ratio is probably around 1.1 tokens per word? You can probably just call those tokens words for the purpose of a # on a slide.
English
1
0
4
0
Michael C. Frank
Michael C. Frank@mcxfrank·
Hey, anyone have a guess about approx. how many words are in the WebText corpus that OpenAI used for their recent language model (or have any other links to approximate amounts of data for training recent language models)? I'm working on a talk and would like to include this...
Michael C. Frank tweet media
English
1
2
2
0
Alec Radford retweetledi
Graham Neubig
Graham Neubig@gneubig·
One commonly cited argument about the difficulty of learning common-sense reasoning is that "no-one writes down common sense". A counter-argument is "well, the web is big": instructables.com/id/How-To-Open…
Graham Neubig tweet media
English
5
22
135
0
Jacob Andreas
Jacob Andreas@jacobandreas·
@AlecRad Oh then this was really unclear---I mean that in my part of NLP-land we're conditioned to auto-reject things that don't do well under a particular style of evaluation, and this is a problem with my part of NLP-land!
English
2
0
2
0
Jacob Andreas
Jacob Andreas@jacobandreas·
(Back when I was starting to do research with @nyhabash I remember "discovering" that I could improve translation by 3--4 BLEU points---a huge bump---with an oracle choice among the top 5 candidates. He said this was well known and that I would lose my mind trying to exploit it.)
English
1
2
11
0
Alec Radford
Alec Radford@AlecRad·
@jacobandreas Sorry - I interpreted: "if a paper had crossed my desk saying here are some hand-curated best-of-25 samples from our model + PPL comparisons with models trained on other datasets" as about the paper - especially since the second half of the statement is about the paper.
English
1
0
0
0
Jacob Andreas
Jacob Andreas@jacobandreas·
@AlecRad Yeah this was just a comment on the blog post, not the paper. I find the zero-shot focus of the paper much more compelling! (And better evidence of non-copying.) But it's the unicorns everyone's talking about....
English
1
0
1
0
Alec Radford
Alec Radford@AlecRad·
@jacobandreas The paper relegates samples to the appendix. The unicorn sample is on page 20 and used to make a qualitative point. Almost everything else in the paper is random samples.
English
2
0
4
0
Jacob Andreas
Jacob Andreas@jacobandreas·
And honestly, if a paper had crossed my desk saying "here are some hand-curated best-of-25 samples from our model + PPL comparisons with models trained on other datasets", I would have said "this is not science" and recommended reject. I think a lot of NLPers would do the same.
English
3
0
17
0
Alec Radford
Alec Radford@AlecRad·
@jacobandreas Those samples use a different technique than the ones shown in the blog. The samples you are looking at are temperature=1. We use top_k=40. Unconditional samples with that are here: github.com/openai/gpt-2/b… It's also important to note that conditioning on "real" text helps too.
English
1
4
18
0
Jacob Andreas
Jacob Andreas@jacobandreas·
I think the most interesting thing about the current LM discussion is the huge quality difference between the raw samples in github (github.com/openai/gpt-2/b…) and the cherry-picked samples on the blog. The raw ones are good, but not fall-out-of-your-chair good like Zombie Kennedy.
English
2
17
87
0
Alec Radford retweetledi
Nando de Freitas
Nando de Freitas@NandoDF·
First, reproducibility is not about rerunning code to get the same results. Science must be more robust, as naive copying has many flaws. Second, reproducibility should never be above public safety. We must publish responsibility, with hope and kindness in our minds.
Volodymyr Kuleshov 🇺🇦@volokuleshov

@NandoDF @ilyasut @icmlconf @iclr2019 Don't the benefits of increased reproducibility and rigor on the part of the authors greatly outweigh any potential misuses of their work, at least for the vast majority of ICML/ICLR papers? I think the current shift towards empirical work puts a greater need on releasing code.

English
4
28
123
0
Alec Radford retweetledi
Joshua Achiam
Joshua Achiam@jachiam0·
I'd like to weigh in on the #GPT2 discussion. The decision not to release the trained model was carefully considered and important for norm-forming. Serving the public good requires us to draw lines on release somewhere: better long before catastrophe than after.
English
9
92
366
0
Alec Radford
Alec Radford@AlecRad·
By the way - I think a valid (if extreme) take on GPT-2 is "lol you need 10,000x the data, 1 billion parameters, and a supercomputer to get current DL models to generalize to Penn Treebank."
English
7
25
202
0
Alec Radford
Alec Radford@AlecRad·
@dennybritz @egrefen @OpenAI We are using a training dataset that is 4,289x larger than WikiText-2. The two that have higher overlaps are Wikipedia. People often quote (or just plagiarize) Wikipedia given how often it is used as a source. When you add more Wiki data (WikiText-103) WebText is lower, though.
English
1
0
8
0
Denny Britz
Denny Britz@dennybritz·
@AlecRad @egrefen @OpenAI Isn’t it still strange that the WebText overlap is larger for some datasets than the actual dataset overlap? How come?
English
1
0
0
0
Edward Grefenstette
Edward Grefenstette@egrefen·
This just caught my eye in the @OpenAI GPT-2 paper: overlap between test-set 8-grams (8!) and the training set. This seems very high to me. Corpus-savvy peeps: Is such 8-gram overlap indeed high?
Edward Grefenstette tweet media
Wandsworth, London 🇬🇧 English
2
6
12
0