Alec Radford

564 posts

Alec Radford

@AlecRad

San Francisco, CA Katılım Ekim 2012

303 Takip Edilen70.7K Takipçiler

Sabitlenmiş Tweet

Alec Radford@AlecRad·11 Haz

What I've been working on for the past year! blog.openai.com/p/7fa97c36-611… Inspired by CoVE, ELMo, and ULMFiT we show that a single transformer language model can be finetuned to a wide variety of NLP tasks and performs very well with little tuning/tweaking.

English

451

1.8K

Alec Radford retweetledi

Nick Levine@status_effects·28 Nis

New work with @AlecRad and @DavidDuvenaud: Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 text. Vintage models should help us to understand how LMs generalize (e.g., can we teach talkie to code?). Thread:

English

171

367

2.9K

Alec Radford retweetledi

David Duvenaud@DavidDuvenaud·28 Nis

Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below! with @AlecRad and @status_effects 🧵

English

200

457

3.6K

1.4M

Alec Radford retweetledi

Grace Luo@graceluo_·9 Şub

We trained diffusion models on a billion LLM activations, and we want you to use them! New preprint: Learning a Generative Meta-Model of LLM Activations Joint work with @feng_jiahai, @trevordarrell, @AlecRad, @JacobSteinhardt. More in thread 🧵

English

191

1.4K

218.4K

Alec Radford retweetledi

Neil Rathi@neil_rathi·30 Oca

New paper, w/@AlecRad Models acquire a lot of capabilities during pretraining. We show that we can precisely shape what they learn simply by filtering their training data at the token level.

English

1.1K

110.2K

Alec Radford@AlecRad·29 May

@skornblith @DGBassani It's the max width with 12 layers that could fit in memory on the dev box that trained GPT-1. Also worked out to a month to train which was edge of my patience. The prototypes went 6 layer 512 wide (og tformer paper "base") to 12 layer 512 wide to 12 layer 768 wide.

English

Simon Kornblith@skornblith·29 May

@DGBassani The original Transformer paper has results for nets with layers of dimension 512 (2^9) and 1024 (2^10). 768 is halfway between 512 and 1024, so it's a logical choice. The BERT paper says they use 768 because that's what GPT did, so @AlecRad might be the person who decided this.

English

Alec Radford@AlecRad·8 Haz

@NPCollapse The raw version used for gpt-2 is available at gs://gpt-2/data/lambada_development.jsonl and gs://gpt-2/data/lambada_test.jsonl

English

Connor Leahy@NPCollapse·8 Haz

As an aside, is it just me or is the LAMBADA dataset website currently down? Anyone know how I can acquire the dataset?

English

Alec Radford@AlecRad·17 May

@chipro Dynamic eval improves an AWD-LSTM baseline by 0.11 nats. Can't be sure it'd have equal sized benefits for both architectures (though arxiv.org/abs/1904.08378 suggests it works fine) but if that gain carried over, the Transformer-XL model would be 48.6 test perplexity.

English

Chip Huyen@chipro·17 May

@AlecRad SOTA for PTB without extra data is 46.54 (Transformer-XL 54.5). On paperswithcode, all top models on WikiText-103 & 1billion are transformer, and all top models on small datasets are lstm. Could just be hp but could also be something else paperswithcode.com/sota/language-…

English

Chip Huyen@chipro·17 May

NLP folks, I have a quick question: why does RNN work much better than Transformer on small datasets like Penn Treebank and WikiText-2?

English

Alec Radford@AlecRad·25 Nis

This is a really fun live experiment with twitch chat predictably oscillating between love and hate based on the sample.

English

206

Alec Radford retweetledi

Christine McLeavey@mcleavey·25 Nis

Extremely excited to share work I've been doing at OpenAI the past few months: MuseNet, a neural net music generator. It's been a huge team effort pulling this all together!

OpenAI@OpenAI

Introducing MuseNet, a neural network which discovered how to generate music using many different instruments and styles. Listen & interact: openai.com/blog/musenet/ MuseNet will play an experimental concert today from 12–3pmPT on livestream: twitch.tv/openai

English

200

Alec Radford retweetledi

rewon@rewonfc·23 Nis

Releasing some work today with @scottgray76 @AlecRad and @ilyasut. Contains some simple adaptations for Transformers that extend them to long sequences.

OpenAI@OpenAI

Releasing the Sparse Transformer, a network which sets records at predicting what comes next in a sequence — whether text, images, or sound. Improvements to neural 'attention' let it extract patterns from sequences 30x longer than possible previously: openai.com/blog/sparse-tr…

English

211

Alec Radford@AlecRad·9 Mar

@jeremyphoward @RogerGrosse The graph shows lines for various initial values so I would guess those aren't learned but manually set.

English

Jeremy Howard@jeremyphoward·9 Mar

@RogerGrosse Do you have a sense of why high dropout is useful in the earliest iterations?

English

Roger Grosse@RogerGrosse·8 Mar

Excited to release our paper on Self-Tuning Networks, a way of adapting regularization hyperparameters online during training. This is the work of Matt MacKay, Paul Vicol, and @jonLorraine9, to appear at ICLR 2019. arxiv.org/abs/1903.03088

English

465

Alec Radford@AlecRad·4 Mar

@tallinzen @mcxfrank @emilymbender @yoavgo Don't know exact # since there is not a traditional word-level tokenization step. There are 9B tokens total and the ratio is probably around 1.1 tokens per word? You can probably just call those tokens words for the purpose of a # on a slide.

English

Tal Linzen@tallinzen·4 Mar

@mcxfrank @emilymbender @yoavgo Yes, number of words may be more useful than number of characters... @alecrad?

English

Michael C. Frank@mcxfrank·4 Mar

Hey, anyone have a guess about approx. how many words are in the WebText corpus that OpenAI used for their recent language model (or have any other links to approximate amounts of data for training recent language models)? I'm working on a talk and would like to include this...

English

Alec Radford retweetledi

Graham Neubig@gneubig·27 Şub

One commonly cited argument about the difficulty of learning common-sense reasoning is that "no-one writes down common sense". A counter-argument is "well, the web is big": instructables.com/id/How-To-Open…

English

144

Alec Radford@AlecRad·19 Şub

@jacobandreas Okay cool - thanks for clarifying!

English

Jacob Andreas@jacobandreas·19 Şub

@AlecRad Oh then this was really unclear---I mean that in my part of NLP-land we're conditioned to auto-reject things that don't do well under a particular style of evaluation, and this is a problem with my part of NLP-land!

English

Jacob Andreas@jacobandreas·18 Şub

(Back when I was starting to do research with @nyhabash I remember "discovering" that I could improve translation by 3--4 BLEU points---a huge bump---with an oracle choice among the top 5 candidates. He said this was well known and that I would lose my mind trying to exploit it.)

English

Alec Radford@AlecRad·19 Şub

@jacobandreas Sorry - I interpreted: "if a paper had crossed my desk saying here are some hand-curated best-of-25 samples from our model + PPL comparisons with models trained on other datasets" as about the paper - especially since the second half of the statement is about the paper.

English

Jacob Andreas@jacobandreas·19 Şub

@AlecRad Yeah this was just a comment on the blog post, not the paper. I find the zero-shot focus of the paper much more compelling! (And better evidence of non-copying.) But it's the unicorns everyone's talking about....

English

Alec Radford@AlecRad·19 Şub

@jacobandreas The paper relegates samples to the appendix. The unicorn sample is on page 20 and used to make a qualitative point. Almost everything else in the paper is random samples.

English

Jacob Andreas@jacobandreas·18 Şub

And honestly, if a paper had crossed my desk saying "here are some hand-curated best-of-25 samples from our model + PPL comparisons with models trained on other datasets", I would have said "this is not science" and recommended reject. I think a lot of NLPers would do the same.

English

Alec Radford@AlecRad·19 Şub

@jacobandreas Those samples use a different technique than the ones shown in the blog. The samples you are looking at are temperature=1. We use top_k=40. Unconditional samples with that are here: github.com/openai/gpt-2/b… It's also important to note that conditioning on "real" text helps too.

English

Jacob Andreas@jacobandreas·18 Şub

I think the most interesting thing about the current LM discussion is the huge quality difference between the raw samples in github (github.com/openai/gpt-2/b…) and the cherry-picked samples on the blog. The raw ones are good, but not fall-out-of-your-chair good like Zombie Kennedy.

English

Alec Radford retweetledi

Nando de Freitas@NandoDF·17 Şub

First, reproducibility is not about rerunning code to get the same results. Science must be more robust, as naive copying has many flaws. Second, reproducibility should never be above public safety. We must publish responsibility, with hope and kindness in our minds.

Volodymyr Kuleshov 🇺🇦@volokuleshov

@NandoDF @ilyasut @icmlconf @iclr2019 Don't the benefits of increased reproducibility and rigor on the part of the authors greatly outweigh any potential misuses of their work, at least for the vast majority of ICML/ICLR papers? I think the current shift towards empirical work puts a greater need on releasing code.

English

124

Alec Radford retweetledi

Joshua Achiam@jachiam0·17 Şub

I'd like to weigh in on the #GPT2 discussion. The decision not to release the trained model was carefully considered and important for norm-forming. Serving the public good requires us to draw lines on release somewhere: better long before catastrophe than after.

English

368

Keşfet

@DavidDuvenaud @status_effects @feng_jiahai @trevordarrell @JacobSteinhardt @skornblith @DGBassani @NPCollapse