NLPurr

935 posts

NLPurr

@NLPurr

SciComm of Academic NLP Papers | Research Scientist | Explainability, Prompting, Benchmarking, Metrics, Red-Teaming & Eval of LLMs

Katılım Temmuz 2022

740 Takip Edilen1K Takipçiler

Sabitlenmiş Tweet

NLPurr@NLPurr·26 Haz

LLMs and their evaluation is a hotly debated topic. So, I wrote it up. Now, we all have something to fight over. Heads up: This is hosted on github, feel free to add examples or propose changes or discuss using the comment box and the end of the post. nlpurr.github.io/posts/case-of-…

English

19.2K

NLPurr retweetledi

Felix Hill@FelixHill84·8 Eki

Do you work in AI? Do you find things uniquely stressful right now, like never before? Haver you ever suffered from a mental illness? Read my personal experience of those challenges here: docs.google.com/document/d/1aE…

English

106

703

233.8K

NLPurr retweetledi

(((ل()(ل() 'yoav))))👾@yoavgo·10 Ağu

lilian weng's blog is really nice!! many good overviews of various LLM/AI topics. lilianweng.github.io @lilianweng

English

163

13.4K

NLPurr retweetledi

Hila Gonen@hila_gonen·9 Ağu

Do you like yellow? Then, according to LLMs, you are probably a school bus driver! Excited to share our new paper about Semantic Leakage in Language Models! Joint work with my wonderful collaborators @terra @alisawuffles @luke @nlpnoah Paper: gonenhila.github.io/files/Semantic… 1/10

English

210

34.8K

NLPurr retweetledi

Nathan Lambert@natolambert·6 Ağu

On data centric vs algorithmic centric rlhf work This year we've had two major projects for our state-of-the-art post training pipelines (Tulu 2.5 and 3 soon) at @allen_ai. One has been more data focussed and one was focussed on trying to get performance from PPO. It's amazing how much smoother and more predictable progress is when working in a data-centric point of view for post training. For tulu 2.5, the innovation to make PPO more stable or faster require very in depth algorithmic knowledge that falls on one or two people, and it adds a lot of uncertainty into the process. It took months and makes these few people disproportionately stressed. For tulu 3, which we are working on now, we've been able to incorporate way more people into the pipeline and see immediate gains on evals we care about, like GSM8k, MATH, IFEval, etc, by curating and filtering the right data. I suspect for tulu 3 we spend a similar amount of person hours, by including more people in a shorter time window, and get 2x plus the gains in performance. This is why people tell you not to read too into the "DPO vs PPO" in the llama 3.1 paper. They're just slowly honing the blades of their datasets. One week at a time, a few thousand examples at a time, until they have a great end product. Way less risk for the model to focus on data. Way more risk for "academic novelty" to not fiddle with algorithms. Models are more impactful anyways, and for some reason their are fewer people publishing their data recipes in the open.

English

10.5K

NLPurr retweetledi

Luca Soldaini 🎀@soldni·7 Ağu

going from “my eval pipeline is finally complete” to “all my models produce random garbage” is such a reality check

English

2.8K

NLPurr retweetledi

Sam Bowman@sleepinyourhat·17 Haz

📰 Excited to see this go out! 📰 LLMs generalize from succeeding at mundane opportunities for reward-seeking to pursuing more concerning ones.

Anthropic@AnthropicAI

We find that models generalize, without explicit training, from easily-discoverable dishonest strategies like sycophancy to more concerning behaviors like premeditated lying—and even direct modification of their reward function.

English

10.6K

NLPurr retweetledi

David Pfau@pfau·21 Nis

OK, this is probably going to raise more questions than it answers, but I just want to put this out here so that no one ever says "we can just get around the data limitations of LLMs with self-play" ever again.

English

159

65.8K

NLPurr retweetledi

Graham Neubig@gneubig·27 Mar

Researchers often have to ask for recommendation letters for visa/job applications, etc. I wrote a script that allows you to find who cites your papers frequently to create a list of potential letter writers: github.com/neubig/researc… Hope it's helpful, improvements are welcome!

English

566

50.7K

NLPurr retweetledi

Matthew Leavitt@leavittron·22 Şub

The next 10x in deep learning efficiency gains are going to come from intelligent intervention on training data. But tools for automated data curation at scale didn’t exist—until now. I’m so excited to announce that I’ve co-founded @DatologyAI, with @arimorcos and @hurrycane

English

123

15.2K

NLPurr@NLPurr·25 Kas

@npparikh Somehow this still holds true 6 months later: twitter.com/NLPurr/status/…

NLPurr@NLPurr

My hottest take is, after ages of all of us really not liking @Meta and tons of research dislike towards that platform; Suddenly, "TO ME", @Meta, and @ylecun seem to having the most level-headed, un-hyped (not talking about accuracy) AI participation in today's discourse.

English

Neal Parikh@npparikh·25 Kas

These guys have done a lot. But all of them have a huge ego and chip on their shoulder. Their stuff was a complete joke for decades then they pull a rabbit out of a hat and win the Turing Award. They’re still personally irritated. Frankly, I think this influences such statements.

English

416

Neal Parikh@npparikh·25 Kas

How’s this any different from your opinion, which puts a lot of weight on your opinion and a minuscule weight on many other equally qualified experts? That’s been your form of opinion for decades.

Geoffrey Hinton@geoffreyhinton

Yann LeCun thinks the risk of AI taking over is miniscule. This means he puts a big weight on his own opinion and a miniscule weight on the opinions of many other equally qualified experts.

English

3.7K

NLPurr retweetledi

Yann LeCun@ylecun·23 Kas

Or even just a working cat bot.

Pedro Domingos@pmddomingos

We need a moratorium on talking about looming AGI until we have at least a working housebot.

English

733

195K

NLPurr retweetledi

Yann LeCun@ylecun·23 Kas

Animals and humans get very smart very quickly with vastly smaller amounts of training data. My money is on new architectures that would learn as efficiently as animals and humans. Using more data (synthetic or not) is a temporary stopgap made necessary by the limitations of our current approaches.

English

311

567

5.4K

NLPurr retweetledi

Sasha Rush@srush_nlp·17 Kas

I attended a Google-hosted workshop today. Workshops like these are a great chance to spread their work. I enjoyed the talks immensely. However, for whatever reason, this was the gender breakdown. I'm posting because I think it's important that people know these statistics.

English

329

132.9K

NLPurr retweetledi

Yann LeCun@ylecun·6 Kas

Don't confuse the approximate retrieval abilities of LLMs for actual reasoning abilities.

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)@rao2z

🧵LLM's seem to fake both "solving" and "self-critiquing" solutions to reasoning problems by approximate retrieval. The two faking abilities just depend on different parts of the training data (..and disappear when such data is not present in the training corpus..) Our recent work, quote tweeted below, questions LLMs ability to self-critique (which shouldn't be a surprise given that there really is no reason to believe that they can reason! c.f. x.com/rao2z/status/1…) And yet, several other researchers report results that seem to indicate that some form of self-critiquing mode seems to help solving mode. The explanation for this seeming disparity is that the observed self-critiquing power is just approximate retrieval on corrections data informing approximate retrieval on correct data. Let me unpack it a little. For most common use domains (e.g. mine craft, grade school word problems), the training corpora not only contain solution (correct) data, but also corrections data (i.e., the types of normal errors to be found in incorrect solutions). (c.f. x.com/rao2z/status/1…) This allows people to conflate approximate retrieval for reasoning or self-critiquing. Like any observed solving of reasoning problems, observed self-critiquing abilities of LLMs are also best understood as approximate retrieval from training data. It is just that the latter depend on corrections data rather than on correct data. This ability to fake solving or critiquing by retrieval gets exposed when LLMs are presented with problems/domains for which they didn't have either the correct data or the corrections data in their training corpus. This is what is exposed by our work on LLM planning abilities (c.f. x.com/rao2z/status/1…) and that on self-critiquing abilities (x.com/rao2z/status/1…) tldr; whether solving or self-critiquing, it is approximate retrieval all the way for LLMs.. [I think this is also true of "automatic curriculum generation" claims a la Voyager--but more on that in another post..] (This thread is kind of an analog of the earlier thread on why people claim LLMs can generate plans: x.com/rao2z/status/1…)

English

651

229.4K

NLPurr retweetledi

David Pfau@pfau·6 Kas

These takes are going to reverse polarize me into being a Google defender. Do people just forget that Bard exists and was shipped to the public in like March?

Mark Tenenholtz@marktenenholtz

Google has been promising Gemini for longer than the entire dev cycle for Grok. Being “GPU rich” isn’t everything.

English

30.1K

NLPurr retweetledi

David Pfau@pfau·31 Eki

Scientific work which cannot be replicated is failed scientific work. Work using closed methods that don't even allow the possibility of replication should be treated as marketing rather than science. Scientists who publish said work should have their reputations suffer.

Dimitris Papailiopoulos@DimitrisPapail

I'm sure you've wondered: Can GPT-4v draw a TikZ Unicorn if we give it visual feedback? I am here to settle this open problem An 8-part 🧵on my attempt to get GPT-4v to draw a🦄 as good as @SebastienBubeck et al's, when given multiple rounds for improvement. TL;DR: i failed

English

130

43.1K

NLPurr retweetledi

Preetum Nakkiran@PreetumNakkiran·20 Eki

careful about overfitting to lists like this. there are many ways to do good research -- my fav papers were born out of getting "stuck in rabbit holes" that no-one else went down...

Jason Wei@_jasonwei

Enjoyed visiting UC Berkeley’s Machine Learning Club yesterday, where I gave a talk on doing AI research. Slides: docs.google.com/presentation/d… In the past few years I’ve worked with and observed some extremely talented researchers, and these are the trends I’ve noticed: 1. When starting a project, average researchers tend to jump quickly to modeling proposals, architecture design, new ideas, etc. Great researchers often first spend time manually looking at data and playing with models to deeply understand the problem, before proposing an (often simple) approach. 2. Average researchers may often write hacky code that is not reusable and requires many separate steps. Great researchers are often also great software engineers—their code can be easily extended for future experiments, they write extensive tests, and they create infra to run many experiments quickly and visualize results with the fewest clicks. 3. While average researchers might work mostly by themselves or with one or two others, great researchers know that research is a social activity. They collaborate with people of varying experience, share results in writeups, and communicate their vision convincingly. 4. Average researchers might get stuck in rabbit holes—if they have experiments with only mediocre results, they spend 3 more weeks writing it up and submitting it to a conference. Great researchers quickly move on to something else when they know that one approach won’t be a breakthrough. 5. If an average researcher finds some success, they may try to keep doing that thing they are comfortable with for several more years, even if it becomes outdated. Great researchers pivot quickly and keep adapting to new advances and paradigms. 6. Average researchers often implement task-specific solutions, which are heavily optimized for a single task. Great researchers may also work on specific tasks, but they try to think of general approaches that can be applied to many other tasks. 7. Average researchers talk about and optimize for the number of papers or conference acceptances. I have never met a great researcher that still cares about such things. (And by the way, being an average researcher shouldn’t be taken as an insult. It takes a lot of hard work to even do research at all :))

English

200

61.3K

NLPurr@NLPurr·28 Eyl

@yoavgo I think it is fine to say someone feels good, or to even say something helped. It is the possible ramifications of recommendation. I think some industries should be regulated for personal recommendation (e.g. medicine) -- instead of it just being personal vetting responsibility.

English

471

(((ل()(ل() 'yoav))))👾@yoavgo·28 Eyl

if people are happy talking to a bot, feel better with themselves afterwards and think it might be an equivalent to mental therapy --- why do you care? just let them be happy

English

203

75.8K

NLPurr@NLPurr·14 Eyl

@egrefen Would love to talk, but your DMs are closed to those not having a blue tick. Could you possibly consider opening them?

English

Edward Grefenstette@egrefen·13 Eyl

If you are interested in understanding where the boundary of research is in topics such as reasoning, open-endedness, tool-use, etc; and if you are keen to conduct research which is grounded in real-world use-cases, and designing evaluations over these, get in touch! [3/3]

GIF

English

1.3K

Edward Grefenstette@egrefen·13 Eyl

🧵 How to we take safe, meaningful steps towards autonomy? How can agents anticipate & support us in fulfilling our needs? How do we connect the increasingly broad capabilities of frontier models to our everyday personal and professional activities? Let's find out together! [1/3]

GIF

Edward Grefenstette@egrefen

🚨 JOB ALERT 🚨 We're hiring research scientists/engineers to conduct research on next-generation assistant technologies to power increasingly autonomous agents which strive to support humans Research Scientist: boards.greenhouse.io/deepmind/jobs/… Research Engineer: boards.greenhouse.io/deepmind/jobs/…

English

12.2K

NLPurr@NLPurr·9 Eyl

@haldaume3 @yoavgo I agree. In my mind, I see it as a robustness to user input vs best possible prompt if this were to be in a pipeline in background. In mind, the first positioning should consider fuzzy prompts for same test sets, the second one should at least consider fuzzy test set samples.

English

320

Hal Daumé III@haldaume3·9 Eyl

@yoavgo yeah this aligns with my general categorization of research - "i got it to work" (most papers) vs "it works" (few papers) [i'm not excluded from the most/few distinction]

English

1.6K

Hal Daumé III@haldaume3·9 Eyl

I'm curious people's takes: When reading a paper that uses prompted LPMs, there is always the objection "maybe if you had prompted better..." and the reverse objection "you over-engineered your prompts" What should best practice actually be, and what is "reasonable effort"? >

English

20.5K

Keşfet

@lilianweng @Terra @alisawuffles @luke @nlpnoah @allen_ai @datologyai @arimorcos