NLPurr

935 posts

NLPurr banner
NLPurr

NLPurr

@NLPurr

SciComm of Academic NLP Papers | Research Scientist | Explainability, Prompting, Benchmarking, Metrics, Red-Teaming & Eval of LLMs

Katılım Temmuz 2022
740 Takip Edilen1K Takipçiler
Sabitlenmiş Tweet
NLPurr
NLPurr@NLPurr·
LLMs and their evaluation is a hotly debated topic. So, I wrote it up. Now, we all have something to fight over. Heads up: This is hosted on github, feel free to add examples or propose changes or discuss using the comment box and the end of the post. nlpurr.github.io/posts/case-of-…
English
3
10
80
19.2K
NLPurr retweetledi
Felix Hill
Felix Hill@FelixHill84·
Do you work in AI? Do you find things uniquely stressful right now, like never before? Haver you ever suffered from a mental illness? Read my personal experience of those challenges here: docs.google.com/document/d/1aE…
English
36
106
703
233.8K
NLPurr retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
On data centric vs algorithmic centric rlhf work This year we've had two major projects for our state-of-the-art post training pipelines (Tulu 2.5 and 3 soon) at @allen_ai. One has been more data focussed and one was focussed on trying to get performance from PPO. It's amazing how much smoother and more predictable progress is when working in a data-centric point of view for post training. For tulu 2.5, the innovation to make PPO more stable or faster require very in depth algorithmic knowledge that falls on one or two people, and it adds a lot of uncertainty into the process. It took months and makes these few people disproportionately stressed. For tulu 3, which we are working on now, we've been able to incorporate way more people into the pipeline and see immediate gains on evals we care about, like GSM8k, MATH, IFEval, etc, by curating and filtering the right data. I suspect for tulu 3 we spend a similar amount of person hours, by including more people in a shorter time window, and get 2x plus the gains in performance. This is why people tell you not to read too into the "DPO vs PPO" in the llama 3.1 paper. They're just slowly honing the blades of their datasets. One week at a time, a few thousand examples at a time, until they have a great end product. Way less risk for the model to focus on data. Way more risk for "academic novelty" to not fiddle with algorithms. Models are more impactful anyways, and for some reason their are fewer people publishing their data recipes in the open.
English
1
8
90
10.5K
NLPurr retweetledi
Luca Soldaini 🎀
Luca Soldaini 🎀@soldni·
going from “my eval pipeline is finally complete” to “all my models produce random garbage” is such a reality check
English
3
2
59
2.8K
NLPurr retweetledi
David Pfau
David Pfau@pfau·
OK, this is probably going to raise more questions than it answers, but I just want to put this out here so that no one ever says "we can just get around the data limitations of LLMs with self-play" ever again.
David Pfau tweet media
English
16
15
159
65.8K
NLPurr retweetledi
Graham Neubig
Graham Neubig@gneubig·
Researchers often have to ask for recommendation letters for visa/job applications, etc. I wrote a script that allows you to find who cites your papers frequently to create a list of potential letter writers: github.com/neubig/researc… Hope it's helpful, improvements are welcome!
English
4
92
566
50.7K
NLPurr retweetledi
Matthew Leavitt
Matthew Leavitt@leavittron·
The next 10x in deep learning efficiency gains are going to come from intelligent intervention on training data. But tools for automated data curation at scale didn’t exist—until now. I’m so excited to announce that I’ve co-founded @DatologyAI, with @arimorcos and @hurrycane
English
11
16
123
15.2K
Neal Parikh
Neal Parikh@npparikh·
These guys have done a lot. But all of them have a huge ego and chip on their shoulder. Their stuff was a complete joke for decades then they pull a rabbit out of a hat and win the Turing Award. They’re still personally irritated. Frankly, I think this influences such statements.
English
3
0
4
416
NLPurr retweetledi
Yann LeCun
Yann LeCun@ylecun·
Animals and humans get very smart very quickly with vastly smaller amounts of training data. My money is on new architectures that would learn as efficiently as animals and humans. Using more data (synthetic or not) is a temporary stopgap made necessary by the limitations of our current approaches.
English
311
567
5.4K
3M
NLPurr retweetledi
Sasha Rush
Sasha Rush@srush_nlp·
I attended a Google-hosted workshop today. Workshops like these are a great chance to spread their work. I enjoyed the talks immensely. However, for whatever reason, this was the gender breakdown. I'm posting because I think it's important that people know these statistics.
Sasha Rush tweet media
English
24
45
329
132.9K
NLPurr retweetledi
Yann LeCun
Yann LeCun@ylecun·
Don't confuse the approximate retrieval abilities of LLMs for actual reasoning abilities.
Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)@rao2z

🧵LLM's seem to fake both "solving" and "self-critiquing" solutions to reasoning problems by approximate retrieval. The two faking abilities just depend on different parts of the training data (..and disappear when such data is not present in the training corpus..) Our recent work, quote tweeted below, questions LLMs ability to self-critique (which shouldn't be a surprise given that there really is no reason to believe that they can reason! c.f. x.com/rao2z/status/1…) And yet, several other researchers report results that seem to indicate that some form of self-critiquing mode seems to help solving mode. The explanation for this seeming disparity is that the observed self-critiquing power is just approximate retrieval on corrections data informing approximate retrieval on correct data. Let me unpack it a little. For most common use domains (e.g. mine craft, grade school word problems), the training corpora not only contain solution (correct) data, but also corrections data (i.e., the types of normal errors to be found in incorrect solutions). (c.f. x.com/rao2z/status/1…) This allows people to conflate approximate retrieval for reasoning or self-critiquing. Like any observed solving of reasoning problems, observed self-critiquing abilities of LLMs are also best understood as approximate retrieval from training data. It is just that the latter depend on corrections data rather than on correct data. This ability to fake solving or critiquing by retrieval gets exposed when LLMs are presented with problems/domains for which they didn't have either the correct data or the corrections data in their training corpus. This is what is exposed by our work on LLM planning abilities (c.f. x.com/rao2z/status/1…) and that on self-critiquing abilities (x.com/rao2z/status/1…) tldr; whether solving or self-critiquing, it is approximate retrieval all the way for LLMs.. [I think this is also true of "automatic curriculum generation" claims a la Voyager--but more on that in another post..] (This thread is kind of an analog of the earlier thread on why people claim LLMs can generate plans: x.com/rao2z/status/1…)

English
28
89
651
229.4K
NLPurr retweetledi
David Pfau
David Pfau@pfau·
Scientific work which cannot be replicated is failed scientific work. Work using closed methods that don't even allow the possibility of replication should be treated as marketing rather than science. Scientists who publish said work should have their reputations suffer.
Dimitris Papailiopoulos@DimitrisPapail

I'm sure you've wondered: Can GPT-4v draw a TikZ Unicorn if we give it visual feedback? I am here to settle this open problem An 8-part 🧵on my attempt to get GPT-4v to draw a🦄 as good as @SebastienBubeck et al's, when given multiple rounds for improvement. TL;DR: i failed

English
7
14
130
43.1K
NLPurr retweetledi
Preetum Nakkiran
Preetum Nakkiran@PreetumNakkiran·
careful about overfitting to lists like this. there are many ways to do good research -- my fav papers were born out of getting "stuck in rabbit holes" that no-one else went down...
Jason Wei@_jasonwei

Enjoyed visiting UC Berkeley’s Machine Learning Club yesterday, where I gave a talk on doing AI research. Slides: docs.google.com/presentation/d… In the past few years I’ve worked with and observed some extremely talented researchers, and these are the trends I’ve noticed: 1. When starting a project, average researchers tend to jump quickly to modeling proposals, architecture design, new ideas, etc. Great researchers often first spend time manually looking at data and playing with models to deeply understand the problem, before proposing an (often simple) approach. 2. Average researchers may often write hacky code that is not reusable and requires many separate steps. Great researchers are often also great software engineers—their code can be easily extended for future experiments, they write extensive tests, and they create infra to run many experiments quickly and visualize results with the fewest clicks. 3. While average researchers might work mostly by themselves or with one or two others, great researchers know that research is a social activity. They collaborate with people of varying experience, share results in writeups, and communicate their vision convincingly. 4. Average researchers might get stuck in rabbit holes—if they have experiments with only mediocre results, they spend 3 more weeks writing it up and submitting it to a conference. Great researchers quickly move on to something else when they know that one approach won’t be a breakthrough. 5. If an average researcher finds some success, they may try to keep doing that thing they are comfortable with for several more years, even if it becomes outdated. Great researchers pivot quickly and keep adapting to new advances and paradigms. 6. Average researchers often implement task-specific solutions, which are heavily optimized for a single task. Great researchers may also work on specific tasks, but they try to think of general approaches that can be applied to many other tasks. 7. Average researchers talk about and optimize for the number of papers or conference acceptances. I have never met a great researcher that still cares about such things. (And by the way, being an average researcher shouldn’t be taken as an insult. It takes a lot of hard work to even do research at all :))

English
5
10
200
61.3K
NLPurr
NLPurr@NLPurr·
@yoavgo I think it is fine to say someone feels good, or to even say something helped. It is the possible ramifications of recommendation. I think some industries should be regulated for personal recommendation (e.g. medicine) -- instead of it just being personal vetting responsibility.
English
0
0
0
471
(((ل()(ل() 'yoav))))👾
if people are happy talking to a bot, feel better with themselves afterwards and think it might be an equivalent to mental therapy --- why do you care? just let them be happy
English
37
11
203
75.8K
NLPurr
NLPurr@NLPurr·
@egrefen Would love to talk, but your DMs are closed to those not having a blue tick. Could you possibly consider opening them?
English
0
0
0
72
Edward Grefenstette
Edward Grefenstette@egrefen·
If you are interested in understanding where the boundary of research is in topics such as reasoning, open-endedness, tool-use, etc; and if you are keen to conduct research which is grounded in real-world use-cases, and designing evaluations over these, get in touch! [3/3]
GIF
English
2
0
6
1.3K
Edward Grefenstette
Edward Grefenstette@egrefen·
🧵 How to we take safe, meaningful steps towards autonomy? How can agents anticipate & support us in fulfilling our needs? How do we connect the increasingly broad capabilities of frontier models to our everyday personal and professional activities? Let's find out together! [1/3]
GIF
Edward Grefenstette@egrefen

🚨 JOB ALERT 🚨 We're hiring research scientists/engineers to conduct research on next-generation assistant technologies to power increasingly autonomous agents which strive to support humans Research Scientist: boards.greenhouse.io/deepmind/jobs/… Research Engineer: boards.greenhouse.io/deepmind/jobs/…

English
1
2
21
12.2K
NLPurr
NLPurr@NLPurr·
@haldaume3 @yoavgo I agree. In my mind, I see it as a robustness to user input vs best possible prompt if this were to be in a pipeline in background. In mind, the first positioning should consider fuzzy prompts for same test sets, the second one should at least consider fuzzy test set samples.
English
0
0
0
320
Hal Daumé III
Hal Daumé III@haldaume3·
@yoavgo yeah this aligns with my general categorization of research - "i got it to work" (most papers) vs "it works" (few papers) [i'm not excluded from the most/few distinction]
English
1
0
10
1.6K
Hal Daumé III
Hal Daumé III@haldaume3·
I'm curious people's takes: When reading a paper that uses prompted LPMs, there is always the objection "maybe if you had prompted better..." and the reverse objection "you over-engineered your prompts" What should best practice actually be, and what is "reasonable effort"? >
English
8
4
54
20.5K