Tilman Beck

274 posts

Tilman Beck

@devnull90

Clinical Machine Learning at University Hospital Zurich / He, Him

Katılım Ekim 2015

1.1K Takip Edilen368 Takipçiler

Tilman Beck@devnull90·16 Mar

@thekaransinghal @shamay___ I would want to use a model for medical advice which is properly aligned to answer given its current state of knowledge (which in non-conversational evaluation is obviously limited and thus I expect it to default to be cautious)

English

Tilman Beck@devnull90·16 Mar

@thekaransinghal @shamay___ I‘d argue different evaluation protocol surface different reasoning and behavioral patterns of the model. If you want to measure whether the model is cautious in light of uncertainty, I would say both evaluations are meaningful.

English

Karan Singhal@thekaransinghal·14 Mar

x.com/i/article/2032…

ZXX

304

64.4K

Tilman Beck@devnull90·30 Haz

@kchonyc Yes, up for it

English

168

Kyunghyun Cho@kchonyc·30 Haz

don’t worry. i am almost done with neurips reviews. that said, is there anyone either in zurich up for beer-soaked lunch in about two weeks or so?

English

7.2K

Tilman Beck retweetledi

Thomas Wolf@Thom_Wolf·6 Mar

I shared a controversial take the other day at an event and I decided to write it down in a longer format: I’m afraid AI won't give us a "compressed 21st century". The "compressed 21st century" comes from Dario's "Machine of Loving Grace" and if you haven’t read it, you probably should, it’s a noteworthy essay. In a nutshell the paper claims that, over a year or two, we’ll have a "country of Einsteins sitting in a data center”, and it will result in a compressed 21st century during which all the scientific discoveries of the 21st century will happen in the span of only 5-10 years. I read this essay twice. The first time I was totally amazed: AI will change everything in science in 5 years, I thought! A few days later I came back to it and, re-reading it, I realized that much of it seemed like wishful thinking at best. What we'll actually get, in my opinion, is “a country of yes-men on servers” (if we just continue on current trends). Let me explain the difference with a small part of my personal story. I’ve always been a straight-A student. Coming from a small village, I joined the top French engineering school before getting accepted to MIT for PhD. School was always quite easy for me. I could just get where the professor was going, where the exam's creators were taking us and could predict the test questions beforehand. That’s why, when I eventually became a researcher (more specifically a PhD student), I was completely shocked to discover that I was a pretty average, underwhelming, mediocre researcher. While many colleagues around me had interesting ideas, I was constantly hitting a wall. If something was not written in a book I could not invent it unless it was a rather useless variation of a known theory. More annoyingly, I found it very hard to challenge the status-quo, to question what I had learned. I was no Einstein, I was just very good at school. Or maybe even: I was no Einstein in part *because* I was good at school. History is filled with geniuses struggling during their studies. Edison was called "addled" by his teacher. Barbara McClintock got criticized for "weird thinking" before winning a Nobel Prize. Einstein failed his first attempt at the ETH Zurich entrance exam. And the list goes on. The main mistake people usually make is thinking Newton or Einstein were just scaled-up good students, that a genius comes to life when you linearly extrapolate a top-10% student. This perspective misses the most crucial aspect of science: the skill to ask the right questions and to challenge even what one has learned. A real science breakthrough is Copernicus proposing, against all the knowledge of his days -in ML terms we would say “despite all his training dataset”-, that the earth may orbit the sun rather than the other way around. To create an Einstein in a data center, we don't just need a system that knows all the answers, but rather one that can ask questions nobody else has thought of or dared to ask. One that writes 'What if everyone is wrong about this?' when all textbooks, experts, and common knowledge suggest otherwise. Just consider the crazy paradigm shift of special relativity and the guts it took to formulate a first axiom like “let’s assume the speed of light is constant in all frames of reference” defying the common sense of these days (and even of today…) Or take CRISPR, generally considered to be an adaptive bacterial immune system since the 80s until, 25 years after its discovery, Jennifer Doudna and Emmanuelle Charpentier proposed to use it for something much broader and general: gene editing, leading to a Nobel prize. This type of realization –"we've known XX does YY for years, but what if we've been wrong about it all along? Or what if we could apply it to the entirely different concept of ZZ instead?” is an example of out-side-of-knowledge thinking –or paradigm shift– which is essentially making the progress of science. Such paradigm shifts happen rarely, maybe 1-2 times a year and are usually awarded Nobel prizes once everybody has taken stock of the impact. However rare they are, I agree with Dario in saying that they take the lion’s share in defining scientific progress over a given century while the rest is mostly noise. Now let’s consider what we’re currently using to benchmark recent AI model intelligence improvement. Some of the most recent AI tests are for instance the grandiosely named "Humanity's Last Exam" or "Frontier Math". They consist of very difficult questions –usually written by PhDs– but with clear, closed-end, answers. These are exactly the kinds of exams where I excelled in my field. These benchmarks test if AI models can find the right answers to a set of questions we already know the answer to. However, real scientific breakthroughs will come not from answering known questions, but from asking challenging new questions and questioning common conceptions and previous ideas. Remember Douglas Adams' Hitchhiker's Guide? The answer is apparently 42, but nobody knows the right question. That's research in a nutshell. In my opinion this is one of the reasons LLMs, while they already have all of humanity's knowledge in memory, haven't generated any new knowledge by connecting previously unrelated facts. They're mostly doing "manifold filling" at the moment - filling in the interpolation gaps between what humans already know, somehow treating knowledge as an intangible fabric of reality. We're currently building very obedient students, not revolutionaries. This is perfect for today’s main goal in the field of creating great assistants and overly compliant helpers. But until we find a way to incentivize them to question their knowledge and propose ideas that potentially go against past training data, they won't give us scientific revolutions yet. If we want scientific breakthroughs, we should probably explore how we’re currently measuring the performance of AI models and move to a measure of knowledge and reasoning able to test if scientific AI models can for instance: - Challenge their own training data knowledge - Take bold counterfactual approaches - Make general proposals based on tiny hints - Ask non-obvious questions that lead to new research paths We don't need an A+ student who can answer every question with general knowledge. We need a B student who sees and questions what everyone else missed. --- PS: You might be wondering what such a benchmark could look like. Evaluating it could involve testing a model on some recent discovery it should not know yet (a modern equivalent of special relativity) and explore how the model might start asking the right questions on a topic it has no exposure to the answers or conceptual framework of. This is challenging because most models are trained on virtually all human knowledge available today but it seems essential if we want to benchmark these behaviors. Overall this is really an open question and I’ll be happy to hear your insightful thoughts.

English

275

492

2.5K

410.5K

Tilman Beck@devnull90·31 Oca

@restoreorderusa I don‘t see a direct causal connection to the increase of significant air traffic control lapses which you set out to explain. There could be other confounding factors, your explanation is very 1-dimensional.

English

Patrick Casey@restoreorderusa·30 Oca

In 2023, there were 503 air traffic control lapses categorized as “significant” – up 65% (!) from last year. Something has clearly gone wrong. But what? Allow me to explain. 🧵🧵🧵

R A W S A L E R T S@rawsalerts

🚨#BREAKING: New dashcam footage captures the moment a military helicopter collides with an American Airlines jet, triggering a mass casualty event with reports of multiple fatalities  📌#Washington | #DC  Watch dramatic new dashcam footage captured by a couple driving near Reagan National Airport in Washington, D.C. as a Black Hawk military helicopter collides midair with an American Airlines jet carrying 64 passengers. The footage captures the terrifying moment of impact, followed by smoke and debris filling the sky as first responders rush to the scene. The collision triggered a mass casualty event, with reports of multiple injuries and fatalities. Emergency crews are actively working to assess the full extent of the disaster

English

345

3.5K

16.6K

3.3M

Tilman Beck@devnull90·20 Eki

@egere14 @utn_nuremberg Congrats Steffen! Wish you all the best for the new position

English

Steffen Eger@egere14·18 Eki

With a bit of delay, I am happy to announce a major career update: I am now Full Professor at @utn_nuremberg, leading the Natural Language Learning & Generation (NLLG) lab. Very excited to be part of a new aspiring AI research team+environment!

English

849

Tilman Beck@devnull90·13 Haz

@pratyushmaini Great work! I feel "maybe training data was leaked" as an increasing sentiment of folks in NLP, so I am actually interested in checking how many datasets in NLP have been leaked. However, this assumption sounds like it is not directly applicable for that use case, is it?

English

Pratyush Maini@pratyushmaini·12 Haz

10/Certain assumptions are needed for DI to work. Please pay close attention to them. We require the presence of a suspect & an unseen set that are IID. Puzzled? One such setup could be the actual Harry Potter books, versus various chapter drafts that didn’t make it to the series

English

2.1K

Pratyush Maini@pratyushmaini·12 Haz

3/In fact, a few years ago when I wrote the original dataset inference paper, we proved that as training set sizes approach infinity, the success of membership inference goes to random chance. This result is ripe for today's era of large-scale pretraining. twitter.com/pratyushmaini/…

Pratyush Maini@pratyushmaini

1/Are you worried that an ML model may be a stolen copy of your model? We introduce *Dataset Inference* in our #ICLR2021 Spotlight paper to resolve model ownership. Paper: arxiv.org/abs/2104.10706 Blog and Video: cleverhans.io/2021/04/28/is-… w/@MYaghini @NicolasPapernot

English

4.7K

Tilman Beck@devnull90·9 Haz

@ML_Burn Great! Any specific reason why you removed the graph about context information contained in either model or text (Figure 1 in v1) ? I found it very informing for my own work (which touches upon the need for more context in stance detection)

English

Mike Burnham@ML_Burn·9 May

I've posted an updated manuscript on Arxiv: arxiv.org/pdf/2405.02472 If you're interested in applying the method I'm still working on the package but it should be functional: github.com/MLBurnham/entss

Mike Burnham@ML_Burn

Check out my job market paper! I estimate ideal points with large language models. - Works with any population/corpus - Can separate affect from policy preferences - Doesn't require long documents or corpora - Makes no bridging assumptions drive.google.com/file/d/1-jHDTA…

English

4.8K

Tilman Beck@devnull90·5 Haz

@dongyeopkang @lucy3_li @Ruyuan_Wan I was exactly thinking about something along those lines, great that I found your article :)

English

114

Dongyeop Kang (DK)@dongyeopkang·4 Haz

@lucy3_li My group also has explored some of these aspects in LLM agent deliberation! Happy to chat more. @Ruyuan_Wan also did a conceptual comparison of these two aspects ruyuanwan.github.io/files/Leveragi…

English

313

Lucy Li@lucy3_li·4 Haz

Has anyone ever studied what happens if instead of having people annotate data individually you put annotators in small groups and allow them to ✨discuss✨ to reach consensus? e.g. not majority voting, but chatting

English

103

26.3K

Tilman Beck retweetledi

UKP Lab@UKPLab·13 May

Good at language 🟰 good at thinking? Not true for LMs! Even if we might think that when talking to them 💬 Meet Holmes 🔎, a benchmark to assess linguistic competence📚 of LMs untwined from other skills 🔥 Spoiler 🤫 Architecture is one key! (1/🧵) 🌐 holmes-benchmark.github.io

English

18.8K

Tilman Beck retweetledi

UKP Lab@UKPLab·15 May

Stop complaining about the bad review quality. Join forces and start research on #NLProc for #PeerReview! 🚨 A new white paper by over 20 top AI and NLP researchers provides a thorough discussion of AI assistance for scientific quality control. (1/🧵) 📑 arxiv.org/abs/2405.06563

English

31.8K

Tilman Beck retweetledi

Anne Lauscher (she/her)@anne_lauscher·20 Mar

Congrats to all my co-authors and especially, to @devnull90 for receiving this recognition! Thank you, @eaclmeeting !

UKP Lab@UKPLab

We are proud to announce that the contribution »Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting« by @devnull90, @HendrikSchuff, @anne_lauscher (@unihh) and @IGurevych (@UKPLab) has just been awarded the #EACL2024 Social Impact Award!

English

4.9K

Tilman Beck@devnull90·21 Mar

Wow, thanks a lot for the appreciation of our work on sociodemographic prompting at #eacl2024!

UKP Lab@UKPLab

English

977

Tilman Beck retweetledi

UKP Lab@UKPLab·20 Mar

English

10K

Tilman Beck retweetledi

UKP Lab@UKPLab·20 Mar

LLMs are increasingly prompted with different user profiles to solve subjective NLP tasks. What are the factors which determine what the model generates? Discover it in our #EACL2024 paper – learn more in this 🧵 (1/8). 📰 arxiv.org/abs/2309.07034 #NLProc #Prompting

English

3.8K

Tilman Beck retweetledi

UKP Lab@UKPLab·18 Mar

Ever faced a lack of labeled data for multilingual tasks? This #EACL2024, we unveil an effective method for sentiment analysis in low-resource languages solely relying on a multilingual lexicon 💡 – more in this 🧵 (1/8). 📰 arxiv.org/abs/2402.02113 #NLProc #SentimentAnalysis

English

1.2K

Tilman Beck@devnull90·29 Şub

@ML_Burn @DCInbox Thanks a lot for your contributions and open-sourcing the materials! The stance labels in the dataset are surprisingly well-balanced, compared to the datasets I am using mostly (where "neutral" label usually takes up the large majority). Got more info on dataset compilation?

English

Mike Burnham@ML_Burn·20 Şub

I'm also releasing the training data: huggingface.co/datasets/mlbur… ~27k documents taken from Twitter and the @DCInbox dataset that have been triple coded for expressed stance.

English

868

Mike Burnham@ML_Burn·20 Şub

Using proprietary non-reproducible LLMs to label data is bad for science and expensive. So I'm creating free, open source LLMs for Zero-shot classification of political texts that require a fraction of the compute. Here are the first models available on @huggingface:

English

157

30.1K

Tilman Beck@devnull90·29 Şub

daily dose of tokenizer weirdness #NLProc

English

253

Tilman Beck@devnull90·29 Şub

How long is the maintenance going to last? 🧐 @huggingface

English

337

Tilman Beck retweetledi

UKP Lab@UKPLab·21 Şub

Seven papers authored or co-authored by UKP staff have been accepted for publication at this year's @eaclmeeting! Congratulations to all authors – see you in Malta 🇲🇹! #EACL2024 #NLProc informatik.tu-darmstadt.de/ukp/ukp_home/u…

English

1.9K

Keşfet

@thekaransinghal @shamay___ @kchonyc @restoreorderusa @egere14 @utn_nuremberg @pratyushmaini @ML_Burn