Ehud Reiter

2.4K posts

Ehud Reiter

Ehud Reiter

@EhudReiter

I am a computer scientist who works on natural language generation and evaluation, often in healthcare contexts. I teach at Aberdeen University.

Aberdeen, Scotland Katılım Mayıs 2014
96 Takip Edilen2.4K Takipçiler
Ehud Reiter
Ehud Reiter@EhudReiter·
@DamienTeney Qualitative insights about what LLMs as a group can and cannot do, and problems that arise, are very interesting and valuable. But detailed performance numbers (especially in "horse race" between models) are not interesting if models investigated are obsolete when paper read
English
0
0
1
16
Damien Teney
Damien Teney@DamienTeney·
@EhudReiter (genuine question, I have no context about this work; the *causes/conditions* for good/bad performance seem the more interestion question, and should be less fickle?)
English
1
0
0
84
Ehud Reiter
Ehud Reiter@EhudReiter·
A student is writing up an experiment which includes a comparison of how well two LLMs do at a task. The community expects such comparisons, but why are they useful, since the LLMs being compared will be obsolete by the time people read this?
English
5
1
20
3.8K
Ehud Reiter
Ehud Reiter@EhudReiter·
PS - If we are making a general claim that LLMs can/cannot do X, then absolutely useful to show this in multiple LLMs. But why do we care that an obsolete version of GPT is better at this task than an obsolete version of Gemini?
English
0
0
2
297
Ehud Reiter
Ehud Reiter@EhudReiter·
@JaydeepBorkar @yanaiela I think it would be easier for junior researchers if the rules were consistently enforced, so that you could learn from existing ACL papers. But enforcement is inconsistent, which means many existing papers break the rules, which is absolutely confusing for newcomers!
English
0
0
1
49
Jaydeep Borkar
Jaydeep Borkar@JaydeepBorkar·
@yanaiela @EhudReiter +1. I’m glad someone said this. A lot of us submit to *CL venues because we really like the community. But having so many rules/checklists makes it complicate to navigate through the ARR system (esp. for the junior researchers).
English
1
0
3
80
Ehud Reiter
Ehud Reiter@EhudReiter·
New blog: Please follow the rules for ARR/ACL papers ACL/ARR have rules and guidelines for how papers are written. Unfortunately many authors (and reviewers) ignore these, which makes their papers harder to read and less useful. Please follow the rules! ehudreiter.com/2026/03/16/ple…
English
1
0
16
1.9K
Ehud Reiter
Ehud Reiter@EhudReiter·
@yanaiela I appreciate this, hopefully my blog helps in highlighting some key points
English
0
0
0
211
Yanai Elazar
Yanai Elazar@yanaiela·
@EhudReiter The amount of rules written for ARR has become so large that it barely even fits within an LLM’s context, let alone a human’s.
English
2
0
17
940
Ehud Reiter
Ehud Reiter@EhudReiter·
Looked at @METR_Evals blogs, some nice material on lessons, challenges, etc in using RCTs to evaluate LLMs. We have had RCTs in medicine for a long time and have built up a lot of knowledge about this, but RCTs in AI are very new, and we need to figure out how to do them well
English
0
0
4
390
Ehud Reiter retweetledi
Arvind Narayanan
Arvind Narayanan@random_walker·
AI isn't replacing programmers, but it *is* making it harder to survive as a programmer with purely technical skills and no interest or expertise in how those skills translate to business or societal value. Funny thing is, this has always been true—it's just being accelerated a bit due to AI. There's a famous essay by @patio11 from 15 years ago called "Don't Call Yourself A Programmer, And Other Career Advice". kalzumeus.com/2011/10/28/don…
Arvind Narayanan tweet media
English
20
47
259
20.3K
Ehud Reiter
Ehud Reiter@EhudReiter·
@simoneballoccu To be honest, I think Github repos are often better for providing extra detail, since they contain data and code as well as text. Also authors realise that reviewers probably wont look at repos, so they need to make main paper self-contained
English
0
0
2
42
Simone Balloccu
Simone Balloccu@simoneballoccu·
@EhudReiter I'm generally encouraging of long appendices but only when they provide extra detail that is not necessary to assess the goodness of a paper!
English
1
0
0
62
Ehud Reiter
Ehud Reiter@EhudReiter·
In future, I will not read appendices when reviewing ACL papers (ARR guidelines say this is not expected). Appendices used to be a few pages of details for replication (eg hyperparam), but now its common to have 10-20 pages (Ive seen 50 pages). Dont expect me to read this!
English
1
0
9
984
Ehud Reiter retweetledi
Jaime Sevilla
Jaime Sevilla@Jsevillamol·
Incredibly cool work! We are not seeing enough manual grading from experts in AI, but its definitely worth it to verify the most important results for top models.
Joel Becker@joel_bkr

new @METR_Evals research note from @whitfill_parker, @cherylwoooo, nate rush, and me. (chiefly parker!) we find that *half* of SWE-bench Verified solutions from Sonnet 3.5-to-4.5 generation AIs *which are graded as passing* are rejected by project maintainers.

English
0
4
17
1.4K
Ehud Reiter
Ehud Reiter@EhudReiter·
I start my last-ever course today, MSc course on Natural Language Generation. My last lecture (on NLG evaluation) will be on 20 April. Hard to believe...
English
1
1
12
560
Ehud Reiter
Ehud Reiter@EhudReiter·
New blog: Questions from readers of my book A group who is reading my book sent me many questions, some of which we discussed in a call last week. I thought I would share the questions and my responses. ehudreiter.com/2026/03/03/que…
English
0
0
3
131
Ehud Reiter
Ehud Reiter@EhudReiter·
Great to see that my student Jawwad Baig has submitted his PhD! One of my main goals for 2025-26 is to help 6 PhD students submit before I retire. Halfway through the academic year, and three of the six have now submitted, so on track.
English
0
0
7
323
Ehud Reiter
Ehud Reiter@EhudReiter·
My PhD student Adarsa Sivaprasad is looking for people who have lived experience of IVF to help evaluate an AI chatbot which explains IVF outcome predictions. What is involved: 45 min online MS Teams call. Read details and sign up at: tinyurl.com/cc2aepf5
Ehud Reiter tweet media
English
0
1
3
155
Ehud Reiter retweetledi
Mali Barbi, MD MSc | Breast & Gyn Oncologist
Everyone is talking about the new @NatureMedicine paper (rdcu.be/e4ADv), but I think the real story is being buried. Here is the cold reality: The AI passed the medical boards with flying colors (~95% accuracy). But when real humans actually used it for triage, their accuracy dropped to <35%. They performed worse than the control group who just used Google. Practically, this means benchmarks are not safety tests. We are validating tools in a vacuum (simulations) that collapse in the real world. As oncologists, we know this pattern: surrogate endpoints ≠ survival data. Passing the boards is just a surrogate. Safe patient interaction is the only outcome that matters. Right now, we are optimizing for the test and failing the patient. cc: @EricTopol @pranavrajpurkar #ClinicalValidation #AIhype #PatientSafety #EvidenceBasedMedicine
English
29
165
792
66.9K
Ehud Reiter
Ehud Reiter@EhudReiter·
On holiday in Elche, Spain. Fascinating place, full of palm trees. Below is an old Christian basilica which *may* have originally been an ancient (Roman era) Synagogue.
Ehud Reiter tweet media
English
0
0
1
87
Ehud Reiter
Ehud Reiter@EhudReiter·
New blog: Dont ignore omissions! Evaluation of LLMs focuses on accuracy and hallucination. Completeness and omission also important; does the text include all the key information? Omissions are a huge problem in medical NLG, as well other NLG tasks ehudreiter.com/2026/02/11/don…
English
0
2
9
466