Rotem Dror

200 posts

Rotem Dror

Rotem Dror

@DrorRotem

Asst. Prof. of Natural Language Processing @UofHaifa

شامل ہوئے Kasım 2018
249 فالونگ318 فالوورز
Rotem Dror ری ٹویٹ کیا
Yanai Elazar
Yanai Elazar@yanaiela·
A strange trend I've noticed at #ACL2025 is that people are hesitant to reach out to papers/"academic products" authors. This is unfortunate for both parties! A simple email can save a lot of time to the sender, but is also one of my favorite kind of email as the receiver!
English
0
1
33
1.5K
Rotem Dror
Rotem Dror@DrorRotem·
Tomorrow morning (9AM🌅) I'll be giving a keynote talk at LAW (linguistic annotation workshop) at #ACL2025 on annotations in the era of LLMs - see you there!!🌟🌟
English
0
3
17
893
Rotem Dror ری ٹویٹ کیا
Nitay Calderon
Nitay Calderon@NitCal·
The Alternative Annotator Test (alt-test) is a new statistical procedure proposed in our ACL 2025 paper! 🇦🇹🇦🇹 @DrorRotem @roireichart The goal? To help justify using LLMs over humans. If the LLM passes the test, its annotations can be trusted 😎 arxiv.org/abs/2501.10970
English
1
1
10
444
Rotem Dror ری ٹویٹ کیا
Nitay Calderon
Nitay Calderon@NitCal·
Everyone uses LLMs to annotate data or evaluate models in their research. But how can we convince others (readers, collaborators, reviewers!!!) that LLMs are reliable? 🤖 Here’s a simple (and low-effort) solution: show the LLM is a *comparable alternative annotator* ✅
Nitay Calderon tweet media
English
3
18
69
6K
Rotem Dror ری ٹویٹ کیا
Mike Erlihson, Math PhD, AI
Mike Erlihson, Math PhD, AI@MikeE_3_14·
🔥הסקירות ממשיכות לזרום ל-X🔥 🧵 המאמר היומי של מייק: 25.06.25 The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs מאמר 🇮🇱 תפנית מעניינת מתרחשת בתקופה האחרונה בעולם של הערכת ביצועי מודלים. אנחנו כבר לא שואלים רק עד כמה המודל מצליח במבחן כלשהו, אלא שאלה מהותית יותר: האם ניתן לסמוך על מודל שפה שיחליף מתייג אנושי? זו לא שאלה שמדדים מסורתיים כמו דיוק, F1 או הסכמה בין מתייגים יכולים לענות עליה כראוי. תחת זאת, המאמר שנסקור היום מציג שיטה מבוססת סטטיסטיקה לפתרון ביעה זו. בלב המאמר עומדת קריאה להתרחק ממדדי התאמה שטחיים, ולעבור לנימוקים מבוססי השערות סטטיסטיות וניתוח עלות-תועלת.
3
7
24
3.9K
Rotem Dror ری ٹویٹ کیا
Nitay Calderon
Nitay Calderon@NitCal·
Preferences drive modern LLM research and development: from model alignment to evaluation. But how well do we understand them? Excited to share our new preprint: Multi-domain Explainability of Preferences arxiv.org/abs/2505.20088 @roireichart @LiatEinDor 🧵👇 1/11
Nitay Calderon tweet media
English
2
16
35
2.2K
Rotem Dror
Rotem Dror@DrorRotem·
This is your chance to uncover revolutionary research, forge invaluable connections, and become part of the AI innovation wave. Conference dates: May 25-27, 2025 📷 Information and registration: lnkd.in/dKqAtP6p
English
0
0
0
20
Rotem Dror
Rotem Dror@DrorRotem·
Keynote speakers: . Hod Lipson, Mechanical Engineering, Columbia University, USA Prof. Mor Naaman, Information, Cornell Tech, USA Prof. Tanya Berger-Wolf, Computer Science & Engineering, Ohio State University, USA
English
1
0
0
42
Rotem Dror
Rotem Dror@DrorRotem·
Join us at the University of Haifa for the HiAI Conference, a dynamic event at the forefront of Artificial Intelligence. Immerse yourself in groundbreaking discoveries, gain exclusive insights into the latest advancements, and navigate the future of AI with leading experts.
English
1
0
1
36
Vered Shwartz
Vered Shwartz@VeredShwartz·
My "AI-backed" washer sorts cycles by frequency of use and suggests the most common cycle. When were "computer programs" rebranded as "AI" and why didn't I get the memo?
Vered Shwartz tweet media
English
3
0
19
2K
Rotem Dror
Rotem Dror@DrorRotem·
@jkkummerfeld @VeredShwartz @LChoshen @GabiStanovsky what load is that? Review load or AC load? or to be more clear - if I'm an AC and I submitted a paper, am I expected to be the AC of at least 4 papers, or, in addition to the AC load, also review 4 papers?
English
1
0
0
57
Vered Shwartz
Vered Shwartz@VeredShwartz·
The ARR policy requiring a qualified author (3 main *CL papers) to review 4 papers per submission makes it impossible for junior faculty to submit papers. I have 9 students, only 2 are qualified reviewers. I will need to review 16 papers in this cycle to avoid desk rejects. 1/2
English
22
23
242
38.1K
Vered Shwartz
Vered Shwartz@VeredShwartz·
I'm excited to announce that my nonfiction book, "Lost in Automatic Translation: Navigating Life in English in the Age of Language Technologies", will be published this summer by Cambridge University Press. I can't wait to share it with you! 📖🤖 #fndtn-information" target="_blank" rel="nofollow noopener">cambridge.org/core/books/los…
Vered Shwartz tweet media
English
9
25
165
11K
Rotem Dror
Rotem Dror@DrorRotem·
@miserlis_ @NitCal I believe that if you do it once (i.e. not try all combinations of annotators from the two sets) and the annotators from both groups have annotated separate sections of the data then it should be fine to combine them into one
English
0
0
1
67
Alexander Hoyle
Alexander Hoyle@miserlis_·
@DrorRotem @NitCal Thanks! So if items per annotator is small, could we compose them into “pseudo-annotators” to improve power (eg,2 sets of 3 annotators see 20 unique items, so we combine random pairs across sets to make one set of 3 “annotators” that each see 40)? Or would that overestimate var.?
English
1
0
1
53
Nitay Calderon
Nitay Calderon@NitCal·
Do you use LLM-as-a-judge or LLM annotations in your research? There’s a growing trend of replacing human annotators with LLMs in research—they're fast, cheap, and require less effort. But can we trust them?🤔 Well, we need a rigorous procedure to answer this. 🚨New preprint👇
Nitay Calderon tweet media
English
7
41
201
21.3K
Rotem Dror
Rotem Dror@DrorRotem·
@miserlis_ @NitCal Hi @miserlis_ thanks for the great questions! Wilcoxon would work as a non-parametric test, however, when possible, it is always recommended to apply a parametric test. For smaller datasets, we can no longer apply t-test so we propose using its non-parametric counterparts.
English
1
0
1
41
Alexander Hoyle
Alexander Hoyle@miserlis_·
@NitCal Ah, thank you! You also mentioned that a Wilcoxon test would work here for smaller data right? (Sorry to pepper you with questions—this is just highly relevant for our ARR rebuttal ;) )
English
2
0
0
85
Rotem Dror ری ٹویٹ کیا
Nitay Calderon
Nitay Calderon@NitCal·
In our new preprint, @roireichart ,@DrorRotem and I propose a new statistical procedure: The Alternative Annotator Test (alt-test) The goal? To help researchers justify using LLMs over humans—if the LLM passes, its annotations can be confidently trusted😎 arxiv.org/abs/2501.10970
English
1
2
13
1.6K
Rotem Dror ری ٹویٹ کیا
Roi Reichart
Roi Reichart@roireichart·
Check out our new pre-print on statistically sound methodology to verify the quality of LLM-as-a-judge annotation. @NitCal @DrorRotem
Nitay Calderon@NitCal

In our new preprint, @roireichart ,@DrorRotem and I propose a new statistical procedure: The Alternative Annotator Test (alt-test) The goal? To help researchers justify using LLMs over humans—if the LLM passes, its annotations can be confidently trusted😎 arxiv.org/abs/2501.10970

English
0
2
8
547
Rotem Dror
Rotem Dror@DrorRotem·
A few months ago I told you that I'm working on something awesome and I need some datasets...well here is the first output of that effort and I sincerely think it's amazing 🤩 check it out! @NitCal @roireichart #llm-as-a-judge, #nlpevaluation
Nitay Calderon@NitCal

Do you use LLM-as-a-judge or LLM annotations in your research? There’s a growing trend of replacing human annotators with LLMs in research—they're fast, cheap, and require less effort. But can we trust them?🤔 Well, we need a rigorous procedure to answer this. 🚨New preprint👇

English
0
6
23
895