
Mind the gap when evaluating LLMs with multiple-choice QA 🚨
In our #EMNLP2025 paper, we show that a tiny space tokenization can shift accuracy by up to 11% – and even reshuffle leaderboards.
Big thanks to my great co-authors @minhducbui_nlp & @kelina1124!
NALA@NALACUJGU
🧐 Evaluating your LLM with multiple-choice question answering? 🧵 A tiny space in the prompt can make accuracy jump by 11% – and even reshuffle model rankings. #EMNLP2025 #NLP #AI #LLM #Evaluation
English








