Dongyang Fan (@dyfan22) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Happy to share our newest challenging multi-turn hallucination benchmark: halluhard.com LLM hallucinations are far from solved

Maksym Andriushchenko@maksym_andr

Do you think LLM hallucinations are solved? 📢 We introduce HalluHard: a challenging multi-turn, open-ended hallucination benchmark. Even the most recent frontier LLMs like Opus 4.5 with web search hallucinate very frequently on our set of challenging examples.

English

0

6

31

4.7K

Dongyang Fan@dyfan22·6 May

Thanks for the great interest in our hard multi-turn hallucination benchmark! We try our best to update our leaderboard, but the cost is getting very high for an academic lab. Any support with API credits would be greatly appreciated! @OpenAI @AnthropicAI @xai @GeminiApp

Nav Toor@heynavtoor

Researchers at EPFL proved your AI is lying to you. Not sometimes. Most of the time. They built one of the hardest hallucination tests ever made with Max Planck Institute. 950 questions. Four domains where being wrong actually hurts. Legal. Medical. Research. Coding. Then they ran every top model on it. The results. GPT-5. Wrong 71.8% of the time. Claude Opus 4.5. Wrong 60% of the time. Gemini 3 Pro. Wrong 61.9% of the time. DeepSeek Reasoner. Wrong 76.8% of the time. These are the smartest AI models on Earth. The ones you trust with your career. Your health. Your money. You think turning on web search fixes it. It doesn't. Claude Opus 4.5 with web search. Still wrong 30.2% of the time. GPT-5.2 thinking with web search. Still wrong 38.2% of the time. The internet attached. Still lying to you in 1 out of every 3 answers. Now the part that should scare you. Medical questions. The one place being wrong can kill you. GPT-5 hallucinated 92.8% of the time on medical guidelines. Claude Haiku 4.5 hallucinated 95.7% of the time. Gemini 3 Flash hallucinated 89% of the time. Nine out of ten medical answers from popular AI models. Wrong. It gets worse. The longer you talk to it, the more it lies. Early mistakes cascade. The model starts citing its own earlier hallucinations as facts. Your third message is more wrong than your first. The paper, in its own words: "hallucinations remain substantial even with web search." This is what hundreds of millions of people are doing right now. Asking software that lies in the majority of its answers. About their health. About their job. About their legal case. About their code. Most are not checking. Most never will. But please. Keep using ChatGPT for medical advice. The doctors need a break. arxiv.org/abs/2602.01031

English

0

2

277

Dongyang Fan@dyfan22·6 May

@j0wimo @maksym_andr Yeah, references can be quite accurate with web search, but the models can fabricate stuff from the cited reference, which is a bigger concern! Wonder if you ever ask LLMs to summarize related work.

English

0

19

jonas wiedermann-möller@j0wimo·6 May

@maksym_andr My references look just fine 👀

English

2

0

41

Maksym Andriushchenko@maksym_andr·6 May

Good to see more interest around HalluHard, our hard hallucination benchmark! Hallucinations are (still) not solved. Everyone who tried to use LLMs for paper writing knows that even frontier LLMs still make up non-existing references.

Nav Toor@heynavtoor

Researchers at EPFL proved your AI is lying to you. Not sometimes. Most of the time. They built one of the hardest hallucination tests ever made with Max Planck Institute. 950 questions. Four domains where being wrong actually hurts. Legal. Medical. Research. Coding. Then they ran every top model on it. The results. GPT-5. Wrong 71.8% of the time. Claude Opus 4.5. Wrong 60% of the time. Gemini 3 Pro. Wrong 61.9% of the time. DeepSeek Reasoner. Wrong 76.8% of the time. These are the smartest AI models on Earth. The ones you trust with your career. Your health. Your money. You think turning on web search fixes it. It doesn't. Claude Opus 4.5 with web search. Still wrong 30.2% of the time. GPT-5.2 thinking with web search. Still wrong 38.2% of the time. The internet attached. Still lying to you in 1 out of every 3 answers. Now the part that should scare you. Medical questions. The one place being wrong can kill you. GPT-5 hallucinated 92.8% of the time on medical guidelines. Claude Haiku 4.5 hallucinated 95.7% of the time. Gemini 3 Flash hallucinated 89% of the time. Nine out of ten medical answers from popular AI models. Wrong. It gets worse. The longer you talk to it, the more it lies. Early mistakes cascade. The model starts citing its own earlier hallucinations as facts. Your third message is more wrong than your first. The paper, in its own words: "hallucinations remain substantial even with web search." This is what hundreds of millions of people are doing right now. Asking software that lies in the majority of its answers. About their health. About their job. About their legal case. About their code. Most are not checking. Most never will. But please. Keep using ChatGPT for medical advice. The doctors need a break. arxiv.org/abs/2602.01031

English

2

1

17

2K

Dongyang Fan@dyfan22·30 Nis

HalluHard Update: We've added three new models to our leaderboard — GPT-5.5-thinking (hallucination rate: 49.8%), Claude-Opus-4.7 (59.6%), and DeepSeek-V4-Pro (74.9%). Progress on multi-turn hallucination mitigation remains slow, with little to no improvement over their predecessors.

English

1

18

2.4K

Dongyang Fan@dyfan22·30 Nis

@maksym_andr yaaay! welcome back at epfl!

English

0

1

75

Maksym Andriushchenko@maksym_andr·30 Nis

Can LLM agents automate LLM post-training? I'll be giving a talk on this at the EPFL AI Center on May 22. Excited to be back at my alma mater for the first time since starting my group in Tübingen!

English

5

2

41

1.7K

Dongyang Fan@dyfan22·29 Nis

@ExaAILabs When evaluating Gemini on HalluHard, we found the native web search very off, which is almost embarrassing for Google 😬 maybe there will be better grounding now

English

1

0

1.1K

Exa@ExaAILabs·28 Nis

We're excited to partner with Google to offer Grounding With Exa inside of Gemini models! Using Exa's agent-first search, Gemini models can now access billions of websites, technical docs, papers, people, companies, and more. 10^18🤝10^100

English

124

87

1.6K

1.1M

Dongyang Fan@dyfan22·23 Nis

Exciting to see open-weight models getting closer to proprietary models on HalluHard!

Maksym Andriushchenko@maksym_andr

💥 Kimi-K2.6-thinking is the new best open-weight model on HalluHard (without web search)! K2.5 had 76.9% hallucination rate, whereas K2.6 now has 63.6%. Since our benchmark contains hard hallucination cases, this improvement is very notable. Thank you @Kimi_Moonshot for providing API credits and @dyfan22 for running the eval! Full results: halluhard.com Paper: arxiv.org/abs/2602.01031

English

1

0

3

534

Dongyang Fan@dyfan22·21 Nis

Excited to present our paper on metadata inclusion for LLM data efficiency at #ICLR2026! 🎉 Key findings: 1) Metadata type & position matter for pretraining efficiency 2) Empty meta tokens help LLMs encode quality-aware signals 3) Metadata shapes better latent embeddings 📍 Fri Apr 24 • 10:30 AM | Poster P3-#1014 I won't be there in person, but find my co-author @diba_hashemi at the poster!

English

0

3

12

837

Dongyang Fan@dyfan22·2 Nis

@maksym_andr Website: halluhard.com

English

0

128

Dongyang Fan@dyfan22·2 Nis

HalluHard Update: we've added two new open-weight models with web search enabled to our leaderboard: - GLM-5-thinking-WS (39.7%) is competitive with proprietary frontier models, landing at #4! - Kimi-K2.5-WS, on the other hand, ranks lower at #11.

English

1

0

10

2.4K

Dongyang Fan retweetledi

Christina Baek@_christinabaek·18 Mar

Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵

English

19

80

618

93.8K

Dongyang Fan@dyfan22·12 Mar

@_onionesque @maksym_andr @SDelsad21364 @tml_lab How so? 😅

English

1

0

1

56

Shubhendu Trivedi@_onionesque·12 Mar

@maksym_andr @dyfan22 @SDelsad21364 @tml_lab HalluHard is such a funny name, man!

English

1

0

2

87

Maksym Andriushchenko@maksym_andr·4 Şub

Do you think LLM hallucinations are solved? 📢 We introduce HalluHard: a challenging multi-turn, open-ended hallucination benchmark. Even the most recent frontier LLMs like Opus 4.5 with web search hallucinate very frequently on our set of challenging examples.

English

16

43

244

26.6K

Dongyang Fan@dyfan22·12 Mar

@ketansingh279 @maksym_andr @SDelsad21364 @tml_lab .. We mentioned in our paper as well that *more reasoning != less hallucination* based on experiments using different reasoning levels. With more reasoning effort, the models tend to produce longer, more detailed responses, which creates more risks for hallucinations.

English

0

1

31

Ketan Singh@ketansingh279·7 Mar

Ideally consider using the maximum thinking/reasoning levels for all models, not the default (which can be influenced more by marketing and company philosophy). Scores will come down a lot on how various companies choose their defaults and granularity for setting reasoning levels. Using the default won't reflect the actual latent capabilities of the frontier models.

English

2

0

35

Maksym Andriushchenko@maksym_andr·6 Mar

GPT-5.4 released yesterday is already on HalluHard: - huge progress on coding hallucinations: 32.2% GPT-5.2-Thinking → 11.7% GPT-5.4-Thinking (!) - legal hallucinations: 33.5% → 29.9% - with web search, GPT-5.4 ≈ Opus-4.5, although with very diff perf across the 4 categories!

English

3

10

111

9.9K

Dongyang Fan@dyfan22·12 Mar

@ketansingh279 @maksym_andr @SDelsad21364 @tml_lab I agree with you completely. We are just evaluating the scenario closest to a real user's experience, which is when we do not tune the knob and go with the default ...

English

0

25

Dongyang Fan@dyfan22·7 Mar

@ketansingh279 @maksym_andr @SDelsad21364 @tml_lab we always used the default reasoning level from the thinking model, which is medium for GPT models

English

1

0

59

Ketan Singh@ketansingh279·7 Mar

@maksym_andr @SDelsad21364 @dyfan22 @tml_lab What reasoning level?

English

1

0

228

Dongyang Fan@dyfan22·7 Mar

New models are climbing up on the hallucination podium! 🏅 Still, there is a concerning imbalance advancement across domains.

Maksym Andriushchenko@maksym_andr

GPT-5.4 released yesterday is already on HalluHard: - huge progress on coding hallucinations: 32.2% GPT-5.2-Thinking → 11.7% GPT-5.4-Thinking (!) - legal hallucinations: 33.5% → 29.9% - with web search, GPT-5.4 ≈ Opus-4.5, although with very diff perf across the 4 categories!

English

0

4

539

Dongyang Fan@dyfan22·22 Şub

@jmbollenbacher @maksym_andr Partially*

English

0

20

Dongyang Fan@dyfan22·22 Şub

@jmbollenbacher @maksym_andr We evaluate hallucinations as shown in the figure. Basically our judge can reveal more hidden hallucination types, where the cited source is correct, but the supported claim is particularly fabricated

English

1

0

42

Maksym Andriushchenko@maksym_andr·22 Şub

good news: hallucinations are on track to be solved bad news: we still need to wait a few more (?) years

English

25

18

307

32.5K

Dongyang Fan@dyfan22·22 Şub

@anko_979 @AdvaitOnline @GoogleDeepMind @maksym_andr @SDelsad21364 Please understand that all APIs behave so differently, and we try our best to have a fair comparison.

English

1

0

22

AnKo@anko_979·22 Şub

@dyfan22 @AdvaitOnline @GoogleDeepMind @maksym_andr @SDelsad21364 Hmm still? Strange to me, wouldn't expect the scale of your benchmark to generate so many citations that would be hitting hard rate limits for a script like yesterday's

English

2

0

28

Dongyang Fan@dyfan22·20 Şub

Following the release of Gemini-3.1-Pro by @GoogleDeepMind, we evaluated it on our hard multiturn hallucination benchmark - HalluHard. Gemini moved up on our leaderboard from 3-Pro (8th) to 3.1-Pro (4th), making it the 2nd best model without web search.

English

1

0

9

2.7K

Dongyang Fan@dyfan22·22 Şub

@anko_979 @AdvaitOnline @GoogleDeepMind @maksym_andr @SDelsad21364 It is not the issue of redirecting. Gemini is citing sources differently than Claude and OpenAI. For each claim, there can be from 0 to multiple cited URLs, unlike one for each for the rest. We have to adapt our judge pipeline to check for each of them

English

1

0

28

Dongyang Fan@dyfan22·22 Şub

✨HalluHard reveals a Parento Frontier for LLM hallucination. Will this linear trend hold in the future?

Maksym Andriushchenko@maksym_andr

good news: hallucinations are on track to be solved bad news: we still need to wait a few more (?) years

English

0

3

690

Dongyang Fan

Keşfet