Dongyang Fan

78 posts

Dongyang Fan banner
Dongyang Fan

Dongyang Fan

@dyfan22

making LLMs efficient and responsible | PhD student in ML/LLMs @epfl_en 🇨🇭🏔️

Lausanne, Switzerland Katılım Kasım 2022
414 Takip Edilen240 Takipçiler
Dongyang Fan
Dongyang Fan@dyfan22·
Thanks for the great interest in our hard multi-turn hallucination benchmark! We try our best to update our leaderboard, but the cost is getting very high for an academic lab. Any support with API credits would be greatly appreciated! @OpenAI @AnthropicAI @xai @GeminiApp
Nav Toor@heynavtoor

Researchers at EPFL proved your AI is lying to you. Not sometimes. Most of the time. They built one of the hardest hallucination tests ever made with Max Planck Institute. 950 questions. Four domains where being wrong actually hurts. Legal. Medical. Research. Coding. Then they ran every top model on it. The results. GPT-5. Wrong 71.8% of the time. Claude Opus 4.5. Wrong 60% of the time. Gemini 3 Pro. Wrong 61.9% of the time. DeepSeek Reasoner. Wrong 76.8% of the time. These are the smartest AI models on Earth. The ones you trust with your career. Your health. Your money. You think turning on web search fixes it. It doesn't. Claude Opus 4.5 with web search. Still wrong 30.2% of the time. GPT-5.2 thinking with web search. Still wrong 38.2% of the time. The internet attached. Still lying to you in 1 out of every 3 answers. Now the part that should scare you. Medical questions. The one place being wrong can kill you. GPT-5 hallucinated 92.8% of the time on medical guidelines. Claude Haiku 4.5 hallucinated 95.7% of the time. Gemini 3 Flash hallucinated 89% of the time. Nine out of ten medical answers from popular AI models. Wrong. It gets worse. The longer you talk to it, the more it lies. Early mistakes cascade. The model starts citing its own earlier hallucinations as facts. Your third message is more wrong than your first. The paper, in its own words: "hallucinations remain substantial even with web search." This is what hundreds of millions of people are doing right now. Asking software that lies in the majority of its answers. About their health. About their job. About their legal case. About their code. Most are not checking. Most never will. But please. Keep using ChatGPT for medical advice. The doctors need a break. arxiv.org/abs/2602.01031

English
0
0
2
277
Dongyang Fan
Dongyang Fan@dyfan22·
@j0wimo @maksym_andr Yeah, references can be quite accurate with web search, but the models can fabricate stuff from the cited reference, which is a bigger concern! Wonder if you ever ask LLMs to summarize related work.
English
0
0
0
19
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
Good to see more interest around HalluHard, our hard hallucination benchmark! Hallucinations are (still) not solved. Everyone who tried to use LLMs for paper writing knows that even frontier LLMs still make up non-existing references.
Nav Toor@heynavtoor

Researchers at EPFL proved your AI is lying to you. Not sometimes. Most of the time. They built one of the hardest hallucination tests ever made with Max Planck Institute. 950 questions. Four domains where being wrong actually hurts. Legal. Medical. Research. Coding. Then they ran every top model on it. The results. GPT-5. Wrong 71.8% of the time. Claude Opus 4.5. Wrong 60% of the time. Gemini 3 Pro. Wrong 61.9% of the time. DeepSeek Reasoner. Wrong 76.8% of the time. These are the smartest AI models on Earth. The ones you trust with your career. Your health. Your money. You think turning on web search fixes it. It doesn't. Claude Opus 4.5 with web search. Still wrong 30.2% of the time. GPT-5.2 thinking with web search. Still wrong 38.2% of the time. The internet attached. Still lying to you in 1 out of every 3 answers. Now the part that should scare you. Medical questions. The one place being wrong can kill you. GPT-5 hallucinated 92.8% of the time on medical guidelines. Claude Haiku 4.5 hallucinated 95.7% of the time. Gemini 3 Flash hallucinated 89% of the time. Nine out of ten medical answers from popular AI models. Wrong. It gets worse. The longer you talk to it, the more it lies. Early mistakes cascade. The model starts citing its own earlier hallucinations as facts. Your third message is more wrong than your first. The paper, in its own words: "hallucinations remain substantial even with web search." This is what hundreds of millions of people are doing right now. Asking software that lies in the majority of its answers. About their health. About their job. About their legal case. About their code. Most are not checking. Most never will. But please. Keep using ChatGPT for medical advice. The doctors need a break. arxiv.org/abs/2602.01031

English
2
1
17
2K
Dongyang Fan
Dongyang Fan@dyfan22·
HalluHard Update: We've added three new models to our leaderboard — GPT-5.5-thinking (hallucination rate: 49.8%), Claude-Opus-4.7 (59.6%), and DeepSeek-V4-Pro (74.9%). Progress on multi-turn hallucination mitigation remains slow, with little to no improvement over their predecessors.
Dongyang Fan tweet media
English
1
1
18
2.4K
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
Can LLM agents automate LLM post-training? I'll be giving a talk on this at the EPFL AI Center on May 22. Excited to be back at my alma mater for the first time since starting my group in Tübingen!
Maksym Andriushchenko tweet media
English
5
2
41
1.7K
Dongyang Fan
Dongyang Fan@dyfan22·
@ExaAILabs When evaluating Gemini on HalluHard, we found the native web search very off, which is almost embarrassing for Google 😬 maybe there will be better grounding now
English
1
0
0
1.1K
Exa
Exa@ExaAILabs·
We're excited to partner with Google to offer Grounding With Exa inside of Gemini models! Using Exa's agent-first search, Gemini models can now access billions of websites, technical docs, papers, people, companies, and more. 10^18🤝10^100
Exa tweet media
English
124
87
1.6K
1.1M
Dongyang Fan
Dongyang Fan@dyfan22·
Excited to present our paper on metadata inclusion for LLM data efficiency at #ICLR2026! 🎉 Key findings: 1) Metadata type & position matter for pretraining efficiency 2) Empty meta tokens help LLMs encode quality-aware signals 3) Metadata shapes better latent embeddings 📍 Fri Apr 24 • 10:30 AM | Poster P3-#1014 I won't be there in person, but find my co-author @diba_hashemi at the poster!
Dongyang Fan tweet media
English
0
3
12
837
Dongyang Fan
Dongyang Fan@dyfan22·
HalluHard Update: we've added two new open-weight models with web search enabled to our leaderboard: - GLM-5-thinking-WS (39.7%) is competitive with proprietary frontier models, landing at #4! - Kimi-K2.5-WS, on the other hand, ranks lower at #11.
Dongyang Fan tweet media
English
1
0
10
2.4K
Dongyang Fan retweetledi
Christina Baek
Christina Baek@_christinabaek·
Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵
Christina Baek tweet media
English
19
80
618
93.8K
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
Do you think LLM hallucinations are solved? 📢 We introduce HalluHard: a challenging multi-turn, open-ended hallucination benchmark. Even the most recent frontier LLMs like Opus 4.5 with web search hallucinate very frequently on our set of challenging examples.
Maksym Andriushchenko tweet media
English
16
43
244
26.6K
Dongyang Fan
Dongyang Fan@dyfan22·
@ketansingh279 @maksym_andr @SDelsad21364 @tml_lab .. We mentioned in our paper as well that *more reasoning != less hallucination* based on experiments using different reasoning levels. With more reasoning effort, the models tend to produce longer, more detailed responses, which creates more risks for hallucinations.
English
0
0
1
31
Ketan Singh
Ketan Singh@ketansingh279·
Ideally consider using the maximum thinking/reasoning levels for all models, not the default (which can be influenced more by marketing and company philosophy). Scores will come down a lot on how various companies choose their defaults and granularity for setting reasoning levels. Using the default won't reflect the actual latent capabilities of the frontier models.
English
2
0
0
35
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
GPT-5.4 released yesterday is already on HalluHard: - huge progress on coding hallucinations: 32.2% GPT-5.2-Thinking → 11.7% GPT-5.4-Thinking (!) - legal hallucinations: 33.5% → 29.9% - with web search, GPT-5.4 ≈ Opus-4.5, although with very diff perf across the 4 categories!
Maksym Andriushchenko tweet media
English
3
10
111
9.9K
Dongyang Fan
Dongyang Fan@dyfan22·
@jmbollenbacher @maksym_andr We evaluate hallucinations as shown in the figure. Basically our judge can reveal more hidden hallucination types, where the cited source is correct, but the supported claim is particularly fabricated
Dongyang Fan tweet media
English
1
0
0
42
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
good news: hallucinations are on track to be solved bad news: we still need to wait a few more (?) years
Maksym Andriushchenko tweet media
English
25
18
307
32.5K
Dongyang Fan
Dongyang Fan@dyfan22·
Following the release of Gemini-3.1-Pro by @GoogleDeepMind, we evaluated it on our hard multiturn hallucination benchmark - HalluHard. Gemini moved up on our leaderboard from 3-Pro (8th) to 3.1-Pro (4th), making it the 2nd best model without web search.
Dongyang Fan tweet media
English
1
0
9
2.7K