Long Phan

6 posts

Long Phan

Long Phan

@longphan3110

AI Safety Research @CAIS | Author of Humanity's Last Exam (HLE)

Katılım Mart 2025
34 Takip Edilen81 Takipçiler
Long Phan retweetledi
Calvin Zhang @ ICLR’26
Calvin Zhang @ ICLR’26@calvincbzhang·
1/ A year ago, we released Humanity’s Last Exam, a benchmark to measure reasoning in LLMs. One year later, almost exactly on the day of my one-year anniversary, it’s incredibly rewarding to see this work published in @Nature under open access and to see how much reasoning performance has progressed since.
Calvin Zhang @ ICLR’26 tweet media
English
6
32
147
14.9K
Long Phan retweetledi
Center for AI Safety
Humanity's Last Exam is now published in Nature. Since its release, HLE has become a leading frontier benchmark, used by OpenAI, Anthropic, DeepMind, and xAI. Thank you to our partners at @scale_AI and the 1,000+ co-authors who made this benchmark possible.
Center for AI Safety tweet media
English
3
15
95
7.6K
Florian Brand
Florian Brand@xeophon·
some might say "stop looking at HLE altogether" but people are not ready for this discussion yet
Florian Brand tweet media
English
4
0
46
2K
Long Phan retweetledi
Dan Hendrycks
Dan Hendrycks@hendrycks·
Just how significant is the jump with Gemini 3? We just released a new leaderboard to track AI developments. Gemini 3 is the largest leap in a long time.
Dan Hendrycks tweet mediaDan Hendrycks tweet mediaDan Hendrycks tweet mediaDan Hendrycks tweet media
English
31
79
545
117.6K
Long Phan retweetledi
Dan Hendrycks
Dan Hendrycks@hendrycks·
Can AI automate jobs? We created the Remote Labor Index to test AI’s ability to automate hundreds of long, real-world, economically valuable projects from remote work platforms. While AIs are smart, they are not yet that useful: the current automation rate is less than 3%.
Dan Hendrycks tweet mediaDan Hendrycks tweet mediaDan Hendrycks tweet media
English
100
189
1K
425.3K
Long Phan retweetledi
Dan Hendrycks
Dan Hendrycks@hendrycks·
Can AIs beat long video games? We made TextQuests to test GPT-5, Grok 4, Deepseek, etc. These games can often take people dozens of hours to beat. - AIs can't beat any of the games (without clues) - some AIs behave more viciously than others - AIs are getting better rapidly
Dan Hendrycks tweet mediaDan Hendrycks tweet mediaDan Hendrycks tweet media
English
17
17
73
16.9K