Long Phan

6 posts

Long Phan

@longphan3110

AI Safety Research @CAIS | Author of Humanity's Last Exam (HLE)

Katılım Mart 2025

34 Takip Edilen81 Takipçiler

Long Phan retweetledi

Calvin Zhang @ ICLR’26@calvincbzhang·28 Oca

1/ A year ago, we released Humanity’s Last Exam, a benchmark to measure reasoning in LLMs. One year later, almost exactly on the day of my one-year anniversary, it’s incredibly rewarding to see this work published in @Nature under open access and to see how much reasoning performance has progressed since.

English

147

14.9K

Long Phan retweetledi

Center for AI Safety@CAIS·28 Oca

Humanity's Last Exam is now published in Nature. Since its release, HLE has become a leading frontier benchmark, used by OpenAI, Anthropic, DeepMind, and xAI. Thank you to our partners at @scale_AI and the 1,000+ co-authors who made this benchmark possible.

English

7.6K

Long Phan@longphan3110·29 Kas

@xeophon @xeophon maybe scroll down a bit

English

Florian Brand@xeophon·28 Kas

some might say "stop looking at HLE altogether" but people are not ready for this discussion yet

English

Florian Brand@xeophon·28 Kas

stop looking at HLE (with tools), most of these mean "has web access" the answers to HLE are easily accessible in ungated mirrors (and prob a dozen other places). the only question is why those agents don't score 100%

Ivan Fioravanti ᯅ@ivanfioravanti

This 8B beast from NVIDIA is a fine-tuning of Qwen3-8B! 37.1 on Humanity's Last Exam!

English

148

24.1K

Long Phan retweetledi

Dan Hendrycks@hendrycks·19 Kas

Just how significant is the jump with Gemini 3? We just released a new leaderboard to track AI developments. Gemini 3 is the largest leap in a long time.

English

545

117.6K

Long Phan retweetledi

Dan Hendrycks@hendrycks·29 Eki

Can AI automate jobs? We created the Remote Labor Index to test AI’s ability to automate hundreds of long, real-world, economically valuable projects from remote work platforms. While AIs are smart, they are not yet that useful: the current automation rate is less than 3%.

English

100

189

425.3K

Long Phan retweetledi

Dan Hendrycks@hendrycks·12 Ağu

Can AIs beat long video games? We made TextQuests to test GPT-5, Grok 4, Deepseek, etc. These games can often take people dozens of hours to beat. - AIs can't beat any of the games (without clues) - some AIs behave more viciously than others - AIs are getting better rapidly

English

16.9K

Keşfet

@Nature @scale_AI @xeophon @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates