nano

3.4K posts

nano

nano

@nanulled

longtermism

United States 가입일 Ekim 2019
56 팔로잉3K 팔로워
nano 리트윗함
METR
METR@METR_Evals·
We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.
METR tweet media
English
28
61
770
258K
nano
nano@nanulled·
I seriously think that openai started purposely hurting ml research capabilities with this model, it's literally worse at taste than 5.2 high. I understand the competitive advantage of withholding capabilities but still they should just admit it and not waste anyone's time.
English
0
0
4
371
nano
nano@nanulled·
5.4 xhigh is worse than 5.3 codex at ml research, running experiments, patching gated features and debugging inference and evals. It's maybe better at moonshooting proposals just like 5.2 high but it does not have a robust experimentation hygiene. Same with 5.4 pro vs 5.2 pro.
English
3
0
12
978
nano 리트윗함
Google DeepMind
Google DeepMind@GoogleDeepMind·
Step inside Project Genie: our experimental research prototype that lets you create, edit, and explore virtual worlds. 🌎
English
984
4.3K
34.5K
13.4M
nano 리트윗함
Google DeepMind
Google DeepMind@GoogleDeepMind·
What if you could not only watch a generated video, but explore it too? 🌐 Genie 3 is our groundbreaking world model that creates interactive, playable environments from a single text prompt. From photorealistic landscapes to fantasy realms, the possibilities are endless. 🧵
English
814
2.6K
13.4K
3.7M
nano
nano@nanulled·
Gemini 2.5 Deep Think Model Card: it's not superhuman but similar to gold IMO model & “approaches human level” on stealth evals more interested in learning “novel rl techniques that can leverage more multi-step reasoning,” (candidate: MARL with verification/voting for each step)
nano tweet medianano tweet medianano tweet medianano tweet media
English
0
0
18
1.7K
nano
nano@nanulled·
The new stealth model, the Horizon Alpha, has the ability to think in cot, but you really have to try to get it to do so. Here is the COT it generated. It's very terse, and I see some O3 in its writing. I think it's safe to say it's an OpenAI open-source model.
nano tweet medianano tweet media
English
1
1
35
2.5K
nano
nano@nanulled·
good to see hle uselessness being confirmed after ~3 months of this thread actually, it's not just useless it's harmfull signal that somewhat slowed down the progress imo x.com/andrewwhite01/…
Andrew White 🐦‍⬛@andrewwhite01

HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7

English
0
0
3
419
nano
nano@nanulled·
Yes there would be differences in taste and preferences but a horrible game/software can be seen and recognized by the majority. Vague Objectives would be set for ai to complete and most humans would be able to verify if said objectives were achieved fully or partially
English
1
0
4
667
nano
nano@nanulled·
Benchmarks like Humanity’s Last Exam, codeforces nerd-sniped researchers and could prevent AI labs from developing genuine AGI capable of performing real-world tasks.
English
1
0
21
1.6K
Sam Altman
Sam Altman@sama·
we achieved gold medal level performance on the 2025 IMO competition with a general-purpose reasoning system! to emphasize, this is an LLM doing math and not a specific formal math system; it is part of our main push towards general intelligence. when we first started openai, this was a dream but not one that felt very realistic to us; it is a significant marker of how far AI has come over the past decade. we are releasing GPT-5 soon but want to set accurate expectations: this is an experimental model that incorporates new research techniques we will use in future models. we think you will love GPT-5, but we don't plan to release a model with IMO gold level of capability for many months.
Alexander Wei@alexwei_

1/N I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s most prestigious math competition—the International Math Olympiad (IMO).

English
510
701
6.2K
1.2M
ID_law
ID_law@brutalmog·
@nanulled @polynoamial @OpenAI it fell to 30 because the resolution is dependent to imo grand challenge which require the model to be open source
English
1
0
1
37
Noam Brown
Noam Brown@polynoamial·
I think it's safe to say this @OpenAI IMO gold result came as a bit of a surprise to folks
Noam Brown tweet media
English
80
145
2.6K
461.4K
nano
nano@nanulled·
@bennetkrause they've said it's a reasoning model so scratchpad with some form of memory maintenance mechanism that they've probably rled using a general reasoning breakthrough i bet that it's not just a verifier, it would be much more bullish if it were a model itself so I bet on that
English
0
0
1
61
Bennet
Bennet@bennetkrause·
@nanulled How does it do that? Does it use some scaffolding with a scratchpad? Otherwise even the biggest context window would not suffice
English
1
0
1
70
nano
nano@nanulled·
OpenAI’ reasoning model thinks for hours and got gold medal level performance on IMO Progress is faster than most thought
Noam Brown@polynoamial

Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵

English
6
3
122
6.7K
nano
nano@nanulled·
@kalomaze and embodied agi is: can it do it using mouse and keyboard or controller like a human would
English
0
0
5
244
kalomaze
kalomaze@kalomaze·
my bar for agi is "can the system beat portal 2 co-op with another human being (or another replicant of itself) without game-specific scaffolding"
English
24
5
294
17.3K
nano
nano@nanulled·
@polynoamial Amazing, Noam q: this IMO was solved by LLM or a system of agents with MARL or something even better?
English
2
0
3
1K
nano
nano@nanulled·
@zephyr_z9 @epicarism If I didn't want to disclose something I would not say anything related to the multi-agent system Perhaps he really wanted to share it but couldn't do it directly
English
0
0
3
93
Zephyr
Zephyr@zephyr_z9·
@nanulled @epicarism What if they don't want to disclose it?? Sheryl & Noam are working on the multi-agent RL team This is the only concrete info we have
English
1
0
4
142
nano
nano@nanulled·
@epicarism @zephyr_z9 i know that pay attention to the wording "models" not a model or reasoning LLM they should've said that imo was solved by agents or system not by a "model"
English
1
0
2
129
Epicarism
Epicarism@epicarism·
@nanulled @zephyr_z9 all agents are models in use, you can't have "agents" without a "model". This was likely from MARL / experimental model fine tuned for multi agents etc
English
1
0
2
69
nano
nano@nanulled·
@zephyr_z9 interesting from what I've seen they claimed that's a reasoning model not a system of agents
English
1
0
3
265
Zephyr
Zephyr@zephyr_z9·
@nanulled They used some kind of multi-agent setup
English
2
0
7
262
nano
nano@nanulled·
do you understand that this model can definitely code and debug for hours on the level of top 1% expert? if they release this model the progress would be unprecedented and intelligence explosion "sci fi" theory becomes a reality
English
0
0
13
554