Psyho

2.4K posts

Psyho

@FakePsyho

Humanity's Last Programmer; Game Designer; Problem Solver; past: OpenAI (Dota), Pro Competitive Programmer, Poker

I don't know anymore Katılım Mayıs 2012

397 Takip Edilen29.5K Takipçiler

Sabitlenmiş Tweet

Psyho@FakePsyho·16 Tem

Humanity has prevailed (for now!) I'm completely exhausted. I figured, I had 10h of sleep in the last 3 days and I'm barely alive. I'll post more about the contest when I get some rest. (To be clear, those are provisional results, but my lead should be big enough)

English

553

1.1K

13.3K

2.3M

Psyho@FakePsyho·4h

apologies if this is the 5th time you see this today

English

1.3K

Psyho@FakePsyho·4h

twitter today

English

229

Psyho@FakePsyho·16h

I'm not even sure why I'm QTing this. On one hand it feels important, on the other hand, feels devoid of any substance.

English

2.5K

Psyho@FakePsyho·16h

tl;dr: "maybe we should regulate AI?" corposlop padded with technomessianism

Demis Hassabis@demishassabis

x.com/i/article/2076…

English

402

38.2K

Psyho@FakePsyho·1d

@Nim_game_btc re: harness, imho it's still required for bigger/harder/more esoteric ones; but yeah observations in those puzzles should be way easier than stuff in AWTF

English

184

Psyho@FakePsyho·1d

@Nim_game_btc My tests were more qualitative than quantitative, i.e. I was looking at solving paths checking etc. The difference is huge, because basically none of the earlier models were able to give me satisfying results before. I'd need to finish my whole benchmark in order to give stats.

English

563

Psyho@FakePsyho·1d

gpt 5.6 sol pro is the very first model that routinely solves some of the harder logic puzzles and so far looks like it actually uses logic rather than pure guessing so far my experience is that the improvement 5.5->5.6 is bigger than 5.2->5.5

English

719

34.7K

Psyho@FakePsyho·1d

@Meronix15 My diplomatic answer is: I don't think "computer science" people should fear more than any other knowledge workers.

English

528

Meronix@Meronix15·1d

@FakePsyho Do you think in 2-3 years that there will be any jobs left in computer science fields?

English

596

Psyho@FakePsyho·1d

@jurgen_kleis I used webui for this

English

830

Kleis Jürgen@jurgen_kleis·1d

@FakePsyho what is pro? you mean ultra or xhigh?

English

898

Psyho@FakePsyho·1d

@gurselcakar Most of the puzzles that I use for testing are not freely available online. You can find similar puzzles at gp.worldpuzzle.org/content/puzzle… (requires account)

English

233

Gürsel Çakar@gurselcakar·1d

@FakePsyho is there a public repo or website where i can find some of these puzzle sets out of curiosity?

English

241

Psyho@FakePsyho·1d

classic sudoku is an easy case, because even the hardest ones are generally easily solvable via backtracking (or more precisely by applying basic patterns + backtracking) I'm mainly testing on puzzles that require advanced break-ins and are hard to "brute force". I'd say we're not there yet, but we got very close in the past few months.

English

960

Gürsel Çakar@gurselcakar·1d

@FakePsyho what is your knowledge with respect to llms ability to solve for instance logic based puzzles like sudoku? e.g. can it now solves those extremely hard sudoku (or similar) puzzles?

English

1.1K

Psyho@FakePsyho·1d

I can no longer claim that if I ever finished that benchmark, the solve rate would be essentially 0% maybe this will motivate me to actually finish it, since it's probably few more months until it would get saturated

English

2.9K

Psyho@FakePsyho·1d

for the record, that was the experience with fable 5 on the same set of puzzles x.com/FakePsyho/stat…

Psyho@FakePsyho

I tried testing Fable 5 on a logic puzzle in webUI: > Claude reached its max length for this message. > Claude reached its max length for this message. > Claude reached its max length for this message. > You're out of usage credits > Claude reached its max length for this message. > Claude reached its max length for this message. > Claude reached its max length for this message. > You're out of usage credits > Claude reached its max length for this message. > Claude reached its max length for this message. > Claude reached its max length for this message. > You're out of usage credits > Claude reached its max length for this message. > WebUI froze I don't think we're there yet

English

3.7K

Psyho@FakePsyho·1d

@anpaure x.com/anpaure/status… we've exchanged a lot of comments and like no meaningful information 😅

anpaure@anpaure

@FakePsyho is that true? by hiring employees and directing effort towards anything but building rsi you're making an explicit bet against it unless there's an amdahl law kinda thing, you're compute constrained, and marginal capability researchers' value is 0, you're unserious by definition

English

169

anpaure@anpaure·1d

@FakePsyho wait which post, like the very first one?

English

130

Psyho@FakePsyho·1d

I really dislike posts like this Literally all of the labs are trying to build "God Himself". The difference is the marketing and that most of the labs are not good at it.

Air Katakana@airkatakana

i don’t understand how anyone could still be surprised anthropic has made it clear that they are trying to build God Himself. your usage limits are of no concern to them all claude subs are just donations to this end if you want to consume a product you should be on chatgpt

English

140

12.9K

Psyho@FakePsyho·1d

@anpaure sure, but your original post sounded like you have major doubts, so I was hoping for something more tangible

English

138

anpaure@anpaure·1d

@FakePsyho i'm not saying i don't believe but there's an is-ought gap, they're probably trying, but that doesn't imply they couldn't obviously be much better, and they aren't because [some random internal reason] the basis is talking to friends at labs and reading all the news on twitter

English

141

Psyho@FakePsyho·1d

@anpaure yes, labs (OpenAI and Anthropic) do things I'd expect them to do given that they're trying to build machine god I still don't get why you believe it's not the case

English

194

anpaure@anpaure·1d

@FakePsyho you've said it, now do you think the labs are doing a good job with these? i feel like they're not even trying properly on "people to like you" and they're struggling despite putting out *some* effort on the gov side

English

185

Psyho@FakePsyho·1d

@anpaure For RSI -> In general; I've switched my line in the middle of the sentence

English

Psyho@FakePsyho·1d

For RSI you need good researchers, compute, data, governments and people to like you, income, etc. If you're in the lead, I'd imagine you want to minimize variance as well. Any specific actions that you think go against that? One of the worst I can think of was sora app, but I don't see anything substantial lately (for OpenAI at least).

English

534

Psyho@FakePsyho·2d

@argus96_adam I hope AI doesn't surpass humans at entertaining people anytime soon

English

108

Adam Adamowski@argus96_adam·2d

@FakePsyho The interesting question is whether LLMs can create interesting/creative problems for programming competitions. Has anybody tested that?

English

157

Psyho@FakePsyho·2d

Back to rainy Warsaw! There are still a few things I want to write wrt AWTF: future of competitive programming, what does this result mean and maybe some analysis of the heuristic problem I have to trim my backlog though, so I'm probably back to my regular tweeting schedule

English

137

6.7K

Keşfet

@Nim_game_btc @Meronix15 @jurgen_kleis @gurselcakar @anpaure @elonmusk @BarackObama @taylorswift13