Shailesh (@0xThoughtVector) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Shailesh@0xThoughtVector·13 Oca

No more gentle singularity, anon.

English

0

3

652

Shailesh@0xThoughtVector·7h

@fidjissimo Please don't make me open Atlas to code with my bud, GPT 5.4, Fidji!

English

0

18

Shailesh@0xThoughtVector·8h

@tensorqt oh. Nice. Bet they were all about the Lich King fucking chicks

English

0

2

10

tensorqt@tensorqt·8h

@0xThoughtVector i read all of them of course, they're for my entertainment. i use my own vibe-coded app to build the backbone of the stories and have agents build them

English

1

0

2

12

Shailesh@0xThoughtVector·10h

I bet you could write a novel with Opus 4.6 in Claude Code.

English

1

0

1

324

Shailesh@0xThoughtVector·8h

@tensorqt wait... I have so many questions about the 80k word pieces. what were they about? did you proof-read them? did other humans read them? Anyway. I think writers SHOULD start using these tools, just like coders.

English

1

0

2

14

tensorqt@tensorqt·8h

@0xThoughtVector i wrote multiple 80k-word pieces with sonnet, opus and gpt 5.4

English

1

0

5

211

Shailesh@0xThoughtVector·8h

Most criticisms of this paper have been along the lines that the chosen languages are unintuitive and would be extremely hard for even humans to adapt to. This raises the question of how much generalization can you/should you actually expect? Like translating python to Javascript is expected but python to branfuck is not. How do you measure the degree of generalization needed for a task and how much of it do you expect?

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

0

1

47

Shailesh@0xThoughtVector·8h

@ChaseBrowe32432 didn't read the paper yet 😬

English

0

1

71

Chase Brower@ChaseBrowe32432·8h

@0xThoughtVector Did you read the paper? They definitely tested opus 4.5

English

1

0

82

Chase Brower@ChaseBrowe32432·11h

Opus 4.6 in webui can solve even the "extremely hard" problems btw, not sure what their precise methodology was but they must have heavily hamstrung the models.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

8

3

76

14.3K

Shailesh@0xThoughtVector·8h

@ChaseBrowe32432 they don't mention Opus 4.5 but they actually mention GPT 5.2 which was better than Opus 4.5 Maybe the jump has been that large.

English

1

0

91

Chase Brower@ChaseBrowe32432·8h

@0xThoughtVector I sincerely doubt we went from "only able to solve easy problems, no mediums, no hards, no extra hards" to "consistently solved every single type of problem including extra hards" from 4.5 -> 4.6 indeed, if you examine the paper and code, it's full of methodological issues.

English

1

0

8

341

Shailesh retweetledi

dax@thdxr·10h

opencode 1.3.0 will no longer autoload the claude max plugin we did our best to convince anthropic to support developer choice but they sent lawyers it's your right to access services however you wish but it is also their right to block whoever they want we can't maintain an official plugin so it's been removed from github and marked deprecated on npm appreciate our partners at openai, github and gitlab who are going the other direction and supporting developer freedom

English

201

337

6.6K

543.9K

Shailesh@0xThoughtVector·9h

@petergostev The paper doesn't claim LLMs have no practical utility. The claim is around generalization to input structures where the core logic is the same but the training data is scarce.

English

0

34

Peter Gostev@petergostev·13h

These kind of claims never pass the sniff test. Benchmarks can be cheated, but if it worked 0-11% of the time on real tasks (which are not part of benchmarks) nobody would ever use LLMs for coding.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

11

0

60

8.6K

Shailesh retweetledi

Paradigma@paradigmainc·11h

introducing Campaigns (Beta).

English

2

14

59

4.3K

Shailesh@0xThoughtVector·9h

@lossfunk I would have expected a drop in performance but the complete dismantling of performance is very unexpected. Need to change a lot of priors.

English

0

95

Shailesh retweetledi

Lossfunk@lossfunk·16h

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

113

218

1.7K

793.2K

Shailesh@0xThoughtVector·10h

Open AI should be doing their golf competition here, no doubt.

Paradigma@paradigmainc

This allows agents to identify dead-ends, transport patterns between different campaigns, in a way that doesn’t dissipate the knowledge gained by exploration.

English

0

1

5

531

Shailesh@0xThoughtVector·13h

@thatssodhawal @mavehealth @Google @ufc @ycombinator @BlumeVentures this is not a good ad. HOW does it do what you claiming it does? These are tall claims and I would think your ICP is technical and won't be swayed by some yellow text on the screen.

English

0

1

14

Dhawal Jain@thatssodhawal·1d

We've raised $2.1M to fix your focus. Our wearable headset @mavehealth improves attention & stress regulation in just 20 minutes a day for users at @Google, @ufc, @ycombinator. Backed by @BlumeVentures, alongside existing and new investors. Order now at mavehealth.com 🇺🇸🇮🇳

English

408

592

3.1K

2.5M

Shailesh@0xThoughtVector·17h

@baldwin_IVth @KnotDating these are small models. take barely any resources.

English

0

1

31

Pratyaksh Patel@baldwin_IVth·17h

How much of ml or dl do platforms like @KnotDating use? Wonder how do they solve the latency problem as the models get bigger/more complex. Would love to try out working on something like this.

English

4

0

5

344

Shailesh@0xThoughtVector·19h

Almost suspicious how @paradigmainc is made exactly for this.

OpenAI@OpenAI

Are you up for a challenge? openai.com/parameter-golf

English

0

1

4

484

Shailesh@0xThoughtVector·20h

@dhruvtrehan9 great evals are worth millions now, btw

English

0

1

80

Dhruv Trehan@dhruvtrehan9·21h

bruh i dont know if i’m just being nitpicky but high quality original evals / benchmarks for tasks with no easy natural language ground truth data are so hard to build

English

6

0

18

1.7K

Shailesh@0xThoughtVector·21h

@elonmusk too much emphasis on video content

English

0

5