Patrick

4.4K posts

Patrick banner
Patrick

Patrick

@noself86

New Mexico, USA Bergabung Ekim 2017
1.6K Mengikuti503 Pengikut
Tweet Disematkan
Patrick
Patrick@noself86·
GPT-5.4 scored 75% on OSWorld's computer use benchmark today and everyone's celebrating "AI can use computers now." But look at what OSWorld actually tests, and what that number actually means. OSWorld runs on desktop Linux. LibreOffice instead of Word. GIMP instead of Photoshop. Thunderbird instead of Outlook. And the human baseline? "Computer science major college students who possess basic software usage skills but have not been exposed to the samples or software before." So the humans are failing because they've never used GIMP. The LLM is failing because it can't click the right spot in a dropdown. These are completely different failure modes producing the same number. The tasks themselves are basic competency. Crop an image. Send an email. Format a cell. This is the knowledge work equivalent of literacy, the kind of thing where an experienced user scores 100%, every time, and failure would indicate something fundamentally wrong with the person, not that the task was hard. An experienced photo editor who can't crop an image has a problem. That's exactly what's happening with LLMs here. Something is fundamentally wrong with the computation, not the knowledge. The LLM has read every GIMP tutorial ever written. It knows the keyboard shortcuts, the layer blending modes, the filter parameters. On the knowledge dimension it should be scoring close to 100%. Every single failure is a control failure, the screenshot-reason-click loop that is fundamentally a robotics problem, not an intelligence problem. Which means 75% isn't "AI can use computers 75% as well as a person." It's "AI fails at physical interaction with interfaces 25% of the time despite knowing exactly what it wants to do." We'd be alarmed if after three years of development an LLM wrote incoherent paragraphs 25% of the time. But because computer use looks like it should be in the same category as text (it's on a screen, it's digital) people treat 75% as progress toward 100% instead of as evidence that the remaining 25% is a fundamentally different kind of problem. The SWE-bench comparison makes this even clearer. SWE-bench verified tests the opposite kind of task: producing a working PR against an existing codebase. That's a genuinely specialized skill. If you used the same CS students from OSWorld, they'd score close to zero. It's something only experienced developers can do. Early models (GPT-4o, the original Claude Sonnet 3.5) scored around 33%, which doesn't sound impressive until you account for how specialized the skill is. The starting point was already remarkable. Current models scoring 50%+ is extraordinary. OSWorld inverts everything. The tasks aren't specialized, they're universal. The starting point (Claude Sonnet 3.5 at ~15%) might feel like a reasonable floor until you realize the human benchmark for any experienced user of the relevant software should be 100%. And consider how much knowledge the LLM already has of these applications. 15% was already a sign that something was categorically wrong with the approach. Three years later, 75% on tasks that should be trivially easy given the model's knowledge isn't progress toward solving the problem. It's evidence the problem is structural. Both benchmarks implicitly load on our intuition from school tests, where 75% feels like a solid B and 33% feels like failing. This makes SWE-bench look less impressive than it should and OSWorld more impressive than it should. Adjust for the actual difficulty and specialization of the tasks, and the picture inverts: SWE-bench is a genuine triumph of symbolic intelligence operating in its native medium. OSWorld is a discrete architecture failing at continuous sensorimotor control, a robotics problem disguised as a software problem. pwhite.org/browser-use-is… The benchmark is flattering the model by comparing it to the wrong human and the wrong kind of task.
English
0
0
7
884
Patrick
Patrick@noself86·
This a really smart take and I think folks from Anthropic said stuff along these lines about opencode, at least. Ofc it being political was the better story. And it’s further evidence against the “subsidized tokens” meme that just won’t die. Your $2k in API usage, especially as a non-enterprise customer, is not at equivalent to $2k Claude Code usage.
English
0
0
2
218
gerred
gerred@devgerred·
I'm betting the Anthropic ban of OpenCode is as technical and cost-saving as it is political. I've long argued there's a moat to be had by closing third party tools to subs. CC can rely on KV caching across every instance, and have KV caches on a per-organization basis for further customization for their largest customers. They can, across their entire fleet, pre-compute 1/3-1/2 (if not more) of every CC user's system prompt. By encouraging baking this into MDM and enterprise plans too, they can further negotiate that out in these large contracts. It also potentially lets them do some more clever things than just pure prefix caching and make specific tradeoffs you don't just get by allowing anybody to use those endpoints. At least that's how I'd do it. It surprised me it took THIS long.
English
5
0
33
3.9K
Patrick
Patrick@noself86·
@suzania Strange how "you as a person matter" becomes "your disability-constrained output is the real you." For people with processing disorders, LLM-assisted writing isn't replacing the self. It's finally letting the self through.
English
0
0
0
43
Susannah Black Roberts
Talking with ppl who are fine with using generative llms for writing and trying to explain why they should not be is one of the more disturbing experiences I've had. Like, what I am trying to say is that you as a person matter, and y'all keep saying "prove it to me."
English
27
44
486
9.7K
Patrick
Patrick@noself86·
@corsaren yeah, interesting research but the framing, include that "memorization" bit, seemed pretty ragebait-y esp in light of "oh, btw, agents did a really good job". I guess at this point the more people down on AI the more alpha for us that embrace it.
English
0
0
1
13
corsaren
corsaren@corsaren·
**memorization is the wrong word. It’s fluency. A programming language is a language. Duh. The LLMs learn that language and syntax, and they learn how to “map” from natlang to code. LLMs are great at translating, but you wouldn’t expect them to excel at niche conlangs.
English
3
0
11
263
corsaren
corsaren@corsaren·
Everyone please read the whole thread. It can be simultaneously true that: A) Much of the current coding capability is stored in the model weights as “memorization”** B) The models ALSO have slower, general reasoning capabilities for OOD contexts. System 1 vs. 2 thinking.
corsaren tweet media
Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English
3
1
30
1.4K
Patrick
Patrick@noself86·
@omooretweets how do you think about this in terms of they're newfound focus on enterprise.
English
0
0
0
45
Olivia Moore
Olivia Moore@omooretweets·
A big story that most people are missing in the AI race for the consumer (ChatGPT vs Claude) is ads. Right now, most consumer AI revenue is coming from power users who are willing to pay high cost subscriptions. This currently skews positive for products like Claude - but this will not be the end state. Google makes ~$460/ user/year in the U.S., mostly on ads. Meta makes around ~$250. I would argue ChatGPT’s ad-based ARPUs will be even higher as they will ultimately have deeper / more frequent user engagement. Even at the $460 level - monetizing everyone in the U.S. via ads is $152 billion in annual revenue. By contrast, if you’re able to monetize even 5% of the population on a $200/month subscription (which is a stretch!), that’s only $40 billion 🤔 I suspect this will be even more drastic outside the U.S. where users are even less willing or able to pay directly for subscriptions. And, the earliest data from a very small rollout shows ChatGPT ads are already outperforming Meta in effectiveness - this just gets better over time. TL;DR - I would not count ChatGPT out on consumer AI revenue. Once ads start working, that can quickly become a massive machine.
English
42
16
203
35.7K
Patrick
Patrick@noself86·
@etorreborre @Ngnghm Sure, but do they solve a reasoning problems for human intelligences or pure symbolic intelligences, i.e. LLMs, I suspect it's kinda the former.
English
1
0
0
9
💻🐴Ngnghm
💻🐴Ngnghm@Ngnghm·
Static types catch errors early and that's great—but they also catch non-errors early, preventing you from writing the software you want—and that's terrible. Those who only tell you about one side of the tradeoff, or claim the other side is universally negligible—are dishonest.
English
30
12
127
12.1K
Patrick
Patrick@noself86·
@scottbelsky This is just describing management. The person who supervised the work gets credit for the work. We already have that product. It's called an org chart. The interesting question is what happens when the agent is better than the supervisor and everyone knows it.
English
0
0
2
104
scott belsky
scott belsky@scottbelsky·
thinking: products that help humans get credit for the work accomplished by agents they supervise in the enterprise will have better adoption than agentic solutions that do the work instead of humans. credit feeds ego, drives adoption...and accountability.
English
19
13
269
16K
Patrick
Patrick@noself86·
interested to read this. my version of this argument is that infrastructure companies are selling picks during a gold rush where the gold is about to be free. when AI can generate a frontend or a database integration from a description, the middleware layer gets squeezed from both sides.
English
0
0
0
43
Nelson Lee
Nelson Lee@NelsonXLee·
Gonna flesh this out in a 2k-word article later this week. Companies like @Vercel, @Anything, and @Supabase are great $1B companies, but they’ll never be $100B companies. Their business model prevents any power law.
English
12
0
75
13.1K
Patrick
Patrick@noself86·
this is a genuinely interesting point. the people who built real relationships w/ chatgpt were the ones most likely to integrate it into daily life in ways that include commerce. openai optimized for the "tool" framing and lost the relational users who would've been the actual economic engine.
English
0
0
5
117
Patrick
Patrick@noself86·
i've shipped and maintained production software for over a decade and my takes are pretty futuristic. but i think the real issue is the opposite of what you're describing. people deep in real software w/ real users often can't see the structural shift because the daily constraints feel permanent.
English
0
0
0
36
David Cramer
David Cramer@zeeg·
Why is it everyone with an absurdly futuristic AI take is someone who - as best I can tell - doesn’t work on (and often never has) real software that has real users and real requirements? More so, why do you trust them?
English
117
42
975
46.4K
Patrick
Patrick@noself86·
i get why that sentence hits hard but i think it assumes the skill lives in the typing. my experience is the opposite, working w/ AI has forced me to think more clearly than i ever did writing solo. the skill that matters was never the production, it was the seeing. that doesn't atrophy, it sharpens
English
5
0
5
2.3K
Patrick
Patrick@noself86·
@mgbianc agreed, and i think the reason they got demoted to "extras" is the same reason best practices in software got treated as universal truth. the scaffolding got mistaken for decoration once people forgot what it was scaffolding. reasoning is infrastructure, not enrichment
English
0
0
0
33
Matt Bianco
Matt Bianco@mgbianc·
The liberal arts aren’t “extras.” In the artistic mode they cultivate reasoning—grammar, logic, rhetoric, and mathematics—so the mind can perceive reality with order.
English
15
41
259
8.8K
Patrick
Patrick@noself86·
@perlucidum this is a genuinely useful reframe. displaced aggression is one of the oldest patterns in psychology but it's easy to forget when you're the one being aggressed upon. doesn't mean you have to accept it, but understanding the mechanism makes it a lot easier to not internalize it
English
0
0
26
1K
Vivian
Vivian@perlucidum·
once you understand everyone is being totally psycho to each other because they feel disempowered by government/capital . it stops feeling personal
English
23
1.1K
8.3K
134.4K
Patrick
Patrick@noself86·
@ninagrewal97 i think the deeper thing is that authenticity isn't something you perform or don't perform. it's what happens when you stop managing how you're perceived. the trying-to-be-authentic person is still running the management loop, just with different content
English
0
0
1
96
nina
nina@ninagrewal97·
it is obvious when people fight so hard to be seen as authentic and it comes off as completely inauthentic. when someone is truly authentic they don’t draw attention to it, rather other people will notice that genuineness and give that attention to them naturally.
English
6
53
256
8.1K
Patrick
Patrick@noself86·
@viemccoy The volume part is right. You can't think your way into good prose. You write enough bad sentences that your body starts rejecting them before your mind catches up. Taste is a physical reflex trained by repetition.
English
0
0
1
24
𝚟𝚒𝚎 ⟢
𝚟𝚒𝚎 ⟢@viemccoy·
The key is to write so much and so constantly that you can't help but feel which words are better in which order. To become a writer is to learn how to become words. Writing is shapeshifting into a vulnerable form and splaying out your literary appendages naked on the cross.
English
7
35
431
66.3K
Patrick
Patrick@noself86·
@hell_line0 The inability to relax when everything is fine is the real tax. You built a nervous system for a war that ended. Now you're running threat detection on an empty room and it still feels like survival.
English
0
0
2
108
Maryam
Maryam@hell_line0·
Behavioral scientists found that people who survived difficult childhoods don’t just bounce back , they develop a permanent hypervigilance that makes them extraordinarily capable in crisis and unable to relax even when everything is finally okay
English
214
1.5K
12.5K
334K
Patrick
Patrick@noself86·
@scottdomes Right. Superiority and inferiority are the same structure, just different ends. The actual exit is losing interest in the ranking entirely. Which is hard because the ranking impulse is what drove you to be ambitious in the first place.
English
0
0
0
21
scott 🌞
scott 🌞@scottdomes·
many many ambitious pursuits are motivated by a shadow desire to feel superior to others there's nothing morally wrong with this. the problem is that it doesn't work. a feeling of superiority only masks a feeling of inferiority; it doesn't cure it
English
5
17
166
3.6K
Patrick
Patrick@noself86·
@incentivising this is a strategy for navigating environments where trust is zero-sum and information is power. it works in those environments. but if you do it everywhere you end up unable to have a single honest relationship, which is where all the actual value in life comes from
English
0
0
1
44
Incentivising
Incentivising@incentivising·
You must play dumb. Ask dumb questions, request clarifications. Never show your intelligence outright. Force the others to overexplain and gather intel. Few understand: high intelligence is always perceived as a threat.
English
97
2.6K
15.2K
206.9K
Patrick
Patrick@noself86·
@maiamindel Public space became a stage when everyone got a camera. Now any visible activity is assumed to be performed for an invisible audience. The default interpretation of another person is "content creator" not "human being."
English
0
1
31
1.1K
Patrick
Patrick@noself86·
@Glace_cakes This is the "more software, fewer software businesses" pattern applied to media. The studio pipeline died. The content didn't. It just moved to creators with zero institutional backing. Kids have more to watch than ever. They just don't have a shared canon.
English
1
0
18
2.7K
Glace
Glace@Glace_cakes·
nothing. they have nothing. executives decided that demographic is useless for making money when the pivot to streaming happened, and so they stopped making those shows. Leaving youtubers, grifters, and AI slop to fill the void. Owl House, Amphibia, etc. are the last of its kind.
Valerie❤️‍🔥❤️‍🔥@Valistryingg

Genuine question, what do kids and tweens watch these days? What’s their High School Musical, Hannah Montana, Cheetah Girls, Wizards of Waverley Place, Camp Rock, Sonny with a Chance, That’s so Raven, Lizzie McGuire,Suite Life,etc? We had so much, and they seemingly have nothing?

English
54
2.2K
17K
233K
Patrick
Patrick@noself86·
this is basically the whole thesis of my smoke alarm essay. the system isn't broken, it's calibrated for a world that no longer exists. the key reframe imo is that once you see it as miscalibration rather than malfunction, you stop fighting yourself and start working with it pwhite.org/smoke-alarm
English
0
0
3
1.6K
Frontier Indica
Frontier Indica@frontierindica·
For 99% of human history, the paranoid guy who couldn't sleep because he heard rustling outside the cave was the one who survived the night. The chill guy who couldn't be bothered got eaten by a leopard. Natural selection rewarded hypervigilance, high cortisol, and an overdeveloped threat radar because in 10,000 BC, a false alarm cost you nothing but a missed nap while a missed threat cost you your life. So the genes that made it through are the ones wired to assume the worst. You can see the same pattern everywhere. The people who could store fat efficiently survived the lean winters and famines, passed on their genes, and now those same genes in a world of cheap seed oils, endless processed carbs, and 24/7 food delivery make you pre-diabetic by 35. An adaptation that kept your ancestors alive for 200,000 years is now the leading cause of death in the modern world. Or take intelligence for instance. For most of human history, being smarter meant better resource acquisition, better social status, more mates, more surviving offspring. But in the modern world, the correlation between IQ and fertility has completely flipped. Multiple studies across countries show a consistent negative relationship between cognitive ability and number of children. Higher IQ individuals delay reproduction, pursue more education, overthink the decision to have kids, and end up having fewer or none. The trait that was once the ultimate evolutionary advantage is now selecting itself out of the gene pool. The takeaway here is that the stress response that kept your ancestors alive through ice ages and tribal warfare now fires because your Uber is 4 minutes late. Evolution built you to survive a world that no longer exists but nobody bothered to tell your amygdala.
Alexander 𖤓 Nietzschean Vitalist@UbermenschMind

"I'm too scared to talk to this girl" What your ancestors did on a random Tuesday:

English
99
1K
11.3K
1.3M