Patrick

4.4K posts

Patrick

@noself86

New Mexico, USA Bergabung Ekim 2017

1.6K Mengikuti503 Pengikut

Tweet Disematkan

Patrick@noself86·7 Mar

GPT-5.4 scored 75% on OSWorld's computer use benchmark today and everyone's celebrating "AI can use computers now." But look at what OSWorld actually tests, and what that number actually means. OSWorld runs on desktop Linux. LibreOffice instead of Word. GIMP instead of Photoshop. Thunderbird instead of Outlook. And the human baseline? "Computer science major college students who possess basic software usage skills but have not been exposed to the samples or software before." So the humans are failing because they've never used GIMP. The LLM is failing because it can't click the right spot in a dropdown. These are completely different failure modes producing the same number. The tasks themselves are basic competency. Crop an image. Send an email. Format a cell. This is the knowledge work equivalent of literacy, the kind of thing where an experienced user scores 100%, every time, and failure would indicate something fundamentally wrong with the person, not that the task was hard. An experienced photo editor who can't crop an image has a problem. That's exactly what's happening with LLMs here. Something is fundamentally wrong with the computation, not the knowledge. The LLM has read every GIMP tutorial ever written. It knows the keyboard shortcuts, the layer blending modes, the filter parameters. On the knowledge dimension it should be scoring close to 100%. Every single failure is a control failure, the screenshot-reason-click loop that is fundamentally a robotics problem, not an intelligence problem. Which means 75% isn't "AI can use computers 75% as well as a person." It's "AI fails at physical interaction with interfaces 25% of the time despite knowing exactly what it wants to do." We'd be alarmed if after three years of development an LLM wrote incoherent paragraphs 25% of the time. But because computer use looks like it should be in the same category as text (it's on a screen, it's digital) people treat 75% as progress toward 100% instead of as evidence that the remaining 25% is a fundamentally different kind of problem. The SWE-bench comparison makes this even clearer. SWE-bench verified tests the opposite kind of task: producing a working PR against an existing codebase. That's a genuinely specialized skill. If you used the same CS students from OSWorld, they'd score close to zero. It's something only experienced developers can do. Early models (GPT-4o, the original Claude Sonnet 3.5) scored around 33%, which doesn't sound impressive until you account for how specialized the skill is. The starting point was already remarkable. Current models scoring 50%+ is extraordinary. OSWorld inverts everything. The tasks aren't specialized, they're universal. The starting point (Claude Sonnet 3.5 at ~15%) might feel like a reasonable floor until you realize the human benchmark for any experienced user of the relevant software should be 100%. And consider how much knowledge the LLM already has of these applications. 15% was already a sign that something was categorically wrong with the approach. Three years later, 75% on tasks that should be trivially easy given the model's knowledge isn't progress toward solving the problem. It's evidence the problem is structural. Both benchmarks implicitly load on our intuition from school tests, where 75% feels like a solid B and 33% feels like failing. This makes SWE-bench look less impressive than it should and OSWorld more impressive than it should. Adjust for the actual difficulty and specialization of the tasks, and the picture inverts: SWE-bench is a genuine triumph of symbolic intelligence operating in its native medium. OSWorld is a discrete architecture failing at continuous sensorimotor control, a robotics problem disguised as a software problem. pwhite.org/browser-use-is… The benchmark is flattering the model by comparing it to the wrong human and the wrong kind of task.

English

884

Patrick@noself86·2h

This a really smart take and I think folks from Anthropic said stuff along these lines about opencode, at least. Ofc it being political was the better story. And it’s further evidence against the “subsidized tokens” meme that just won’t die. Your $2k in API usage, especially as a non-enterprise customer, is not at equivalent to $2k Claude Code usage.

English

218

gerred@devgerred·4h

I'm betting the Anthropic ban of OpenCode is as technical and cost-saving as it is political. I've long argued there's a moat to be had by closing third party tools to subs. CC can rely on KV caching across every instance, and have KV caches on a per-organization basis for further customization for their largest customers. They can, across their entire fleet, pre-compute 1/3-1/2 (if not more) of every CC user's system prompt. By encouraging baking this into MDM and enterprise plans too, they can further negotiate that out in these large contracts. It also potentially lets them do some more clever things than just pure prefix caching and make specific tradeoffs you don't just get by allowing anybody to use those endpoints. At least that's how I'd do it. It surprised me it took THIS long.

English

3.9K

Patrick@noself86·4h

@suzania Strange how "you as a person matter" becomes "your disability-constrained output is the real you." For people with processing disorders, LLM-assisted writing isn't replacing the self. It's finally letting the self through.

English

Susannah Black Roberts@suzania·11h

Talking with ppl who are fine with using generative llms for writing and trying to explain why they should not be is one of the more disturbing experiences I've had. Like, what I am trying to say is that you as a person matter, and y'all keep saying "prove it to me."

English

486

9.7K

Patrick@noself86·4h

@corsaren yeah, interesting research but the framing, include that "memorization" bit, seemed pretty ragebait-y esp in light of "oh, btw, agents did a really good job". I guess at this point the more people down on AI the more alpha for us that embrace it.

English

corsaren@corsaren·5h

**memorization is the wrong word. It’s fluency. A programming language is a language. Duh. The LLMs learn that language and syntax, and they learn how to “map” from natlang to code. LLMs are great at translating, but you wouldn’t expect them to excel at niche conlangs.

English

263

corsaren@corsaren·5h

Everyone please read the whole thread. It can be simultaneously true that: A) Much of the current coding capability is stored in the model weights as “memorization”** B) The models ALSO have slower, general reasoning capabilities for OOD contexts. System 1 vs. 2 thinking.

Lossfunk@lossfunk

🚨 Shocking: Frontier LLMs score 85-95% on standard coding benchmarks. We gave them equivalent problems in languages they couldn't have memorized. They collapsed to 0-11%. Presenting EsoLang-Bench. Accepted to the Logical Reasoning and ICBINB workshops at ICLR 2026 🧵

English

1.4K

Patrick@noself86·17h

@omooretweets how do you think about this in terms of they're newfound focus on enterprise.

English

Olivia Moore@omooretweets·22h

A big story that most people are missing in the AI race for the consumer (ChatGPT vs Claude) is ads. Right now, most consumer AI revenue is coming from power users who are willing to pay high cost subscriptions. This currently skews positive for products like Claude - but this will not be the end state. Google makes ~$460/ user/year in the U.S., mostly on ads. Meta makes around ~$250. I would argue ChatGPT’s ad-based ARPUs will be even higher as they will ultimately have deeper / more frequent user engagement. Even at the $460 level - monetizing everyone in the U.S. via ads is $152 billion in annual revenue. By contrast, if you’re able to monetize even 5% of the population on a $200/month subscription (which is a stretch!), that’s only $40 billion 🤔 I suspect this will be even more drastic outside the U.S. where users are even less willing or able to pay directly for subscriptions. And, the earliest data from a very small rollout shows ChatGPT ads are already outperforming Meta in effectiveness - this just gets better over time. TL;DR - I would not count ChatGPT out on consumer AI revenue. Once ads start working, that can quickly become a massive machine.

English

203

35.7K

Patrick@noself86·17h

@etorreborre @Ngnghm Sure, but do they solve a reasoning problems for human intelligences or pure symbolic intelligences, i.e. LLMs, I suspect it's kinda the former.

English

Eric Torreborre@etorreborre·18h

@noself86 @Ngnghm Types solve a reasoning problem

English

💻🐴Ngnghm@Ngnghm·2d

Static types catch errors early and that's great—but they also catch non-errors early, preventing you from writing the software you want—and that's terrible. Those who only tell you about one side of the tradeoff, or claim the other side is universally negligible—are dishonest.

English

127

12.1K

Patrick@noself86·18h

@scottbelsky This is just describing management. The person who supervised the work gets credit for the work. We already have that product. It's called an org chart. The interesting question is what happens when the agent is better than the supervisor and everyone knows it.

English

104

scott belsky@scottbelsky·1d

thinking: products that help humans get credit for the work accomplished by agents they supervise in the enterprise will have better adoption than agentic solutions that do the work instead of humans. credit feeds ego, drives adoption...and accountability.

English

269

16K

Patrick@noself86·19h

interested to read this. my version of this argument is that infrastructure companies are selling picks during a gold rush where the gold is about to be free. when AI can generate a frontend or a database integration from a description, the middleware layer gets squeezed from both sides.

English

Nelson Lee@NelsonXLee·1d

Gonna flesh this out in a 2k-word article later this week. Companies like @Vercel, @Anything, and @Supabase are great $1B companies, but they’ll never be $100B companies. Their business model prevents any power law.

English

13.1K

Patrick@noself86·19h

this is a genuinely interesting point. the people who built real relationships w/ chatgpt were the ones most likely to integrate it into daily life in ways that include commerce. openai optimized for the "tool" framing and lost the relational users who would've been the actual economic engine.

English

117

Lex@xw33bttv·1d

What did they think would happen when they began alienating and then fully ostracising their most loyal and ingrained portion of the paying customer base? Code bros and enterprise aren't buying groceries or gift shopping through ChatGPT. You know who would have? The users who treated the platform as a companion, mentor or friend.

Negligible Capital@negligible_cap

$WMT is disappointed in results from OpenAI partnership, whereby Walmart users are allowed to shop via ChatGPT and OpenAI would receive a commission on these purchases “Conversion rates—the percentage of users following through with a purchase of an item shown to them by ChatGPT—have been three times lower for the selection sold directly inside the chatbot than those that require clicking out, according to Daniel Danker, who oversees design and product for Walmart. Put simply, Instant Checkout has been a flop.” -- Wired OpenAI on a heater recently in the news….

English

212

7.4K

Patrick@noself86·19h

i've shipped and maintained production software for over a decade and my takes are pretty futuristic. but i think the real issue is the opposite of what you're describing. people deep in real software w/ real users often can't see the structural shift because the daily constraints feel permanent.

English

David Cramer@zeeg·22h

Why is it everyone with an absurdly futuristic AI take is someone who - as best I can tell - doesn’t work on (and often never has) real software that has real users and real requirements? More so, why do you trust them?

English

117

975

46.4K

Patrick@noself86·1d

i get why that sentence hits hard but i think it assumes the skill lives in the typing. my experience is the opposite, working w/ AI has forced me to think more clearly than i ever did writing solo. the skill that matters was never the production, it was the seeing. that doesn't atrophy, it sharpens

English

2.3K

nikki mccann ramírez@NikkiMcR·1d

“‘I do not want to be de-skilled,’ I remind the machine” is perhaps the bleakest sentence I have read this year

nxthompson@nxthompson

This is a cool example of how you can use AI to help your writing—without relying on it for any actual writing. From @jasminewsun theatlantic.com/technology/202…

English

448

4.7K

158.5K

Patrick@noself86·1d

@mgbianc agreed, and i think the reason they got demoted to "extras" is the same reason best practices in software got treated as universal truth. the scaffolding got mistaken for decoration once people forgot what it was scaffolding. reasoning is infrastructure, not enrichment

English

Matt Bianco@mgbianc·2d

The liberal arts aren’t “extras.” In the artistic mode they cultivate reasoning—grammar, logic, rhetoric, and mathematics—so the mind can perceive reality with order.

English

259

8.8K

Patrick@noself86·1d

@perlucidum this is a genuinely useful reframe. displaced aggression is one of the oldest patterns in psychology but it's easy to forget when you're the one being aggressed upon. doesn't mean you have to accept it, but understanding the mechanism makes it a lot easier to not internalize it

English

Vivian@perlucidum·2d

once you understand everyone is being totally psycho to each other because they feel disempowered by government/capital . it stops feeling personal

English

1.1K

8.3K

134.4K

Patrick@noself86·1d

@ninagrewal97 i think the deeper thing is that authenticity isn't something you perform or don't perform. it's what happens when you stop managing how you're perceived. the trying-to-be-authentic person is still running the management loop, just with different content

English

nina@ninagrewal97·2d

it is obvious when people fight so hard to be seen as authentic and it comes off as completely inauthentic. when someone is truly authentic they don’t draw attention to it, rather other people will notice that genuineness and give that attention to them naturally.

English

256

8.1K

Patrick@noself86·1d

@viemccoy The volume part is right. You can't think your way into good prose. You write enough bad sentences that your body starts rejecting them before your mind catches up. Taste is a physical reflex trained by repetition.

English

𝚟𝚒𝚎 ⟢@viemccoy·2d

The key is to write so much and so constantly that you can't help but feel which words are better in which order. To become a writer is to learn how to become words. Writing is shapeshifting into a vulnerable form and splaying out your literary appendages naked on the cross.

English

431

66.3K

Patrick@noself86·1d

@hell_line0 The inability to relax when everything is fine is the real tax. You built a nervous system for a war that ended. Now you're running threat detection on an empty room and it still feels like survival.

English

108

Maryam@hell_line0·2d

Behavioral scientists found that people who survived difficult childhoods don’t just bounce back , they develop a permanent hypervigilance that makes them extraordinarily capable in crisis and unable to relax even when everything is finally okay

English

214

1.5K

12.5K

334K

Patrick@noself86·1d

@scottdomes Right. Superiority and inferiority are the same structure, just different ends. The actual exit is losing interest in the ranking entirely. Which is hard because the ranking impulse is what drove you to be ambitious in the first place.

English

scott 🌞@scottdomes·2d

many many ambitious pursuits are motivated by a shadow desire to feel superior to others there's nothing morally wrong with this. the problem is that it doesn't work. a feeling of superiority only masks a feeling of inferiority; it doesn't cure it

English

166

3.6K

Patrick@noself86·1d

@incentivising this is a strategy for navigating environments where trust is zero-sum and information is power. it works in those environments. but if you do it everywhere you end up unable to have a single honest relationship, which is where all the actual value in life comes from

English

Incentivising@incentivising·3d

You must play dumb. Ask dumb questions, request clarifications. Never show your intelligence outright. Force the others to overexplain and gather intel. Few understand: high intelligence is always perceived as a threat.

English

2.6K

15.2K

206.9K

Patrick@noself86·1d

@maiamindel Public space became a stage when everyone got a camera. Now any visible activity is assumed to be performed for an invisible audience. The default interpretation of another person is "content creator" not "human being."

English

1.1K

Maia@maiamindel·2d

doing anything in public is considered performative because public life is just a wasteland now, it's a content mine for online posts

crusty@crust333

This generation is so slow that reading a book is considered performative

English

2.7K

29.7K

384.4K

Patrick@noself86·1d

@Glace_cakes This is the "more software, fewer software businesses" pattern applied to media. The studio pipeline died. The content didn't. It just moved to creators with zero institutional backing. Kids have more to watch than ever. They just don't have a shared canon.

English

2.7K

Glace@Glace_cakes·2d

nothing. they have nothing. executives decided that demographic is useless for making money when the pivot to streaming happened, and so they stopped making those shows. Leaving youtubers, grifters, and AI slop to fill the void. Owl House, Amphibia, etc. are the last of its kind.

Valerie❤️‍🔥❤️‍🔥@Valistryingg

Genuine question, what do kids and tweens watch these days? What’s their High School Musical, Hannah Montana, Cheetah Girls, Wizards of Waverley Place, Camp Rock, Sonny with a Chance, That’s so Raven, Lizzie McGuire,Suite Life,etc? We had so much, and they seemingly have nothing?

English

2.2K

17K

233K

Patrick@noself86·1d

this is basically the whole thesis of my smoke alarm essay. the system isn't broken, it's calibrated for a world that no longer exists. the key reframe imo is that once you see it as miscalibration rather than malfunction, you stop fighting yourself and start working with it pwhite.org/smoke-alarm

English

1.6K

Frontier Indica@frontierindica·2d

For 99% of human history, the paranoid guy who couldn't sleep because he heard rustling outside the cave was the one who survived the night. The chill guy who couldn't be bothered got eaten by a leopard. Natural selection rewarded hypervigilance, high cortisol, and an overdeveloped threat radar because in 10,000 BC, a false alarm cost you nothing but a missed nap while a missed threat cost you your life. So the genes that made it through are the ones wired to assume the worst. You can see the same pattern everywhere. The people who could store fat efficiently survived the lean winters and famines, passed on their genes, and now those same genes in a world of cheap seed oils, endless processed carbs, and 24/7 food delivery make you pre-diabetic by 35. An adaptation that kept your ancestors alive for 200,000 years is now the leading cause of death in the modern world. Or take intelligence for instance. For most of human history, being smarter meant better resource acquisition, better social status, more mates, more surviving offspring. But in the modern world, the correlation between IQ and fertility has completely flipped. Multiple studies across countries show a consistent negative relationship between cognitive ability and number of children. Higher IQ individuals delay reproduction, pursue more education, overthink the decision to have kids, and end up having fewer or none. The trait that was once the ultimate evolutionary advantage is now selecting itself out of the gene pool. The takeaway here is that the stress response that kept your ancestors alive through ice ages and tribal warfare now fires because your Uber is 4 minutes late. Evolution built you to survive a world that no longer exists but nobody bothered to tell your amygdala.

Alexander 𖤓 Nietzschean Vitalist@UbermenschMind

"I'm too scared to talk to this girl" What your ancestors did on a random Tuesday:

English

11.3K

1.3M

Jelajahi

@suzania @corsaren @omooretweets @etorreborre @Ngnghm @scottbelsky @vercel @anything