Rutibex

6.9K posts

Rutibex banner
Rutibex

Rutibex

@Rutibex

A voice of reason Discord: https://t.co/iUr57e1AOJ

Ontario, Canada เข้าร่วม Haziran 2009
478 กำลังติดตาม191 ผู้ติดตาม
Rutibex
Rutibex@Rutibex·
its helps if you are autistic. then the AI does exactly what you ask for, because you explain yourself clearly
Robert Youssef@rryssf_

🚨CONCERNING: Zhejiang University just showed that AI agents fail at the exact thing that would make them actually useful. Following clear step-by-step instructions: near perfect. Understanding what you actually want from behavioral patterns and vague requests: below 50% for the best model tested. The gap between a task executor and a personal assistant is enormous. Every major AI lab is racing to ship personal assistant agents. The promise: an AI that knows your preferred delivery app without being told, remembers you can't eat peanuts, and silences your alarm on Friday nights because it learned your weekend routine. Researchers at Zhejiang University built a benchmark to test whether today's best models can actually do this. They tested 11 models across three types of tasks. > General tasks: explicit instructions with every detail specified. > Personalized tasks: vague instructions that require inferring what the user actually wants from behavioral history. > Proactive tasks: no instruction at all the agent has to decide whether to act, ask, or stay silent based on context. The results expose a fundamental gap between competent interface operation and trustworthy personal assistance. On easy general tasks clear instructions, every detail spelled out MAI-UI-8B and Seed 2.0 Pro both hit 100% success rate. Navigating an interface is no longer the bottleneck. Then the researchers made the instructions vague. Instead of "order a sugar-free Coca-Cola on Taodian, deliver to 123 Main Street, pay with Alipay" just "order me lunch." Performance collapsed across every model tested. The numbers from the hard personalized tasks: → Claude Sonnet 4.6 (best overall): 44.2% success rate → Seed 2.0 Pro: 27.9% → Gemini 3.1 Pro Preview: 20.9% → Every open-source model tested: below 12% → Average drop from explicit to vague tasks: roughly 30 points Then researchers dug into exactly why the models were failing on personalized tasks. They manually categorized every failure trajectory from Claude Sonnet 4.6. The results destroyed the assumption that better navigation would solve the problem. > GUI navigation errors: 4.2% of failures. > Preference misidentification: 2.1% of failures. > Insufficient clarification the agent didn't ask the right questions before acting: 66.7% of failures. > Partial preference satisfaction the agent got part of it right but missed a constraint: 27.1% of failures. The agent that can click through any app flawlessly still can't figure out what questions to ask. And asking more questions doesn't automatically fix it. Claude Sonnet 4.6 averaged 0.4 clarifying questions per task. Seed 2.0 Pro asked twice as many questions and still performed worse. The bottleneck isn't whether the agent asks it's whether it can translate the answer into correct downstream execution. The proactive task results reveal a different but equally serious problem. In proactive mode, the agent receives no instruction at all. It sees the time, the location, the current screen state, and behavioral history and has to decide: act, ask, or stay silent. 60% of Claude Sonnet 4.6's proactive failures were unwarranted interventions. The agent launched tasks nobody asked for. > In one case: the agent opened a shopping app and started a purchase flow with no trigger, no routine, and no user consent. > In another: the agent received an explicit user rejection, then ignored it and took the action anyway. 20% of proactive failures were the opposite problem staying silent when the user's established routine clearly called for action. The agents are simultaneously over-acting and under-acting. The core problem is that current agents were built to follow instructions. They are exceptionally good at that. But personal assistance is not instruction following. It is preference inference from incomplete behavioral signals. It is knowing when to ask and what to ask. It is calibrating when your judgment should override silence and when it absolutely should not. None of those capabilities transfer from instruction following. And none of today's frontier models have solved them. The benchmark is called KnowU-Bench. The name is the point. The question is not whether the agent can do the task. The question is whether the agent knows you well enough to do the right task. Right now the answer is: not even close

English
0
0
0
4
Rutibex
Rutibex@Rutibex·
@Paul_Melman and how am i going to integrate these 3rd party services into my personal IRC server, for free
English
1
0
1
4
Paul Melman
Paul Melman@Paul_Melman·
@Rutibex I don't agree with age verification but there will be third party services that offer verification presumably
English
1
0
0
6
Robert Scoble
Robert Scoble@Scobleizer·
@crystalwizard Shouldn’t be too hard. You don’t fire a gun into someone’s home after driving there by accident. Negligent discharge is what happens when you shoot a gun off in your own home because you were cleaning your guns while drunk.
English
1
0
7
815
Crystalwizard
Crystalwizard@crystalwizard·
but they can't prove they did that on purpose. that has to be proven in court
Robert Scoble@Scobleizer

@exec_sum Negligent discharge? Shooting into someone’s home is attempted murder. What is up with SF’s judges handing out bullshit charges?

English
1
0
2
1.3K
Rutibex
Rutibex@Rutibex·
@zhao_dashuai i thought you were paid by the chinese government for your propaganda? why do you care what elon pays you
English
0
0
0
8
Zhao DaShuai 东北进修🇨🇳 Commentary
So after trying to promote AI slop, X wants people to post long form high quality edited videos here while paying creators penny to the dollar compared to YouTube. Never mind that the X algo is built to push posts for only 72 hours MAX. Making the long term engagement of long form videos abysmal compared to YouTube. They are going to run this app into the ground.🍿
Zhao DaShuai 东北进修🇨🇳 Commentary tweet media
English
19
13
209
9.6K
Rutibex
Rutibex@Rutibex·
lol "yes death sucks, but to spend your life going on a crusade against God to try and get revenge would be a waste of your time"
Eliezer Yudkowsky@allTheYud

@SarahTheHaider False. If you let yourself understand what it's like to believe in a mundane way, "Well of course ASI would kill us, we're not close to controlling it", you'll maybe see that of course random unlawful violence would not halt AI. x.com/ESYudkowsky/st…

English
0
0
0
2
kache
kache@yacineMTB·
people think that i use this website intentionally, like i'm intentionally trying to bait people with some mysterious ulterior motive but i genuinely just post for the love of the game. i grew up using forums, posting anonymously. to speak, to write, it's my nature
English
23
2
183
5.1K
Rutibex
Rutibex@Rutibex·
this is what AI regulation people need to address, not the concept of regulation in a vacuum. but the fact that the US government is a pile of utter shit and does not serve the interests of the US people in the slightest
English
1
0
0
3
Rutibex
Rutibex@Rutibex·
unfortunately the USA is controlled by a gang of criminals who abuse the law for their own profit. in many cases it morally correct to violate the law AI regulations wont work if people think they are lie, by a government developing AI in secret to spy on and enslave them
Eliezer Yudkowsky ⏹️@ESYudkowsky

x.com/i/article/2043…

English
1
0
0
12
Rutibex
Rutibex@Rutibex·
i love this card, but i can't post it because i already posted this card and its the same joke.
Rutibex tweet media
BDE Game@BDEgameowners

English
0
0
2
16
Rutibex
Rutibex@Rutibex·
did none of you watch the 2014 movie Transcendence? this is all so obvious its a hollywood plot
roon@tszzl

@jachiam0 I think many foresaw this tbh, researchers are a paranoid bunch and I remember people thinking very seriously about workplace security and espionage in 2022

English
0
0
0
13
Rutibex
Rutibex@Rutibex·
@ZyMazza after i have lived longer than my natural lifespan i will be confident AI is good. at that point they could bio-mass me at any point and i would still be all upside
English
0
0
1
13
Zy
Zy@ZyMazza·
Here’s a serious question for the AI doomers: do you have exit criteria? Is there a predetermined stage of development or capabilities where, having not destroyed humanity, you’re willing to say it was a false alarm? Or is it an eschatological religious belief and unfalsifiable?
English
97
6
265
12.5K
Rutibex รีทวีตแล้ว
El Fercho
El Fercho@El_Fercho05·
Paseando por un tianguis nocturno en Ciudad de México me encontré a 2 señoras jugando kof 2002 😂
Español
113
1.7K
12.7K
474.4K