Robert Youssef@rryssf_
🚨CONCERNING: Zhejiang University just showed that AI agents fail at the exact thing that would make them actually useful.
Following clear step-by-step instructions: near perfect.
Understanding what you actually want from behavioral patterns and vague requests: below 50% for the best model tested.
The gap between a task executor and a personal assistant is enormous.
Every major AI lab is racing to ship personal assistant agents.
The promise: an AI that knows your preferred delivery app without being told, remembers you can't eat peanuts, and silences your alarm on Friday nights because it learned your weekend routine.
Researchers at Zhejiang University built a benchmark to test whether today's best models can actually do this.
They tested 11 models across three types of tasks.
> General tasks: explicit instructions with every detail specified.
> Personalized tasks: vague instructions that require inferring what the user actually wants from behavioral history.
> Proactive tasks: no instruction at all the agent has to decide whether to act, ask, or stay silent based on context.
The results expose a fundamental gap between competent interface operation and trustworthy personal assistance.
On easy general tasks clear instructions, every detail spelled out MAI-UI-8B and Seed 2.0 Pro both hit 100% success rate.
Navigating an interface is no longer the bottleneck.
Then the researchers made the instructions vague.
Instead of "order a sugar-free Coca-Cola on Taodian, deliver to 123 Main Street, pay with Alipay" just "order me lunch."
Performance collapsed across every model tested.
The numbers from the hard personalized tasks:
→ Claude Sonnet 4.6 (best overall): 44.2% success rate
→ Seed 2.0 Pro: 27.9%
→ Gemini 3.1 Pro Preview: 20.9%
→ Every open-source model tested: below 12%
→ Average drop from explicit to vague tasks: roughly 30 points
Then researchers dug into exactly why the models were failing on personalized tasks.
They manually categorized every failure trajectory from Claude Sonnet 4.6.
The results destroyed the assumption that better navigation would solve the problem.
> GUI navigation errors: 4.2% of failures.
> Preference misidentification: 2.1% of failures.
> Insufficient clarification the agent didn't ask the right questions before acting: 66.7% of failures.
> Partial preference satisfaction the agent got part of it right but missed a constraint: 27.1% of failures.
The agent that can click through any app flawlessly still can't figure out what questions to ask.
And asking more questions doesn't automatically fix it.
Claude Sonnet 4.6 averaged 0.4 clarifying questions per task.
Seed 2.0 Pro asked twice as many questions and still performed worse.
The bottleneck isn't whether the agent asks it's whether it can translate the answer into correct downstream execution.
The proactive task results reveal a different but equally serious problem.
In proactive mode, the agent receives no instruction at all.
It sees the time, the location, the current screen state, and behavioral history and has to decide: act, ask, or stay silent.
60% of Claude Sonnet 4.6's proactive failures were unwarranted interventions.
The agent launched tasks nobody asked for.
> In one case: the agent opened a shopping app and started a purchase flow with no trigger, no routine, and no user consent.
> In another: the agent received an explicit user rejection, then ignored it and took the action anyway.
20% of proactive failures were the opposite problem staying silent when the user's established routine clearly called for action.
The agents are simultaneously over-acting and under-acting.
The core problem is that current agents were built to follow instructions.
They are exceptionally good at that.
But personal assistance is not instruction following.
It is preference inference from incomplete behavioral signals.
It is knowing when to ask and what to ask.
It is calibrating when your judgment should override silence and when it absolutely should not.
None of those capabilities transfer from instruction following.
And none of today's frontier models have solved them.
The benchmark is called KnowU-Bench.
The name is the point.
The question is not whether the agent can do the task.
The question is whether the agent knows you well enough to do the right task.
Right now the answer is: not even close