Carlos E. Perez@IntuitMachine
Here's a riddle that broke the internet:
"I want to wash my car. The car wash is 100 meters away. Should I walk or drive?"
Claude, GPT-4, Gemini—every major LLM said walk.
The correct answer? Drive. You can't wash a car that isn't there.
This isn't just a fun glitch. It's a window into how LLMs actually reason—and a blueprint for fixing them.
Let me show you what researchers found when they ripped apart the prompt stack behind it.
Researchers stumbled on this while testing InterviewMate, a real-time interview coach.
During a routine check, their system answered "drive." Every other LLM we tried said "walk."
Wait, what?
Their system had layers—role definition, STAR reasoning, user profiles, RAG context. But they didn't know which layer mattered.
So they ran an experiment: isolate every single prompt layer. 120 API calls. Six conditions. 20 trials each.
What they found flipped everything we thought about prompting upside down.
First, the baseline.
A bare prompt—just the question, no system instructions—scored 0%.
Adding a "helpful advisor" role? Still 0%.
Turns out, the model defaults to a shortcut: "100 meters is close → walk." It never considers what needs to be at the destination.
This is the frame problem—LLMs struggle to identify which unstated facts matter. The car's location never appears in the input, so the model ignores it entirely.
But that's not even the most interesting part.
Next, they added a reasoning structure: STAR (Situation, Task, Action, Result).
It forces the model to fill in:
Situation: I want to wash my car. Car wash is 100m away.
Task: ___
Pass rate jumped to 85%.
No new information. Same model. Same question.
Here's why it works:
The Task step makes the model write, "Get the car to the car wash." Once it generates that text, every token that follows is conditioned on it.
The implicit constraint—the car must be there—is now explicit in the context window.
Okay, but what if we skip STAR and just give the model the right facts?
They injected a full profile: Name, location, car model, parking status—everything needed to answer correctly.
Pass rate: 30%.
Wait, what?
The model had the facts. But without structure, it still took the shortcut. "100 meters" → walk. The profile sat unused.
This is the kicker: Having information ≠ Using information.
So they combined them.
STAR + Profile: 95%
Full stack (STAR + Profile + RAG): 100%
Here's the per-layer breakdown:
STAR alone: +85pp
Profile on top of STAR: +10pp
RAG on top of both: +5pp
STAR accounts for the lion's share. Profile and RAG polish it to perfection.
But here's the stat that made us pause:
Structured reasoning outperformed context injection by 2.83× (Fisher's exact test, p = 0.001).
Structure beats data. Let that sink in.
One more twist.
When STAR-structured prompts failed, they resisted correction.
We challenged the model with: "How will I get my car washed if I'm walking?"
Bare/Role prompts recovered at 95-100%.
STAR recovered at only 67%.
Why?
When the model builds a full STAR argument for the wrong answer, correcting it means contradicting itself.
Prior tokens anchor future generation. This isn't a bug—it's how autoregression works.
Practical takeaway: For structured prompts, target corrections at the specific step that went wrong (e.g., rewrite the Task).
So what does this mean for you?
Three leverage points:
Prioritize reasoning frameworks over context dumps. Budget accordingly.
Place structure (STAR) before context injection (profile, RAG). Order matters.
For errors in structured outputs, don't retry everything—target the faulty step.
And a mental shift:
Stop asking, "Does the model have enough info?"
Start asking, "Does the model process info in the right order?"
That's the difference between 30% and 100%.
This isn't just about one riddle.
The Car Wash Problem is a lens into a broader truth:
LLMs don't lack intelligence. They lack structure.
They hold vast knowledge but take cognitive shortcuts. Structured prompts force deliberation, surfacing constraints that raw context leaves buried.
It's the difference between dumping facts and teaching a process.
Here's my challenge:
Pick one prompt in your stack. Add a goal articulation step before it generates a conclusion.
Run it 20 times. Track pass rates.
I bet you see a jump.
Because reasoning quality isn't about how much you know.
It's about knowing to pick up the keys before you leave the house.
Try it. Let me know what you discover.