
People often talk about RAG and fine-tuning like you are picking between two clean options.
"If you need facts, use RAG. If you need behavior, fine-tune." That sounds nice in an article.
In a real system, it is not that simple.
For the clinic project, I did not skip RAG in favor of fine-tuning. I started with RAG because that was the right move. RAG gave us a big jump. A prompt-rule fix plus two new skills moved the system from about 0.65 to 0.85 stable. No GPU. About one hour of work.
That is the basic playbook. Fix retrieval first. Add the right tools. Tighten the rules. Make sure the model has the facts it needs. But then we hit the wall. The last 15% was not a retrieval problem anymore. The facts were there. The model still derailed in different ways across runs. Same question, different failure.
One run might confuse Gonadorelin with Sermorelin. Another might inject a weird SQL placeholder mid-response. That does not mean RAG failed. That means the base model was losing consistency.
And this is where most RAG vs fine-tuning articles get too clean for their own good.
IBM’s own article frames RAG as connecting the model to internal data so it can return more accurate answers, while fine-tuning improves performance on domain-specific tasks. That is not wrong. But the common takeaway becomes too shallow.
"RAG gives you accuracy. Fine-tuning gives you behavior."
Reality is messier. RAG can still fail if retrieval pulls the wrong context, if chunking is weak, if ranking is off, or if the model ignores the evidence. Fine-tuning can still fail if the data is bad, if the labels are sloppy, or if you are trying to teach facts that should live in retrieval.
The real question is not RAG or fine-tuning. The real question is which failure mode are you solving?
Missing facts? Use RAG. Wrong behavior? Use fine-tuning. Unsafe action flow? You need governance.
That is why I built the Blackboard Kernel work. zenodo.org/records/186918…
The point was simple. As AI systems move from isolated chatbots into agents and workflows, the failure mode changes. It is no longer only "the model hallucinated." Sometimes the system commits a belief without evidence. Sometimes it takes action before constraints are satisfied. Sometimes the glue code lets an unsafe step through because nothing is enforcing internal state, evidence, and action gates.
That is the problem I built for. Typed internal state. Evidence-based belief commitment. Constraint-gated action execution.
In the controlled evaluation, the deterministic BK agent reached 100.0% task success with zero unsafe actions. The LLM-backed BK agent reached 99.0% task success with zero unsafe actions. Baseline architectures produced unsafe actions in 38.7% to 43.0% of episodes.
So when I say the last 15% was not a RAG problem, I mean that literally. We already harvested what RAG could give us. The next lever is fine-tuning because the remaining issue is model behavior.
And beyond fine-tuning, the deeper layer is governed cognition. Facts belong in retrieval. Behavior belongs in fine-tuning. Safety-critical action flow belongs in the system architecture. That is the part most articles "hallucinate" about.
And about the picture, I almost forgot. That is the second ASUS GX10 being added today.
Together, the two boxes move this from a local AI workstation into a small private AI cluster.
256 GB aggregate unified memory, up to 2 petaFLOPS of FP4 AI compute, 40 ARM CPU cores.
And linked systems capable of handling models up to the 405B class.
Not @hackingdave H100 level yet. LOL

English



