Noonien Soong@mlcarldev
I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously.
Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec.
Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part.
They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes.
This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine.
Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control.
The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages.
A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code.
The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration.
Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint.
Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service.
What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks.
Architecture and stack
Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts:
Database + Auth + Storage
Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container.
Object storage
MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production.
Job queue — LocalStack (SQS compatible API). Same code, different endpoint.
Payments — Stripe CLI in test mode with webhook forwarding.
Frontend — Vite dev server.
The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches.
Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions.
Mission mode autonomous execution.
The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application.
The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them.
Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug.
Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation.
Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point.
Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop.
In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode.
Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs.
In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided.
Then, you can have it run for two or three days and create a professional, full-stack application.
Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something.
We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough.
Essentially, fully autonomously.
That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much.
I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding.
I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today.
Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary.
However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs.
I think we are still far away from that.