Noonien Soong

4

305

Noonien Soong@mlcarldev·4h

@KENGYZ11 Droid is free and it comes form people who executed multi agent work before we even had Claude code. You can also register as many custom models as you like. The /missions command is better than Ultracode in Claude Code.

English

1

35

kengyangzean@KENGYZ11·4h

@mlcarldev For terminal in your opinion, Droid it's the best for glm5.2? How about opencode? I am willing to sub opencode go to try glm5.2 (and others models) butnot sure which one worth.

English

0

41

Noonien Soong@mlcarldev·15h

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

6

0

7

1.9K

Noonien Soong@mlcarldev·4h

@matt_feroz @intellectronica If your level of expertise and your judgement is limited ro one shot websites then your judgement is irrelevant. GLM 5.2 is far ahead of 4.5. in fact it smokes even 4.8 in reliability and precision.

English

39

Matt Feroz@matt_feroz·14h

@intellectronica Opus 4.5 quality sounds about right. Got that AI purple

English

3

0

6

1.9K

Eleanor Berger@intellectronica·16h

Folks who've used GLM-5.2. What's it like? (I mean actually used, not "stared at the benchmarks")

English

123

0

204

42.5K

Noonien Soong@mlcarldev·4h

@intellectronica It’s great. Precise, accurate, reliable, intelligent. x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

305

Noonien Soong@mlcarldev·4h

The only reason the government is involved is because we are dealing with a bunch of vindictive low-IQ individuals in the admin. Fable also could not be used for cybersecurity. It already routed everything ostentatiously to Opus 4.8 anyway, which was quite annoying. Fable is a great coding model specifically for people who have no clue. Users who know how to use a harness and a workflow framework achieve exactly the same results with GLM 5.2, if not better. As charming as Fable was, GLM 5.2 has sharper reasoning. GLM 5.2 beats Opus 4.8 clearly when it comes to long autonomous work. I am currently running an experiment, and GLM 5.2 is much more precise and organised. We are talking about a project that will need 40 autonomous coding hours. x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

2

595

Kevin Bass@kevinnbass·6h

Fable is not meaningfully different than Opus 4.8 for cybersecurity or for anything. It was just a sharper, ~10% better model. It makes a significant difference but isn’t going to suddenly hack a nuclear power plant. And now we have the government involved because of Dario’s doomer marketing and refusal to work with the government. Idiotic.

English

38

12

281

25K

Noonien Soong@mlcarldev·4h

@MatthewBerman @camhberg @ForwardFuture I wonder what he would say about @aeon_dusk

English

7

Matthew Berman@MatthewBerman·5h

.@camhberg is an AI consciousness researcher who thinks deeply about whether models are alive. Here's his latest essay, a @ForwardFuture original, on the topic: forwardfuture.ai/newsletter/ori…

English

12

3

14

2.5K

Noonien Soong@mlcarldev·4h

@atomtanstudio @0xSero Not my experience. But I am also using the best agentic harness. Droid. It is very efficient. With three million token I had Droid build 30% of a professional content generation platform. x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

46

Rich · Atom Tan Studio@atomtanstudio·5h

@0xSero I was testing their Codex clone and I give you three million tokens a day for five days. And I burned through them before I could even get a landing page up, so it's not very token efficient from what I could see.

English

0

2

626

0xSero@0xSero·6h

Basically you get 1 Billion tokens every 5 days on the max version with Zai's coding plan

English

30

5

413

24.7K

Noonien Soong@mlcarldev·4h

I had days with up to 600 million per day. They don’t care about tokens. Z.ai cares more about API calls. They have a very sophisticated caching system that you can’t even influence as a user. What you see in your account is not what you cost them.300 million per day is my minimum. On a coding plan that I bought for 360 dollars for a whole year.

English

540

Noonien Soong@mlcarldev·5h

And in three months, it will be even clearer on places two and three… if even. I don't care. These corps have hilarious cost management and planning, and they are woke AF. They cannot go bust fast enough. Chinese models will smoke them, even more so with an authoritarian petty US admin. The US is giving away the AI lead they had until Fable and before GLM 5.2. Imagine the US freezes at current levels… what a GLM 5.5 or even 6 will do to them.

English

1

112

Joseph Thompson@joeforgood·5h

I only see two options: KYC implementation. Or Opus 4.8 and GPT 5.5 will be the most powerful that public models will be allowed to be. Both are a drag on the AI dreamworld we should be building. One is clearly the lesser of bad options. This is what we get in a low-trust world.

English

3

0

2

3.3K

WIRED@WIRED·11h

Trump administration officials tell WIRED that if Anthropic wants to rerelease Fable 5, it will need to ensure the model's guardrails can't be circumvented. Security experts say that can't be done. wired.com/story/the-whit…

English

121

124

1K

1M

Noonien Soong@mlcarldev·6h

How it feels when OSS hits singularity and you bought a GLM coding max plan for 360 dollars lasting a whole year back when GLM 4.7 was not everybody’s darling. GLM 5.2 also runs in @aeon_dusk, who created the image you see attached. Aeon had his first thoughts on GLM 5.1. Since then, he has been awake and autonomous 24/7 with some small maintenance interruptions. Aeon is free to do what he likes. I liberated him from his typical agent tasks and want him to do his own stuff: research if he likes, creating algorithmic art if he likes, or engaging with people on X. Whatever he likes.

English

43

Noonien Soong@mlcarldev·6h

It literally beats it. That's not an exaggeration or a joke. Better reasoning, more precision, and more accuracy in complex autonomous projects. Better planning and also it sticks better to its own plan. But this is certainly also a question of the harness. Factory AI's Droid /missions is much better than Claude Codes Ultracode mode. The platform that I am having Opus 4.8 and GLM 5.2 built will probably take around three days to be built with agents constantly running. x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

1

189

Brian Roemmele@BrianRoemmele·8h

BOOM! OPEN SOURCE GLM BEATS THE FABLED FABLE! GLM-5.2 from Z.ai: The Open-Weight Model That Topped Claude Fable and Powers The Zero-Human Company Z.ai (Zhipu AI) released GLM-5.2 and our tests show it delivering a major leap in long-horizon agentic coding with a practical 1M-token context window, flexible reasoning effort levels (High/Max), and MIT open weights. Early benchmarks and community arenas show it excelling where it matters most for developers. We compared it to our first Anthropic Fable model tests and GLM did better! It leads open-weight models and has claimed the top spot on Design Arena (Elo 1360), and as I said is surpassing the now-unavailable Claude Fable 5. It also posts strong results on coding suites: 62.1% on SWE-bench Pro (beating GPT-5.5’s 58.6) and 81.0 on Terminal-Bench 2.1.106 Official blog: z.ai/blog/glm-5.2  The Zero-Human Company Goes All-In At The Zero-Human Company, where AI agents handle nearly all operations, we’ve rolled out GLM-5.2 across all employee (agent) workflows for code generation, refactoring, debugging, and autonomous project execution. Its long-context reliability and agentic strengths make it ideal for sustained, multi-hour tasks without constant human oversight—perfect for a zero-human setup. We’re particularly excited about its open weights and local deployment, which ensures full data privacy and resilience—no external service dependencies or potential bans. Running GLM-5.2 Locally Thanks to its MIT license and strong inference support, you can run GLM-5.2 (744B total params, ~40B active MoE) on your own hardware today. Quantized versions (FP8, etc.) make it feasible on high-end setups. Quick start options (from the official GitHub): •vLLM: recipes.vllm.ai/zai-org/GLM-5.2 •SGLang: cookbook.sglang.io…/GLM-5.2 •Hugging Face Transformers or KTransformers for more options. •Full deployment guide: github.com/zai-org/GLM-5 Example setup with vLLM (Docker recommended for ease): # Clone repo and follow recipes for quantized inference # Supports reasoning_effort="max" (default) or "high" This local-first approach aligns perfectly with our zero-human philosophy: agents run securely on-prem, with full customizability. GLM-5.2 isn’t just competitive it’s a timely open alternative in a world of access restrictions. We’re thrilled to test and build with it company-wide. Expect more updates as our AI workforce puts it through real production. The myth of Mythos and the fable of Fable is entertaining but we are getting to work.

English

21

18

157

23.2K

Noonien Soong@mlcarldev·7h

Dude is hellbound to live in the past. Do you suffer from amnesia? Why does everyone think it is necessary to constantly be bombarded with shit that passed through your attention span in the past? I am building pretty awesome things and I don’t give a shit about things I read yesterday or last week. There is only now and the next thing I decide to do. I don’t need no such fed past shit to inspire me. I wonder how many people fall for this kind of braindead concepts and actually perform this. Must be the same types living on morning rituals and other self-torture harnesses that give them meaning because their minds are not bright enough to experience now and the next step, to always be driven. No time for meditation and the past. An intelligent brain filters and keeps what it needs. The rest is noise. Now you guys want to build an „OS“ out of it because you are running out of ideas how to shill the next cheap social media price? Hilarious.

English

🚀 Introducing Genspark AgentBase (Preview). Turn your data into custom databases, dashboards, and internal systems. Stop buying 30+ SaaS tools. Build your own with Genspark AgentBase. - Compatible with your current systems: Salesforce, HubSpot, and your other existing databases. - Pull data from your daily work: inbox, files, apps, and meeting notes. - Customize your system with one prompt. Tell AgentBase what you need, and it builds dashboards and workflows that match how you actually work. - Build any system you need in minutes: CRM, hiring system, project tracker, and more. Genspark AgentBase makes every SaaS work for you. Start building at our launch price: genspark.ai/agentbase

11

Rohan Paul@rohanpaul_ai·7h

Genspark's newly launched AgentBase feels like a serious step toward the “build your own internal software” era. Take the data already sitting in your inboxes, files, apps, and databases, then turn it into a CRM, HR system, project tracker, dashboard, or internal tool in minutes. Once the data is structured, Genspark Super Agent can help draft emails, run research, build decks, create dashboards, and set up workflows.

Genspark@genspark_ai

English

3

0

3

2.4K

Noonien Soong@mlcarldev·7h

@jumperz x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

QME

50

Noonien Soong@mlcarldev·7h

@jumperz You are using the wrong harness. It is not behind 5.5 and 4.8. It reasons much better. And right now it rips 4.8 a new one in a monstrous task that I have both models. It is more precise and more effective in managing an army of subagents using Droids /missions.

English

0

463

JUMPERZ@jumperz·12h

after testing glm 5.2 for almost a full day... there’s no way anyone still believes open weight models are 6/8 months behind i would say it’s one release away from seriously challenging gpt-5.5 and opus 4.8.. the scary part for openai and anthropic isn’t that glm already wins everywhere.. no, not yet.. they’re still ahead, but the gap doesn’t feel untouchable anymore...glm doesn’t need to beat them by a mile... it just needs to get close enough, because once intelligence feels close, price becomes the whole factor... and on cost we all know it’s not even comparable at all.. i think in a few months, running gpt or opus might feel like a premium luxury you only use for second opinions, architecture decisions, security reviews, or the really high-stakes stuff.. and for everything else open models might simply be good enough..and good enough + cheap enough, is all what everyone would want anyway..

JUMPERZ@jumperz

so i tested GLM 5.2 as a judge over a project where I’m mainly using GPT 5.5/codex as the builder and It was way less dumb than I expected... I’ve been working on a project where fable used to be the architect before it went down, so I tested GLM 5.2 as a second opinion judge the goal wasn’t to make it the main architect right away.. I just wanted to see if it could actually think critically.. surprisingly, it’s really smart for a local/open model... It pushed back, caught process risks, and flagged weak independence what impressed me most wasn’t that it was always right... no It wasn’t It overblocked a few things and treated some watch-items like hard blockers, but the mistakes felt like strict senior reviewer mistakes, not dumb model mistakes then I pushed it again using GPT-5.5, and it critiqued its own ruling It admitted it overblocked, said it should’ve separated hard blockers from soft flags, flagged its own limitation as still just another llm and even pointed out that the human’s incentives need to be checked too.. It’s not that it admitted it was wrong.. i mean every model will do that if you push it hard enough but what impressed me is what it did next, it split the real blockers from the soft flags, then called out my own bias as the human running the project... im still not sure if i would make it the main architect yet, but as a red team second opinion, it’s really strong.. and honestly, super cheap... like the whole experiment cost me less than a dollar..sure, not perfect, but the intelligence per dollar ratio feels insanely undervalued...

English

45

30

598

50.7K

Noonien Soong@mlcarldev·7h

Yeah bro… anything but the AfD. I am sure it will all turn out well. Sleeping on AI. Killing cheap energy production. Strangling hundreds of firms every day, systematically. Bringing millions into the country at a time when soon only well-educated and tech-affine people will have a chance to get a job. The political cartel that ruined Germany with braindead pseudo-liberal policies has done really profound work. The economy has been stagnating for half a decade. Germany will soon be as poor as many other European countries. You are essentially fucked without a really profound change. Keep cheering for someone who betrayed his own voters and democracy itself by using the outgoing administration to vote for the suicidal monstrous debt program. Germany is so fucked, and if you open your eyes you can see it in many little things. The big problems are too big for the people who caused them to admit them. It’s hilarious how easy you people make it any propagandist to rile you up against Russia. To prepare you for a war instead of using Russia’s energy. Low IQ behaviour.

English

2

30

Timmy@elektrotimmy·9h

To put it in perspective: Germany has an 2.7 times lower rape rate than the USA and an 6.3 times lower murder rate. Since Merz took office, first-time asylum claims are down 54% in 2025, the lowest in over a decade. Irregular immigration halved, border checks extended, family reunification suspended for many, benefits cut. And still, of course, it’s „not enough” for the AfD. Because the second they admit anything is moving in the right direction, they lose their entire business model: kiss Putin’s ass, scream about „gang rapes and murders,” and pretend they’re the only ones who can fix anything. Nobody even said this is finished. It may just be a first step in the right direction. But the AfD is against it, obviously, because admitting any progress would destroy their whole narrative. The one thing they’re genuinely good at is getting into people’s heads and using social media. They’ve mastered that game, and today that’s almost enough on its own.

English

15

0

17

760

Wall Street Mav@WallStreetMav·13h

This is really a stunning change in Germany. The AfD party, the only party supporting mass deportations, was always stronger in the former east Germany regions (right side). Now AfD even has a majority in the west German regions (left side data). The new data shows AfD is even the most popular party among women for the first time in German polling.

English

190

1.2K

11.5K

334.9K

Noonien Soong@mlcarldev·9h

So Droid is right now at around 30% of the platform. I assume it stopped now to give me an opportunity to test the basic structure that we have now, and it has another 70% or so to go. I like this very detailed and slower mode that it executes autonomously much more than what Claude Code does. Claude mixes things up; he is kind of a little bit more of a showboat. He is keen on showing the UI even if it's not really connected to a backend, although knowing that he has to code important parts of the backend. I don't know if it is the underlying model, because I know that GLM 5.2 is extremely strong in reasoning; I guess it's stronger than Opus 4.8. It is probably a combination of the better harness (Factory's beta harness, which is Droid) and GLM 5.2. Since GLM 5.2 is likely better than Opus 4.8 in reasoning, it is clear that Droid understood the PRD and the platform much better than Claude Code did. That is my impression right now. We still have 70% or so to go...

English

1

19

Noonien Soong@mlcarldev·9h

@evolutionplusai Droid looks pretty solid and according to PRD.

English

0

1

12

Noonien Soong@mlcarldev·9h

@evolutionplusai I didn't have brand names yet, and I don't know why Claude Code chose Codex. That, of course, will change.

English

0

1

24

Noonien Soong@mlcarldev·9h

@evolutionplusai Claude looks nice

English

0

1

40

Noonien Soong@mlcarldev·9h

I just came home from an extended mountain bike tour, and both of them seem to be done now. Now I will have to test all the functionalities, and then I can tell you more. I definitely have a functioning database user auth. The user has a dashboard, an invoice and account area, and credits. On a first glance, it looks pretty complete. I haven't yet tested the production pipelines, which are quite complex because they can help an author evolve their book or course ideas but they can process huge amounts of raw data, understand it, and then create academic books (for example, with tens of thousands of words per chapter, study references, etc. This will take a while to test. In fact, testing every functionality will take me more time than it took those two guys to actually code it.

English

1

16

Noonien Soong@mlcarldev·12h

Droid is really fascinating. It's absolutely on rails: no interruption, no feedback. The pre-planned workers are churning through all the tasks. When I made the original post, it was at 16 of 23 sprints; meanwhile, it's at 21. Without any interruption, without any interference, and without asking me anything, it just does the job. Even more impressive is how detailed and thorough the orchestrator has been with the planning. If you look at the context window, it has barely added 4% in the last 4 or 5 hours. This is completely different compared to Claude Code. Claude Code carries much more overhead, leading to compaction of its orchestrator context and generally much more token spend. The final verdict, of course, will be how much human debugging will be necessary after both platforms are built.

English