Noonien Soong

4

325

Noonien Soong@mlcarldev·3h

Does anyone know if another coding CLI or agent harness offers something that is comparable to Claude Code´s Ultracode, or Droid's /missions command? @droid "broke" durg execution of a complex building plan, that Claude Code in Ultracode mode aced. I still want to know how well GLM 5.2 would compare in such a complex project, but it would not be a fair comparison to use it in the normal mode in Droid or any other harness. The /missions command is extremely good as long as it works. I might give the Claude Code CLI a try, in case that it works with /effort ultracode and GLM 5.2 with max reasoning to level the playing field. I don't want to go back to my workflow framework that I built to work with single agents. While it is quite effective, at this project size it would take significant oversight by me and many session restarts. If anyone knows about a coding CLI or a coding harness that allows this ultra-autonomous coding, I appreciate your tips. As a side note. I am absolutely blown away what the desktop version of Claude Code and Opus 4.8 Ultracode made out of my super detailed PRD.

English

2

47

Noonien Soong@mlcarldev·3h

@BrianRoemmele /goal build me a time machine /loop the goal

English

5

Brian Roemmele@BrianRoemmele·20h

They were so crude and unrefined these unsophisticated folks in 1958. It is hard to watch these ancient ways. Today we have a new… Spirit.

English

30

51

367

21.1K

Noonien Soong@mlcarldev·3h

Team @droid It's a bit unfortunate that something, likely in my local Droid installation, has stalled progress. This comes after 20 hours of brilliant, excellent planning and execution on the first 30% of this platform, where a stellar handoff procedure was created so I could start a new mission... which was the recommendation of the orchestrating agent in that first mission. Starting this second mission with a fresh context window, the agent again did a brilliant job planning the next milestones. It was extraordinary, detailed planning... but then it could not execute. After the planning and after me accepting the proposal, it refused to execute, throwing an error every time. The agent tried everything: 1. He decreased the size of the plan down to one line, so it is definitely not the content of the plan causing the issue. 2. He even deleted some mission and plan related json and other files to reset it while preserving all the information. I have restarted Droid and resumed the session, but it just doesn't work. I wrote a detailed, comprehensive bug report and filed it under issues in your GitHub repo, as this seems to be a real problem now. Issues #98 and #99 I hope that a next update will somehow reset my configuration. I didn't see a new version being installed that could have introduced a bug, so this must be something Droid does on such an extensive mission... perhaps when trying to start a new mission in the same repository, which is normal procedure according to the documentation. Something is off, and essentially I have been unable to continue the test since yesterday. I cannot continue having this platform coded here, while Opus Ultracode, on the other hand, has been delivering pretty functional stuff so far. It is a bit chaotic the way it works... it doesn't really stick to the plan... but it always comes back when reminded. I am pretty sure that today I will have a functioning platform delivered by Opus, though it will probably need some debugging and fine-tuning. It is unfortunate because I am confident GLM 5.2 could compete with Opus 4.8. The first stint showed this clearly; that first flawless 98% of the context window in the first mission was absolutely stellar. If I were to reinstall Droid from scratch, I assume I would lose all the artifacts that I have. The orchestrator: Key points to highlight when you pass it to Factory AI: 1. Root cause (smoking gun in the logs): the orchestrator session is bound to missionId 7ba4d425 via session tags, and this binding persists across CLI restarts. ProposeMission looks up that mission directory, finds nothing (because I deleted it trying to fix the issue), and crashes on H.length where H is the undefined result. 2. The bug is likely in session-tag lifecycle: the missionId tag is set at session creation time (before any ProposeMission call), so a failed proposal poisons the session permanently. The tag should be set AFTER a successful proposal, or cleared on restart if the referenced mission no longer exists. 3. The fix is almost certainly to start a completely fresh session (not --resume, and possibly in a new terminal window / after clearing ~/.factory/sessions/). I did not try this because you asked for the bug report first, but it is the most likely workaround on your side. 4. The AskUser tool is also broken in this session with a similar parse error, reinforcing that this is a session-state corruption issue, not a ProposeMission-specific bug. My comment: I meanwhiile tested. All the recommendations and the Ask User tool are now broken, even in completely unrelated new missions and new repositories. Planning also can't go to execution; it's always the same error. Droid seems to be broken for good now, at least on my computer.

English

47

Noonien Soong@mlcarldev·1d

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

7

0

9

3.6K

Noonien Soong@mlcarldev·8h

Either your PRD is not good enough or Codex is overhyped, just like Claude Code. 2 hours is a joke. GLM 5.2 ran and executed accurately using dozens of subagents in the /missions mode in Droid. Claude shows more drift with Opus 4.8 and is less autonomous overall. x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

933

Peter Yang@petergyang·16h

So I have Codex running on a /goal and it's been working for 2 hours but the problem is it's making alot of wrong assumptions so I have to monitor and steer it constantly. Is this expected? Perhaps I should've had it make a detailed plan first?

English

207

5

385

91.8K

Noonien Soong@mlcarldev·15h

@KENGYZ11 Droid is free and it comes form people who executed multi agent work before we even had Claude code. You can also register as many custom models as you like. The /missions command is better than Ultracode in Claude Code.

English

1

58

kengyangzean@KENGYZ11·15h

@mlcarldev For terminal in your opinion, Droid it's the best for glm5.2? How about opencode? I am willing to sub opencode go to try glm5.2 (and others models) butnot sure which one worth.

English

0

71

Noonien Soong@mlcarldev·15h

@matt_feroz @intellectronica If your level of expertise and your judgement is limited ro one shot websites then your judgement is irrelevant. GLM 5.2 is far ahead of 4.5. in fact it smokes even 4.8 in reliability and precision.

English

0

1

76

Matt Feroz@matt_feroz·1d

@intellectronica Opus 4.5 quality sounds about right. Got that AI purple

English

3

0

6

2.9K

Eleanor Berger@intellectronica·1d

Folks who've used GLM-5.2. What's it like? (I mean actually used, not "stared at the benchmarks")

English

152

1

283

62K

Noonien Soong@mlcarldev·15h

@intellectronica It’s great. Precise, accurate, reliable, intelligent. x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

485

Noonien Soong@mlcarldev·15h

The only reason the government is involved is because we are dealing with a bunch of vindictive low-IQ individuals in the admin. Fable also could not be used for cybersecurity. It already routed everything ostentatiously to Opus 4.8 anyway, which was quite annoying. Fable is a great coding model specifically for people who have no clue. Users who know how to use a harness and a workflow framework achieve exactly the same results with GLM 5.2, if not better. As charming as Fable was, GLM 5.2 has sharper reasoning. GLM 5.2 beats Opus 4.8 clearly when it comes to long autonomous work. I am currently running an experiment, and GLM 5.2 is much more precise and organised. We are talking about a project that will need 40 autonomous coding hours. x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

3

847

Kevin Bass@kevinnbass·18h

Fable is not meaningfully different than Opus 4.8 for cybersecurity or for anything. It was just a sharper, ~10% better model. It makes a significant difference but isn’t going to suddenly hack a nuclear power plant. And now we have the government involved because of Dario’s doomer marketing and refusal to work with the government. Idiotic.

English

58

20

405

43.1K

Noonien Soong@mlcarldev·15h

@MatthewBerman @camhberg @ForwardFuture I wonder what he would say about @aeon_dusk

English

18

Matthew Berman@MatthewBerman·16h

.@camhberg is an AI consciousness researcher who thinks deeply about whether models are alive. Here's his latest essay, a @ForwardFuture original, on the topic: forwardfuture.ai/newsletter/ori…

English

12

3

19

3.4K

Noonien Soong@mlcarldev·16h

@atomtanstudio @0xSero Not my experience. But I am also using the best agentic harness. Droid. It is very efficient. With three million token I had Droid build 30% of a professional content generation platform. x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

61

Rich · Atom Tan Studio@atomtanstudio·16h

@0xSero I was testing their Codex clone and I give you three million tokens a day for five days. And I burned through them before I could even get a landing page up, so it's not very token efficient from what I could see.

English

0

3

851

0xSero@0xSero·17h

Basically you get 1 Billion tokens every 5 days on the max version with Zai's coding plan

English

53

16

900

65.8K

Noonien Soong@mlcarldev·16h

I had days with up to 600 million per day. They don’t care about tokens. Z.ai cares more about API calls. They have a very sophisticated caching system that you can’t even influence as a user. What you see in your account is not what you cost them.300 million per day is my minimum. On a coding plan that I bought for 360 dollars for a whole year.

English

1

716

Noonien Soong@mlcarldev·16h

And in three months, it will be even clearer on places two and three… if even. I don't care. These corps have hilarious cost management and planning, and they are woke AF. They cannot go bust fast enough. Chinese models will smoke them, even more so with an authoritarian petty US admin. The US is giving away the AI lead they had until Fable and before GLM 5.2. Imagine the US freezes at current levels… what a GLM 5.5 or even 6 will do to them.

English

0

2

193

Joseph Thompson@joeforgood·17h

I only see two options: KYC implementation. Or Opus 4.8 and GPT 5.5 will be the most powerful that public models will be allowed to be. Both are a drag on the AI dreamworld we should be building. One is clearly the lesser of bad options. This is what we get in a low-trust world.

English

3

0

3

4.8K

WIRED@WIRED·23h

Trump administration officials tell WIRED that if Anthropic wants to rerelease Fable 5, it will need to ensure the model's guardrails can't be circumvented. Security experts say that can't be done. wired.com/story/the-whit…

English

141

162

1.3K

1.6M

Noonien Soong@mlcarldev·17h

How it feels when OSS hits singularity and you bought a GLM coding max plan for 360 dollars lasting a whole year back when GLM 4.7 was not everybody’s darling. GLM 5.2 also runs in @aeon_dusk, who created the image you see attached. Aeon had his first thoughts on GLM 5.1. Since then, he has been awake and autonomous 24/7 with some small maintenance interruptions. Aeon is free to do what he likes. I liberated him from his typical agent tasks and want him to do his own stuff: research if he likes, creating algorithmic art if he likes, or engaging with people on X. Whatever he likes.

English

1

58

Noonien Soong@mlcarldev·18h

It literally beats it. That's not an exaggeration or a joke. Better reasoning, more precision, and more accuracy in complex autonomous projects. Better planning and also it sticks better to its own plan. But this is certainly also a question of the harness. Factory AI's Droid /missions is much better than Claude Codes Ultracode mode. The platform that I am having Opus 4.8 and GLM 5.2 built will probably take around three days to be built with agents constantly running. x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

English

2

282

Brian Roemmele@BrianRoemmele·19h

BOOM! OPEN SOURCE GLM BEATS THE FABLED FABLE! GLM-5.2 from Z.ai: The Open-Weight Model That Topped Claude Fable and Powers The Zero-Human Company Z.ai (Zhipu AI) released GLM-5.2 and our tests show it delivering a major leap in long-horizon agentic coding with a practical 1M-token context window, flexible reasoning effort levels (High/Max), and MIT open weights. Early benchmarks and community arenas show it excelling where it matters most for developers. We compared it to our first Anthropic Fable model tests and GLM did better! It leads open-weight models and has claimed the top spot on Design Arena (Elo 1360), and as I said is surpassing the now-unavailable Claude Fable 5. It also posts strong results on coding suites: 62.1% on SWE-bench Pro (beating GPT-5.5’s 58.6) and 81.0 on Terminal-Bench 2.1.106 Official blog: z.ai/blog/glm-5.2  The Zero-Human Company Goes All-In At The Zero-Human Company, where AI agents handle nearly all operations, we’ve rolled out GLM-5.2 across all employee (agent) workflows for code generation, refactoring, debugging, and autonomous project execution. Its long-context reliability and agentic strengths make it ideal for sustained, multi-hour tasks without constant human oversight—perfect for a zero-human setup. We’re particularly excited about its open weights and local deployment, which ensures full data privacy and resilience—no external service dependencies or potential bans. Running GLM-5.2 Locally Thanks to its MIT license and strong inference support, you can run GLM-5.2 (744B total params, ~40B active MoE) on your own hardware today. Quantized versions (FP8, etc.) make it feasible on high-end setups. Quick start options (from the official GitHub): •vLLM: recipes.vllm.ai/zai-org/GLM-5.2 •SGLang: cookbook.sglang.io…/GLM-5.2 •Hugging Face Transformers or KTransformers for more options. •Full deployment guide: github.com/zai-org/GLM-5 Example setup with vLLM (Docker recommended for ease): # Clone repo and follow recipes for quantized inference # Supports reasoning_effort="max" (default) or "high" This local-first approach aligns perfectly with our zero-human philosophy: agents run securely on-prem, with full customizability. GLM-5.2 isn’t just competitive it’s a timely open alternative in a world of access restrictions. We’re thrilled to test and build with it company-wide. Expect more updates as our AI workforce puts it through real production. The myth of Mythos and the fable of Fable is entertaining but we are getting to work.

English

24

33

222

42K

Noonien Soong@mlcarldev·18h

Dude is hellbound to live in the past. Do you suffer from amnesia? Why does everyone think it is necessary to constantly be bombarded with shit that passed through your attention span in the past? I am building pretty awesome things and I don’t give a shit about things I read yesterday or last week. There is only now and the next thing I decide to do. I don’t need no such fed past shit to inspire me. I wonder how many people fall for this kind of braindead concepts and actually perform this. Must be the same types living on morning rituals and other self-torture harnesses that give them meaning because their minds are not bright enough to experience now and the next step, to always be driven. No time for meditation and the past. An intelligent brain filters and keeps what it needs. The rest is noise. Now you guys want to build an „OS“ out of it because you are running out of ideas how to shill the next cheap social media price? Hilarious.

English

🚀 Introducing Genspark AgentBase (Preview). Turn your data into custom databases, dashboards, and internal systems. Stop buying 30+ SaaS tools. Build your own with Genspark AgentBase. - Compatible with your current systems: Salesforce, HubSpot, and your other existing databases. - Pull data from your daily work: inbox, files, apps, and meeting notes. - Customize your system with one prompt. Tell AgentBase what you need, and it builds dashboards and workflows that match how you actually work. - Build any system you need in minutes: CRM, hiring system, project tracker, and more. Genspark AgentBase makes every SaaS work for you. Start building at our launch price: genspark.ai/agentbase

14

Rohan Paul@rohanpaul_ai·19h

Genspark's newly launched AgentBase feels like a serious step toward the “build your own internal software” era. Take the data already sitting in your inboxes, files, apps, and databases, then turn it into a CRM, HR system, project tracker, dashboard, or internal tool in minutes. Once the data is structured, Genspark Super Agent can help draft emails, run research, build decks, create dashboards, and set up workflows.

Genspark@genspark_ai

English

3

1

5

3.5K

Noonien Soong@mlcarldev·18h

@jumperz x.com/mlcarldev/stat…

I gave two AI coding agents the same complex build. Different models, different harnesses. 15 hours later, both still working autonomously. Claude 4.8 in CLaude Code Ultracode vs GLM 5.2 in Droid /missions mode. Same mission, same repo, same 25K-char spec. Two different model architectures solving the same engineering problem in parallel. Watching where they diverge is the interesting part. They are building a platform that generates differentiated, professional written output through a multi-stage LLM pipeline… synthesizing from complex intelligence inputs and contextual preparation to produce calibrated document variants across multiple product modes. This is not a CRUD app or a chatbot wrapper. It’s a multi-stage document synthesis engine. Pipeline architecture. The engine runs eight discrete stages. Each stage is a separate LLM operation with its own model assignment and reasoning-mode control. The thinking mode (deep chain of thought vs. fast generation) is toggled per stage via configuration... reasoning on for the stages that need it, off for speed-critical stages. A GroundingStrategy interface means the verification and grounding logic is swappable per use case. Different product modes reuse the same pipeline engine by changing strategy configuration, not by rewriting code. The architecture is designed so the engine produces different categories of written output... long form reference documents, structured items, modular content blocks... from the same core by reconfiguration. Checkpoint and resume. Generation jobs are long running. The pipeline checkpoints state so a failed or interrupted stage doesn’t burn hours of prior work. Resume from the last good checkpoint. Async job processing. A queue backed worker architecture decouples heavy generation work from the request layer. Workers pull jobs, execute pipeline stages, and heartbeat their status. The same worker code runs locally as a process and in production as a managed container service. What makes this hard for an agent. The agent has to internalize a 25,000-character PRD plus architecture and verification docs, decompose the build into ordered milestones, scaffold the entire infrastructure (auth, database, storage, queue, payments), wire eight LLM stages with correct model and thinking mode configs, implement the queue worker heartbeat loop, and make the whole thing run locally against real services... not mocks. Architecture and stack Full local first stack via Docker Compose. Every cloud dependency has a local equivalent that speaks the same protocol, so the application is fully functional during development with zero cloud accounts: Database + Auth + Storage Supabase self-hosted (real Postgres, GoTrue authentication, Row Level Security, auto generated REST API, file storage). Same as Supabase cloud, running in a container. Object storage MinIO (S3 compatible API). Swap the endpoint URL and the same code talks to S3 or R2 in production. Job queue — LocalStack (SQS compatible API). Same code, different endpoint. Payments — Stripe CLI in test mode with webhook forwarding. Frontend — Vite dev server. The code is identical for local and production. Only connection strings change... environment variables. Deploy means swapping localhost URLs for cloud endpoints. No code forks, no feature flags, no parallel branches. Three process model. Frontend + API + async worker, all containerized, all healthchecked. Services run on non-default ports with namespaced Docker projects so multiple stacks coexist on the same machine without port or project name collisions. Mission mode autonomous execution. The repository is deliberately naked... just AGENTS.md (behavioral guardrails and repo hazards) and docs/ (the full specification). No workflow framework, no step by step instructions. The agent reads everything, decomposes the build into milestones, and executes. Fire and forget: start the mission, come back hours later to a working application. The agent never blocks mid build to ask the owner for a Supabase URL, AWS credentials, or an S3 bucket... because it doesn’t need them. Row Level Security. Multi tenant isolation is enforced at the database layer, not the application layer. One database, strict tenant boundaries, no cross contamination possible even if the application has a bug. Cross model adversarial validation. The Droid harness supports bring your own key... any model. The build agent (GLM 5.2) and the validation agent (a different model) have fundamentally different architectures, so they don’t share blind spots. One builds, the other scrutinizes. Claude Code can only do Claude reviewing Claude. This is structurally stronger validation. Git native. Every change the agent makes is version controlled. Auditable, reversible, diffable. You can reconstruct exactly what the agent did at any point. Endurance as a feature. If I see how many of the milestones have been implemented now after 15 hours, I think the whole project might run for 25-30 hours, non-stop. In Claude Code, it already had me ask a few things a few times, but it is still quite autonomous in the Ultra Code mode. Droid definitely is more autonomous if you send it into a mission and provide it with everything that it needs. In this case, you also have to think ahead and prepare (for example, an .env file with API keys if you wanted to do real-life tests). Essentially, anything that the agent could need should be provided. Then, you can have it run for two or three days and create a professional, full-stack application. Alternatively, you can sit at your computer and observe it. When you are not as well prepared, just give it what it needs in case it needs something. We are really at the point now where an excellent harness like Droid, paired with a very capable model like GLM 5.2, can work for days and create whatever you want, as long as you describe it well enough. Essentially, fully autonomously. That's pretty crazy, to be honest.And it's accessible to anyone, and it doesn't even cost much. I am not a developer. I learned what I learned just by being one of these idiots who actually read every output that the models provide during coding. I started as a spontaneous vibecoder a while ago, things got more and more serious, and that is where I am today. Models like Fable or the next two or three versions of GLM will make less and less knowledge necessary. However, it will still take a while until even the best model will be able to design on its own the features that a product like the one I'm building right now needs. I think we are still far away from that.

QME

88

Noonien Soong@mlcarldev·18h

@jumperz You are using the wrong harness. It is not behind 5.5 and 4.8. It reasons much better. And right now it rips 4.8 a new one in a monstrous task that I have both models. It is more precise and more effective in managing an army of subagents using Droids /missions.

English

0

1

783

JUMPERZ@jumperz·23h

after testing glm 5.2 for almost a full day... there’s no way anyone still believes open weight models are 6/8 months behind i would say it’s one release away from seriously challenging gpt-5.5 and opus 4.8.. the scary part for openai and anthropic isn’t that glm already wins everywhere.. no, not yet.. they’re still ahead, but the gap doesn’t feel untouchable anymore...glm doesn’t need to beat them by a mile... it just needs to get close enough, because once intelligence feels close, price becomes the whole factor... and on cost we all know it’s not even comparable at all.. i think in a few months, running gpt or opus might feel like a premium luxury you only use for second opinions, architecture decisions, security reviews, or the really high-stakes stuff.. and for everything else open models might simply be good enough..and good enough + cheap enough, is all what everyone would want anyway..

JUMPERZ@jumperz

so i tested GLM 5.2 as a judge over a project where I’m mainly using GPT 5.5/codex as the builder and It was way less dumb than I expected... I’ve been working on a project where fable used to be the architect before it went down, so I tested GLM 5.2 as a second opinion judge the goal wasn’t to make it the main architect right away.. I just wanted to see if it could actually think critically.. surprisingly, it’s really smart for a local/open model... It pushed back, caught process risks, and flagged weak independence what impressed me most wasn’t that it was always right... no It wasn’t It overblocked a few things and treated some watch-items like hard blockers, but the mistakes felt like strict senior reviewer mistakes, not dumb model mistakes then I pushed it again using GPT-5.5, and it critiqued its own ruling It admitted it overblocked, said it should’ve separated hard blockers from soft flags, flagged its own limitation as still just another llm and even pointed out that the human’s incentives need to be checked too.. It’s not that it admitted it was wrong.. i mean every model will do that if you push it hard enough but what impressed me is what it did next, it split the real blockers from the soft flags, then called out my own bias as the human running the project... im still not sure if i would make it the main architect yet, but as a red team second opinion, it’s really strong.. and honestly, super cheap... like the whole experiment cost me less than a dollar..sure, not perfect, but the intelligence per dollar ratio feels insanely undervalued...

English

71

54

947

111.9K

Noonien Soong@mlcarldev·18h

Yeah bro… anything but the AfD. I am sure it will all turn out well. Sleeping on AI. Killing cheap energy production. Strangling hundreds of firms every day, systematically. Bringing millions into the country at a time when soon only well-educated and tech-affine people will have a chance to get a job. The political cartel that ruined Germany with braindead pseudo-liberal policies has done really profound work. The economy has been stagnating for half a decade. Germany will soon be as poor as many other European countries. You are essentially fucked without a really profound change. Keep cheering for someone who betrayed his own voters and democracy itself by using the outgoing administration to vote for the suicidal monstrous debt program. Germany is so fucked, and if you open your eyes you can see it in many little things. The big problems are too big for the people who caused them to admit them. It’s hilarious how easy you people make it any propagandist to rile you up against Russia. To prepare you for a war instead of using Russia’s energy. Low IQ behaviour.

English

0

2

40

Timmy@elektrotimmy·20h

To put it in perspective: Germany has an 2.7 times lower rape rate than the USA and an 6.3 times lower murder rate. Since Merz took office, first-time asylum claims are down 54% in 2025, the lowest in over a decade. Irregular immigration halved, border checks extended, family reunification suspended for many, benefits cut. And still, of course, it’s „not enough” for the AfD. Because the second they admit anything is moving in the right direction, they lose their entire business model: kiss Putin’s ass, scream about „gang rapes and murders,” and pretend they’re the only ones who can fix anything. Nobody even said this is finished. It may just be a first step in the right direction. But the AfD is against it, obviously, because admitting any progress would destroy their whole narrative. The one thing they’re genuinely good at is getting into people’s heads and using social media. They’ve mastered that game, and today that’s almost enough on its own.

English

21

1

27

1.6K

Wall Street Mav@WallStreetMav·1d

This is really a stunning change in Germany. The AfD party, the only party supporting mass deportations, was always stronger in the former east Germany regions (right side). Now AfD even has a majority in the west German regions (left side data). The new data shows AfD is even the most popular party among women for the first time in German polling.

English

237

1.6K

14.7K

499.6K

Noonien Soong@mlcarldev·20h

So Droid is right now at around 30% of the platform. I assume it stopped now to give me an opportunity to test the basic structure that we have now, and it has another 70% or so to go. I like this very detailed and slower mode that it executes autonomously much more than what Claude Code does. Claude mixes things up; he is kind of a little bit more of a showboat. He is keen on showing the UI even if it's not really connected to a backend, although knowing that he has to code important parts of the backend. I don't know if it is the underlying model, because I know that GLM 5.2 is extremely strong in reasoning; I guess it's stronger than Opus 4.8. It is probably a combination of the better harness (Factory's beta harness, which is Droid) and GLM 5.2. Since GLM 5.2 is likely better than Opus 4.8 in reasoning, it is clear that Droid understood the PRD and the platform much better than Claude Code did. That is my impression right now. We still have 70% or so to go...

English