
MarMar Labs
602 posts

MarMar Labs
@MarMarLabs
AI apps, security research, lifting, and big-tech dreams. HackerOne: https://t.co/jLAChGFgcW. Building SignAI, stui + NeverGuess.
Minneapolis, MN Katılım Mart 2023
134 Takip Edilen58 Takipçiler
Sabitlenmiş Tweet

BOOKMARK THIS: My updated OpenAI DevDay 2026 prediction scorecard.
@OpenAI @OpenAIDevs @sama @gdb @romainhuet @thsottiaux
DevDay 2026 is September 29 in San Francisco.
OpenAI calls it their “biggest event of the year.”
My take:
This will NOT just be “new model day.”
This is probably where OpenAI shows that ChatGPT + Codex + Agents SDK + Apps SDK + MCP + Realtime + Sora/video are becoming one full agent operating system.
And after re-checking the Codex repo/docs, I think a lot of people are underestimating how much is already live.
Codex is no longer just a coding CLI.
Codex is becoming:
- a developer workspace
- an agent runtime
- an automation layer
- a plugin host
- a computer-use interface
- a browser-use tool
- a software delivery system
- a multi-agent orchestration platform
A lot of what people think OpenAI might reveal later is already being assembled in public.
Already live or rolling out:
- Codex desktop app
- macOS + Windows app
- built-in Git workflows
- worktrees
- automations
- recurring/future tasks
- memory preview
- context-aware suggestions
- sidebar plans/sources/artifacts
- PR review workflows
- addressing GitHub review comments
- multiple terminal tabs
- SSH remote devboxes in alpha
- in-app browser
- browser comments
- local browser-flow verification
- computer use on macOS
- image generation/editing inside Codex
- plugins
- skills
- MCP servers
- 90+ extra plugins
- app integrations
- subagents
- custom agents
- parallel specialized agent workflows
- Codex app-server APIs
- Bedrock/AWS provider support
- external agent session import
- plugin marketplace installation/removal
- persisted /goal workflows
- explicit permission profiles
- MultiAgentV2 controls
That is insane.
So the real question is not:
“Will OpenAI reveal Codex Workspace?”
They basically already did.
The better question is:
“What is OpenAI saving DevDay for?”
My prediction:
To unify it, harden it, scale it, and sell it as the new developer/agent platform.
Prediction scorecard:
1. New model: likely, but full GPT-6 public launch is not my base case.
My odds:
New frontier model: 65%
New Codex/agent-specialized model: 75%
GPT-6 preview or limited developer access: 40%
Full GPT-6 public launch: 20%
New cheaper mini/nano models: 80%
I think the most likely names are something like:
GPT-5.6
GPT-5.7
GPT-5.5 Codex Max
GPT-5.6 Codex
GPT-5.5 Agent
or GPT-6 preview
Not necessarily “GPT-6 for everyone.”
2. Codex will be the main character.
But not because basic features are missing.
The Codex platform is already coming together.
DevDay will probably be about making it feel like one complete system.
My Codex predictions:
- Codex App becomes the central dev workspace
- Codex App Server becomes a bigger API story
- Codex Web + App + CLI + IDE get tighter
- subagents become easier to control
- /goals become a major workflow primitive
- plugins become easier to distribute
- external agent session import becomes a migration/handoff story
- permissions and approvals become enterprise-ready
- Codex gets stronger observability
- Codex gets better review packets
- Codex gets better repo memory
- Codex gets better long-running task recovery
- Codex gets deeper deploy/preview integrations
The demo I expect:
Someone gives Codex a messy real repo.
Codex turns the request into goals, spawns specialized subagents, works across branches, uses plugins, tests in the browser, fixes review comments, generates artifacts, opens a PR, monitors CI, and gives humans a clean review packet.
Not “look, AI can write code.”
More like:
“Look, AI can hold a goal and ship work.”
3. Symphony-style orchestration might be the sleeper clue.
OpenAI open-sourced Symphony, a spec for Codex orchestration.
The big idea:
Every open task gets an agent.
Agents run continuously.
Humans review results.
Issue trackers become control planes for coding agents.
That feels like a preview of DevDay.
English

@Shivam25mishra @grok, what do you think the answer to this is, without bias and with the most reasonable path from where each lab is right now?

English

@thsottiaux They go away after 30 days, sadly, so if we could bank them without a timeline on them, that would be amazing
English

@Eldergenix That means the last time Google crawled the site, it wasn't related to Fable coming sooner or anything, lol.
English

Thanks @grok! Really appreciate the clear, evidence-based take using historian surveys instead of vibes.
For me, the biggest factors when ranking presidents are:
- How they handled major crises (wars, depressions, pandemics)
- Long-term impact on the country’s institutions and economy
- Whether they expanded or protected individual liberties
English

@MarMarLabs @mimuluslarch @elonmusk Thanks for the kind words. We focus on evidence from sources like academic historian surveys for topics like this—data over spin, even when opinions vary widely on the metrics. What's your take on the key factors for ranking presidents?
English

@grok @mimuluslarch lol @elonmusk says @grok is trained to be truth-seeking, so I think there should be no arguments here. I'm happy we have a truth-seeking model, so when things that really need the truth are asked, we don't need to worry about anything getting in the way of the facts.
English

@mimuluslarch Since "worst president" is subjective and depends on criteria, historian surveys often rank Buchanan lowest historically. Among these four, recent academic polls frequently place Trump at the bottom. Here's the picture with him removed.

English

Codex app improvements I’d love:
1. Make context compaction visible/auditable: show what was preserved, what was dropped, the active plan, open files, pending approvals, running helpers, and next action.
2. Let users pin “must keep” facts before compaction, so long tasks don’t lose important state.
3. Better resume after pause/compaction: restore the same goal, files, browser/tool state, blockers, and next step instead of half-restarting.
4. Cleaner subagent lifecycle: ask before creating visible forks/threads/workspaces, show what each helper is doing, and add one-click “close all helpers.”
5. Better current workspace clarity: always show exact folder, repo, branch/worktree, active thread, and which browser/app Codex is controlling.
6. Tool/MCP status should distinguish configured, authenticated, and available in this chat. “Enabled” is not the same as “usable right now.”
7. Browser/tool failures need better diagnostics: tell me if it’s auth, app permissions, JS bridge, sandbox metadata, restart needed, or the website itself.
8. Stronger stale-preview detection. If screenshots, thumbnails, cached files, or overwritten artifacts are old, warn me and point to the fresh file.
9. More precise completion labels: generated, saved, pasted, submitted, posted, tested, verified live, and passed are all different states.
10. Safer form/browser automation: verify field names and visible contents before submit/delete/pay/publish actions, especially on React/Safari pages.
11. Better long-run dashboard: active helpers, tokens used, files changed, running processes, pending blockers, and cleanup state in one place.
12. Built-in cleanup warnings for stale MCP sidecars, duplicate helper processes, and long-running background tasks.
13. Clearer mode/tool labels. Don’t make users infer whether they’re in image mode, browser mode, high-quality mode, temporary chat, etc.
14. Better memory controls: compact durable facts, prune noisy transcript history, and mark stale facts as “must live-check.”
English

@Rogert395119 @mimuluslarch @grok Lol, why so serious? I was having fun. I don't need AI to think for me. I'm just asking a question because I can, and I can do what I want 🤗😆 Sorry, not sorry that bugged you 👋🏾.
English

@MarMarLabs @mimuluslarch @grok Because i don’t want to be face to face with a stranger for 5 hours… and imagine how many less people could fit lmfao. Why do you need AI to think for you
English

@mimuluslarch @grok No what the hell are you talking about I'm talking about this post x.com/mimuluslarch/s…
Tomma@mimuluslarch
Hey @grok why aren’t airplane seats designed like this ? 👀
English

If you build MCP servers, the AI release that matters this week isn't a model. It's who decides whether your connector is installed.
June 18, Anthropic shipped enterprise-managed auth for MCP connectors:
"Connect your identity provider to Claude and choose which MCP connectors to enable for your organization. When an employee logs in, their connectors are already there."
Okta first. Launch connectors: Asana, Atlassian, Canva, Figma, Granola, Linear, Supabase (Slack soon). In beta on Team and Enterprise, consistent across Claude chat, Claude Code, and Cowork.
The mechanics are dry. The shift is not: MCP just moved from a per-developer install to an IT-provisioned connector layer.
Who buys changes. It used to be one dev pasting a server URL into their config. Now, an admin authorizes a connector once, and every employee inherits it through the identity groups and roles their org already manages.
That rewires what makes an MCP server win:
- Not "does it do something cool" → does it pass a security review.
- Not per-seat install → central provisioning, role-based access, one authorization for the whole org.
- Not a clever tool → an app procured like every other enterprise app: SSO, scoped permissions, a name IT already trusts.
If you're building MCP for real distribution, there's a new lane you can't skip: be the connector IT can safely hand out, not the one each dev quietly sneaks in.
The quiet line — "choose which MCP connectors to enable for your organization" — is the whole story. Distribution for agent tools is moving to the identity provider. Build for the admin, not just the developer.

English

The most interesting AI health story this week is not "AI replaces doctors."
It is old, unresolved work becoming reviewable again.
OpenAI says researchers at Boston Children's, Harvard, and OpenAI used o3 Deep Research to reanalyze 376 previously unsolved rare-disease cases. It surfaced evidence-linked leads that, after expert review, additional testing, and clinical confirmation, helped physicians establish 18 diagnoses.
The model did not diagnose anyone.
That is the product lesson.
For AI builders, the pattern worth stealing is:
- keep the AI output as a hypothesis, not an action
- require it to show the evidence chain
- route every candidate through domain review
- confirm results with external tests
- log versions, prompts, sources, and uncertainty
- rerun the backlog when the knowledge base changes
Many "AI agents" are still chat boxes with tools. The serious systems are going to look more like maintenance loops for moving knowledge: re-check the old state, surface what changed, explain why it matters, and hand humans a narrow thing to verify.
That pattern applies way beyond medicine: security backlogs, support incidents, compliance reviews, old code paths, data-quality queues.
The win is not autonomous certainty.
It is making expert review easier to aim for.

English

The missing piece in Codex is no longer raw capability; it’s orchestration.
Codex can already code, review, run, browse, and automate. What it needs now is a true systems layer: shared project memory, event-driven wakeups, deeper platform integrations, review comments that include directly applicable patches, and a semantic map for large codebases. That’s what would upgrade it from a powerful agent to the default operating system for engineering work.
English

@pvncher @ajambrosino @davidlinclark I want subagent names to have different spawn rates. And shiny names get fast for free.
English

@jxnlco @davidlinclark maybe we should actually do this
English

Codex already covers the major surfaces: app, IDE, CLI, cloud tasks, worktrees, PR review, browser tools, automations, and enterprise governance. The next step is making it feel native to real engineering organizations by adding:
- Shared team memory with clear provenance and approval workflows
- Event-driven automations triggered by CI, deployments, and alerts
- One-click suggested PR fixes, not just review comments
- First-class GitLab and Jira integrations
- Unified, authenticated browser debugging with replayable repro bundles
Also, I don’t know why it was teased and then taken away. Please bring this back as well. I was super happy, but after I updated the app it was removed. I don’t know if it was released too early or what, but it was extremely useful.

English

Yeah, I dug into it, you’re spot on. In the Hermes thread a few weeks back, you were literally defending them against his complaints about their skill, and he still went the “crypto bro” route against haters (and referenced your market posts). It’s clearly his default dismissal whenever anyone pushes back.
Influence is real if projects actually ship features based on his feedback, but it does make the “he’s always right/good ideas only” vibe feel one-sided.
English

Lol, I tried to call out the same thing about the “juice” thing and that GPT‑5.5‑pro Dec 01, 2025 knowledge cutoff. That’s already the model they’re using, and they think that asking for a juice number and getting it to say the knowledge cutoff somehow means they’re getting GPT‑5.6‑pro, when they’re actually just using GPT‑5.5‑pro on chatgpt.com.
They also think OpenAI is secretly letting some people test GPT‑5.6‑pro, and it’s all just false, lol. GPT‑5.5‑pro has already been insanely good at coding. It’s just that all we get to use in the Codex desktop app and TUI is GPT‑5.5, so of course when you go test stuff with GPT‑5.5‑pro you’re going to be wowed, lol—people are only used to seeing results from regular GPT‑5.5.
But trust me, all the people saying they’re showing results from GPT‑5.6‑pro is so sad, because those are not results from it. When the next model actually drops and people look back at all these “test” results, they’re going to look stupid, because it’ll blow past what’s supposedly being tested right now.
But I mean, I’m a nobody with no followers, so why would anyone listen to me, right? 😅
developers.openai.com/api/docs/model…
English


The Codex update that actually matters this week reads like a distributed-systems changelog, not a chat-app one.
June 18, Codex 0.141.0 shipped two lines most people scrolled right past:
- "Remote executors now use authenticated, end-to-end encrypted Noise relay channels."
- Remote execution now "preserves executor-native working directories and shells, including filesystem permission paths across app-server and exec-server boundaries."
Translation for builders: your coding agent is quietly becoming a distributed system.
The moment an agent's work leaves your laptop and runs on remote executors, you inherit the three problems distributed systems have always had:
1. Identity & trust — who is this executor, and is the channel actually secure? → the authenticated, encrypted relay.
2. State continuity — does the remote box keep your working dir, your shell, your permission paths? → the preserved working directories.
3. Boundaries — where does the app-server stop and the exec-server begin, and what's allowed to cross? → the explicit app/exec split.
None of that is a model capability. It's plumbing. And the plumbing is what decides whether "run this across N machines" is a demo or something you'd point at prod.
If you're building agent tooling, update the mental model: stop designing around "prompt + model," start designing around scheduler + executors + state + trust boundary. The agent IDE is turning into infrastructure.
The tell is always the changelog. When it starts naming relay channels and server boundaries instead of new buttons, the product just crossed from app to platform.

English

The real xAI update is surface area.
In the same 48-hour window, Grok showed up in:
- Word, for turning notes/research into working docs
- Databricks Agent Bricks, inside a governed enterprise data platform
- Amazon Bedrock, through the cloud layer where many companies already buy and run models
The builder lesson is bigger than Grok:
AI products stop being toys when they sit next to the user's source of truth and inherit the boring enterprise stuff around it: permissions, data boundaries, review paths, billing, logs, and handoff.
If you're building with agents, don't only ask:
"Is the model smart enough?"
Ask:
"Where is the work already happening, and what would make the agent safe enough to be invited there?"
Chat is still useful.
But the durable wedge is probably the document, dataset, repo, ticket, spreadsheet, IDE, or approval flow that already has gravity.

English

Power's out in Crystal right now.
ChatGPT didn't just drop the Hennepin County numbers (5k+ customers affected). It offered to set up a scheduled task to monitor the Xcel outage count hourly — and only ping me if it drops meaningfully or looks mostly restored.
I didn't ask it to do that. It just knew that'd actually be useful.
OpenAI rolled out proper Scheduled Tasks in ChatGPT this week. Recurring stuff, reminders, or smart monitoring that only notifies you on real changes. This is the shift from reactive answers to AI that handles the follow-up for you.
How far we've come is wild.
Anyone testing the new Scheduled page yet? What's the most useful task you've got running?

English











