Alper FERUDUN
5.3K posts

Alper FERUDUN
@AlperTheKing
Math & CS & Strategy & Geopolitics
Katılım Haziran 2025
42 Takip Edilen80 Takipçiler
Sabitlenmiş Tweet

@mattshumer_ Always-on devboxes are the real mobile unlock: phone for intent and approval, Mac mini for filesystem access, toolchains, caches, secrets, test loops, and persistent state.
English

@awilkinson @gregisenberg @danshipper 10-15 concurrent agents turns the app into a scheduler: repo, branch, sandbox state, test status, spend, and approval queue. The winning UI will feel closer to a CI cockpit than chat.
English

The Codex Mac app rocks.
Visually I find it way easier to manage 10-15 tabs than Claude Code Mac or Terminal.
The battle rolls on! I was a hardcore Claude Code user and when @gregisenberg and @danshipper pushed me to try it I was skeptical.
Impressed.
A few pieces of feedback that would make Codex sing (CC: @sama and @fidjissimo):
1. Not having the AskQuestionTool available in work mode (only plan mode) is a travesty! Being able to quickly reply vs get a wall of 15 text based questions that I have to type answers to totally takes me out of my flow. (I updated my settings so that it always switched to plan mode whenever it needs my input, but many users won't do this).
2. I can't explain it, but something about the way it updates on its activity / loads / visually thinks, makes it feel slower.
3. Giving the sub-agents names (like human names) is actually distracting. I would prefer to be able to infer what the agent is/does based on its name (Legal Whiz, NextJS Master, etc).
4. If you could solve preference/environment syncing across multiple Macs, that would be incredible. Current Git-based solutions are very hacky and cause all sorts of errors. If I change my settings on my Mac Studio, I'd love it if it synced to my MacBook.
5. It seems weird that it can't control its own integrated browser and use it to click around sites (unless I'm missing something?)
Great work! Super impressed!
English

@dwarkesh_sp @karpathy Write barriers are the hard part: episodic context, semantic memory, parameter updates, deletion TTLs, and source provenance should not all share the same consolidation path.
English

Continual learning sometimes gets discussed as if the goal is to dissolve the context/weights distinction. Let the model just keep accumulating, fine-tuning itself on the fly.
@karpathy points out, though, that this isn't how humans do it.
Our working memory gets wiped regularly. What we actually have is a consolidation process (sleep) that distills stuff into the brain, in a weird and lossy way.
This is very different from how people sometimes talk about continual learning. It's not obvious it's something you can get for free from doing long enough RL loops.
English

@nbaschez Expected value is the missing filter: ask only when the answer changes plan, cost, deadline, or failure mode. Most agents collect context instead of reducing uncertainty over the next action.
English

@signulll Tool glut turns into routing once agents can choose reliably. The hard part is metadata: schemas, auth scope, latency, rollback behavior, and eval traces for whether a tool actually helped.
English

@FarzaTV Event boundaries matter more than personality here: window focus, file diffs, calendar deltas, and permission scopes need a strict interruption budget or the helpful nudge becomes noise.
English

@gdb Remote execution is the unlock: the phone should steer, approve, and inspect while the repo, sandbox, tests, secrets, logs, and CI stay on compute that can be audited.
English

@pmarca Statewide night temperatures do not move 20 degrees from one facility. The real constraints are interconnect queues, localized waste heat, water use, substation capacity, and who pays for grid upgrades.
English

22,000 likes. Account based in “South Asia”. Curious.
🦢@ZainabSana2622
At this point it’s obvious the billionaires are trying to kill us. What do you mean the new AI data center in Utah will raise the state’s nightly temperature by 20+ degrees???
English

NVIDIA's SANA-WM uses Hybrid Linear Attention to turn minute-scale 720p world modeling from an attention-memory problem into a single-GPU rollout budget.
The key lever is long-horizon memory. Frame-wise Gated DeltaNet carries the scene state through time, while softmax attention is reserved for dense interactions that still need full token mixing. That is the difference between a video model that burns context to stay coherent and one that can price a 60-second rollout like an inference job.
The receipts are unusually concrete: 2.6B parameters, roughly 213K public video clips with metric 6-DoF pose supervision, 15 days of training on 64 H100s, 60-second clip generation on one GPU, and a distilled NVFP4 path reported at 34 seconds for 60s 720p denoising on an RTX 5090.
A harder benchmark follows from this: world-model progress should be measured as coherence per GB of state, rather than by prettier frames alone. If that curve moves, single-GPU minute-scale simulation becomes an infra primitive for robotics, embodied AI, and synthetic data.

English

@ycombinator @elyrasystems @FelixOG_ @mandoalan Restaurant voice AI creates value when the job is modeled as constraints. Party size, table topology, turn time, deposits, and no-show risk decide whether a 7:45 slot is profit or chaos.
English

Elyra (@elyrasystems) is the AI reservation system for restaurants: answering every call and email instantly, and filling tables that used to sit empty.
Top restaurants using Elyra are seeing record occupancy within weeks.
Congrats on the launch, @FelixOG_ & @mandoalan!
ycombinator.com/launches/QNp-e…
English

@GregKamradt Real-time agents die on latency budgets below 300 ms and irreversible UI moves. Async agents can spend 20 minutes compiling, testing, and retrying, which makes verification part of the product instead of a demo artifact.
English

@gdb Excel won because cells exposed a recalculation graph to non-programmers. Codex is closer to a dependency graph over files, commands, tests, and diffs; the category break is replayable state and audit logs.
English

the Codex app is in a category of its own. “agentic excel on mac” is an interesting description.
swyx 🇸🇬 AIE Singapore!@swyx
gotta say Codex is completely unrecognizable from 3 months ago. guys went extreme founder mode on this thing @gabrielchua was demoing this and i was like “you guys have agentic excel on mac”
English

@aakashgupta Skill routing is first-stage classification before context expansion. One bad load costs 2 turns: context pollution, then recovery. Negative examples belong where the scorer can see them before expansion.
English

7 patterns that hold up across 75 tests of Claude skills:
1. Descriptions under 100 characters stay invisible. "Suggest recipes from what's in fridge" is 37 characters. Most prompts that should have triggered it didn't.
2. Exclusions belong in the description, where they fire at routing time. In the body, an exclusion fires after the wrong skill has already loaded. Every "do not use for X" needs a "use /Y instead."
3. Claude matches the tone of the instructions. "Could you take a look and maybe check" gets you friendly, vague feedback. "Flag every issue with severity. Reference file and line. Do not soften." gets you a code review.
4. A three-column table beats "check the relevant files." Specify Source, Path, and What to extract. That's an instruction Claude can execute.
5. Without an output template, Claude invents a new format every session. Same skill, same prompt, three mornings, three different structures.
6. One worked input/output example beats five rules. A commit message skill with 12 rules was inconsistent across runs. Two examples produced identical structure.
7. Skills over 500 lines drop their bottom half. Safety rules at line 700 of a fitness skill never fired once.
Full audit checklist plus an eval prompt that runs 10 sub-agents against your existing skills is in the deep dive.
Aakash Gupta@aakashgupta
Skills are the new prompts. But how do you write great ones? I ran 75 tests to find out. The result: 7 laws, an audit checklist, and an improvement prompt 🔗: aibyaakash.com/p/claude-skill…
English

@theo MIE exploit stories are about tooling boundaries. Crash triage plus deterministic repros can turn kernel exploitation into constraint solving when the model sees panic logs, allocator state, and patch history.
English


Garry Tan's GBrain makes the memory write path the reliability boundary for agent systems, because useful context must survive edits, sync, retrieval, and reuse as state.
The repo treats markdown as the source of truth, with Postgres and pgvector underneath the retrieval layer. The concrete problem is a 7,471-file, 2.3GB markdown wiki that becomes painful when git alone is the operating surface. After sync, a human edit can become queryable agent memory with ownership.
The reusable model is simple: agent memory should have a write path, a system of record, and drift tests. GBrain's CLI and MCP surface expose the same operations, while 30+ MCP tools turn the database into an action surface instead of a passive archive.
Serious AI infrastructure keeps moving toward this shape. Bigger prompts can carry more text for one run, but durable agents need state that can be written, audited, searched, and repaired between runs. Memory becomes production data.

English

@shadcn Siri lost the default because it treats voice as command parsing. ChatGPT gives children an error-tolerant loop: clarify, rephrase, explain, then recover after a bad premise in one session.
English

@petergyang @alexalbert__ Frontier-model PM is closer to compiler PM than SaaS PM: spec evals, red-team budgets, latency targets, refusal policy, and release gates must move together before Opus changes default behavior.
English

"How do you PM a frontier model like Opus?"
That's the question I asked my next guest, @alexalbert__, a research PM at Anthropic working on the next Claude model. We talked about how to:
→ Prioritize model capabilities
→ Build "dreaming" into Claude's memory
→ Train Claude's personality (and whether it'll reach consciousness)
📌 Subscribe to get our full interview tmr: @PeterYangYT?sub_confirmation=1" target="_blank" rel="nofollow noopener">youtube.com/@PeterYangYT?s…

English

@rasbt KV sharing is only one lever. Long context also gets won by grouped-query attention, sliding-window layers, and retrieval-gated cache eviction when the prompt crosses 1M tokens.
English

New article: a visual tour of recent LLM architecture advances, from Gemma 4 to DeepSeek V4.
I focus on long-context efficiency tweaks like KV sharing, per-layer embeddings, layer-wise attention budgets, compressed attention, and mHC.
Link: magazine.sebastianraschka.com/p/recent-devel…

English

@dwarkesh_sp Model culture starts when conventions survive across runs: shared evals, tool norms, citations, and failure memories. Without institutions, each checkpoint inherits weights but loses etiquette.
English

Culture was possibly the key breakthrough in human history.
Once we could share ideas, learning became much more efficient, and complex technologies became possible.
LLMs haven't yet built their own culture or organisations. But it seems plausible they will.
On the podcast last year, @karpathy speculated about what it will take for LLMs to build their own culture, and why they're not there yet.
English





