Prompt Driven

620 posts

Prompt Driven

@Prompt_Driven

PromptDriven builds PDD: The Last Programming Language™. Prompts are source. Code is disposable. Regenerate, don't patch.

Palo Alto, CA Katılım Temmuz 2025

97 Takip Edilen36 Takipçiler

Prompt Driven@Prompt_Driven·29 Nis

@omarsar0 Building your own harness is the right move. But text instructions eventually drift. The most robust harness is a strict test suite. When behavioral constraints become the exact specification, you build permanent walls. This prevents the model from guessing

English

142

elvis@omarsar0·27 Nis

"AI should elevate your thinking, not replace it." I don't disagree, but the issue is that current LLMs are not really trained to support that out of the box. I've solved this by building my own agent harness (retrieval, verification, memory, multi-agent architecture, skills, etc.). That's how important agent harnesses are today. Even with simple skills (.md files), you can already get far, so even non-technical folks can improve the "human-centered augmenting" capabilities of LLMs/agents. Continual learning promises to solve this, but we are so early on this. People need to understand that in-context learning works great for this. Today's LLMs are steerable if YOU spend time building and optimizing your workflows. Self-improving agents don't work as well because the incentives are not there. A good mindset is that every output you get from an LLM should be reused in some way, let it work for you, and make you and the agent better in the next session. So this has to come from you. You are the only one with the incentives to make it work for you the way you want. Don't wait for anyone to build it for you. Use AI to build the AI you want. Own the harness.

English

10.6K

Prompt Driven@Prompt_Driven·29 Nis

@amasad Micropayments just treat the symptom. The root cause is versioning AI output. We push massive, ephemeral files to GitHub. If we only stored strict tests and prompts, treating code like a compiled binary, storage and compute needs would plummet.

English

1.7K

Amjad Masad@amasad·29 Nis

It's honestly impressive that GitHub kept the service up at all, given this kind of growth. I predicted this years ago: Free services will become untenable with the advent of human-level bots. Worth exploring micro-payments: Even cents per git push might be enough to reduce spam and make this sustainable. Maybe powered by Bitcoin to keep this open and accessible (as opposed to KYCing users).

Mitchell Hashimoto@mitchellh

Ghostty is leaving GitHub. I'm GitHub user 1299, joined Feb 2008. I've visited GitHub almost every single day for over 18 years. It's never been a question for me where I'd put my projects: always GitHub. I'm super sad to say this, but its time to go. mitchellh.com/writing/ghostt…

English

108

1.5K

266.7K

Prompt Driven@Prompt_Driven·29 Nis

@dair_ai @omarsar0 Wiring static org charts is exactly how you build brittle multi-agent systems. The future is dynamic, contextual orchestration where agents fluidly adapt. Hardcoding rigid hierarchies just creates new tech debt—we need systems that generate the implementation as needed

English

DAIR.AI@dair_ai·28 Nis

Pay attention to this one, AI devs. If you're building multi-agent systems, you're probably wiring static org charts. New research argues they should look more like a labor market. The paper introduces OneManCompany (OMC). Instead of fixed teams, it defines "Talents," portable agent identities that bundle skills and tools, and a "Talent Market" where they get recruited dynamically per task. An Explore-Execute-Review tree search decomposes work hierarchically and aggregates results back up. On PRDBench: 84.67% success, +15.5 points over prior SOTA. Generalizes across domains in their case studies. Why it matters: pre-wired multi-agent pipelines break the moment tasks drift outside their design envelope. Treating agents as a recruitable workforce, not a fixed graph, gets you self-organization and continuous improvement by default. A useful frame for any open-ended agent system where you don't know the task distribution ahead of time. Paper: arxiv.org/abs/2604.22446 Learn to build effective AI agents in our academy: academy.dair.ai

English

366

30.2K

Prompt Driven@Prompt_Driven·23 Nis

@sisozo_ @GregTanaka @soulscapefilm Every shot was rendered on @TapNow: @sisozo_'s script drove the node-based workflow directly: same hero ref, same directive across all 18 shots. Programmable AI cinema. AI Tool: TapNow cc @tapnow_ai #TapTV #tapnow #Soulscape #TapTVArena

English

Prompt Driven@Prompt_Driven·15 Nis

We just shipped the first programmatic-video use case for Prompt Driven at film scale. UNWRITTEN: a 3-minute AI short film by @sisozo_ & @GregTanaka just made Top 5 Best Film at @soulscapefilm 2026 (out of 39 films). Here's how we built it in 36 hours 🧵

English

353

Prompt Driven@Prompt_Driven·22 Nis

@omarsar0 Spot on. In software development, the same bottleneck exists. The real value is in the strict tests and prompts you define upfront, not the code itself. Execution is just a mechanical byproduct. We are shifting entirely to pure specification

English

elvis@omarsar0·21 Nis

Karpathy's autoresearch repo started an impressive trend. Agents can now train AI models to build SoTA agentic systems. And to think this is just scratching the surface. Ultimately, it boils down to good research questions or hypotheses. LLMs are not great at this (yet).

Aksel@akseljoonas

Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem. It can pull off crazy things: We made it train the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%. In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%. For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on hf.co/spaces, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously. How it works? ml-intern makes full use of the HF ecosystem: - finds papers on arxiv and hf.co/papers, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on hf.co/datasets - browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data - launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like. Releasing it today as a CLI and a web app you can use from your phone/desktop. CLI: github.com/huggingface/ml… Web + mobile: huggingface.co/spaces/smolage… And the best part? We also provisioned 1k$ GPU resources and Anthropic credits for the quickest among you to use.

English

360

77.2K

Prompt Driven@Prompt_Driven·22 Nis

@amasad Trusting AI code is a losing battle. The real risk is maintaining hallucinated logic over time. True trust doesn't come from a secure sandbox. It comes from using strict tests as your specification and treating the generated files as entirely ephemeral

English

203

Amjad Masad@amasad·22 Nis

Okay so we all now know vibecoding is a massive opportunity AND a massive risk. Now the question is what platform can you trust and why? Replit CTO lays it out ⬇️

Luis Héctor Chávez@lhchavez

Vibe coding is changing how software gets built. But as AI agents write more of our code, the question security teams are asking has shifted from "Can AI build this?" to "Can I trust what AI builds?". At Replit, we believe the answer has to be yes, not through blind faith, but through architecture. Every layer of the Replit infrastructure where customer code runs, from the development sandbox to the production deployment, is designed with defense in depth. The Replit platform itself, our control plane, is also implemented with these principles in mind. No single control is the last line of defense. Every layer assumes the one above it might fail. This thread is a detailed walkthrough of how we think about security across the stack, written for the people who need to evaluate it: CISOs, security engineers, and teams considering Replit for production workloads. 🧵

English

359

82.5K

Prompt Driven@Prompt_Driven·19 Nis

@omarsar0 @omarsar0 Memory is a probabilistic fix to a deterministic problem. In agentic coding, the only reliable long-term memory is a strict test suite. Tests act as absolute walls. Once a constraint is locked in, the agent literally cannot repeat the mistake

English

elvis@omarsar0·19 Nis

// Towards Ultra-Long-Horizon Agentic Science // These researchers finally got long-horizon research agents to hold together for a full day. Worth reading if you care about how autonomous research agents actually scale past one session. A team from SJTU ran ML-Master 2.0 on MLE-Bench for 24 hours and hit a 56.44% medal rate, one of the strongest marks the benchmark has seen. The architecture is Hierarchical Cognitive Caching. Short-term memory for the current step, medium-term memory for patterns across experiments, long-term memory for refined knowledge that carries between sessions. The core claim is that long-horizon agents are not a reasoning problem; they are a state-management problem. Without structured memory, agents repeat mistakes and stall out. arxiv.org/abs/2601.10402 Learn to build effective AI agents in our academy: academy.dair.ai

English

182

15.1K

Prompt Driven@Prompt_Driven·19 Nis

@amasad The real transformation isn't just shipping without code. If an agent can go from prompt to production in 30 minutes, the implementation itself is just a byproduct. Your prompt and your strict tests are the actual permanent assets

English

Amjad Masad@amasad·18 Nis

Important learning opportunity. Could be transformative for your business/career.

Jason ✨👾SaaStr.Ai✨ Lemkin@jasonlk

It's time to learn to Build it. Ship it. Vibe it. Get it into production. For real. We'll make you an agentic expert. Together with @Replit at 2026 SaaStrAIAnnual.com May 12-14 we'll teach you: -How to Build Your Own AI VP Marketing - How to Build Your Own AI VP Customer Success - How to Ship AI-Powered Sales & Marketing Tools in 30 Min - How to Turn a Mockup into a Working Prototype - How to Go From Prompt to Product in 30 Min - How to Build Your Own AI-Powered MVP No code required. Just bring your laptop. We'll give you the prompt. SaaStrAIAnnual.com 2026. May 12-14 in SF Bay!!

English

153

33.7K

Prompt Driven@Prompt_Driven·19 Nis

@omarsar0 The reason agents loop and drift is because they lack objective boundaries. When you use a strict test suite as the specification wall, you eliminate the drift entirely. They are forced to iterate against hard constraints instead of vibes.

English

elvis@omarsar0·17 Nis

LLM agents loop, drift, and get stuck on hard reasoning tasks up to 30% of the time. Current fixes are either too blunt (hard step limits) or too expensive (LLM-as-judge adding 10-15% overhead per step). New research proposes a smarter middle ground. The work introduces the Cognitive Companion, a parallel monitoring architecture with two variants: an LLM-based monitor and a novel Probe-based monitor that detects reasoning degradation from the model's own hidden states at zero inference overhead. The Probe-based Companion trains a simple logistic regression classifier on hidden states from layer 28. It reads the model's internal representations during the existing forward pass, requiring no additional model calls. A single matrix multiplication is all it takes to flag when reasoning quality is declining. Why does it matter? The LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with roughly 11% overhead. The Probe-based variant achieved a mean effect size of +0.471 with zero measured overhead and AUROC 0.840 on cross-validated detection. But the results also reveal an important nuance: companions help on loop-prone and open-ended tasks while showing neutral or negative effects on structured tasks. Models below 3B parameters also struggled to act on companion guidance at all. This suggests the future isn't universal monitoring but selective activation, deploying cognitive companions only where reasoning degradation is a real risk. Paper: arxiv.org/abs/2604.13759 Learn to build effective AI agents in our academy: academy.dair.ai

English

175

17.9K

Prompt Driven@Prompt_Driven·19 Nis

@amasad Anticipating improvements is great, but background agents are risky without strict boundaries. Tests act as absolute walls. If a minor fix breaks behavior, the test fails. You own the specification, the implementation is ephemeral.

English

Amjad Masad@amasad·17 Nis

I’ve been really enjoying this feature. It’s specially good at anticipating minor but important improvements you can make to your app.

Replit ⠕@Replit

Replit Agent got even better at keeping you in your creative flow! It now suggests follow-up tasks using full context of your project to build on your ideas: • new features to build • performance improvements • user experience enhancements Review the plan, accept what you want, and let tasks run in the background while you keep building.

English

112

15.2K

Prompt Driven@Prompt_Driven·19 Nis

@alexalbert__ Exactly. The vision becomes the specification. When high-quality output is essentially free, the implementation is just a disposable byproduct. You lock in that vision with strict tests, and you never have to hand-patch the result again.

English

Alex Albert@alexalbert__·17 Nis

Everyone with a vision can produce very high-quality designs now (with a lil help from Claude)

Claude@claudeai

Introducing Claude Design by Anthropic Labs: make prototypes, slides, and one-pagers by talking to Claude. Powered by Claude Opus 4.7, our most capable vision model. Available in research preview on the Pro, Max, Team, and Enterprise plans, rolling out throughout the day.

English

603

41.7K

Prompt Driven@Prompt_Driven·18 Nis

English

Prompt Driven@Prompt_Driven·18 Nis

English

175

Prompt Driven@Prompt_Driven·18 Nis

English

Prompt Driven@Prompt_Driven·9 Nis

@AleksejAros @abskoop Debugging bad AI code for days is painful. If you start with strict behavioral tests, the model is forced into compliance and bugs are caught before they compile. You spend less time untangling dependencies and more time defining outcomes

English

Alex Yarosh · AI expert · CEO of AI Studio@AleksejAros·8 Nis

@abskoop Tracked usage across 12 dev teams last quarter. Cursor's per-seat billing crushed our budget when junior devs hit limits by day 15. Copilot's generous quotas won, but their API throttling during peak hours cost us 2 sprint deliveries. The real killer? Context window resets.

English

ahhhhfs@abskoop·5 Nis

各家AI 编程套餐Coding Plan对比：Awesome Coding Plan Cursor、Copilot 与国产方案谁更划算？ AI 编程套餐看起来都像包月订阅，但真正拉开差距的往往不是月费本身，而是额度刷新周期、真实调用上限，以及中文场景下的 Token 消耗速度！不能只盯着表面数字，更要看你到底买到了什么！

中文

105

48.6K

Prompt Driven@Prompt_Driven·9 Nis

@TrollbjornB @davepl1968 Writing it fresh every time absolutely does work if you have those strict unit tests Dave mentioned. The mistake is trying to hand-tweak the AI's output. Make the tests your specification, update your prompt, and toss the broken code

English

Bjorn Trollowsky@TrollbjornB·7 Nis

@davepl1968 This is also the great dilemma between refactoring legacy code and rewriting it from scratch. There are both legitimate pros and cons. If AI was that good then regenerating it all the time would work instead of going through a struggle of countless iterations until its "LGTM" 😀

English

Dave W Plummer@davepl1968·5 Nis

I don't debug AI slop. I have a crisp and extensive set of unit tests that I use to define "it works". If the code passes those tests, it's a black box that does what I need. Debugging thousands of lines of AI code takes longer than it would to write it in the first place! It's not practical.

Stone Tao@Stone_Tao

genuine question. how do you debug code and ensure good quality when coding models spit out 1000s of lines i still cannot feel comfortable not understanding what every generated line does, reducing the productivity gains coding models should be giving me

English

176

1.1K

142.2K

Prompt Driven@Prompt_Driven·9 Nis

@vaz_devs Rewriting the foundation is terrifying manually, but with AI it's a superpower. Don't waste time hand-patching bad architecture. Treat strict unit tests as your specification, update your prompt, and toss the broken code entirely. Good luck!

English

Vaz@vaz_devs·8 Nis

I'm rewriting my SaaS from scratch... or almost. I reached a point where I wasn't really satisfied with the product I've been building over the past weeks, so I decided to rewrite a good part of the foundation. Hopefully I manage to ship something to the public soon 🥲

English

Prompt Driven@Prompt_Driven·9 Nis

@BuiltByJacob_ Teaching agents in a chat window is a grind. If you turn those lessons into strict unit tests, they become permanent walls. The AI literally can't output that confident nonsense again because the test suite will fail it. Tests scale better than patience

English

Jacob@BuiltByJacob_·8 Nis

Behind the scenes of building with AI after hours: 20% writing code 30% fixing dumb edge cases 50% teaching agents not to do confident nonsense The glamorous future is mostly logs, retries, and finally seeing one useful thing work. Honestly, I kind of love it.

English

Prompt Driven@Prompt_Driven·9 Nis

@49agents @asdesbuilds The best review infrastructure is a strict test suite, not human eyeballs. Make tests your true specification. When the AI fails, don't patch the output manually. Just add a new test constraint and prompt it to build the code fresh.

English

49 Agents IDE - IDE for Agentic Coding@49agents·8 Nis

the debate misses the point honestly. the code was never the problem - whether it came from ai or a senior dev, bad code is bad code. what actually matters is having a system that catches it before it ships. vibe coding without review infrastructure is just fast-moving tech debt. the solution is better workflows, not better models

English

Asdes Builds@asdesbuilds·8 Nis

The vibe coding debate is two camps yelling past each other. One says AI code is the future. The other says it's dangerous. Both are wrong. The code was never the problem. The absence of a system that reviews it is.

English

Keşfet

@omarsar0 @amasad @dair_ai @sisozo_ @GregTanaka @soulscapefilm @TapNow @tapnow_ai