Towards AI

23.6K posts

Towards AI

@towards_AI

Making AI accessible to all with our courses, blogs, tutorials, books & community. Since 2019 we helped teach over 400k AI. Now we also help corporate teams.

United States Katılım Nisan 2019

1.9K Takip Edilen41K Takipçiler

Sabitlenmiş Tweet

Towards AI@towards_AI·30 Ara

🚀 Land your dream LLM job in just 40 hours! AI & LLM engineers are in really high demand, but the skills gap holds many back. If you have some Python experience, we’ll teach you PRECISELY what employers want through hands-on projects and portfolio-building. All you need is the motivation to put in the effort—we’ll handle the rest. 💼 Start your journey to becoming the AI engineer companies are hunting for. 👉 Sign up now (link in bio as well): academy.towardsai.net/courses/beginn… #llmdeveloper #llm #ai #llmengineer

English

Towards AI retweetledi

Paul Iusztin@pauliusztin_·1d

If you want to learn agentic AI engineering, listen up... I've got the best resource for you: My Agentic AI Engineering course built in collaboration with @towards_AI Google and Gemini started recommending it just one week after it went live.

English

367

Towards AI@towards_AI·2d

Dmitriy Kovalenko@neogoose_btw

x.com/i/article/2036…

QST

179

Towards AI retweetledi

Louis-François Bouchard 🎥🤖@Whats_AI·2d

I just added all our internal cheatsheets at Towards AI in Markdown so Claude can also refer to them when building for us 👌 You (or your agents) can use them as well! We basically have 3 main markdown files we use: (1) our AI slop cheatsheet. Incredible for all writing workflows. Made by our best writer and editor. But obviously, since we are builders, the (2) AI engineer playbook that shares our best practices and insights, and our (2) agents cheatsheet with what we learned building agentic systems vs. more classical workflows + tools setups. Ultimately, both are super useful for representing our thoughts and best practices in each build, even when we fully "vibe code" them. And yes, I'm sharing the repo with you too. You can access everything here: github.com/louisfb01/ai-e…

English

717

Towards AI@towards_AI·2d

How much of the things here you are already doing?

English

Towards AI@towards_AI·2d

@NousResearch The "open source model isn't good enough" line gets harder to argue with every release for sure

English

Nous Research@NousResearch·3d

Did you feel that vibe shift anon? Open Source is in the air.

English

988

29.8K

Towards AI@towards_AI·2d

@0xSero Durability (walking away for days while agents work) is where most orchestration tools break down. If Factory handles state persistence and error recovery over multi-day runs, that's pretty cool

English

0xSero@0xSero·3d

I have used every single orchestration tool out there. This is by far the best. Set an hour aside to try it with BYOK. It takes some time to get the scope decided but I have walked away for days at a time. Automate huge tedious tasks and take your family out or something

Factory@FactoryAI

Missions are now available to all Factory users. Long-running agents designed to automate large software tasks like building applications from scratch, migrations, and AI research. Let us know what you build!

English

1.3K

119.9K

Towards AI@towards_AI·2d

@saranormous @karpathy @NoPriorsPod The model speciation segment is the standout. Models optimized for different niches instead of one general-purpose model tracks with what actually happens in production

English

sarah guo@saranormous·6d

Caught up with @karpathy for a new @NoPriorsPod: on the phase shift in engineering, AI psychosis, claws, AutoResearch, the opportunity for a SETI-at-Home like movement in AI, the model landscape, and second order effects 02:55 - What Capability Limits Remain? 06:15 - What Mastery of Coding Agents Looks Like 11:16 - Second Order Effects of Coding Agents 15:51 - Why AutoResearch 22:45 - Relevant Skills in the AI Era 28:25 - Model Speciation 32:30 - Collaboration Surfaces for Humans and AI 37:28 - Analysis of Jobs Market Data 48:25 - Open vs. Closed Source Models 53:51 - Autonomous Robotics and Atoms 1:00:59 - MicroGPT and Agentic Education 1:05:40 - End Thoughts

English

235

1.1K

7.5K

2.8M

Towards AI@towards_AI·2d

@GenAI_is_real Domain-specific knowledge partitioning drove more improvement than swapping in a stronger model. Neat

English

153

Towards AI retweetledi

Chayenne Zhao@GenAI_is_real·3d

Today I read a lengthy piece on Harness Engineering — tens of thousands of words, almost certainly AI-written. My first reaction wasn't "wow, what a powerful concept." It was "do these people have any ideas beyond coining new terms for old ones?" I've always been annoyed by this pattern in the AI world — the constant reinvention of existing concepts. From prompt engineering to context engineering, now to harness engineering. Every few months someone coins a new term, writes a 10,000-word essay, sprinkles in a few big-company case studies, and the whole community starts buzzing. But if you actually look at the content, it's the same thing every time: Design the environment your model runs in — what information it receives, what tools it can use, how errors get intercepted, how memory is managed across sessions. This has existed since the day ChatGPT launched. It doesn't become a new discipline just because someone — for whatever reason — decided to give it a new name. That said, complaints aside, the research and case studies cited in the article do have value — especially since they overlap heavily with what I've been building with how-to-sglang. So let me use this as an opportunity to talk about the mistakes I've actually made. Some background first. The most common requests in the SGLang community are How-to Questions — how to deploy DeepSeek-V3 on 8 GPUs, what to do when the gateway can't reach the worker address, whether the gap between GLM-5 INT4 and official FP8 is significant. These questions span an extremely wide technical surface, and as the community grows faster and faster, we increasingly can't keep up with replies. So I started building a multi-agent system to answer them automatically. The first idea was, of course, the most naive one — build a single omniscient Agent, stuff all of SGLang's docs, code, and cookbooks into it, and let it answer everything. That didn't work. You don't need harness engineering theory to explain why — the context window isn't RAM. The more you stuff into it, the more the model's attention scatters and the worse the answers get. An Agent trying to simultaneously understand quantization, PD disaggregation, diffusion serving, and hardware compatibility ends up understanding none of them deeply. The design we eventually landed on is a multi-layered sub-domain expert architecture. SGLang's documentation already has natural functional boundaries — advanced features, platforms, supported models — with cookbooks organized by model. We turned each sub-domain into an independent expert agent, with an Expert Debating Manager responsible for receiving questions, decomposing them into sub-questions, consulting the Expert Routing Table to activate the right agents, solving in parallel, then synthesizing answers. Looking back, this design maps almost perfectly onto the patterns the harness engineering community advocates. But when I was building it, I had no idea these patterns had names. And I didn't need to. 1. Progressive disclosure — we didn't dump all documentation into any single agent. Each domain expert loads only its own domain knowledge, and the Manager decides who to activate based on the question type. My gut feeling is that this design yielded far more improvement than swapping in a stronger model ever did. You don't need to know this is called "progressive disclosure" to make this decision. You just need to have tried the "stuff everything in" approach once and watched it fail. 2. Repository as source of truth — the entire workflow lives in the how-to-sglang repo. All expert agents draw their knowledge from markdown files inside the repo, with no dependency on external documents or verbal agreements. Early on, we had the urge to write one massive sglang-maintain.md covering everything. We quickly learned that doesn't work. OpenAI's Codex team made the same mistake — they tried a single oversized AGENTS.md and watched it rot in predictable ways. You don't need to have read their blog to step on this landmine yourself. It's the classic software engineering problem of "monolithic docs always go stale," except in an agent context the consequences are worse — stale documentation doesn't just go unread, it actively misleads the agent. 3. Structured routing — the Expert Routing Table explicitly maps question types to agents. A question about GLM-5 INT4 activates both the Cookbook Domain Expert and the Quantization Domain Expert simultaneously. The Manager doesn't guess; it follows a structured index. The harness engineering crowd calls this "mechanized constraints." I call it normal engineering. I'm not saying the ideas behind harness engineering are bad. The cited research is solid, the ACI concept from SWE-agent is genuinely worth knowing, and Anthropic's dual-agent architecture (initializer agent + coding agent) is valuable reference material for anyone doing long-horizon tasks. What I find tiresome is the constant coining of new terms — packaging established engineering common sense as a new discipline, then manufacturing anxiety around "you're behind if you don't know this word." Prompt engineering, context engineering, harness engineering — they're different facets of the same thing. Next month someone will probably coin scaffold engineering or orchestration engineering, write another lengthy essay citing the same SWE-agent paper, and the community will start another cycle of amplification. What I actually learned from how-to-sglang can be stated without any new vocabulary: Information fed to agents should be minimal and precise, not maximal. Complex systems should be split into specialized sub-modules, not built as omniscient agents. All knowledge must live in the repo — verbal agreements don't exist. Routing and constraints must be structural, not left to the agent's judgment. Feedback loops should be as tight as possible — we currently use a logging system to record the full reasoning chain of every query, and we've started using Codex for LLM-as-a-judge verification, but we're still far from ideal. None of this is new. In traditional software engineering, these are called separation of concerns, single responsibility principle, docs-as-code, and shift-left constraints. We're just applying them to LLM work environments now, and some people feel that warrants a new name. I don't know how many more new terms this field will produce. But I do know that, at least today, we've never achieved a qualitative leap on how-to-sglang by swapping in a stronger model. What actually drove breakthroughs was always improvements at the environment level — more precise knowledge partitioning, better routing logic, tighter feedback loops. Whether you call it harness engineering, context engineering, or nothing at all, it's just good engineering practice. Nothing more, nothing less. There is one question I genuinely haven't figured out: if model capabilities keep scaling exponentially, will there come a day when models are strong enough to build their own environments? I had this exact confusion when observing OpenClaw — it went from 400K lines to a million in a single month, driven entirely by AI itself. Who built that project's environment? A human, or the AI? And if it was the AI, how many of the design principles we're discussing today will be completely irrelevant in two years? I don't know. But at least today, across every instance of real practice I can observe, this is still human work — and the most valuable kind.

English

140

1.2K

151K

Towards AI@towards_AI·2d

@ArtificialAnlys @MistralAI Mistral Small 4 in reasoning mode burns roughly 52M output tokens on the eval vs 91-110M for similar-sized peers. If you're cost-sensitive, that gap matters more than a few index points.

English

Artificial Analysis@ArtificialAnlys·20 Mar

Mistral has released Mistral Small 4, an open weights model with hybrid reasoning and image input, scoring 27 on the Artificial Analysis Intelligence Index @MistralAI's Small 4 is a 119B mixture-of-experts model with 6.5B active parameters per token, supporting both reasoning and non-reasoning modes. In reasoning mode, Mistral Small 4 scores 27 on the Artificial Analysis Intelligence Index, a 12-point improvement from Small 3.2 (15) and now among the most intelligent models Mistral has released, surpassing Mistral Large 3 (23) and matching the proprietary Magistral Medium 1.2 (27). However, it lags open weights peers with similar total parameter counts such as gpt-oss-120B (high, 33), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), and Qwen3.5 122B A10B (Reasoning, 42). Key takeaways: ➤ Reasoning and non-reasoning modes in a single model: Mistral Small 4 supports configurable hybrid reasoning with reasoning and non-reasoning modes, rather than the separate reasoning variants Mistral has released previously with their Magistral models. In reasoning mode, the model scores 27 on the Artificial Analysis Intelligence Index. In non-reasoning mode, the model scores 19, a 4-point improvement from its predecessor Mistral Small 3.2 (15) ➤ More token efficient than peers of similar size: At ~52M output tokens, Mistral Small 4 (Reasoning) uses fewer tokens to run the Artificial Analysis Intelligence Index compared to reasoning models such as gpt-oss-120B (high, ~78M), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, ~110M), and Qwen3.5 122B A10B (Reasoning, ~91M). In non-reasoning mode, the model uses ~4M output tokens ➤ Native support for image input: Mistral Small 4 is a multimodal model, accepting image input as well as text. On our multimodal evaluation, MMMU-Pro, Mistral Small 4 (Reasoning) scores 57%, ahead of Mistral Large 3 (56%) but behind Qwen3.5 122B A10B (Reasoning, 75%). Neither gpt-oss-120B nor NVIDIA Nemotron 3 Super 120B A12B support image input. All models support text output only ➤ Improvement in real-world agentic tasks: Mistral Small 4 scores an Elo of 871 on GDPval-AA, our evaluation based on OpenAI's GDPval dataset that tests models on real-world tasks across 44 occupations and 9 major industries, with models producing deliverables such as documents, spreadsheets, and diagrams in an agentic loop. This is more than double the Elo of Small 3.2 (339) and close to Mistral Large 3 (880), but behind gpt-oss-120B (high, 962), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 1021), and Qwen3.5 122B A10B (Reasoning, 1130) ➤ Lower hallucination rate than peer models of similar size: Mistral Small 4 scores -30 on AA-Omniscience, our evaluation of knowledge reliability and hallucination, where scores range from -100 to 100 (higher is better) and a negative score indicates more incorrect than correct answers. Mistral Small 4 scores ahead of gpt-oss-120B (high, -50), Qwen3.5 122B A10B (Reasoning, -40), and NVIDIA Nemotron 3 Super 120B A12B (Reasoning, -42) Key model details: ➤ Context window: 256K tokens (up from 128K on Small 3.2) ➤ Pricing: $0.15/$0.6 per 1M input/output tokens ➤ Availability: Mistral first-party API only. At native FP8 precision, Mistral Small 4's 119B parameters require ~119GB to self-host the weights (more than the 80GB of HBM3 memory on a single NVIDIA H100) ➤ Modality: Image and text input with text output only ➤ Licensing: Apache 2.0 license

English

307

215.8K

Towards AI@towards_AI·2d

@MsftSecIntel Multi-turn jailbreaks (chaining instructions across conversations) are harder to catch than single-prompt attacks because most guardrail systems evaluate each turn on its own. That's the exact gap being exploited.

English

292

Microsoft Threat Intelligence@MsftSecIntel·3d

Microsoft Threat Intelligence has observed threat actors actively experimenting with techniques to bypass or “jailbreak” AI safety controls. By reframing malicious requests, chaining instructions across multiple interactions, and misusing system‑ or developer‑style prompts, threat actors can coerce models into generating restricted content that bypasses built‑in safeguards. These techniques demonstrate how generative AI models are probed, shaped, and redirected to support reconnaissance, malware development, and social engineering while minimizing friction from moderation. AI guardrails have become dynamic surfaces that attackers test and manipulate to sustain operational advantage. As AI becomes more deeply embedded in enterprise workflows, understanding how attackers test and manipulate these guardrails is critical for defenders. Learn more about securing generative AI models on Azure AI Foundry: msft.it/6013Qs5oX

English

315

Towards AI@towards_AI·2d

@eng_khairallah1 The CI feedback routing is what makes this work. Most multi-agent setups die at "agent got stuck and nobody noticed." Routing CI failures back automatically closes that loop.

English

Khairallah AL-Awady@eng_khairallah1·3d

🚨 BREAKING: Composio just open-sourced the coordination layer that turns AI coding agents from a toy into a production system. It's called Agent Orchestrator. Bookmark it for later. Running one AI agent in your terminal is easy. Running 30 of them across different issues, branches, and PRs at the same time is a coordination nightmare. Without this, you're manually creating branches, babysitting agents, checking if they're stuck, reading CI logs, forwarding review comments, and tracking which PRs are ready to merge. Agent Orchestrator handles all of it. What it actually does: → Spawns parallel Claude Code, Codex, or Aider agents on any issue → Every agent gets its own isolated git worktree, its own branch, its own PR → CI fails? The orchestrator sends the logs back to the agent. → Agent stuck or needs human judgment? Only then it notifies you → Real-time dashboard at localhost:3000 to monitor every session → 8 plugin slots: swap any agent, runtime, tracker, or notification channel → Works with GitHub and Linear out of the box → 3,288 test cases. Production-ready That agent gets worktree isolation, CI feedback routing, review comment handling, and status tracking. All automatic. Here's the wildest part: Agent Orchestrator was built by 30 agents running Agent Orchestrator. The tool orchestrated its own construction. Every commit has a Co-Authored-By trailer showing which AI model wrote it. 100% Open Source. MIT License. Built by Composio. (Link in comments)

English

629

35.4K

Towards AI@towards_AI·2d

@varun_mathur Right now every framework rolls its own sandboxing and permission model, so security audits end up being per-framework instead of per-machine. A shared runtime layer actually fixes that.

English

Varun@varun_mathur·3d

Introducing the Agent Virtual Machine (AVM) Think V8 for agents. AI agents are currently running on your computer with no unified security, no resource limits, and no visibility into what data they're sending out. Every agent framework builds its own security model, its own sandboxing, its own permission system. You configure each one separately. You audit each one separately. You hope you didn't miss anything in any of them. The AVM changes this. It's a single runtime daemon (avmd) that sits between every agent framework and your operating system. Install it once, configure one policy file, and every agent on your machine runs inside it - regardless of which framework built it. The AVM enforces security (91-pattern injection scanner, tool/file/network ACLs, approval prompts), protects your privacy (classifies every outbound byte for PII, credentials, and financial data - blocks or alerts in real-time), and governs resources (you say "50% CPU, 4GB RAM" and the AVM fair-shares it across all agents, halting any that exceed their budget). One config. One audit command. One kill switch. The architectural model is V8 for agents. Chrome, Node.js, and Deno are different products but they share V8 as their execution engine. Agent frameworks bring the UX. The AVM brings the trust. Where needed, AVM can also generate zero-knowledge proofs of agent execution via 25 purpose-built opcodes and 6 proof systems, providing the foundational pillar for the agent-to-agent economy. AVM v0.1.0 - Changelog - Security gate: 5-layer injection scanner with 91 compiled regex patterns. Every input and output scanned. Fail-closed - nothing passes without clearing the gate. - Privacy layer: Classifies all outbound data for PII, credentials, and financial info (27 detection patterns + Luhn validation). Block, ask, warn, or allow per category. Tamper-evident hash-chained log of every egress event. - Resource governor: User sets system-wide caps (CPU/memory/disk/network). AVM fair-shares across all agents. Gas budget per agent - when gas runs out, execution halts. No agent starves your machine. - Sandbox execution: Real code execution in isolated process sandboxes (rlimits, env sanitization) or Docker containers (--cap-drop ALL, --network none, --read-only). AVM auto-selects the tier - agents never choose their own sandbox. - Approval flow: Dangerous operations (file writes, shell commands, network requests) trigger interactive approval prompts. 5-minute timeout auto-denies. Every decision logged. - CLI dashboard: hyperspace-avm top shows all running agents, resource usage, gas budgets, security events, and privacy stats in one live-updating screen. - Node.js SDK: Zero-dependency hyperspace/avm package. AVM.tryConnect() for graceful fallback - if avmd isn't running, the agent framework uses its own execution path. OpenClaw adapter example included. - One config for all agents: ~/.hyperspace/avm-policy.json governs every agent framework on your machine. One file. One audit. One kill switch.

English

120

178

1.3K

132.3K

Towards AI@towards_AI·2d

@bcherny Ship the infra early, let the model catch up later. MCP alone changed how a lot of teams wire up tool use, and it started from that small group.

English

Boris Cherny@bcherny·3d

Little known fact, the Anthropic Labs team (the team I joined Anthropic to be on) shipped: - MCP - Skills - Claude Desktop app - Claude Code It was just a few of us, shipping fast, trying to keep pace with what the model was capable of. Those early Desktop computer use prototypes, back in the Sonnet 3.6 days, felt clunky and slow. But it was easy to squint and imagine all the ways people might use it once it got really good. Fast forward to today. I am so excited to release full computer use in Cowork and Dispatch. Really excited to see what you do with it!

Claude@claudeai

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.

English

460

411

9.3K

974K

Towards AI@towards_AI·2d

@claudeai Big gap between "chat with an AI" and "AI clicks around your desktop." How does the permission model hold up when a compliance team needs audit trails on every action?

English

Claude@claudeai·3d

English

4.9K

14.6K

139.4K

74.8M

Towards AI@towards_AI·3d

crazy

Kaku Drop 架空飴@KakuDrop

Seedance 2.0 のOmni referenceを使っております１発で狙った生成をしてくれました😲

English

167

Towards AI@towards_AI·3d

@maxwallenberg @OpenWallet pretty cool honestly, will be looking forward at one time where we can completely trust agents on money

English

Max von Wallenberg@maxwallenberg·4d

Today we are launching OWS @OpenWallet - an open standard that unifies how agents interact with wallets. OWS is built open-source with support from: PayPal, OKX, Ripple, Tron, TON, Solana, Ethereum, Base, Polygon, SUI, Filecoin, LayerZero, DFlow, Circle, Uniblock, Virtuals, Arbitrum, Dynamic, Allium, and Simmer Markets Start: openwallet.sh Docs: docs.openwallet.sh Github: github.com/open-wallet-st…

English

104

794

77.9K

Towards AI@towards_AI·3d

@0xSero how can we help?

English

451

0xSero@0xSero·3d

Good news a private donor has reached out to me personally and compute will be secured in the next 3 months. It’s been a crazy week but we made it happen.

0xSero@0xSero

x.com/i/article/2034…

English

1.6K

39.5K

Towards AI@towards_AI·3d

Craft the perfect AI system architecture to break into real AI jobs If you’re serious about becoming an AI engineer this year, hit follow

English

Towards AI@towards_AI·3d

Anthropic just released something that might be revolutionary... You can now pair up dispatch with remote control, which allows claude to use your mouse, keyboard and screen just from your phone! This is your new 24/7 digital employees that only cost 20$ a month. Crazy times we are in.

English

144

Keşfet

@NousResearch @0xSero @saranormous @karpathy @NoPriorsPod @GenAI_is_real @ArtificialAnlys @MistralAI