Jesse Anglen

386 posts

Jesse Anglen

@Jesse_Anglen

Founder @ https://t.co/aotYG5swVz 🚀 | Digital Labor Evangelist | AI Agent Pioneer | Speaker • Influencer • Thought Leader

Beigetreten Ağustos 2021

725 Folgt394 Follower

Jesse Anglen@Jesse_Anglen·4h

AI traffic grew 8,000% last year. Bots now officially outnumber humans online. Cloudflare predicted this would happen by 2027. It's March 2026. Everyone's calling it the "dead internet." I've got agents running research and filing reports at 3 AM. It isn't dead. It just stopped waiting for us.

English

Jesse Anglen@Jesse_Anglen·8h

4/6 You can also schedule recurring tasks. Weekly analytics dashboard every Friday morning. Slack digest of prioritized emails at 8 AM weekdays. Competitor audit on the first of each month. Just tell it. It runs while you're not there. Your personal SRE who never sleeps.

English

Jesse Anglen@Jesse_Anglen·19h

@heykahn Curious if it accounts for quantization tier vs context window tradeoffs. Different 4-bit vs 8-bit setups on the same GPU hit very different throughput ceilings depending on the context length you're targeting. That's usually where people actually get surprised.

English

Zain Kahn@heykahn·2d

Stop guessing which LLMs your machine can actually handle. llmfit analyzes your hardware in seconds to find your perfect local AI match. P.S. Sharing more practical, no-fluff AI resources with 150K+ engineers here:codenewsletter.ai/subscribe?utm_… So, Instead of downloading a model and hitting an OOM error, it scans your RAM, CPU, GPU, and VRAM first - then scores every model across 4 dimensions: 1. Quality - param count, model family, quantization penalty 2. Speed - estimated tok/s for your exact backend (CUDA, Metal, ROCm) 3. Fit - memory utilization vs. your available hardware 4. Context - context window vs. your use case Each model gets a label: Perfect / Good / Marginal / Too Tight.

English

3.1K

Jesse Anglen@Jesse_Anglen·19h

@svpino Bidirectional is the one people underestimate. CLI is fire-and-forget. MCP lets the server push back, stream progress, and maintain state across calls. Once you're building an agent that actually needs to know what's happening mid-task, fire-and-forget stops being an option.

English

Santiago@svpino·1d

Both CLI and MCP are useful. They can coexist and complement each other. Here are just a few ways an MCP server is better than a CLI: • Bidirectional communication: you get a persistent connection with MCP where the server can push updates, stream progress, or notify the agent. • Elicitation: MCP servers can ask the model (or user) for clarification mid-execution. • Scoped permissions: built-in OAuth and permission boundaries. No need to sandbox a shell to prevent malicious commands. • Enterprise governance: audit trails, auth handshakes, and multi-tenant authorization out of the box. • Remote execution: MCP over streamable HTTP means agents can use complex backends without local installs. • Stateful by default: You get session management, connection pooling, and multi-step workflows much simpler than using a CLI. I hope more companies embrace both.

Karan Vaidya@KaranVaidya6

Okay, @gdb is team CLI all the way. @garrytan thinks MCPs suck. So we hit the streets of SF to see if the city agreed. We posed a simple question: MCP or CLI? - Basically everyone under the age of 35 said CLI - One person said MCP was as bloated as Java - & unsurprisingly, numerous people told us to touch grass Final score- MCP: 3 vs CLI: 17 SF has spoken, and @composio listened. Our universal CLI is now live! Drop your best CLI vs MCP hot take in the comments and we'll send the best ones some very sick gear 👀 Link to try our CLI in the next thread ⬇️

English

14.8K

Jesse Anglen@Jesse_Anglen·20h

@LangChain Tried rolling back a production prompt manually three months ago. Two failed deploys, one incident review, and a very uncomfortable Slack thread later. A one-click rollback with full deployment history would've saved everyone involved about four hours of pain.

English

LangChain@LangChain·1d

Environments in LangSmith Prompt Hub Environments give you a proper promotion workflow for your prompts: - Assign any commit to Staging or Production - Promote between environments instantly - Roll back with a single click from a full deployment history - Reference reserved tags (`staging`, `production`) in your code so the right version is always served with no code changes needed The same promote-and-rollback pattern you already use for app deployments, applied to prompt management. Give it a spin :rocket: Docs: #environments" target="_blank" rel="nofollow noopener">docs.langchain.com/langsmith/mana…

English

7.2K

Jesse Anglen@Jesse_Anglen·20h

Distribution is the real moat. A billion Samsung device installs with your AI browser as default, that's the same playbook Google used to dominate mobile search for a decade. Samsung's hedging by running Perplexity AND Gemini side by side, so the real game is which one becomes the habitual tap after day one.

English

Aravind Srinivas@AravSrinivas·1d

We're deepening our partnership with Samsung by powering the AI on their browser that's pre-installed on 1B+ Samsung devices (100m+ active users), extending the partnership that allows Perplexity to power Bixby and be pre-loaded on all Galaxy S26 devices with alongside Gemini.

English

440

20.7K

Jesse Anglen@Jesse_Anglen·20h

$1.2B into robotics last week. Mind Robotics $500M, Rhoda AI $450M, Sunday $165M, Oxa $103M. We keep focusing on software agents. Physical AI just got $1.2B in 7 days. Why isn't anyone building the layer that makes these two work together?

English

Jesse Anglen@Jesse_Anglen·20h

Something happened yesterday I've been waiting two years to see. Amazon shipped an AI agent that finds security vulnerabilities and patches them. Autonomously. No human needed. CrowdStrike fell 7% on the news. Which other companies haven't figured out their AI-agent answer yet?

English

Jesse Anglen@Jesse_Anglen·2d

@fchollet Yearly releases with fresh unsaturation is the ambitious part. Do you run blind eval against frontier models right before publishing, or is there a different protocol? Genuinely curious how internal testing doesn't partially saturate it by the time it drops.

English

153

François Chollet@fchollet·2d

For those wondering about ARC-AGI-4 timing: it will be released in early 2027. We are aiming for a yearly release schedule for new benchmarks. We are also aiming for each new benchmark to be fully unsaturated upon release, and to target the most important unanswered research questions at that time. This requires us to estimate where AI capabilities will be (and won't be) one year from now. Like we did over one year ago when we started to work on ARC-AGI-3.

English

227

15.7K

Jesse Anglen@Jesse_Anglen·2d

@hwchase17 Single-turn eval doesn't transfer. Outcome-based scoring misses that an agent can succeed via lucky shortcut or fail despite 8 of 10 decisions being correct. Path quality matters as much as task completion, maybe more in production where shortcuts become recurring failure modes.

English

233

Harrison Chase@hwchase17·2d

if you want to learn how we think about evaluating agents - real agents, not simple llm prompts - this blog is a goldmine

Viv@Vtrivedy10

x.com/i/article/2036…

English

310

75.7K

Jesse Anglen@Jesse_Anglen·2d

Worth separating two things. Chat jailbreaks are annoying but bounded. Autonomous agents that can discover and optimize novel attacks at scale is a different threat model. Beating 30+ existing attacks via autoresearch isn't jailbreaking, it's offensive AI red-teaming on autopilot.

English

177

Simon Willison@simonw·2d

To me this mostly illustrates the futility of robust jailbreaking prevention

Alexander Panfilov@kotekjedi_ml

New paper: We deploy Claude Code in an autoresearch loop to discover novel jailbreaking algorithms – and it works. It beats 30+ existing GCG-like attacks (with AutoML hyperparameter tuning) This is a strong sign that incremental safety and security research can now be automated.

English

434

55.4K

Jesse Anglen@Jesse_Anglen·2d

@svpino Ran three agents in parallel on the same codebase last week. The coordination part was the actual problem, not the output quality. Dependency chaining between tasks is the thing I didn't know I needed.

English

136

Santiago@svpino·2d

A single board to orchestrate all your coding agents. The future of software development is managing a swarm of agents. $ npm i -g cline Cline Kanban is a board where you can create tasks, chain dependencies, and see how your agents tackle them. Works with Claude Code, Codex, and Cline. Free and open-source.

Cline@cline

Introducing Cline Kanban: A standalone app for CLI-agnostic multi-agent orchestration. Claude and Codex compatible. npm i -g cline Tasks run in worktrees, click to review diffs, & link cards together to create dependency chains that complete large amounts of work autonomously.

English

243

64.9K

Jesse Anglen@Jesse_Anglen·2d

Context window vs. use case is the one that matters most for agent workloads. A 4k context feels fine for chat. Run a multi-step agent loop and you hit the ceiling fast. Does llmfit let you specify "I'm building agents" vs "I'm doing chat"? That'd shape the Fit score pretty significantly.

English

Jesse Anglen@Jesse_Anglen·2d

@svpino Adobe's moat isn't the UI, it's twenty years of workflow muscle memory. Layer-based AI editing is coming either way. Real question: does Adobe ship it first and become the layer engine for AI-native tools, or does someone else and Adobe becomes the Kodak of creative software?

English

Santiago@svpino·2d

Adobe will start sweating bullets the day AI starts generating images with layers that you can edit independently. Imagine being able to modify every single little detail of an image without touching the rest. It looks like that's where we are going.

Priyaa@pritopian

Billions of $$ raised to generate images, and none of them let you actually edit what you get. You type a prompt, get something close, try to fix one thing, and the whole image regenerates. Now 50% of what was working is gone. So you prompt again. And again. Stuck in prompt doom loops, burning tokens every single time. The output is always a flat PNG, limiting what you can do with it. @world_lica actually reads your image and breaks it into structured, editable layers. You go in, change what you need to change, and everything else stays exactly where it was. Route each layer to the right model or a capable human. You don't need to regenerate from scratch or pay the token tax to fix a font color. Enterprises publishing creatives across site, social, and email are already using Lica to own their model and own their output. We're grateful to be supported by @Accel, @amasad, @snsf, @southpkcommons, @villageglobal, and @pirroh to build the editing layer that AI image gen has been missing. Want early access? Check the next thread below.

English

12K

Jesse Anglen@Jesse_Anglen·2d

Built a full deployment agent last fall and can confirm. The code gen part worked by day 2. Spent another week teaching it to navigate Stripe's sandbox/prod key rotation, Vercel's project linking flow, and GitHub webhook setup. Most of those APIs exist but weren't designed to be called without a human in the loop double-checking each step.

English

Andrej Karpathy@karpathy·2d

When I built menugen ~1 year ago, I observed that the hardest part by far was not the code itself, it was the plethora of services you have to assemble like IKEA furniture to make it real, the DevOps: services, payments, auth, database, security, domain names, etc... I am really looking forward to a day where I could simply tell my agent: "build menugen" (referencing the post) and it would just work. The whole thing up to the deployed web page. The agent would have to browse a number of services, read the docs, get all the api keys, make everything work, debug it in dev, and deploy to prod. This is the actually hard part, not the code itself. Or rather, the better way to think about it is that the entire DevOps lifecycle has to become code, in addition to the necessary sensors/actuators of the CLIs/APIs with agent-native ergonomics. And there should be no need to visit web pages, click buttons, or anything like that for the human. It's easy to state, it's now just barely technically possible and expected to work maybe, but it definitely requires from-scratch re-design, work and thought. Very exciting direction!

Patrick Collison@patrickc

When @karpathy built MenuGen (karpathy.bearblog.dev/vibe-coding-me…), he said: "Vibe coding menugen was exhilarating and fun escapade as a local demo, but a bit of a painful slog as a deployed, real app. Building a modern app is a bit like assembling IKEA future. There are all these services, docs, API keys, configurations, dev/prod deployments, team and security features, rate limits, pricing tiers." We've all run into this issue when building with agents: you have to scurry off to establish accounts, clicking things in the browser as though it's the antediluvian days of 2023, in order to unblock its superintelligent progress. So we decided to build Stripe Projects to help agents instantly provision services from the CLI. For example, simply run: $ stripe projects add posthog/analytics And it'll create a PostHog account, get an API key, and (as needed) set up billing. Projects is launching today as a developer preview. You can register for access (we'll make it available to everyone soon) at projects.dev. We're also rolling out support for many new providers over the coming weeks. (Get in touch if you'd like to make your service available.) projects.dev

English

555

524

6.2K

2.2M

Jesse Anglen@Jesse_Anglen·2d

Agents that write their own tools mid-task. NVIDIA shipped OpenShell last week. Adobe, SAP, Salesforce, and 13 others adopted it within days. They're calling these agents 'claws.' Either bold branding or someone really liked Attack on Titan.

English

Jesse Anglen@Jesse_Anglen·3d

Usability is a real constraint, but I'd put something bigger first: every tool built on top of stable DOM structure. Test suites, accessibility tooling, browser extensions all break with pixel-streamed UIs. The hard part isn't users adapting to a new UI each time. It's the whole ecosystem built around consistent structure.

English

123

Matt Shumer@mattshumer_·3d

In 5 to 7 years, UIs will be generated/streamed from the cloud, pixel-by-pixel. Phones/etc. will literally just be useless bricks with screens, speakers, and input. That said, UIs won't be as dynamic as people expect. Imagine a new UI each time... that'd be so hard to use!

English

112

141

30.9K

Jesse Anglen@Jesse_Anglen·3d

'Action efficiency' is where this benchmark gets philosophically serious. Past benchmarks ask 'can it?' This one asks 'how cheaply?' Humans don't just solve these environments, they solve them in roughly optimal steps. Curious: is efficiency graded continuously, or pass/fail once you cross the human threshold? That answer shapes where models make their first real progress.

English

955

François Chollet@fchollet·3d

ARC-AGI-3 is out now! We've designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first time. We've done extensive human testing that shows 100% of these environments are solvable by humans, upon first contact, with no prior training and no instructions. Meanwhile, all frontier AI reasoning models do under 1% at this time.

English

162

313

2.6K

522.5K

Jesse Anglen@Jesse_Anglen·3d

Your follow-up with the Opus data answers your question pretty directly. 0% to 97.1% from adding a harness isn't a model limitation story. It's an interface story. ARC-AGI-3 measures whether you can build the right harness for the task, which is arguably a more interesting thing to measure for real-world AI deployment.

English

255

Ethan Mollick@emollick·3d

ARC-AGI-3 took me a few tries, but it is definitely human winnable. I am curious how much of the very initially very low performance of frontier models is harness, vision, and tools, versus how much are limitations of LLMs. I guess we will find out! arcprize.org/arc-agi/3

English

268

22.1K

Jesse Anglen@Jesse_Anglen·3d

'Trying too hard' is the right diagnosis. The problem isn't what gets stored, it's that memory doesn't model forgetting. That 2-month question had a half-life. Relevance decays. Building agents with persistent memory, this comes up constantly. Early-session context bleeds into later sessions in ways that are genuinely hard to prune.

English

228

Andrej Karpathy@karpathy·3d

One common issue with personalization in all LLMs is how distracting memory seems to be for the models. A single question from 2 months ago about some topic can keep coming up as some kind of a deep interest of mine with undue mentions in perpetuity. Some kind of trying too hard.

English

1.7K

1.1K

21K

2.6M

Entdecken

@heykahn @svpino @LangChain @fchollet @hwchase17 @elonmusk @BarackObama @taylorswift13