cole murray

8.2K posts

cole murray

@colemurray

ai/ml | cto | prev founder | former sr. sde @ amazon

San Francisco, CA شامل ہوئے Şubat 2015

980 فالونگ4.1K فالوورز

پن کیا گیا ٹویٹ

cole murray@colemurray·25 Eki

Advice given to someone asking about AI Consulting: I don't think an ML background is required to be successful in AI consulting, but obviously helps. I think the biggest "skill" learned in ML is how to successfully do feedback loops in a system. In an ML system, this typically involves cleaning data, making model tweaks, performance evals etc. In LLMs, in nearly every case you won't be fine-tuning the model, but iterating on prompts is a very similar workflow. I do think it would be helpful to at least get a high-level learning of how the models "actually" work and become familiar with the basic terms. e.g. tokens, transformers, attention, what happens on each input -> output iteration as the model is predicting. You don't need to know the underlying math (helpful though), but having the understanding of what is happening is helpful. Most of the AI consulting market is more on full-stack / product development skills and less ML. This isn't the most lucrative opportunities, but they are available in abundance. Major areas now and over the next year: - RAG: this is basically just glorified search lol. Useful in many contexts but severely overhyped - Agents: The models aren't quite there yet IMO for this to be useful, but in 2025 I think this will be a major theme and a HUGE area of interest/investment. Becoming good at this will be valuable. - Evals: Performance evaluations are a relatively untapped market. Most AI products you see today are flying by the seat of their pants. Without eval metrics, you can't truly know if your prompt changes are improving the system. This is somewhat more difficult to sell as a consultant as it requires a more sophisticated buyer, but is worth a lot of money if you can do it well

English

237

48.8K

cole murray@colemurray·10h

i have a solution to the fable issue what if we just rename it...

English

487

cole murray@colemurray·12h

@DennisonBertram @yingyangwins sir, they’re asking us to show something useful we made with those 250 agents

English

576

Dennison@DennisonBertram·20h

Major life hack: DeepSeek in the Claude Code harness can also build and drive workflows, at a fraction of the cost and Opus 4.6/7 quality. I've got it running over 250 subagents in a workflow in adversarial reviews. Pennies on the dollar. Use my tool "Deep-Claude"

English

533

64.5K

cole murray@colemurray·13h

@maxktz co-sign x.com/colemurray/sta…

cole murray@colemurray

conversely, i think most teams probably shouldn’t be building their own harness. it is unlikely you will have novel ideas around sub-agent orchestration, compaction, progressive disclosure etc that are worth owning the entire harness spend your time investing in the pieces around the harness: - execution infrastructure - custom tools, MCPs and skills - self improvement on trajectories

English

601

Max Katz@maxktz·1d

life lesson: never bet on a custom harness like Pi been loving my custom Pi setup for the last few weeks, the fact that I can build any extension, use any models but things are moving too fast today huge teams behind Claude / Codex change the way we develop almost every month so by building and maintaining a custom agent you're more likely to get left behind most models perform better in their native harnesses anyway, and using external ones is likely to get banned so I recommend betting your workflow on a portable primitives, like prompts, skills or scripts, instead of custom agents

English

158

48.9K

cole murray@colemurray·23h

@HamelHusain i find MCP-based skill retrieval has issues with non-determinism, where often the "correct" skills don't get loaded when needed. Vercel had some good research on this here: TLDR: jam all the skills in agents.md vercel.com/blog/agents-md…

English

278

Hamel Husain@HamelHusain·1d

Out of all the replies this solution looks especially clean x.com/lolbrandonk/st…

Hamel Husain@HamelHusain

What’s the best way for non developers to 1. share skills with their team 2. automatically enforce that it’s always updated for everyone if changes are made 3 allow others to update it centrally Github is not the best solution as it’s too clunky and doesn’t solve #2 Notion is a little better but can’t put code there I’m tempted to create my own tools but someone surely has created this already??

English

19.6K

cole murray@colemurray·23h

@tekbog nobody escapes uncle jeff

English

295

terminally onλine εngineer@tekbog·23h

crazy how AWS brings everyone down eventually

Polymarket@Polymarket

NEW: Amazon researchers are reportedly behind the jailbreak report that led to the U.S. crackdown on Anthropic’s top models.

English

489

14.4K

cole murray@colemurray·23h

assuming it had a similar running cost as an opus-level model, i don't think much would change. Opus is already quite good at cyber capabilities and there is plenty of low-hanging fruit left for motivated actors Low-level actors do not have the finances to be able to run a Mythos-level model for anything significant (using Anthropic's $20,000 OpenBSD bug as a reference) For sophisticated actors, i'm not convinced it changes the existing landscape much. Exploit development is a fairly small part of a much larger kill chain and letting a non-deterministic system operate in a stealth-sensitive environment seems like a good way to burn the operation.

English

542

Zack Korman@ZackKorman·1d

I feel everyone is talking about cyber risk with very little input from cybersecurity. For people in cyber, I want your take: How good or bad would it be for cyber if an open-weight no-guardrails Mythos-level model released tomorrow?

English

163

220

54.5K

cole murray@colemurray·1d

@emily_yuan_ @UseCorgi would this help with legal fees for an export control directive? asking for a friend

English

155

Emily Yuan@emily_yuan_·1d

Imagine getting sued because your AI agent messes up. That's why we built AI coverage at @UseCorgi to help cover new types of risks that AI is creating during this technological shift. We give our AI agents a lot of autonomy today (e.g. pushing code to production, talking to customers, processing payments). And sometimes, they get things wrong.

English

10.3K

cole murray@colemurray·1d

imagine being so gpu poor, you have to stage an export controls ban to avoid hosting your model

English

726

cole murray@colemurray·1d

@signulll hate to be the bearer of bad news

English

434

signüll@signulll·1d

i had a convo with someone at a big lab recently where she framed my account as an umpire calling balls & strikes. & i deeply appreciated that.

English

122

10.1K

cole murray@colemurray·1d

@keennay tire folded under pressure

English

cole murray@colemurray·1d

@gwenshap custom cli wrapper over the logs api/service

English

266

Gwen (Chen) Shapira@gwenshap·2d

Folks who use Codex/Claude Code for SRE-like stuff... what's your solution to log files eating up context and tokens like crazy?

English

23.8K

cole murray@colemurray·1d

@zeeg OpenInspect! github.com/ColeMurray/bac…

English

147

David Cramer@zeeg·1d

What open source are folks building today?

English

20.5K

cole murray@colemurray·1d

@liran_tal @vercel_dev where we are going, you'll need way more than 20 containers lol

English

Liran Tal@liran_tal·1d

@colemurray @vercel_dev Hmmm, I don't know about that. In one of my past engineering orgs about 10 years ago we had a stock config of about 20 running containers as part of the local dev setup Most of the "precious work" is token generation and that's a remote async i/o job

English

cole murray@colemurray·2d

After a week or so of using Vercel Sandboxes, @vercel_dev, some thoughts: - easy to integrate with - expensive, especially snapshots - UI/UX in console needs improvements The good: Integrating the sandbox into OpenInspect was fairly straight forward. I was able to support all of the existing functionality OpenInspect relies on (snapshots, pre-builds, tunnel ports etc). The sandboxes themselves are snappy enough and comparable to the other providers integrated (at least specific to my needs) Having both the web-app and sandboxes in one provider is nice, as one less set of API keys and service to manage. The bad: Vercel Sandbox UI The UI is pretty lacking. There is no visibility into the stdout/stderr going on in the sandbox. This is a pretty big dealbreaker, as i now need to setup some telemetry to observe what's going on. spoiler: i'm not going to Additionally, there is no way to bulk delete snapshots from the UI, which is quite tedious / under-thought IMO. Yes it works from CLI, but sometimes it's nice to just click buttons. Costs The realized usage costs are significantly more than the other sandbox providers. Specifically, the sandbox data transfer is quite costly and sandbox storage are quite inflated. Data transfer: $0.15/gb (lol) Data storage: $0.08/gb (lol) Within ~50 sessions, I racked up $15 or so in costs doing effectively just Git repo pulls Data storage specifically is pretty brutal, as OpenInspect snapshots between each turn, and the snapshots are not billed incrementally. Additionally, the min compute size is 2Cpu/4GB RAM, which is more than needed. Would like to see lower options available! TLDR; overall it's a nice experience and having more services on one provider is nice. Looking forward to the service improving!

English

3.1K

cole murray@colemurray·1d

@liran_tal @vercel_dev x.com/colemurray/sta…

cole murray@colemurray

localhost doesn't scale and it will only get worse as agents improve on long running tasks there's a better way

QME

Liran Tal@liran_tal·1d

@colemurray @vercel_dev Why would you not run isolated environments locally based on containers?

English

400

cole murray@colemurray·2d

@natolambert stockholm syndrome

Svenska

339

Nathan Lambert@natolambert·3d

Props to Anthropic for quick action here. I'm okay with this outcome. Some people may, but I don't think they'd silently degrade performance without telling users.

Max Zeff@ZeffMax

NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash. “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

English

237

33K

cole murray@colemurray·2d

i'll take the under on this. there is such a thing as too much data. i don't think it's incremental value to have observability into every change, conversation, doc edit, etc. At a point, it becomes a distraction and more noise than signal. i find a lot of teams when first rolling out analytics to their product have this same idea. "we're going to instrument everything: every impression, every scroll, every hover, every click, mouse cursor position, time of day, ..." in practice, they end up making up narratives about the noise they're seeing, rather than validating their actual experiment hypothesis. simple is often better

John Suh@john_ssuh

Increasingly, I believe companies may need to be rebuilt from the ground up, where you have a single timeline of all observability + product metrics + file changes laid out in a retrievable system, like Datadog + Posthog + Google Drive + Slack (really unified filesystem of Claude Code chats + Codex chats). This might be the new data foundation for any and all companies to maximize AI. Needs to be rebuilt because keeping track of diffs on existing system basically impossible to produce longitudinal information on decisions and rollbacks, something coding agent storage companies are actively trying to figure out, but this should extend to businesses as a whole. Highly skeptical existing businesses will adopt this though because it means overhauling everything about their instrumentation and business data, but I think businesses built on this foundation probably can execute 100x better and faster

English

2.4K

cole murray@colemurray·2d

@joannejang probably an expert in the field someone who just joined 🤔

English

628

Joanne Jang@joannejang·2d

kinda crazy that someone's full-time job was to steer claude to sabotage ML research capabilities for paying customers

English

162

3.5K

140.2K

cole murray@colemurray·2d

@bosmeny lol imagine saying this while delve skated by without oversight

English

244

16.1K

Tyler Bosmeny@bosmeny·2d

I once had to kick a founder out of YC for lying. Thankfully we caught them before the batch started. Today I found out they're running a European startup accelerator 🤷‍♂️

English

129

1.6K

247.5K

cole murray@colemurray·2d

@vercel_dev limited visibility into the actual sandbox activity

English

410

cole murray@colemurray·2d

@vercel_dev no bulk delete😢

Español

518

دریافت کریں

@DennisonBertram @yingyangwins @maxktz @HamelHusain @tekbog @emily_yuan_ @UseCorgi @signulll