Tom Firth

1.4K posts

Tom Firth

@tdfirth

Thinking about thinking. YC W22.

New York Katılım Ağustos 2015

182 Takip Edilen768 Takipçiler

Tom Firth retweetledi

ibby@StatueofIBBertY·23 Nis

Every day I'll be sharing an agent my customers actually use in their business, and show how well the agent does using Claude vs an open model. Today: Two popular models (Kimi K2.6 and Mistral Large) against Claude Sonnet - which can make the best testimonial ad for On shoes? TL;DR Mistral continues to impress me greatly. The agent is very simple - it looks for google reviews of the product, finds the best testimonial, analyzes the imagery on the site, and then uses a Nano Banana tool (keeps it even) with reference data to create an advertisement. No order this time, you decide which one won. Mistral Large 3 Cost: ~$0.03 Quality: 4.5/5 Notes: This did a great job of creating cool imagery, and copying a lot of the close up shots of the shoes. I liked it because it’s the best quality per dollar. You could very cheaply make assets for every single review, if you wanted to. The downside is - even though it found a ton of reviews, it didn’t actually take a verbatim from there, but instead decided to use the On tagline from the site. Claude Sonnet 4.6 Cost: ~$0.15 Quality: 3/5 Notes: It found an actual verbatim, but the font is weird, and it picked a quote that was far too long to put on a testimonial. I wasn’t that impressed with how it did, and it cost a ton more than Mistral did. Still, it kept up with some of the imagery that On uses (folks using the shoes) which was good. Kimi K2.6 Cost: ~$0.07 Quality: 4/5 Notes: This one is actually my favorite, but it doesn’t actually look like any advertisement for On that I was able to find. It used an actual verbatim, and made the imagery look super interesting. All of them are fairly cringe (a real designer would do far better than this), but this is the closest to something that maybe I would put on a slide.

English

163

Tom Firth retweetledi

ibby@StatueofIBBertY·21 Nis

Every day for a while, I'll be posting two agents that my customers use. One will be an OSS model, one will be Claude (or other frontier model). Will share cost, performance, and output quality. Sharing the agent and outputs below; For today, I'm mixing a prompt that our VC customers use to gauge founding team and adding some flavor around GTM so that sales teams can use it. I had it research the founding team at Modal Labs, and deliver a document that benchmarks the leadership team on their technical acumen, background, and GTM connectedness. Document for each is below. I used Claude, Mistral, and Minimax: 1st Place: Mistral Large 3: Cost: ~0.03 cents (1/6th the cost of Claude) Quality: 4.5/5 Length: 7 Pages Overall, Mistral is the clear winner - it not only produced a beautiful document that was easy to read with tables and well thought out sections, but it wrote more data and the sections were better laid out (basic info, detailed info, takeaway). The analysis was arguably also better. 2nd Place: Claude Sonnet 4.5: Cost: ~$0.20 Quality: 3.5/5 Length: 6 pages Overall, Claude ended up being fairly good at following the prompts. It didn't do the best job of formatting the document, but it DID fill out all the sections and gave detailed depth. It didn't complete the doc so points off for that. 3rd Place: Minimax 2.5 Cost: ~$0.02 Quality: 1.5/5 Length: 3 pages Overall, not as good. Basically just summarized their linkedin profiles, and didn't really do any analysis. Sharing the agent and the docs below!

English

195

Tom Firth retweetledi

ibby@StatueofIBBertY·21 Nis

SUPER excited to announce our latest launch, which is focused on solving a problem every single one of our customers has: prompting and creating agents. So today we're launching Coco - an agent creator inside Cotera to help you build repeatable, scalable agents for work. Just tell Coco what you want your agent to do, and it builds it for you. It finds the tools, you put in your creds, and that's it. Connect it to any of our hundreds of integrations, then run it however you need: ⚡ On a trigger: on a regular schedule 💬 One time: chat with the agent on demand 🔁 In a loop: run up to 100 in parallel 🗄️ Via dataset: run it on every row in your data warehouse or CSV Watch the full walkthrough and join our Slack community :)

English

601

Tom Firth retweetledi

ibby@StatueofIBBertY·20 Nis

ANNOUNCEMENT: Every day for a while, I'll be posting two agents that my customers use. One will be an OSS model, one will be Claude (or other frontier model). Will share cost, performance, and output quality Why? Everyone on twitter/LI is still so surprised at what Claude can do that they don't realize just how close behind the open source models are. For the first detailed breakdown, I've created a basic AI agent that does deep research on an individual (link in comments). It takes a name and a company, and finds their linkedin, company linkedin, and X account and goes through and reads the info it can find. It then outputs a large briefing document on the individual. My first victim is @JaredSleeper I ran this agent on Claude Sonnet and Mistral 3 Large. The outputs are as follows: MISTRAL: Length: 44 pages Cost: $0.048 The formatting came out a bit botched, but WOW was it detailed. It wrote a ton of very, very detailed information, and didn't just summarize his background. It was a very very thorough piece of work. Note - it messed up the google doc skill and created the document in two parts, which I had to unify. CLAUDE: Length: 11 pages Cost: $0.233 (almost 5x more for about a third the content) The formatting is VERY botched, and it didn't use the google docs tool correctly at all. Squinting, it might have elaborated more than mistral, which focused more on HOW to talk to jared. It generally seems a bit more effusive and had more opinions. Potentially more depth. The accuracy and sourcing seems to be the same for each, which is impressive on the mistral side The quality of the output is interesting - see below

English

10.3K

Tom Firth@tdfirth·9 Nis

@adamwathan But sometimes I do ask it to run the dev server (so it can see the logs).

English

Adam Wathan@adamwathan·8 Nis

// AGENTS.md Never, ever, under any circumstances, ever, not once, no matter what, try to start the fucking dev server, it’s already fucking running.

English

311

286

6.7K

297.5K

Tom Firth@tdfirth·6 Nis

@MaartenGr Thanks for sharing this, great work!

English

Maarten Grootendorst@MaartenGr·3 Nis

A Visual Guide to Gemma 4 With almost 40 (!) custom visuals, explore the new models from Google DeepMind. We explore various techniques, ranging from Mixture of Experts and the Vision Encoder all the way up to Per-Layer Embeddings and the Audio Encoder. Link below 👇

English

115

627

107.3K

Tom Firth@tdfirth·6 Nis

Great tour of Gemma 4

Maarten Grootendorst@MaartenGr

English

Tom Firth@tdfirth·5 Nis

While it's clear that we want models to generalize in this way (and that today's models don't), I'm not sure it's so obvious that symbolic language _powers_ the generalization process in humans. Symbolic language is still language, and I think is therefore primarily a tool for communication. It's an extremely powerful tool that allows for lossless transmission of ideas in a given domain, which is very useful for humans, who must pool intelligence by sharing thought across space and time (poor animals that we are). I think it's only secondarily a tool for reasoning though, and I'm generally in the camp that most ideas originate through intuition, and we formalize afterwards as a communication device. I.e., I don't think I believe our internal representation is symbolic. I like this line of thought though. Evidence of compact symbolic representation and manipulation would certainly be good evidence of generalization. Would be fun to train a model by giving it examples of evaluation in some symbolic system, and see if it could learn to recover the rules of that system and represent them in a compact way. Very easy to generate synthetic data for such a problem too.

English

485

François Chollet@fchollet·5 Nis

Science went from the initial observation of radioactivity to a working atom bomb over 47 years via only about 9 distinct key experiments -- extremely few data points -- and symbolic models concise enough they would fit on a single page. This is what extreme generalization looks like, and it powered entirely by symbolic compression. Turn a handful of data points (deliberately collected) into a tractable plan to completely reshape reality, by reverse-engineering the causal symbolic rules behind the data.

English

109

1.4K

105K

Tom Firth@tdfirth·3 Nis

@theo In the past I have just taken the trial and then downgraded. Dumb af.

English

Theo - t3.gg@theo·3 Nis

Does Google actually hide all the cheaper plan options when setting up a new Google workspace? There are 3 cheaper options and I'm not allowed to see or select any of them.

English

166

1.3K

138.5K

Tom Firth@tdfirth·3 Nis

You should all get off this site.

Infinite Books@infinitebooks

This paragraph from Schopenhauer has probably never been more relevant

English

Tom Firth@tdfirth·2 Nis

This is accurate

Jamie Turner@jamwt

LLMs are not perfect at writing software. In fact, they plain suck in some respects compared to strong human engineers. I broke down why in a video... and how Convex's design makes LLM-generated code actually reliable. The key: feedback, local reasoning, meta minimization.

English

120

Tom Firth@tdfirth·2 Nis

@mattyglesias This has to be rage bait. Surely.

English

Tom Firth@tdfirth·31 Mar

If accurate, I'm surprised by how few are found in southern Italy. I must look that up.

Epic Maps 🗺️@theepicmap

Map of where Roman coins have been found

English

Tom Firth@tdfirth·25 Mar

These are very cool, I've played four or five now. These first few at least are very easy for any human that's ever played a video game, but it's pretty clear where it's hard for an LLM. Spatial reasoning, learning about the environment as you go, applying that knowledge, and a time constraint... hitting them everywhere it hurts.

English

748

François Chollet@fchollet·25 Mar

ARC-AGI-3 is out now! We've designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first time. We've done extensive human testing that shows 100% of these environments are solvable by humans, upon first contact, with no prior training and no instructions. Meanwhile, all frontier AI reasoning models do under 1% at this time.

English

236

340

2.7K

622.2K

Tom Firth@tdfirth·25 Mar

much excite

François Chollet@fchollet

English

Tom Firth@tdfirth·21 Mar

@auchenberg Scandinavian winter is a totally different ball game. Stockholm is about 20 degrees further North than NYC. Not as dry though that's the worst bit of winter in NYC.

English

Kenneth Auchenberg 🛠@auchenberg·20 Mar

NYC too.

Charles@wotancore

Scandinavian winters are so dark, so draining, that when spring comes one becomes manic with joy. Even though it happens every year, one is left overwhelmed by the sense that they had forgotten it was possible to feel this good.

English

3.6K

Tom Firth@tdfirth·20 Mar

Of those I have to pick independence, but I think a better definition is just whether control flow is determined by the LLM in a loop or whether it is dictated by a human. I prefer that more technical distinction because it’s a simple clear cut test. You can show me the code and it always neatly passes or fails that test. System prompt counts as code ofc! Following a sequence of steps in a prompt doth not an agent make.

English

Tom Firth retweetledi

Jared Sleeper@JaredSleeper·20 Mar

An LLM crosses the threshold and becomes an “agent” when: 1) It reasons/has a chain of thought 2) It calls/uses tools 3) It is persistently used for the same purpose 4) It operates w/o a human trigger (chron, goal-seeking, etc.)

English

1.7K

Tom Firth@tdfirth·19 Mar

Noooooooo

Charlie Marsh@charliermarsh

We've entered into an agreement to join OpenAI as part of the Codex team. I'm incredibly proud of the work we've done so far, incredibly grateful to everyone that's supported us, and incredibly excited to keep building tools that make programming feel different.

111

Tom Firth@tdfirth·18 Mar

@JaredSleeper The prosumer segment is probably a good proxy for the set of knowledge workers that will remain useful the longest.

English

293

Jared Sleeper@JaredSleeper·18 Mar

This is so true. The vast majority of the hypergrowth AI companies are built on prosumer-led growth... it is arguably the only way to have "best in class" growth today. They are either prosumer (Lovable, Replit, Cursor, OpenEvidence, Suno, Higgsfield, n8n, Claude, OpenAI), prosumer-like usage dynamics with enterprise SLA aircover (Abridge, Harvey, Cognition, etc.) or somehow derivative of the insane level of usage of/investment in the above (FAL, Temporal, Mercor, etc.). Doesn't mean there aren't great businesses that are being built the old-fashioned way (enterprise sales, heavy integrations, etc.) but it is just going to be hard for those businesses to keep up with the PLG of the above when a new model release can send usage absolutely vertical for a few months.

Nikita Bier@nikitabier

For the rest of the year, the word for everyone working at the frontier of AI will be: Prosumer

English

130

54.5K

Keşfet

@JaredSleeper @adamwathan @MaartenGr @theo @mattyglesias @elonmusk @BarackObama @taylorswift13