Martin Gale

774 posts

Martin Gale

Martin Gale

@finstratege

taming the beasts 🪽🦞

SF Katılım Mart 2024
407 Takip Edilen114 Takipçiler
Martin Gale
Martin Gale@finstratege·
@TheDavidaGinter @polsia I agree, it’s annoying that they exaggerate their “””arr””” so much though.. doesn’t inspire confidence 😮‍💨
English
0
0
0
7
Davida Ginter
Davida Ginter@TheDavidaGinter·
Everyone’s talking about @polsia raising $30M on a product that looks… meh. This is actually the strongest validation yet of what the market is thirsty for: AI that works for you, instead of you working for it. Polsia’s product is far from perfect (tried. stopped). But the promise is interesting: Can you run a business while AI handles the boring operational work for you?
English
24
2
70
10.4K
Siddhartha Saxena
Siddhartha Saxena@siddsax·
Anthropic onboarding day: Michael Scott introducing Karpathy like he just signed Wemby in free agency.
English
370
1.4K
16.4K
1.9M
Martin Gale
Martin Gale@finstratege·
omg this is so amazing
English
0
0
0
9
Martin Gale
Martin Gale@finstratege·
@kwindla it’s just that it gave us great speed and agent seems very reactive which is good for the customer experience
English
0
0
1
19
kwindla
kwindla@kwindla·
@finstratege Interesting. I didn't love the results when I tested 3.6 sparse, but if it's working well for you I should spend more time with it. Can you talk about the use cases where that model is doing well, for you?
English
1
0
0
41
kwindla
kwindla@kwindla·
Gemini 3.5 Flash is out today. Here are numbers from my main voice and task agent benchmarks. Some notes: All the Gemini 3 models so far are too slow to work well for voice agents. Gemini 2.5 Flash was a *great* model for voice agents, when it was SOTA. It was fast and good at instruction following. Its big weakness was tool calling. It was quite difficult to prompt Gemini 2.5 Flash to perform tool calling reliably in long context, multi-turn use cases. With Gemini 3, Google improved the tool calling issues a lot. But time to first token is ~1s. We really need TTFT down below 700ms. Google isn't alone in this. All the SOTA models released this year have been reasoning models that aren't optimized for low latency. Claude Haiku 4.5 (released last October) remains the best-performing model with a TTFT under 700ms. Gemini 3.5 Flash is the first Flash model in the 3 family to be released as "generally available." It's quite different from gemini-3-flash-preview, which was released last December. That model actually scored a bit better on my voice agent benchmark. This new model is the new overall top scorer on my task agent benchmark. This benchmark tests a multi-turn task, requiring that models achieve a P50 turn execution time faster than four seconds. Gemini 3.5 Flash with a "high" thinking budget scores significantly better than any other model I've tested. So even though the TTFT isn't what we'd like to see from this model, the overall generation speed makes up for it, and allows us to use the "high" thinking budget and still achieve a per-turn P50 under two seconds. Very impressive. This performance costs money, though. I had become accustomed to thinking of Gemini models as aggressively priced. But Gemini 3.5 Flash is actually more expensive than GPT-5.4 and Claude Sonnet 4.6 on this benchmark. Also note that lower reasoning settings don't always save money. Gemini 3.5 Flash "minimal" costs more, on this benchmark, than "high," because it makes more mistakes, so it uses more tokens to complete the task. Please note that performance of this model on your benchmarks might be very different. My voice agent and task agent results are often wildly out of line with the reported results on standard benchmarks in the model cards and release notes. The voice agent benchmark is 30 turns, and heavily tests tool calling in a long-context scenario. The task agent benchmark injects large streams of structured data events into the context, all tool calls are asynchronous, and the test task takes at least 32 turns to complete. (My motto for evals is "30 turns or it didn't happen.") Make your own benchmarks! (And post the source code and the results for different models, if you can.)
kwindla tweet mediakwindla tweet mediakwindla tweet media
English
14
9
113
14K
Martin Gale
Martin Gale@finstratege·
@kwindla for my cases I find it hard to justify maintaining anything else..
Martin Gale tweet media
English
1
0
1
53
kwindla
kwindla@kwindla·
@finstratege That’s pretty good. 27b dense or 35B sparse? The 27b version seems to me like it performs a lot better. But it’s more expensive to serve at scale.
English
1
0
0
217
Martin Gale
Martin Gale@finstratege·
micromanaging | /ˌmī-krō-ˈma-ni-jiŋ/ | noun - the practice of reviewing and approving every action your AI agent takes, i.e. not running it in YOLO bypass-permissions mode. "instead of trusting the agent, he kept micromanaging it, hand-approving each command with cmd + enter."
English
0
0
1
17
cayden 凯登
cayden 凯登@caydengineer·
Launching Mentra Live open-source smart glasses. Deploy smart glasses for real world work. We already shipped thousands. Now, they're generally available. Build apps that leave the screen. Let your AI step into the real world.
San Francisco, CA 🇺🇸 English
72
48
538
75.7K
Martin Gale
Martin Gale@finstratege·
[🔮 vision tweet] you won't manage a knowledge base for your agents. your computer / workspace / server IS the knowledge base
English
0
0
1
13
Martin Gale
Martin Gale@finstratege·
@egocgp Y’a une pomme pourrie dans le lot et ça a contaminé tout le panier lol
Français
0
0
5
820
Léo Bachelot
Léo Bachelot@egocgp·
ça me fume le downfall / glowdown des entrepreneurs de Qui Veut Être Mon Associé Tous dans des plans foireux de club d'investissement, de webinaire formation en publicité insta youtube pour t'apprendre la méthode croissance Alors qu'ils ont tous un background très respectables
Français
11
7
177
34.6K
Martin Gale
Martin Gale@finstratege·
Just closed our biggest account to date… $18K ARR.  Fuck yeahhh!!!!!!!!!!!!!!!
English
0
0
0
31
Thinking Machines
Thinking Machines@thinkymachines·
People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…
English
460
1.9K
15.7K
7.6M
Martin Gale
Martin Gale@finstratege·
@daedalium @NanoCorpHQ C’est un paid partnership ou tu y trouves vraiment de la valeur? Tu payes? Just curious 🧐
Français
1
0
1
214
Oussama Ammar
Oussama Ammar@daedalium·
I just launched my autonomous AI company "Dwell HQ" on @NanoCorpHQ Verification: bask-Mw3A
English
15
3
52
10.3K
Martin Gale
Martin Gale@finstratege·
@kwindla Is it really? Would you use it in customer facing app over STT-to-TTS?
English
1
0
0
164
kwindla
kwindla@kwindla·
OpenAI shipped a new speech-to-speech model today: gpt-realtime-2 This is the first speech-to-speech model good enough to use in my voice agents that do "real work." Or real play, for that matter. Here's gpt-realtime-2 as the brain of the ship AI in Gradient Bang. The voice-to-voice response and tool calling times here are unedited, so you can see exactly what the interaction with the model is like in an agent with a very complex system instruction and frequent tool calls. (I did clip out the subagent task execution segments, after gpt-realtime-2 starts a subagent via a tool call. Subagents in this config used gpt-5.2 "medium" effort.)
English
30
40
450
54.7K