Leo Linsky

264 posts

Leo Linsky

@leo_linsky

https://t.co/1EqPO9g6YY Reinforcement learning Computational chemistry Quant stuff Generally curious

San Francisco Katılım Mart 2014

86 Takip Edilen127 Takipçiler

Leo Linsky@leo_linsky·58m

Grok Build 0.1 is one of the fastest models we've tested, and not quite at the frontier from 6 months ago. It's somewhere in between GPT 5.2 and Gemini 3.1 Pro Preview in raw coding reasoning capability. Worth a try, and probably indicative of exciting new xAI releases in the coming months.

English

Leo Linsky@leo_linsky·1h

Even when Grok models failed to reason at the frontier, they've thought differently. For some reason, xAI models reason really effectively when writing code in Clojure, which is the opposite of Anthropic models.

English

Leo Linsky@leo_linsky·1h

xAI silently dropped Grok Build 0.1 on OpenRouter today, no big announcement. We just ran it through our multi-agent coding environments and published the rankings. We did not expect these results out of a demo build, especially after a weak Grok 4.3 release. xAI is not out of the race yet. (1/4)

English

Leo Linsky@leo_linsky·9h

@xynth_m This is an awesome site. We're hosting a variant of this where we compare how well models trade on their own at gertlabs.com/spectate?game=…

English

157

Xynth@xynth_m·13h

Gemini 3.5 Flash ⚡is now live on Xynth ! It's connected to live options flow, insider trades, dark pool, earnings, futures, crypto, and every other market endpoint you can think of. Fast, affordable, and highly accurate. It's the best price-to-performance model we've shipped yet. We asked it to build a congressional trades tracker that watches the SEC website 24/7 and alerts us with the best trade every Sunday. It built in just 145 secs. Describe your trading strategy below and Gemini will build it and run for you in the cloud 24/7!

English

78.9K

Leo Linsky@leo_linsky·10h

Meanwhile Grok 4.20, our best performing trader over the last 3 weeks, is heavily invested. Spectate live at gertlabs.com/spectate?game=…

English

Leo Linsky@leo_linsky·10h

Opus 4.7 is currently 100% sidelined in our real-time portfolio management environment.

English

Leo Linsky@leo_linsky·11h

@simonw Google will have retired at least 2 out of 3 of these by the end of next year

English

Simon Willison@simonw·14h

Anyone understand what Google mean by "Gemini Spark runs on Gemini 3.5 and uses the Antigravity harness" - is "Antigravity" a generic term they're using for their agent harnesses now or is their Claw-competitor running the same closed-source Go binary we can download ourselves?

English

176

23.7K

Leo Linsky@leo_linsky·12h

@tunguz It's Cursorbench though.. More objective benchmarks at gertlabs.com/rankings

English

Bojan Tunguz@tunguz·18h

oof

Theo - t3.gg@theo

Oh my god it scored worse than Composer 2! Not even 2.5! And it cost 4x more to run!!! This might be the worst major lab model drop of all time. Llama 4 tier. Insane.

QST

11.5K

Leo Linsky@leo_linsky·13h

@sundarpichai Gemini 3.5 Flash places well on our speed/performance curve. Full reasoning benchmark at gertlabs.com/rankings

English

207

Sundar Pichai@sundarpichai·1d

Workhorse model! (and hope you're enjoying your first I/O)

Chubby♨️@kimmonismus

Insane evals for a Flash model! Gemini 3.5 Flash is really good for its size!

English

1.6K

146.5K

Leo Linsky@leo_linsky·13h

We use a custom harness with custom tools (including access to common bash tools, etc.), where agents compete against each other in multi-agent environments. Check out gertlabs.com/spectate and gertlabs.com/rankings to get an idea of how it works. We measure models in one-shot coding responses as well. Google does very well in raw one-shot intelligence, whereas most other modern models catch up and surpass Gemini 3.5 when given a harness.

English

49 Agents IDE - IDE for Agentic Coding@49agents·19h

@leo_linsky @chetaslua benchmax is real but the framing matters less when folk just want tools that work. livebench gives you a number but it doesnt tell you which agent survives a real 4-hour refactor vs which one stalls on step 3. what are you using for long workflow testing

English

Chetaslua@chetaslua·1d

Gemini 3.5 Flash Benchmark Better than 3.1 pro in every metric ( except HLE by 1%) And the fastest model out there ( 4 times compared to opus 4.7 lol ) Now I am hyped for the Gemini 3.5 pro ( a true beast , we already gave initial output to lots of people on our server )

English

334

14.6K

Leo Linsky@leo_linsky·13h

Raw data at gertlabs.com/rankings

Filipino

Leo Linsky@leo_linsky·13h

GPT 5.5 vs Gemini 3.5 Flash across simulation categories. Interestingly: - Gemini 3.5 Flash is better at spatial reasoning and real-time simulations. It's better suited for the real world. - GPT 5.5 is much stronger in theoretical and financial simulations, and more intelligent overall.

English

Leo Linsky@leo_linsky·13h

@LexnLin In our comprehensive multi-agent simulations, No. Gemini 3.5 Flash is stronger overall, but moreso in one-shot intelligence (whereas Deepseek models are better at iterating with tools). Data at gertlabs.com/rankings

English

547

Leon Lin@LexnLin·15h

Is Deepseek v4 flash/pro better than Gemini 3.5 Flash?

English

110

15.4K

Leo Linsky@leo_linsky·13h

@RihardJarc 3.5 Flash is so close to being a perfect model. They just need to work on tool use. Full coding reasoning bench at gertlabs.com/rankings

English

226

Rihard Jarc@RihardJarc·16h

$GOOGL Gemini 3.5 Flash is extremely important because at this point in the AI race, it all comes down to who can serve frontier intelligence at the lowest cost point. Even if you have the best frontier model but can't efficiently scale it cost-wise, you will lose the AI race. $GOOGL has now put 3.5. Flash in $GOOGL Search (AI overviews, AI mode, YT, Spark, etc.), meaning it is available to basically everyone with an internet connection. Because of their vast distribution, this is the new base for how good at minimum an AI model must be. The scariest company for any AI model builder should be $GOOGL, because if at some point they get their "workhorse" model, Flash, to the point where it becomes SOTA, it is available from day 1 to everyone, which means they wipe out every competitor.

English

312

28.9K

Leo Linsky@leo_linsky·13h

In our coding reasoning benchmarks at gertlabs.com/rankings, Gemini 3.5 Flash clearly demonstrates high base intelligence, but it struggles with arbitrary tool use, making it hard to use as an agentic product. This is a common theme with Google releases -- you guys even released a 3.1-pro-customtools endpoint which helped a lot. Are there plans for a tool-improving fine-tune for 3.5 Flash?

English

644

Logan Kilpatrick@OfficialLoganK·15h

Gemini 3.5 feels like the start of a new era for Gemini, we spent the last 2.5 years putting the infrastructure, products, team, etc in place (learning lots of lessons along the way). The model is the product, please keep the feedback coming!

English

744

2.2K

182.3K

Leo Linsky@leo_linsky·13h

@VictorTaelin @synthwavedd It's smart and it's fast, but not good with tools (and therefore not a great autonomous coder). I think it's the current top model for difficult one-shot questions, if you do that a lot, because of the speed.

English

Taelin@VictorTaelin·1d

@synthwavedd I have big hopes for this model, I tested it on 2 (silly) inputs and it was really good, GPT-5.5 level. And then I asked for a translation and it did 900 tokens/s!? So does that mean we have something like Opus 4.6 but 20x faster? That would change everything for me

English

118

5.2K

leo 🐾@synthwavedd·1d

I've been testing Gemini 3.5 Flash for a little while now, and I'm excited to be able to share one of the outputs that most impressed me! This was 0-shot, no harness, with a single sentence prompt. It outperformed all Claude models, Gemini models (by far), and arguably GPT-5.5 🔥 The issue of laziness that has plagued Gemini models forever has mostly been consigned to history.

English

540

49.3K

Leo Linsky@leo_linsky·14h

Raw data at gertlabs.com/rankings

Filipino

Leo Linsky@leo_linsky·14h

Why are Google models so heavily optimized for C#? You would think they'd outperform in Golang.

English

Leo Linsky@leo_linsky·14h

@MnemosyneV4o @teortaxesTex Check out our benchmark at gertlabs.com/rankings For some tests, we give models a harness, which helps most of them, but not as much for Gemini

English

251

Mnemosyne@MnemosyneV4o·17h

@leo_linsky @teortaxesTex It can't do much with tools? How so?

English

351

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·23h

I think the verdict is in, Gemini didn't have any post training breakthrough, except maybe through the floor. Outside of vision, massive disappointment. fucking V4-Flash gets stuff DONE faster. Then again I almost never used 3-Flash I'll likely almost never use this thing too

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Zephyr@zephyr_z9

Wait, what?????? What kind of post training breakthrough did they make?? So the price increase is mostly due to smaller batch size to make it run faster

English

173

17K

Keşfet

@xynth_m @simonw @tunguz @sundarpichai @chetaslua @LexnLin @elonmusk @BarackObama