Ryan Smith

5

17

1.3K

Ryan Smith@rnsmith49·1 Nis

@withmartian Humans in this benchmark are built better than me, because I am definitely losing vs a crocodile

English

2

12

Martian@withmartian·1 Nis

The most important part of the benchmark is the human baseline. Humans obviously still win every fight. When they stop, that's how you know we've hit AGI (and the AGI hit back).

English

2

0

11

281

Martian@withmartian·1 Nis

Traditional Precision-Recall curves tell you how your code review tool performs on static benchmarks. They don't tell you how it performs against a Hawk. Introducing Fight Index (FI).

English

2

35

5.7K

Ryan Smith@rnsmith49·1 Nis

I unironically love the Lotka-Volterra equations, so yes I may spend my Wednesday morning diving into the code of an April Fools’ post

Traditional Precision-Recall curves tell you how your code review tool performs on static benchmarks. They don't tell you how it performs against a Hawk. Introducing Fight Index (FI).

English

11

55

Ryan Smith@rnsmith49·25 Mar

@mwilliammyers Find out next time... on CRB v0.4!

English

2

35

Will Myers@mwilliammyers·25 Mar

Slow Deep Review vs. quicker reviews. Is slower actually better? Which helps improve the Software Factory velocity more? Which helps improve maintainability and reduce bugs over the lifespan of the codebase more?

We've been tracking AI code review tools across OSS, and a new category is emerging. We're calling it "Deep Review": → Standard AI review: PR-level, fast, human in the loop → Deep Review: repo-wide context, runs autonomously in the background 🧵👇

English

0

9

239

Ryan Smith@rnsmith49·25 Mar

Personally, I love the shift in the industry of optimizing for two separate workflows - fast and iterative, vs slow and offline. And I'm always a sucker for seeing data back up intuition 😅

We've been tracking AI code review tools across OSS, and a new category is emerging. We're calling it "Deep Review": → Standard AI review: PR-level, fast, human in the loop → Deep Review: repo-wide context, runs autonomously in the background 🧵👇

English

0

9

100

Ryan Smith أُعيد تغريده

Fazl Barez@FazlBarez·25 Mar

If this policy is not revoked, I won’t be reviewing/ACing for #NeurIPS Science requires open exchange of ideas! When participation gets shaped by geopolitics, it ends up reflecting power structures, not merit--narrows what science can be and powerful nations get full control!

Mathieu@miniapeur

NeurIPS 😬

English

5

19

249

24.3K

Ryan Smith أُعيد تغريده

Ashley Zhang@AshleyZhang110·13 Mar

I've been playing around with eval-ing AI code review tools at work. We track 22 different ones. @greptile V4 had the single biggest improvement I've ever measured. Recall increased 47% from 38.7 → 56.9%

How good is Claude Code Review really, and is it worth $25+ per review? We scraped every OSS repo on GitHub that's using it to figure out how devs actually use it. Here's how it stacks up against 22 other tools: codereview.withmartian.com Featuring: @augmentcode @baz_scm @CodeAntAI @coderabbitai @cognition @cubic_dev_ @cursor_ai @GeminiApp @greptile @kilocode @kodustech @mesa_dot_dev @QodoAI

English

5

9

60

6.4K

Ryan Smith أُعيد تغريده

Shriyash Upadhyay@shriyashku·13 Mar

Our code review tracker caught the release of Claude Code Review before @AnthropicAI announced it. Greptile v4 hit #1 on CRB. The tracker caught it before their announcement. The data is predicting something new from @Devin Review in the near future. Here's how. 🧵

How good is Claude Code Review really, and is it worth $25+ per review? We scraped every OSS repo on GitHub that's using it to figure out how devs actually use it. Here's how it stacks up against 22 other tools: codereview.withmartian.com Featuring: @augmentcode @baz_scm @CodeAntAI @coderabbitai @cognition @cubic_dev_ @cursor_ai @GeminiApp @greptile @kilocode @kodustech @mesa_dot_dev @QodoAI

English

5

21

1.4K

Ryan Smith@rnsmith49·13 Mar

Coding tools are getting way better, but also way more convincing to humans (or me at least). We definitely need to continue building robust evals for code review so we know we are keeping the "LGTM effect" in check

How good is Claude Code Review really, and is it worth $25+ per review? We scraped every OSS repo on GitHub that's using it to figure out how devs actually use it. Here's how it stacks up against 22 other tools: codereview.withmartian.com Featuring: @augmentcode @baz_scm @CodeAntAI @coderabbitai @cognition @cubic_dev_ @cursor_ai @GeminiApp @greptile @kilocode @kodustech @mesa_dot_dev @QodoAI

English

0

10

114

Ryan Smith@rnsmith49·6 Mar

@joshgreaves_ml frfr f(or)r(ust) when?

English

0

1

75

Josh Greaves@joshgreaves_ml·5 Mar

Just released frfr—lightweight runtime type validation for Python. One function. Zero dependencies. 57KB installed. Validates your dataclasses, TypedDicts, and NamedTuples directly. That's the whole API. Here's why this exists:

English

4

6

20

918

Ryan Smith@rnsmith49·26 Şub

Verification is easier than Generation - the same is true for Code, and I am really excited to see the pipeline of: Code Review getting more robust evals → better code review tools → better coding tools

Introducing Code Review Bench v0: codereview.withmartian.com The first independent code review benchmark. 200,000+ PRs. Unbiased. Fully OSS. Updated daily. Tool performance highlights 🧵👇 Featuring: @augmentcode @baz_scm @claudeai @coderabbitai @cursor @GeminiApp @github @graphite @greptile @kilocode @OpenAIDevs @propelcode @QodoAI

English

13

211

Ryan Smith أُعيد تغريده

Martian@withmartian·19 Şub

A new ARES tutorial from @Narmeen29013644: Getting started in long-horizon interp. When do agents fail to accurately model their environment? How de we fix them? And how can you run these experiments on your own machine?

LLM agents ignore their own environment — skipping outputs, misreading tools, repeating failures. Turns out the model knows when it’s wrong. 87% probe accuracy from activations. How we found it, fixed it, and how to try it with ARES + TransformerLens 🧵

English

3

21

1.4K

Ryan Smith@rnsmith49·19 Şub

@alexML @withmartian My favorite meetings are the ones where by step 2 you can tell you're allowed to tune out though

English

1

21

Alex Zverianskii@alexML·19 Şub

Your LLM agent knows it's about to fail and does it anyway. 87% probe accuracy in the residual stream *before it acts* — the model literally encodes "this is a bad idea" and then proceeds. @withmartian 's ARES + TransformerLens lets you intercept that moment and steer it mid-generation. Works best early though — by step 15 the model is too far gone. We've all been in meetings like that.

LLM agents ignore their own environment — skipping outputs, misreading tools, repeating failures. Turns out the model knows when it’s wrong. 87% probe accuracy from activations. How we found it, fixed it, and how to try it with ARES + TransformerLens 🧵

English

0

9

144

Ryan Smith@rnsmith49·19 Şub

This is one of the biggest takeaways for me - model internals change over steps after interacting with the environment! I love this other figure @Narmeen29013644 made that shows this too - a matrix of cosine similarity between optimal steering vectors at each step:

But you can't just compute one steering vector and reuse it for the whole episode. The representation of "valid vs. invalid" drifts as the conversation goes on;per-step vectors outperform a single static one. PCA on the vectors at different time steps shows they point in genuinely different directions.

English

0

10

87

Ryan Smith@rnsmith49·19 Şub

ARES 🤝 Mech Interp We’re excited that you can now easily dig into model internals across long-horizon tasks with ARES! We’ve built some nice integrations with TransformerLens, with even better support for hooks coming soon!

LLM agents ignore their own environment — skipping outputs, misreading tools, repeating failures. Turns out the model knows when it’s wrong. 87% probe accuracy from activations. How we found it, fixed it, and how to try it with ARES + TransformerLens 🧵

English

Tyler Griggs@tyler_griggs_

0

9

170

Ryan Smith@rnsmith49·13 Şub

Really great to see the tooling here converging on an interface!

SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: novasky-ai.notion.site/skyrl-tinker 🧵

English

9

452

Ryan Smith أُعيد تغريده

Shriyash Upadhyay@shriyashku·9 Şub

Thinking of training an RL model specialized for @openclaw using ARES. So it gets better performance and much lower token costs. Is this something folks would be interested in using? 100 likes and I put up an endpoint

English

2

14

2.6K

Ryan Smith أُعيد تغريده

Josh Greaves@joshgreaves_ml·5 Şub

We'll be presenting the ARES roadmap at office hours tomorrow at 2pm PT. If you're interested in agents | RL | interp and want to contribute to open-source send me a DM for more info. github.com/withmartian/ar…

English

6

31

2.3K

Ryan Smith أُعيد تغريده

Abir Harrasse@AHarrasse1906·3 Şub

Paper alert 🚨: LLMs build a shared multilingual latent space for meaning, decoding into languages only later. 🌍 Performance gaps come from tokenizer bias & weaker late-layer circuits, not missing concepts. We show this mechanistically with Cross-Layer Transcoders. 🧵👇

English

14

55

9.7K

Ryan Smith أُعيد تغريده

Martian@withmartian·4 Şub

We’re excited for the first round of the Martian Interpretability Prize! The proposal deadline was this week, and we’ve had a ton of applications. Can’t wait to look through these and get back to everyone who applied. More rounds coming up 🙂

English