Shriyash Upadhyay

267 posts

Shriyash Upadhyay

@shriyashku

Founder @withmartian

Katılım Şubat 2018

127 Takip Edilen368 Takipçiler

Sabitlenmiş Tweet

Shriyash Upadhyay@shriyashku·19 May

Safety is being subsumed to capitalism because when there is misalignment between the two, capitalism wins. The only way to make sure AI is safe is to make a strong capitalist case for the technologies that will make AI safe: creating an economic incentive to understand models.

English

1.2K

Shriyash Upadhyay retweetledi

Cognition@cognition·25 Mar

We're happy to announce our collaboration with @withmartian on Code Review Bench v0.3, with a focus on the tradeoffs between precision and latency.

Martian@withmartian

We've been tracking AI code review tools across OSS, and a new category is emerging. We're calling it "Deep Review": → Standard AI review: PR-level, fast, human in the loop → Deep Review: repo-wide context, runs autonomously in the background 🧵👇

English

13.3K

Shriyash Upadhyay@shriyashku·25 Mar

@alexML @withmartian I like to think of myself as a *very deep* review agent

English

Martian@withmartian·25 Mar

English

39K

Shriyash Upadhyay@shriyashku·25 Mar

First was codegen, now code review. Every product category will have background agents. Tools in most fields talk about augmenting humans, but that’s a bad design pattern. It encourages humans to be the bottleneck. Things will just happen in the background, automatically

Martian@withmartian

English

445

Shriyash Upadhyay@shriyashku·13 Mar

@Piyushkumar420 @withmartian @augmentcode @baz_scm You should read the methodology here!: github.com/withmartian/co…

English

Shriyash Upadhyay@shriyashku·13 Mar

We caught the same pattern with Claude Code Review. Reviews written by claude-code improved in the data weeks before Anthropic's announcement on Monday. This is how we figured out the launch was coming before the blog post dropped.

English

Shriyash Upadhyay retweetledi

CodeRabbit@coderabbitai·5 Mar

Every AI code review benchmark published so far has one thing in common: they were all made by vendors. And somehow, their own tool always wins. That just changed with the first independent benchmark. Heres how we performed on real OSS PR's! 👇

English

14.3K

Shriyash Upadhyay retweetledi

Rohan Paul@rohanpaul_ai·27 Şub

The developer space is absolutely on fire over the last few days. 🔥 And now we have Martian releasing the largest coding benchmark ever to evaluate how well AI agents review your daily code. And its open-sourced. This is also the first unbiased code review benchmark to finally stop AI models from cheating on tests. The real breakthrough is that this is the first "self-correcting" benchmark that can't be gamed by marketing teams or lazy training data. Most benchmarks are like a fixed school exam that never changes; once the "students" (the AI models) see the questions enough times, they just memorize the answers, and the test becomes useless. Martian structurally fixed this by introducing a Dual-Layer Evaluation system. They have an "Offline" layer (a fair, side-by-side test on static data) and an "Online" layer (tracking real-world behavior of what developers actually use). If an AI company tries to "cheat" by optimizing their model specifically for the offline test, their score will stop matching the real-world usage in the online layer, and everyone will see they are faking it. This dual method completely stops companies from rigging the scores and proves which tools actually work. This is the first time we've had a measuring stick for AI that actually survives contact with the real world without breaking down or becoming biased over time. They combined live data from human behavior with isolated offline tests to evaluate over 200,000 code changes. The system remains totally neutral because the creators do not sell any coding assistants themselves. Software teams finally have a reliable measuring standard that adapts to the real world and never breaks.

Martian@withmartian

Introducing Code Review Bench v0: codereview.withmartian.com The first independent code review benchmark. 200,000+ PRs. Unbiased. Fully OSS. Updated daily. Tool performance highlights 🧵👇 Featuring: @augmentcode @baz_scm @claudeai @coderabbitai @cursor @GeminiApp @github @graphite @greptile @kilocode @OpenAIDevs @propelcode @QodoAI

English

Keşfet

@withmartian @alexML @Piyushkumar420 @augmentcode @baz_scm @elonmusk @BarackObama @taylorswift13