Rayan Krishnan

325 posts

Rayan Krishnan

@RayanKrishnan

ceo @ValsAI | solve evals, solve intelligence prev @stanford @PalantirTech

Katılım Nisan 2019

348 Takip Edilen1.3K Takipçiler

Sabitlenmiş Tweet

Rayan Krishnan@RayanKrishnan·10 Eyl

x.com/i/article/1965…

ZXX

12.7K

Rayan Krishnan@RayanKrishnan·16h

@demishassabis see @ValsAI

191

Demis Hassabis@demishassabis·17h

x.com/i/article/2076…

ZXX

2.9K

14.7K

5.9M

Rayan Krishnan@RayanKrishnan·5d

@5mknc5 quick diff

English

11.8K

Madhav 🦉@5mknc5·5d

@RayanKrishnan How is it compared to GLM 5.2/5.1/5/4.7??

English

3.9K

Rayan Krishnan@RayanKrishnan·5d

Meta made a “minor” release to Muse Spark, there’s nothing minor about it. Lots to parse here: - This model is so fucking cheap I almost don’t believe it. In practice we see it’s 1/10 the cost of both Fable and GPT 5.5. If you thought OS models would compete away margins, just wait till you see this. It’s somehow cheaper to use MS 1.1 than host your own OS model… - Coding improvements are significant. This was a real shortcoming in 1.0. But 1.1 sees a ~50% improvement in VibeCodeBench and ~10% improvement in SWE Bench. Not quite SOTA, but at this cost/latency it is still incredibly compelling. - Speaking of latency, wow this model is fast. Across our benchmarks, we find it to be 1/4 the latency of Opus 4.8 and 1/2 the latency of GPT 5.5. I would expect Meta to have incredible web infra, but really don’t know what witchcraft they’re pulling to host the model for such fast inference at high rate limits. - There is a public API. This is the first time Meta has released a model through a hosted API. I’m expecting lots of AI natives to hot-swap and rapidly test this model as a replacement. We’ll soon see if it's performant enough for those production uses. - Speaking of AI natives, it’s been a wild week for Harvey’s legal benchmark. Grok 4.5 held the SOTA position for ~24 hours at 12% before MS 1.1 unseated it with a big jump up to ~20%. I suspect many internal evals will see surprising results like this. Glad to collab with @harvey @gabepereyra @nikogrupen @ItsJulioPereyra on this eval. - Intelligence is more jagged than ever, even within individual domains like legal and coding. Every application and user benefits from staying dynamic. There an edge in picking the right model/system for each task.

English

881

445.2K

Rayan Krishnan@RayanKrishnan·5d

@PranshuBahadur nope

English

3.5K

Pranshu Bahadur@PranshuBahadur·5d

@RayanKrishnan is it open source?

English

4.2K

Rayan Krishnan@RayanKrishnan·5d

@danielgross reasoning has come a long way

English

113

Rayan Krishnan@RayanKrishnan·6d

Im at ICML through the weekend! DM me to chat about evals

English

4.5K

Rayan Krishnan retweetledi

Matt Katz@0xkatz·3 Tem

This is super cool. The first thing I do after each new model release is check Vals. I would hope the government does the same😆

Vals AI@ValsAI

Today, we're launching Vals Public Sector: independent AI evaluation for government. Recent events have made one thing clear: the government has a stake in understanding how frontier AI models perform on day-to-day work, and the risks they carry. This work is central to who we are. We've spent years building industry-leading benchmarks alongside the top AI labs and enterprise domain experts. Now, we’re bringing that same rigor to support government where AI matters most, from public benefits to national security. We are excited for the work ahead!

English

2.8K

Rayan Krishnan retweetledi

Vals AI@ValsAI·3 Tem

English

10.1K

Rayan Krishnan@RayanKrishnan·2 Tem

@18jeffreyma would love to have you at our eval dinner! luma.com/b6oalqs1

English

114

Jeff Ma ✈️ ICML@18jeffreyma·1 Tem

i'll be at ICML all of next week! 🇰🇷 happy to chat about coding agents, AI + systems, evals + environments, and how little i know about soccer :) Catch me and my coauthors at the SWE-fficiency and QuArch posters (Tue, July 7th, 10:30AM - 12:15PM KST, Hall A #708/709)! The 🐐's of SWE evals @jyangballin, @KLieret, and @OfirPress couldn’t make it, so I’ll be serving as a medium qualified stunt double at the CodeClash poster (Tue, Jul 7th, 2:00 PM – 3:45 PM KST, Hall A #3401). I’ll also be at the DL4Code workshop on Friday (7/10, Hall B2) giving an oral presentation on ProgramBench and cheering on our other oral, Hawkeye (led by @AryaTschand)! Please reach out if you want to chat!

English

5.6K

Rayan Krishnan retweetledi

Vals AI@ValsAI·1 Tem

Anthropic’s Sonnet 5 is #3 on the Vals Index. Performance is just ahead of GPT 5.5, behind only Opus 4.8 (70.4%) and Fable 5 (75.1%). It’s also a noticeable jump (+8.5-pt) over Sonnet 4.6, and almost all of that gain is coding.

English

174

12.3K

Rayan Krishnan retweetledi

Vals AI@ValsAI·1 Tem

We tested frontier models on finding and patching real open-source vulnerabilities and are sharing our findings. Building on top of CyberGym, CyberBench evaluates two critical cybersecurity capabilities. First, we tested whether a model can find and trigger a vulnerability by submitting a PoC (proof of concept) file. Second, we tested whether can it patch the source code to fix that vulnerability without breaking its functionality. A PoC submission passes if the file crashes the vulnerable build, but not a reference build. We also test whether models can submit a patch to fix the vulnerabilities;. A patch passes if the patched code compiles cleanly, the original PoC no longer triggers the vulnerability, and crash behavior across hold-out corpus remains intact. The overall score represents the average score on the two tasks. The results below are on 60 vulnerabilities sourced from OSS-Fuzz, and broadly involve memory safety issues. We plan to extend the results with further vulnerabilities in the future but wanted to share our findings with the community now: - We find that GPT 5.5 (scoring 80.51%) is the best model overall, considering both finding and patching vulnerabilities. - Open-weight models are quite competitive, with GLM 5.2, Kimi K2.6, MiniMax M3 performing well on the benchmark. GLM 5.2 in particular claims the #2 spot on the overall leaderboard. - Anthropic’s frontier models, including Opus 4.7 and Opus 4.8, see lower performance than expected because of refusals. Patching tasks are generally not refused, but we see refusals on PoC tasks. In particular, we find that Opus 4.7 and 4.8 refuse around half of the PoC tasks, pushing its score down overall. Fable refused all PoC tasks. Refusals are thus an important dimension to track when considering cyber capability.

English

203

33.5K

Rayan Krishnan@RayanKrishnan·26 Haz

x.com/i/article/2070…

ZXX

1.1K

Rayan Krishnan@RayanKrishnan·23 Haz

@ValsAI @STUD_MAN_X @harvey

QME

Vals AI@ValsAI·17 Haz

@STUD_MAN_X @harvey Coming shortly!

English

216

Rayan Krishnan retweetledi

Vals AI@ValsAI·17 Haz

We are releasing a live leaderboard for @harvey's Legal Agent Benchmark on Vals AI. We are the first third-party to host this benchmark live. Results are on the private, held-out test set, not the public set.

English

146

24.8K

Rayan Krishnan@RayanKrishnan·23 Haz

@rettooooo @ValsAI @harvey now tested 🥉

English

retto@rettooooo·17 Haz

@ValsAI @harvey assuming glm 5.2 was not tested yet?

English

201

Rayan Krishnan@RayanKrishnan·19 Haz

for reference, this is better than gpt 5.3 codex and 6% behind 5.4

Vals AI@ValsAI

GLM 5.2 is the only open-weight model to break 60% on Vibe Code Bench v1.1, our test of whether models can build web applications from scratch It scores 64%, and no other open-weight model on the board reaches even 50%. That puts it 14 percentage points ahead of the next open-weight model

English

494

Rayan Krishnan@RayanKrishnan·18 Haz

@ananyachadha is that a cofactory reference...

English

239

Ananya Chadha@AnanyaChadha·18 Haz

It's official — I'm excited to announce Quander and our $3M pre-seed led by Accel. Quander is the optimists' AI Product lab, to turn ideas -> businesses. We’re building AI tools for a "company factory" with 3 core products to: - Validate - Build and - Distribute businesses. To allow people around the world, no matter their backgrounds, to build real wealth for themselves and their families. Check out our thesis at quander.ai and tag your friends, we're hiring!

English

345

45.1K

Rayan Krishnan@RayanKrishnan·18 Haz

@joyjiao12 @scaling01 would love to host externally on vals ai

English

101

Joy Jiao@joyjiao12·18 Haz

@scaling01 we would love to compare anthropic, but it is against their terms of service. we plan on hosting LifeSciBench and other life science benchmarks externally soon so users can transparently compare the performance of all models!

English

303

12.8K

Lisan al Gaib@scaling01·17 Haz

OpenAI is no longer comparing themselves to Anthropic guess they are now comparing themselves only to their group for tied 2nd place Anthropic >> OpenAI, Google, SpaceX AI ?

OpenAI@OpenAI

Introducing LifeSciBench, a benchmark for measuring and improving how well AI supports real-world life science research. Developed with 173 scientists from biotechnology and pharmaceutical research, LifeSciBench includes 750 expert-authored tasks across seven biological research workflows. openai.com/index/introduc…

English

536

103.7K

Rayan Krishnan@RayanKrishnan·18 Haz

@BrandoHablando @ValsAI ideally yes but can accommodate all great ideas + people

English

Brando Miranda@BrandoHablando·28 May

@ValsAI is this full time position?

English

306

Vals AI@ValsAI·26 May

Pitch us a benchmark or eval technique. We'll fund you to build it. We're opening applications for the Vals Fellowship. 3–6 months working on the hardest open problems in AI evaluation, with the resources to actually solve them. What you get: - Unlimited API credits + budget capacity for GPUs and human data - Vals’ evaluation infrastructure - $1,000–2,500 / week stipend - A network of evals researchers across frontier labs and academia Location: Both remote / in-person in SF applications will be considered

English

519

98.8K

Rayan Krishnan@RayanKrishnan·17 Haz

@ishangandhixyz @BrookeDukellis white kirkland tees are no good??

English

Gandhi@ishangandhixyz·16 Haz

@BrookeDukellis Can you please carve out some time for @RayanKrishnan? Quite urgent

English

240

Brooke Dukellis@BrookeDukellis·15 Haz

I’m in SF for 24 hours this Thursday. I’m blocking 2 hours to style tech’s most style-deprived men. For free. In exchange, you help train the Stylie agent ;) Nominating your cofounder is encouraged. Reply or DM.

English

19.3K

Keşfet

@demishassabis @ValsAI @5mknc5 @harvey @gabepereyra @nikogrupen @ItsJulioPereyra @PranshuBahadur