Ryan Shar

8 posts

Ryan Shar

Ryan Shar

@RyanShar01

Research Scientist @ Apple | CMU ML

Katılım Ağustos 2024
26 Takip Edilen7 Takipçiler
Ryan Shar retweetledi
Wayne Chi
Wayne Chi@iamwaynechi·
New preprint alert 🚨 Can LLM agents develop video games? We release GameDevBench, the first benchmark evaluating agentic game development in a game engine, Godot. We also present two simple multimodal feedback mechanisms that lead to immediate performance gains. /🧵
English
19
28
252
22.7K
Ryan Shar retweetledi
Wayne Chi
Wayne Chi@iamwaynechi·
Tired of evaluating LLMs on made-up problems that look nothing like real tasks? Introducing EDIT-Bench, a code editing benchmark built from in-the-wild user interactions in VSCode. Real-world edits are challenging: 𝗼𝗻𝗹𝘆 𝟭/𝟰𝟬 𝗺𝗼𝗱𝗲𝗹𝘀 𝘀𝗰𝗼𝗿𝗲 > 𝟲𝟬% 𝗽𝗮𝘀𝘀@𝟭.
Wayne Chi tweet media
English
2
12
40
11.1K
Ryan Shar retweetledi
Ameet Talwalkar
Ameet Talwalkar@atalwalkar·
I’m excited to share new work from Datadog AI Research! We just released Toto, a new SOTA (by a wide margin!) time series foundation model, and BOOM, the largest benchmark of observability metrics. Both are available under the Apache 2.0 license. 🧵
Ameet Talwalkar tweet media
English
5
51
242
38.1K
Ryan Shar retweetledi
Wayne Chi
Wayne Chi@iamwaynechi·
What do developers 𝘳𝘦𝘢𝘭𝘭𝘺 think of AI coding assistants? In October, we launched @CopilotArena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint. Here's what we have learned /🧵
Wayne Chi tweet media
Arena.ai@arena

Introducing Copilot Arena - Interactive coding evaluation in the wild. Our extension lets you test top models for free, right in VSCode. Let's vote and build the Copilot leaderboard! Download here: marketplace.visualstudio.com/items?itemName… Led by @iamwaynechi and @valeriechen_ at CMU. 1/🧵

English
3
32
161
70.9K
Ryan Shar retweetledi
Jane Pan
Jane Pan@JanePan_·
When benchmarks talk, do LLMs listen? Our new paper shows that evaluating that code LLMs with interactive feedback significantly affects model performance compared to standard static benchmarks! Work w/ @RyanShar01, @jacob_pfau, @atalwalkar, @hhexiy, and @valeriechen_! [1/6]
Jane Pan tweet media
English
2
13
54
10.5K
Ryan Shar retweetledi
Misha Khodak
Misha Khodak@khodakmoments·
🧵 on surprising revelations from our study of specialized foundation models (FMs beyond vision/text): after evaluating dozens of scientific & time series FMs we found that most weren’t even competitive with simple supervised models, some with as little as 513 parameters. 1/n
Misha Khodak tweet media
English
3
60
243
43K
Ryan Shar retweetledi
Arena.ai
Arena.ai@arena·
Which model is best for coding? @CopilotArena leaderboard is out! Our code completions leaderboard contains data collected over the last month, with >100K completions served and >10K votes! Let’s discuss our findings so far🧵
Arena.ai tweet media
English
17
77
532
135.9K