JS Denain

36 posts

JS Denain

@js_denain

@EpochAIResearch

Katılım Ekim 2024

2 Takip Edilen232 Takipçiler

JS Denain retweetledi

Epoch AI@EpochAIResearch·12 Oca

Frontier labs are investing massively in RL environments, yet most of what happens in this space stays behind closed doors. @chrisbarber and @js_denain interviewed 18 people from RL environment startups, neolabs, and frontier labs. Here's what they found:

English

356

58.4K

JS Denain@js_denain·28 Ara

I rarely use this account. If you want to read more of my thoughts you can follow me at @datagenproc!

English

105

JS Denain retweetledi

Epoch AI@EpochAIResearch·17 Ara

Conventional wisdom says that the US can’t build power but China can, so China’s going to “win the AGI race by default”. We think this is wrong. The US likely can build enough power to support AI scaling through 2030 — as long as they’re willing to spend a lot. A thread:

English

183

55.3K

JS Denain retweetledi

Epoch AI@EpochAIResearch·12 Eyl

Should AI regulations be based on training compute? As training pipelines become more complex, they could undermine compute-based AI policies. In a new piece with Google DeepMind’s AI Policy Perspectives team, we explain why. 🧵

English

8.4K

JS Denain@js_denain·26 Tem

Very excited about this analysis by @GregHBurnham!

Epoch AI@EpochAIResearch

xAI commissioned us to analyze Grok 4’s math capabilities. Our findings: + It’s good at involved computations, improving at proofs (from a low base), and useful for literature search. - It favors low-level grinds and leans on background knowledge. Read on for examples!

English

638

JS Denain retweetledi

Epoch AI@EpochAIResearch·18 Tem

How fast has society been adopting AI? Back in 2022, ChatGPT arguably became the fastest-growing consumer app ever, hitting 100M users in just 2 months. But the field of AI has transformed since then, and it’s time to take a new look at the numbers. 🧵

English

255

27.8K

JS Denain retweetledi

Epoch AI@EpochAIResearch·17 Tem

We are still hiring for an Engineering Lead on our Benchmarking team! We need a software engineer with outstanding technical expertise (no AI experience necessary) who's excited about leading evaluations on frontier AI models.

English

3.2K

JS Denain retweetledi

Epoch AI@EpochAIResearch·11 Tem

Introducing FrontierMath Tier 4: a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities.

English

559

81.8K

JS Denain retweetledi

Epoch AI@EpochAIResearch·10 Tem

Running SWE-bench evals is very slow and difficult. To solve this, we created a registry of optimized Docker images that let us run SWE-bench Verified in just one hour on a single 32-core machine. Today, we are open-sourcing these images— anyone can `docker pull` them.

English

203

10.6K

JS Denain@js_denain·1 Tem

@tmkadamcz @EpochAIResearch Tagging @github

English

JS Denain retweetledi

Tom Adamczewski@tmkadamcz·1 Tem

The GitHub API doesn't seem to support changing the visibility of an image on the Container Registry. This is a huge problem for me as I have 4,219 images I need to make public for an @EpochAIResearch project :( Anyone at GitHub who could help with this?

English

885

JS Denain retweetledi

Epoch AI@EpochAIResearch·13 Haz

SWE-bench Verified is one of the main benchmarks to assess AI coding skills. But what does it actually measure? We found that it's one of the best tests of AI coding, but limited by its focus on simple bug fixes in familiar repositories. Here’s a summary of our article 🧵

English

407

418.8K

JS Denain retweetledi

Epoch AI@EpochAIResearch·5 Haz

Three years and 100+ projects in, our mission is the same: give everyone clear, trusted insight into where AI is headed. Our new post unpacks the principles behind every research choice—why we take some ideas on and pass on others. epoch.ai/blog/what-is-e…

English

3.2K

JS Denain@js_denain·5 Haz

@tomekkorbak My logs are kind of confusing to read though

English

JS Denain@js_denain·5 Haz

@tomekkorbak Nice! Here are a few transcripts I generated yesterday (no scoring), showing similar things. js-d.github.io/self-interacti…

English

233

Tomek Korbak@tomekkorbak·5 Haz

I reimplemented the bliss attractor eval from Claude 4 System Card. It's fascinating how LLMs reliably fall into attractor basins of their pet obsessions, how different these attractors across LLMs, and how they say something non-trivial about LLMs' personalities. 🌀🌀🌀

English

173

38.4K

JS Denain@js_denain·30 May

@TeksEdge For more information: cbrower.dev/vpct huggingface.co/datasets/camel…

English

JS Denain@js_denain·30 May

@TeksEdge Hi! The problems are all of the same form: you're given an image with ramps and buckets like this one, you have to predict which bucket the ball will fall into.

English

David Hendrickson@TeksEdge·30 May

Very cool, but does anyone know what kind of elementary visual physics problems are? Is this like watching physical YouTube videos (yeah, that's exactly correct)?

Epoch AI@EpochAIResearch

We're expanding the Epoch AI Benchmarking Hub with four more external benchmarks: VPCT, Fiction-liveBench, GeoBench, and SimpleBench! These benchmarks test visual physics understanding, Geoguessr ability, long-context comprehension, and reasoning and logic skills. 🧵

English

155

JS Denain retweetledi

Epoch AI@EpochAIResearch·9 May

We’re hiring an Engineering Lead to help guide our Benchmarking team! Provide independent evaluations of today’s and tomorrow’s AI models, leading to better research, policy, and decision-making. The role is fully remote, and applications are rolling.

English

170

345.5K

JS Denain@js_denain·6 May

@scaling01 - OpenCompass - HHEM - Galileo Agent - XLANG Computer Agent Arena I haven't looked into them in detail yet, but will H/T you if we add some of them to the hub 🙂

English

JS Denain@js_denain·6 May

@scaling01 Thanks for making your list though! It did put on the radar some benchmarks I was not tracking: - Thematic Generalization by LechMazur (I knew of the writing and multi-agent ones, not this one) - Dubesor LLM - TrackingAI - IQ Bench - Misguided Attention - Snake-Bench

English

Lisan al Gaib@scaling01·6 May

a little credit for bringing these on the radar would have been nice

Epoch AI@EpochAIResearch

We’ve added four new benchmarks to the Epoch AI Benchmarking Hub: Aider Polyglot, WeirdML, Balrog, and Factorio Learning Environment! Before we only featured our own evaluation results, but this new data comes from trusted external leaderboards. And we've got more on the way 🧵

English

2.5K

Keşfet

@chrisbarber @datagenproc @GregHBurnham @tmkadamcz @EpochAIResearch @github @tomekkorbak @TeksEdge