JS Denain

36 posts

JS Denain

JS Denain

@js_denain

@EpochAIResearch

Katılım Ekim 2024
2 Takip Edilen232 Takipçiler
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
Frontier labs are investing massively in RL environments, yet most of what happens in this space stays behind closed doors. @chrisbarber and @js_denain interviewed 18 people from RL environment startups, neolabs, and frontier labs. Here's what they found:
Epoch AI tweet media
English
9
38
356
58.4K
JS Denain
JS Denain@js_denain·
I rarely use this account. If you want to read more of my thoughts you can follow me at @datagenproc!
English
0
0
0
105
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
Conventional wisdom says that the US can’t build power but China can, so China’s going to “win the AGI race by default”. We think this is wrong. The US likely can build enough power to support AI scaling through 2030 — as long as they’re willing to spend a lot. A thread:
Epoch AI tweet media
English
2
43
183
55.3K
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
Should AI regulations be based on training compute? As training pipelines become more complex, they could undermine compute-based AI policies. In a new piece with Google DeepMind’s AI Policy Perspectives team, we explain why. 🧵
Epoch AI tweet media
English
8
11
63
8.4K
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
How fast has society been adopting AI? Back in 2022, ChatGPT arguably became the fastest-growing consumer app ever, hitting 100M users in just 2 months. But the field of AI has transformed since then, and it’s time to take a new look at the numbers. 🧵
Epoch AI tweet media
English
2
50
255
27.8K
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
We are still hiring for an Engineering Lead on our Benchmarking team! We need a software engineer with outstanding technical expertise (no AI experience necessary) who's excited about leading evaluations on frontier AI models.
Epoch AI tweet media
English
1
1
13
3.2K
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
Introducing FrontierMath Tier 4: a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities.
Epoch AI tweet media
English
18
60
559
81.8K
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
Running SWE-bench evals is very slow and difficult. To solve this, we created a registry of optimized Docker images that let us run SWE-bench Verified in just one hour on a single 32-core machine. Today, we are open-sourcing these images— anyone can `docker pull` them.
Epoch AI tweet media
English
3
13
203
10.6K
JS Denain retweetledi
Tom Adamczewski
Tom Adamczewski@tmkadamcz·
The GitHub API doesn't seem to support changing the visibility of an image on the Container Registry. This is a huge problem for me as I have 4,219 images I need to make public for an @EpochAIResearch project :( Anyone at GitHub who could help with this?
English
2
2
5
885
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
SWE-bench Verified is one of the main benchmarks to assess AI coding skills. But what does it actually measure? We found that it's one of the best tests of AI coding, but limited by its focus on simple bug fixes in familiar repositories. Here’s a summary of our article 🧵
Epoch AI tweet media
English
6
31
407
418.8K
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
Three years and 100+ projects in, our mission is the same: give everyone clear, trusted insight into where AI is headed. Our new post unpacks the principles behind every research choice—why we take some ideas on and pass on others. epoch.ai/blog/what-is-e…
English
0
7
44
3.2K
Tomek Korbak
Tomek Korbak@tomekkorbak·
I reimplemented the bliss attractor eval from Claude 4 System Card. It's fascinating how LLMs reliably fall into attractor basins of their pet obsessions, how different these attractors across LLMs, and how they say something non-trivial about LLMs' personalities. 🌀🌀🌀
Tomek Korbak tweet media
English
12
23
173
38.4K
JS Denain
JS Denain@js_denain·
@TeksEdge Hi! The problems are all of the same form: you're given an image with ramps and buckets like this one, you have to predict which bucket the ball will fall into.
JS Denain tweet media
English
2
0
1
63
JS Denain retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
We’re hiring an Engineering Lead to help guide our Benchmarking team! Provide independent evaluations of today’s and tomorrow’s AI models, leading to better research, policy, and decision-making. The role is fully remote, and applications are rolling.
Epoch AI tweet media
English
15
25
170
345.5K
JS Denain
JS Denain@js_denain·
@scaling01 - OpenCompass - HHEM - Galileo Agent - XLANG Computer Agent Arena I haven't looked into them in detail yet, but will H/T you if we add some of them to the hub 🙂
English
0
0
1
45
JS Denain
JS Denain@js_denain·
@scaling01 Thanks for making your list though! It did put on the radar some benchmarks I was not tracking: - Thematic Generalization by LechMazur (I knew of the writing and multi-agent ones, not this one) - Dubesor LLM - TrackingAI - IQ Bench - Misguided Attention - Snake-Bench
English
1
0
2
68