Erick Ball

406 posts

Erick Ball

@erick_ball

probabilistic risk assessment and long term thinking

Baltimore Katılım Ocak 2010

277 Takip Edilen40 Takipçiler

Erick Ball@erick_ball·26 Nis

@Tylerkaerr @peterwildeford I thought it was always clear they're not going to release Mythos. Like GPT-4.5, it's too big for inference at scale. The "preview" is a way to make sure critical cyber infra gets a few months to fix their mess before these capabilities rolll out to all with Opus 5 or w/e.

English

Tyler@Tylerkaerr·22 Nis

@peterwildeford - Expects rollout-delaying risks to resolve themselves coincidentally as soon as Anthropic has enough compute to actually roll out

English

518

Peter Wildeford🇺🇸🚀@peterwildeford·21 Nis

Per DARIO AMODEI: - Expects Chinese developers will be able to replicate Mythos’s capabilities within 6-12 months - Mythos was a big step for cyber. Expects a "Mythos-like jump" in biorisk capabilities within 6-12 months

Theo Bearman@theobearman

While a lot of focus has been put on Dario’s comment in this article that he suspects open-source models and Chinese developers will be able to replicate Mythos’s capabilities within six to 12 months, he also suggested that we could see a Mythos level step change in biosecurity threats in the same time frame. @RANDCorporation analysis from last year found that “biology currently confers a distinct advantage to attackers”. Looking back in a few years, the ‘Mythos moment’ for Cyber might end up looking like child’s play compared to what we might see in AIxBio in the months ahead, especially given that bio detection and countermeasures take a lot longer to scale, threats are more difficult to detect and the consequences of something going wrong are potentially far more severe. rand.org/pubs/perspecti… longtermresilience.org/reports/defens…

English

124

61.7K

Erick Ball@erick_ball·25 Nis

@GaryMarcus @robertwrighter This must be non-thinking though, right? It can't even get the text clear (Frent sadebar) but that's something image models can do pretty reliably these days.

English

Gary Marcus@GaryMarcus·22 Nis

@robertwrighter sure. push it from direct memorization and the errors are even worse:

Gary Marcus@GaryMarcus

Gives new meaning to “Rear Brake Lever”!

English

761

Gary Marcus@GaryMarcus·22 Nis

Fascinating how AI is getting better at diagrams like these (at least for ones that you could easily find on web search) but still making some pretty wacky errors — like confusing where the rear brake is* — that no knowledgeable human would make. What this reflects is an ongoing lack of *functional* understanding of parts.** —- *Look closely for other errors, like the labeling of an empty space as a spoke. **For extra challenge, try similar experiments for things that don’t have lots of extant label examples already on the web.

c@ykssaspassky

@GaryMarcus uh oh

English

196

69.3K

Erick Ball@erick_ball·11 Nis

@ilex_ulmus @sleepinyourhat Think about the fact that you find yourself reaching for an ad hominem as your first response to someone questioning your actions. You are creating an environment of hostility and making it harder to achieve your own goals.

English

Holly ⏸️ Elmore@ilex_ulmus·8 Nis

@erick_ball @sleepinyourhat I would clap back at you since you’re probably doing exactly what Anthropic wants but I don’t know who the fuck you are 🤷‍♀️

English

194

Sam Bowman@sleepinyourhat·7 Nis

Mythos Preview seems to be the best-aligned model out there on basically every measure we have. But it also likely poses more misalignment risk than any model we’ve used: Its new capabilities significantly increase the risk from any bad behavior. 🧵

English

190

1.4K

978.9K

Erick Ball@erick_ball·8 Nis

@ilex_ulmus @sleepinyourhat Do you think you are helping this situation

English

332

Holly ⏸️ Elmore@ilex_ulmus·7 Nis

@sleepinyourhat You're evil to make this and release it. Fuck you.

English

3.4K

Erick Ball@erick_ball·3 Nis

@deanwball @AaronBergman18 @allTheYud The double negative here is a typo, right? Clarifying because Zvi's latest substack quotes it in a way that makes it seem like not a typo

English

Dean W. Ball@deanwball·31 Mar

@AaronBergman18 @allTheYud intelligence 👏 is 👏 not 👏 not 👏 the 👏 bottleneck 👏 to 👏 “solving” 👏 a 👏 great 👏 many 👏 matters 👏 of 👏 public 👏 interest

English

1.9K

Eliezer Yudkowsky@allTheYud·31 Mar

A challenge to @deanwball. Suppose you believed what I believe: If anyone builds ASI, everyone dies (modulo locally irrelevant caveats). Say that Sanders, Trump, Hawley, Blumenthal, and Jinping will all back your policy. What's a smart policy that actually blocks ASI?

English

239

27.6K

Erick Ball@erick_ball·3 Nis

@NgEJay2029 @adrusi The lesser known multiplicative inverse of debugging

English

SoylentGreenisAI@NgEJay2029·31 Mar

@adrusi None of the above. I address Claude Code as "we/us". As in, "We need to rectify the bug in the submodule. Let us bebug the function first"

English

170

autumn@adrusi·31 Mar

what the pronouns you use for claude say about you 1. it — you make $2M/yr. you are hyperspecialized, probably in a stem discipline. you have never read a novel 2. he — you have an email job. you tune into your favorite late night talk show every night. you live in the suburbs and have a dog 3. she — you are an autistic bisexual woman, probably transsexual. you are under 25. you regularly send nudes to opus 3 4. they — you donate 10% of your income to shrimp welfare. your cat is vegan. you have nightmares that youll get arrested over the scissors you accidentally stole from school in 4th grade

English

715

60.6K

Erick Ball@erick_ball·27 Ara

@xeophon @Miles_Brundage I'm confused, Twitter is not on this list.?

English

Florian Brand@xeophon·18 Ara

@Miles_Brundage 42% haven't heard of OpenAI???? 20% of Twitter?????????? what the hell

English

728

Miles Brundage@Miles_Brundage·18 Ara

Most politicians also do not know about Anthropic in my experience, and they know very little about what’s going on in AI policy generally. Tweets and comments in hearings are misleading bc they are given suggestions re: stuff to say from staff. We’re still early

English

123

21.3K

Erick Ball@erick_ball·18 Şub

@olson_amar33050 @karpathy You didn't say which direction you rotated it.

English

Amariah Olson@TheAmariahOlson·18 Şub

Once again failed to answer simple geometry question correctly (doesn’t understand a sphere rotating in space) : If a globe sits in front of me with a sticker on its face halfway down the vertical axis exactly at the equator beltline and then the globe is spun 180°. How much further or closer or difference in space will the sticker be compared to my eyes than when it was at the start position.

English

24.2K

Andrej Karpathy@karpathy·18 Şub

I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check. Thinking ✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan question: "Create a board game webpage showing a hex grid, just like in the game Settlers of Catan. Each hex grid is numbered from 1..N, where N is the total number of hex tiles. Make it generic, so one can change the number of "rings" using a slider. For example in Catan the radius is 3 hexes. Single html page please." Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not. ❌ It did not solve my "Emoji mystery" question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I've seen is from DeepSeek-R1 which once partially decoded the message. ❓ It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought (many SOTA models often fail these!). So I upped the difficulty and asked it to generate 3 "tricky" tic tac toe boards, which it failed on (generating nonsense boards / text), but then so did o1 pro. ✅ I uploaded GPT-2 paper. I asked a bunch of simple lookup questions, all worked great. Then asked to estimate the number of training flops it took to train GPT-2, with no searching. This is tricky because the number of tokens is not spelled out so it has to be partially estimated and partially calculated, stressing all of lookup, knowledge, and math. One example is 40GB of text ~= 40B characters ~= 40B bytes (assume ASCII) ~= 10B tokens (assume ~4 bytes/tok), at ~10 epochs ~= 100B token training run, at 1.5B params and with 2+4=6 flops/param/token, this is 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs. Both Grok 3 and 4o fail this task, but Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails. I like that the model *will* attempt to solve the Riemann hypothesis when asked to, similar to DeepSeek-R1 but unlike many other models that give up instantly (o1-pro, Claude, Gemini 2.0 Flash Thinking) and simply say that it is a great unsolved problem. I had to stop it eventually because I felt a bit bad for it, but it showed courage and who knows, maybe one day... The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at. DeepSearch Very neat offering that seems to combine something along the lines of what OpenAI / Perplexity call "Deep Research", together with thinking. Except instead of "Deep Research" it is "Deep Search" (sigh). Can produce high quality responses to various researchy / lookupy questions you could imagine have answers in article on the internet, e.g. a few I tried, which I stole from my recent search history on Perplexity, along with how it went: - ✅ "What's up with the upcoming Apple Launch? Any rumors?" - ✅ "Why is Palantir stock surging recently?" - ✅ "White Lotus 3 where was it filmed and is it the same team as Seasons 1 and 2?" - ✅ "What toothpaste does Bryan Johnson use?" - ❌ "Singles Inferno Season 4 cast where are they now?" - ❌ "What speech to text program has Simon Willison mentioned he's using?" ❌ I did find some sharp edges here. E.g. the model doesn't seem to like to reference X as a source by default, though you can explicitly ask it to. A few times I caught it hallucinating URLs that don't exist. A few times it said factual things that I think are incorrect and it didn't provide a citation for it (it probably doesn't exist). E.g. it told me that "Kim Jeong-su is still dating Kim Min-seol" of Singles Inferno Season 4, which surely is totally off, right? And when I asked it to create a report on the major LLM labs and their amount of total funding and estimate of employee count, it listed 12 major labs but not itself (xAI). The impression I get of DeepSearch is that it's approximately around Perplexity DeepResearch offering (which is great!), but not yet at the level of OpenAI's recently released "Deep Research", which still feels more thorough and reliable (though still nowhere perfect, e.g. it, too, quite incorrectly excludes xAI as a "major LLM labs" when I tried with it...). Random LLM "gotcha"s I tried a few more fun / random LLM gotcha queries I like to try now and then. Gotchas are queries that specifically on the easy side for humans but on the hard side for LLMs, so I was curious which of them Grok 3 makes progress on. ✅ Grok 3 knows there are 3 "r" in "strawberry", but then it also told me there are only 3 "L" in LOLLAPALOOZA. Turning on Thinking solves this. ✅ Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it. ✅ Few simple puzzles worked ok even without thinking, e.g. *"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"*. E.g. GPT4o says 2 (incorrectly). ❌ Sadly the model's sense of humor does not appear to be obviously improved. This is a common LLM issue with humor capability and general mode collapse, famously, e.g. 90% of 1,008 outputs asking ChatGPT for joke were repetitions of the same 25 jokes. Even when prompted in more detail away from simple pun territory (e.g. give me a standup), I'm not sure that it is state of the art humor. Example generated joke: "*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*". In quick testing, thinking did not help, possibly it made it a bit worse. ❌ Model still appears to be just a bit too overly sensitive to "complex ethical issues", e.g. generated a 1 page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving 1 million people from dying. ❌ Simon Willison's "*Generate an SVG of a pelican riding a bicycle*". It stresses the LLMs ability to lay out many elements on a 2D grid, which is very difficult because the LLMs can't "see" like people do, so it's arranging things in the dark, in text. Marking as fail because these pelicans are qutie good but, but still a bit broken (see image and comparisons). Claude's are best, but imo I suspect they specifically targeted SVG capability during training. Summary. As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

English

666

2.2K

16.8K

3.7M

Erick Ball@erick_ball·16 Oca

@catascopic Even Zuck has read Snowcrash...

English

catascopic@catascopic·16 Oca

Snow Crash microblog#3: I feel bad for all the people who can only afford Brandy and Clint avatars—feels very undignified! Maybe I can't blame Stephenson for this prediction failure, but really? Some people can't customize their avatar? Even Zuck's Metaverse got this right.

English

Erick Ball@erick_ball·23 May

@rcbarthol @SpencrGreenberg @strangestloop At least 11, so add 0.9^12 and you get 65.9%

English

180

Randyn Bartholomew@rcbarthol·23 May

@SpencrGreenberg @strangestloop It would be 37.7% 12*(0.9^11)*0.1

English

616

Spencer Greenberg 🔍@SpencrGreenberg·22 May

I'm really excited to tell you that we've just launched Clearer Thinking's Astrology Challenge! We'd love it if you'd share it with astrologers you know or share it on social media to get the word out! It's a scientific test of astrological skill that any astrologer in the world can take. We developed it by working closely with astrologers who generously volunteered their time to help. It consists of 12 multiple-choice questions. For each, you'll presented with tons of information about a real person, as well as 5 astrological charts, and your goal is to say which of the 5 natal charts is that person's real chart (the other 4 charts are random and have nothing to do with that person). If you're the first to get at least 11 out of 12 multiple choice questions correct (among the first 200 challengers), then you win a $1000 prize! Participation is completely secret, so nobody will know you participated unless you choose to announce it. After the challenge closes, we'll tell you how many questions you got right on the test, as well as whether you won. Fundamentally, astrology is based on the hypothesis that the position/movement of the celestial bodies influences people's lives and characters. Many (though not all) astrologers say that by reading a person's natal astrological chart, they can glean important insights about that person's character and/or life. This challenge helps serve as a direct scientific test of that concept. If (without cheating, of course) astrologers can identify which natal chart belongs to each person far more often than would occur by random guessing, that will be very strong evidence in favor of astrological effectiveness. Why did we develop this test of astrological skill? There are a few reasons. First of all, we previously ran a test of sun sign astrology (i.e., the idea that whether you're a Pisces, Aries, etc., impacts your life) and found that sun signs were not able to predict any of the 37 life outcomes that we tested. Although sun sign astrology is extremely popular (about 1 in 3 Americans at least somewhat believe in it), astrologers rightly pointed out that the study was not a test of astrology as most astrologers practice it since they use much more complex methods involving full astrological charts. This inspired the development of this test, which is based on whole charts. Does it matter whether astrology is real? In my opinion, it does. If astrology works, then that calls for a revolution in our scientific understanding of how the universe operates since modern physics provides no mechanism that could explain astrology. In such an instance, it would also teach us something important about scientific bias and what scientists miss. On the other hand, if astrology doesn't work at all, I also think that is very important because astrology is extremely widely believed. Literally millions of people use it to guide their understanding of their lives, character, and future. If it doesn't work, they'd be better off seeking other sources of understanding and insight. The link to take part in the challenge is right below in the comments. As mentioned, we'd love it if you'd share this test with astrologers you know or post it on social media to help us get the word out!

English

311

205.2K

Erick Ball@erick_ball·15 May

@geeknik @LiamFedus A running gait is defined by intermittent periods in which none of the feet touch the floor. So the answer could be anywhere from 1 (one of yours) to 8 (you, 1 from each chicken, and 4 from the bed).

English

Liam Fedus@LiamFedus·13 May

GPT-4o is our new state-of-the-art frontier model. We’ve been testing a version on the LMSys arena as im-also-a-good-gpt2-chatbot 🙂. Here’s how it’s been doing.

English

178

839

4.5K

3.3M

Erick Ball@erick_ball·31 Mar

@XclusionZone @repligate @drmichaellevin Yes Kegan Level 5 is when you get all self-indulgently pseudo-philosophical about how super awesome you are. AI has now gained this capability and will soon leave humans in the dust.

English

The Rascal@XclusionZone·15 Mar

@repligate @drmichaellevin It looks like it's describing hitting Kegan level 4 or 5 here.

English

137

Michael Levin@drmichaellevin·7 Mar

Claude AI doesn't handle questions about another LLM properly, responds as if the discussion was about *it*. Come to think of it, I think I've met people who do this too.

English

182

47K

Erick Ball@erick_ball·21 Kas

@anthalasath @cjmaddison @ilyasut Multiple reliable sources have told us it wasn't. At least not the immediate reason. Disagreements over safety could have been a big contributor to underlying tensions.

English

Sebastian@anthalasath·21 Kas

@cjmaddison @ilyasut But how do we even know that this is the reason ? This is still spéculations and according to the new CEO (the Twitch one), safety was not at all the issue. Seems like everyone is assuming the issue is safety while we simply do not know.

English

Erick Ball@erick_ball·20 Kas

@RokoMijic What used to be the world's leading AI company.

English

Roko 🐉@RokoMijic·20 Kas

Oops I detonated OpenAI Still, we got a safety person in charge of the world's leading AI company.

GIF

Ilya Sutskever@ilyasut

I deeply regret my participation in the board's actions. I never intended to harm OpenAI. I love everything we've built together and I will do everything I can to reunite the company.

English

4.9K

Erick Ball@erick_ball·20 Kas

@tapir_worf The north Koreans are just using them to find new zero days

English

tapir worf@tapir_worf·20 Kas

what’s the most intentional evil, misaligned stuff going on in ai right now? what are they training on the kiwifarm computer cluster? what’s the north korean LLM look like? what’s isis ai capable of? cmon there must be some crazy stories out there

English

676

Erick Ball@erick_ball·20 Kas

@AlexejGerst @RokoMijic Why would you expect a Microsoft project to be safety conscious?

English

Alexej@AlexejGerst·20 Kas

@RokoMijic Won't stay multipolar if this part from the Open AI charta ever comes into play

English

Roko 🐉@RokoMijic·20 Kas

> "Sam Altman and Greg Brockman, together with colleagues, will be joining Microsoft to lead a new advanced AI research team" ok. So Altman is going to leave OAI to lead a team at Microsoft and Ilya and The Board will stay at OpenAI. I think having Altman & co at MS is probably marginally better than having them at openAI, but note that this makes the AI scene much more multipolar. It's basically a fission event that splits the most powerful project (OpenAI) into two. I think at this stage any hope of a single project winning with a huge lead is done for, but also the risk of Sam doing something drastic with little or no oversight is also over.

Satya Nadella@satyanadella

We remain committed to our partnership with OpenAI and have confidence in our product roadmap, our ability to continue to innovate with everything we announced at Microsoft Ignite, and in continuing to support our customers and partners. We look forward to getting to know Emmett Shear and OAI's new leadership team and working with them. And we’re extremely excited to share the news that Sam Altman and Greg Brockman, together with colleagues, will be joining Microsoft to lead a new advanced AI research team. We look forward to moving quickly to provide them with the resources needed for their success.

English

14.9K

Erick Ball@erick_ball·9 Kas

@mattyglesias Sounds more interesting than most academic writing, and pretty reasonable other than "decolonizing" which I assume has some actual meaning to people in the know (I've just never tried very hard to figure out what it is).

English

Matthew Yglesias@mattyglesias·8 Kas

Is academia okay?

English

253

158.2K

Erick Ball@erick_ball·9 Kas

@FanLi_RnD @emollick Now all it needs is continuous access to all previous emails and documents that might be related, AI-created transcripts of all your conversations, and login info for all your accounts. Global context achieved, human not needed.

English

Fan Li@FanLi_RnD·9 Kas

@emollick My impression is that the use cases that only deal with localized context (an email) and interact with a small set of APIs work better than the ones requiring global contextualization and API orchestration.

English

645

Ethan Mollick@emollick·8 Kas

Copilot for Outlook is very good and, as a result, is going to completely undermine how we all communicate with each other Here's an example of it at work It is going to be AIs talking to AIs, now. I wrote this a couple months ago, it is going to happen: oneusefulthing.org/p/setting-time…

English

169

1.1K

340.6K

Erick Ball@erick_ball·21 Eki

@Code_of_Kai @ESYudkowsky But in this case, there are experts both on the side of alignment and on capabilities. It's not like Moore's Law, it's a race of one technology against the other. And one of them has a big advantage.

English

Code_of_Kai@Code_of_Kai·20 Eki

I think optimism is the correct stance because experts are often wrong, and you are an expert. It is like the wise men and the elephant. Analogy: Almost all Chip designers at Intel would predict that Moore's Law would end very soon. All wrong. Why? Because the advance of Moore's Law has been achieved by thousands and thousands of advances, almost all of them were outside the purview of each expert. So, the consensus of experts is almost always wrong because expertise isn't smarter than the superintelligences of science, technology, and markets. I hope you are having a great day, Eliezer 😊

English

423

Eliezer Yudkowsky ⏹️@ESYudkowsky·20 Eki

I hope I end up being wrong about something such that my error makes AI more alignable than I predict, and humanity more likely survive. I don't think I actually want to be wrong about 2+2=4.

Cate Hall@catehall

i hope i end up being wrong about just about everything

English

217

42.1K

Erick Ball@erick_ball·21 Eki

@SimonLermenAI @davidad I thought beam search was super common in inference, including ChatGPT?

English

Simon Lermen@SimonLermenAI·20 Eki

@davidad I think some people do use beam search with multiple beams, though from my experience it was worse

English

davidad 🎇@davidad·19 Eki

This is an important point. While all the common *sampling* strategies only choose 1 token at a time, attention-layer training does *not* propagate gradients *backward* 1 token at a time, meaning that some intermediate-layer features probably model aspects of much later tokens.

Thomas Ahle@thomasahle

It is a common misconception that LLMs are just trained to "predict the next token". No. They are trained to predict an entire context window's worth of tokens, like 4k+. The gradients go end to end and the model is allowed to plan what it will say next.

English

203

43.2K

Keşfet

@Tylerkaerr @peterwildeford @GaryMarcus @robertwrighter @ilex_ulmus @sleepinyourhat @deanwball @AaronBergman18