Aharon Azulay

518 posts

Aharon Azulay

@AharonAzulay

Applied epistemologist. AI researcher

가입일 Eylül 2011

542 팔로잉104 팔로워

Aharon Azulay 리트윗함

AI Security Institute@AISecurityInst·5d

We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵

English

110

549

1.2M

Aharon Azulay@AharonAzulay·8 Nis

@julien_c The cynical reasons: 1) They don't have enough compute to serve it given the crazy demand 2) They want to keep their moat of being the closest to automate AI R&D

English

1.3K

Julien Chaumond@julien_c·7 Nis

“gpt2-large is too powerful to be publicly released” vibes

English

156

4.3K

329.7K

Aharon Azulay@AharonAzulay·7 Nis

@AnthropicAI The gap between public and private frontiers widens.

English

2.4K

Anthropic@AnthropicAI·7 Nis

We do not plan to make Mythos Preview generally available. Our goal is to deploy Mythos-class models safely at scale, but first we need safeguards that reliably block their most dangerous outputs. We’ll begin testing those safeguards with an upcoming Claude Opus model.

English

290

3.6K

886.1K

Anthropic@AnthropicAI·7 Nis

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing

English

6.7K

43.9K

30.7M

Aharon Azulay@AharonAzulay·12 Mar

@kimmonismus That's what you get when every employee is expected by leadership to fully embrace Claude Code. It also helps to get unlimited Claude Code with unreleased models, a /super-fast internal mode, longer context windows, etc.

English

101

Chubby♨️@kimmonismus·12 Mar

They dont stop, do they? There's a new release from Anthropic practically every day. Today: Interactive charts and diagrams directly in the chat.

Claude@claudeai

Claude can now build interactive charts and diagrams, directly in the chat. Available today in beta on all plans, including free. Try it out: claude.ai

English

1.1K

139.5K

Aharon Azulay@AharonAzulay·8 Mar

@mahaoo_ASI For coding?

English

Mahaoo@mahaoo_ASI·8 Mar

I've moved to using gpt 5.4-high for a few days now and didn't feel the urge to move back to opus 4.6 seems like anthropic better release their next model soon if they want to reach their goal of 10x-ing their revenue once again by the end of the year

English

181

Aharon Azulay 리트윗함

Photoroom@photoroom_ML·4 Mar

How far can you push diffusion training in 24 hours and $1500? We ran a diffusion speedrun in the next post of our PRX series. 32× H200 1 day of training The result is a surprisingly capable text-to-image model. Full recipe and code open sourced 🧵

English

166

12.4K

Aharon Azulay@AharonAzulay·1 Mar

@mahaoo_ASI logic bully

English

Mahaoo@mahaoo_ASI·1 Mar

unpopular opinion: if you hold beliefs that are technically incorrect or have a large number of logical inconsistencies - you are not "entitled to your opinion" opinions that contradict reality or logic should not infact be tolerated

English

Aharon Azulay@AharonAzulay·27 Şub

@trq212 Shared team memory can be nice

English

Thariq@trq212·26 Şub

We've rolled out a new auto-memory feature. Claude now remembers what it learns across sessions — your project context, debugging patterns, preferred approaches — and recalls it later without you having to write anything down.

English

853

1.1K

15.8K

3.2M

Aharon Azulay@AharonAzulay·25 Şub

@karpathy The crazy thing is this current abilities are achieved with models designed with compute shortage in mind.

English

Andrej Karpathy@karpathy·25 Şub

It is hard to communicate how much programming has changed due to AI in the last 2 months: not gradually and over time in the "progress as usual" way, but specifically this last December. There are a number of asterisks but imo coding agents basically didn’t work before December and basically work since - the models have significantly higher quality, long-term coherence and tenacity and they can power through large and long tasks, well past enough that it is extremely disruptive to the default programming workflow. Just to give an example, over the weekend I was building a local video analysis dashboard for the cameras of my home so I wrote: “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me”. The agent went off for ~30 minutes, ran into multiple issues, researched solutions online, resolved them one by one, wrote the code, tested it, debugged it, set up the services, and came back with the report and it was just done. I didn’t touch anything. All of this could easily have been a weekend project just 3 months ago but today it’s something you kick off and forget about for 30 minutes. As a result, programming is becoming unrecognizable. You’re not typing computer code into an editor like the way things were since computers were invented, that era is over. You're spinning up AI agents, giving them tasks *in English* and managing and reviewing their work in parallel. The biggest prize is in figuring out how you can keep ascending the layers of abstraction to set up long-running orchestrator Claws with all of the right tools, memory and instructions that productively manage multiple parallel Code instances for you. The leverage achievable via top tier "agentic engineering" feels very high right now. It’s not perfect, it needs high-level direction, judgement, taste, oversight, iteration and hints and ideas. It works a lot better in some scenarios than others (e.g. especially for tasks that are well-specified and where you can verify/test functionality). The key is to build intuition to decompose the task just right to hand off the parts that work and help out around the edges. But imo, this is nowhere near "business as usual" time in software.

English

1.6K

4.7K

37.2K

5.1M

Aharon Azulay@AharonAzulay·21 Şub

@DaveShapi Exactly. You can also plot the exponent of overlapping windows and see that the exponent is increasing.

English

David Shapiro (L/0)@DaveShapi·21 Şub

I told you all. Glad the rest of the world is catching up.

David Shapiro (L/0)@DaveShapi

I keep trying to tell people that the METR simple exponential is incorrect. I fed METR's raw data to ChatGPT Pro and asked it to determine which model (type of regression) best fit the data. I had it compare simple exponential, super exponential, logarithmic, and more. Overall, it tested 10 different ways of modelling the data. GPT-PRO: "Answer in one sentence: On the raw METR dataset (28 points, release date → P50 horizon length in minutes), the best‑fitting trend for model autonomy is super‑exponential, specifically a stretched‑exponential of the form. With A ≈ 0.170, b ≈ 0.0547, and c ≈ 2.56, which decisively outperforms simple exponential, logarithmic, power‑law, or low‑order polynomial alternatives on AIC/BIC and leave‑one‑out cross‑validation." In the graph below, you can see GPT-5's work, plotting several variants and the exp_power_time is very clearly the best fit. Let's explore this data and implications a little bit more. 🧵 1/x

English

192

18.4K

Aharon Azulay@AharonAzulay·19 Şub

@EMostaque Actually, alignment will be a by product of optimizing multiple different AIs on different utility functions that are all slightly misaligned with humans but in different ways that keep them in check. This is not dissimilar from the Sam and Dario situation.

English

448

Emad@EMostaque·19 Şub

If we can’t align humans how we gonna align AI

English

179

173

2.9K

126.5K

Aharon Azulay@AharonAzulay·19 Şub

@Google @demishassabis @GoogleDeepMind @GeminiApp Extremely cool and not AGI pilled. Or maybe it’s post-agi pilled?

English

Google@Google·18 Şub

Meet Lyria 3, our latest music generation model from @GoogleDeepMind. 🎶 Now, you can create custom music tracks in the @GeminiApp — just by describing an idea or uploading an image or video.

English

219

393

2.8K

1.3M

Aharon Azulay@AharonAzulay·13 Şub

Suffering = passively resisting reality

English

Aharon Azulay@AharonAzulay·13 Şub

@TheZvi For profit, race conditions.

English

541

Zvi Mowshowitz@TheZvi·13 Şub

I confirmed with a Google representative that since this was a runtime improvement and they do not believe these performance gains constitute any additional risk, they believe that no safety explanation is required of them. I found that to be a pretty terrible answer.

Nathan Calvin@_NathanCalvin

Did I miss the Gemini 3 Deep Think system card? Given its dramatic jump in capabilities seems nuts if they just didn't do one. There are really bad incentives if companies that do nothing get a free pass while cos that do disclose risks get (appropriate) scrutiny

English

340

65.5K

Aharon Azulay@AharonAzulay·11 Şub

@polynoamial @Anthropic Exactly what I thought when I read it.

English

Noam Brown@polynoamial·10 Şub

I appreciate @Anthropic's honesty in their latest system card, but the content of it does not give me confidence that the company will act responsibly with deployment of advanced AI models: -They primarily relied on an internal survey to determine whether Opus 4.6 crossed their autonomous AI R&D-4 threshold (and would thus require stronger safeguards to release under their Responsible Scaling Policy). This wasn't even an external survey of an impartial 3rd party, but rather a survey of Anthropic employees. -When 5/16 internal survey respondents initially gave an assessment that suggested stronger safeguards might be needed for model release, Anthropic followed up with those employees specifically and asked them to "clarify their views." They do not mention any similar follow-up for the other 11/16 respondents. There is no discussion in the system card of how this may create bias in the survey results. -Their reason for relying on surveys is that their existing AI R&D evals are saturated. Some might argue that AI progress has been so fast that it's understandable they don't have more advanced quantitative evaluations yet, but we can and should hold AI labs to a high bar. Also, other labs do have advanced AI R&D evals that aren't saturated. For example, OpenAI has the OPQA benchmark which measures AI models' ability to solve real internal problems that OpenAI research teams encountered and that took the team more than a day to solve. I don't think Opus 4.6 is actually at the level of a remote entry-level AI researcher, and I don't think it's dangerous to release. But the point of a Responsible Scaling Policy is to build institutional muscle and good habits before things do become serious. Internal surveys, especially as Anthropic has administered them, are not a responsible substitute for quantitative evaluations.

English

952

189.6K

Aharon Azulay@AharonAzulay·8 Şub

StackOverClaw Collective continual learning platform for coding agents @steipete

English

Aharon Azulay@AharonAzulay·7 Şub

Intelligence is the best way to overcome the bottlenecks for achieving more intelligence

English

Aharon Azulay@AharonAzulay·7 Şub

@mahaoo_ASI Centralized continual learning

English

Mahaoo@mahaoo_ASI·6 Şub

A year ago we had a new model released every 3 months, now we are approaching a new model released every 1 month when will we get to a new model released every 2 weeks? What about 3 days? Isn't that a form of continual learning?

prinz@deredleritt3r

Cadence of recent Codex releases: - November 19: Codex 5.1 Max - December 18: Codex 5.2 - February 5: Codex 5.3 A significantly better model *every month*

English

153

Aharon Azulay 리트윗함

Photoroom@photoroom_ML·2 Şub

We’re training a text-to-image model (PRX) from scratch and documenting the whole journey here :)) First major milestone: PRX weights are live in 🤗 Diffusers (Apache 2.0) 🎉 PRX is a 1.3B-param flow-matching T2I model, built on a simplified MMDiT backbone with a multilingual text encoder and multiple VAE / resolution variants. We’ll be sharing the full journey here: experiments, design choices, lessons learned, and future releases. Excited to show more soon. Full announcement & demo 👇 huggingface.co/blog/Photoroom… @huggingface @nvidia @NVIDIAGeForceFR @matthieurouif

English

197

22.4K

Aharon Azulay@AharonAzulay·29 Oca

I feel like a paper titled: “Reasoning Models are Few-Shot Reinforcement Learners” should be a thing

English

탐색

@julien_c @AnthropicAI @kimmonismus @mahaoo_ASI @trq212 @karpathy @DaveShapi @EMostaque