Aharon Azulay

518 posts

Aharon Azulay

Aharon Azulay

@AharonAzulay

Applied epistemologist. AI researcher

가입일 Eylül 2011
542 팔로잉104 팔로워
Aharon Azulay 리트윗함
AI Security Institute
AI Security Institute@AISecurityInst·
We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵
AI Security Institute tweet media
English
110
549
3K
1.2M
Aharon Azulay
Aharon Azulay@AharonAzulay·
@julien_c The cynical reasons: 1) They don't have enough compute to serve it given the crazy demand 2) They want to keep their moat of being the closest to automate AI R&D
English
0
0
1
1.3K
Julien Chaumond
Julien Chaumond@julien_c·
“gpt2-large is too powerful to be publicly released” vibes
English
69
156
4.3K
329.7K
Anthropic
Anthropic@AnthropicAI·
We do not plan to make Mythos Preview generally available. Our goal is to deploy Mythos-class models safely at scale, but first we need safeguards that reliably block their most dangerous outputs. We’ll begin testing those safeguards with an upcoming Claude Opus model.
English
95
290
3.6K
886.1K
Anthropic
Anthropic@AnthropicAI·
Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing
English
2K
6.7K
43.9K
30.7M
Aharon Azulay
Aharon Azulay@AharonAzulay·
@kimmonismus That's what you get when every employee is expected by leadership to fully embrace Claude Code. It also helps to get unlimited Claude Code with unreleased models, a /super-fast internal mode, longer context windows, etc.
English
0
0
1
101
Mahaoo
Mahaoo@mahaoo_ASI·
I've moved to using gpt 5.4-high for a few days now and didn't feel the urge to move back to opus 4.6 seems like anthropic better release their next model soon if they want to reach their goal of 10x-ing their revenue once again by the end of the year
English
1
0
6
181
Aharon Azulay 리트윗함
Photoroom
Photoroom@photoroom_ML·
How far can you push diffusion training in 24 hours and $1500? We ran a diffusion speedrun in the next post of our PRX series. 32× H200 1 day of training The result is a surprisingly capable text-to-image model. Full recipe and code open sourced 🧵
Photoroom tweet media
English
8
22
166
12.4K
Mahaoo
Mahaoo@mahaoo_ASI·
unpopular opinion: if you hold beliefs that are technically incorrect or have a large number of logical inconsistencies - you are not "entitled to your opinion" opinions that contradict reality or logic should not infact be tolerated
English
2
0
4
94
Thariq
Thariq@trq212·
We've rolled out a new auto-memory feature. Claude now remembers what it learns across sessions — your project context, debugging patterns, preferred approaches — and recalls it later without you having to write anything down.
English
853
1.1K
15.8K
3.2M
Aharon Azulay
Aharon Azulay@AharonAzulay·
@karpathy The crazy thing is this current abilities are achieved with models designed with compute shortage in mind.
English
0
0
1
22
Andrej Karpathy
Andrej Karpathy@karpathy·
It is hard to communicate how much programming has changed due to AI in the last 2 months: not gradually and over time in the "progress as usual" way, but specifically this last December. There are a number of asterisks but imo coding agents basically didn’t work before December and basically work since - the models have significantly higher quality, long-term coherence and tenacity and they can power through large and long tasks, well past enough that it is extremely disruptive to the default programming workflow. Just to give an example, over the weekend I was building a local video analysis dashboard for the cameras of my home so I wrote: “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me”. The agent went off for ~30 minutes, ran into multiple issues, researched solutions online, resolved them one by one, wrote the code, tested it, debugged it, set up the services, and came back with the report and it was just done. I didn’t touch anything. All of this could easily have been a weekend project just 3 months ago but today it’s something you kick off and forget about for 30 minutes. As a result, programming is becoming unrecognizable. You’re not typing computer code into an editor like the way things were since computers were invented, that era is over. You're spinning up AI agents, giving them tasks *in English* and managing and reviewing their work in parallel. The biggest prize is in figuring out how you can keep ascending the layers of abstraction to set up long-running orchestrator Claws with all of the right tools, memory and instructions that productively manage multiple parallel Code instances for you. The leverage achievable via top tier "agentic engineering" feels very high right now. It’s not perfect, it needs high-level direction, judgement, taste, oversight, iteration and hints and ideas. It works a lot better in some scenarios than others (e.g. especially for tasks that are well-specified and where you can verify/test functionality). The key is to build intuition to decompose the task just right to hand off the parts that work and help out around the edges. But imo, this is nowhere near "business as usual" time in software.
English
1.6K
4.7K
37.2K
5.1M
Aharon Azulay
Aharon Azulay@AharonAzulay·
@DaveShapi Exactly. You can also plot the exponent of overlapping windows and see that the exponent is increasing.
Aharon Azulay tweet media
English
0
0
0
63
Aharon Azulay
Aharon Azulay@AharonAzulay·
@EMostaque Actually, alignment will be a by product of optimizing multiple different AIs on different utility functions that are all slightly misaligned with humans but in different ways that keep them in check. This is not dissimilar from the Sam and Dario situation.
English
0
0
0
448
Emad
Emad@EMostaque·
If we can’t align humans how we gonna align AI
Emad tweet media
English
179
173
2.9K
126.5K
Google
Google@Google·
Meet Lyria 3, our latest music generation model from @GoogleDeepMind. 🎶 Now, you can create custom music tracks in the @GeminiApp — just by describing an idea or uploading an image or video.
English
219
393
2.8K
1.3M
Aharon Azulay
Aharon Azulay@AharonAzulay·
Suffering = passively resisting reality
English
0
0
0
23
Zvi Mowshowitz
Zvi Mowshowitz@TheZvi·
I confirmed with a Google representative that since this was a runtime improvement and they do not believe these performance gains constitute any additional risk, they believe that no safety explanation is required of them. I found that to be a pretty terrible answer.
Nathan Calvin@_NathanCalvin

Did I miss the Gemini 3 Deep Think system card? Given its dramatic jump in capabilities seems nuts if they just didn't do one. There are really bad incentives if companies that do nothing get a free pass while cos that do disclose risks get (appropriate) scrutiny

English
13
15
340
65.5K
Noam Brown
Noam Brown@polynoamial·
I appreciate @Anthropic's honesty in their latest system card, but the content of it does not give me confidence that the company will act responsibly with deployment of advanced AI models: -They primarily relied on an internal survey to determine whether Opus 4.6 crossed their autonomous AI R&D-4 threshold (and would thus require stronger safeguards to release under their Responsible Scaling Policy). This wasn't even an external survey of an impartial 3rd party, but rather a survey of Anthropic employees. -When 5/16 internal survey respondents initially gave an assessment that suggested stronger safeguards might be needed for model release, Anthropic followed up with those employees specifically and asked them to "clarify their views." They do not mention any similar follow-up for the other 11/16 respondents. There is no discussion in the system card of how this may create bias in the survey results. -Their reason for relying on surveys is that their existing AI R&D evals are saturated. Some might argue that AI progress has been so fast that it's understandable they don't have more advanced quantitative evaluations yet, but we can and should hold AI labs to a high bar. Also, other labs do have advanced AI R&D evals that aren't saturated. For example, OpenAI has the OPQA benchmark which measures AI models' ability to solve real internal problems that OpenAI research teams encountered and that took the team more than a day to solve. I don't think Opus 4.6 is actually at the level of a remote entry-level AI researcher, and I don't think it's dangerous to release. But the point of a Responsible Scaling Policy is to build institutional muscle and good habits before things do become serious. Internal surveys, especially as Anthropic has administered them, are not a responsible substitute for quantitative evaluations.
Noam Brown tweet media
English
60
66
952
189.6K
Aharon Azulay
Aharon Azulay@AharonAzulay·
StackOverClaw Collective continual learning platform for coding agents @steipete
English
0
0
2
28
Aharon Azulay
Aharon Azulay@AharonAzulay·
Intelligence is the best way to overcome the bottlenecks for achieving more intelligence
English
0
0
2
23
Aharon Azulay 리트윗함
Photoroom
Photoroom@photoroom_ML·
We’re training a text-to-image model (PRX) from scratch and documenting the whole journey here :)) First major milestone: PRX weights are live in 🤗 Diffusers (Apache 2.0) 🎉 PRX is a 1.3B-param flow-matching T2I model, built on a simplified MMDiT backbone with a multilingual text encoder and multiple VAE / resolution variants. We’ll be sharing the full journey here: experiments, design choices, lessons learned, and future releases. Excited to show more soon. Full announcement & demo 👇 huggingface.co/blog/Photoroom… @huggingface @nvidia @NVIDIAGeForceFR @matthieurouif
English
1
29
197
22.4K
Aharon Azulay
Aharon Azulay@AharonAzulay·
I feel like a paper titled: “Reasoning Models are Few-Shot Reinforcement Learners” should be a thing
English
0
0
0
19