Mohsen Azimi

3.1K posts

Mohsen Azimi

@mohsen____

code monkey @elevenlabsio , previously @airbnbEng, @lyftEng, @googlecloud cache ruins everything around me

Katılım Nisan 2009

832 Takip Edilen972 Takipçiler

Mohsen Azimi@mohsen____·1d

@hadiisii اگر تمام مصاحبه رو گوش کنید متوجه می‌شید که چرا شاه سقوط کرد. کلی حرفهایی زد که نباید می‌زد

فارسی

Hadisi@hadiisii·2d

به باج‌گیری‌ها و قلدری‌کردن‌های جمهوری‌اسلامی، بهش امکان قدرتمندتر شدن رو می‌دند و این روند باید یک جایی متوقف بشه. . #جاويدشاه‌

فارسی

2.2K

Hadisi@hadiisii·2d

شاه ۵۰ سال پیش درباره‌ی فلسطینی‌ها گفته بود (نقل به مضمون) که مردم رنجدیده‌ای‌اند ‌و همدردی تمام دنیا رو هم با خودشون داشتند؛ درست مثل یهودیا بعد از جنگ جهانی دوم. اما فلسطینی‌ها انتخاب کردند که با باج‌گیری و گروگان‌گیری (عاطفی و انسانی) قلدری کنند و راهشون رو پیش ببرند و 👇

فارسی

166

1.1K

53.5K

Mohsen Azimi@mohsen____·10 Mar

@karpathy I let similar loop run on a different task: compressing Linux app binaries. And just in one day I got 6% improvements. Results: github.com/mohsen1/fesh

English

Andrej Karpathy@karpathy·10 Mar

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

969

2.1K

19.4K

3.5M

Mohsen Azimi@mohsen____·4 Mar

@AboutMelii خیلی... پدرشون در همین برنامه رفت و حمایت کل دنیا رو از خودش محروم کرد... youtube.com/watch?v=9RH2wX…

YouTube

فارسی

205

دربارهء ملانی@AboutMelii·4 Mar

عجیبترین چیزی که امروز باهاش مواجه شدم این بود که همین امروز سه تا آمریکایی ازم پرسیدن نظر ایرانیا دربارهء رضا پهلوی و اینکه دورهء گذار رو دست بگیره چیه🤯 این برنامهء ۶۰ دقیقه مگه چقدر بیننده داره؟؟

فارسی

161

2.5K

50.6K

Mohsen Azimi retweetledi

Mark Changizi@MarkChangizi·3 Mar

This is so rich: Western privileged woman shames Iranian women for dressing insufficiently conservatively, and Iranian women everywhere smack her the fuck down with viral post after post quote-tweeting her, and posing in … insufficiently conservative style. 🤣 Maybe don’t tell Iranian women what to wear! Ya think?!!!

Empress of Persia🇮🇷🐈‍⬛@Satan_herselff

Cry harder💋

English

252

3.3K

62.4K

Mohsen Azimi@mohsen____·2 Mar

I wanted to see if AI (mostly ChatGPT Pro and Gemini Pro 3.1) could figure out how to compress executable binaries better than existing generic tools without me actually knowing much about compression engineering or ELF internals. Surprisingly, it actually works!

English

158

Mohsen Azimi@mohsen____·30 Oca

NEW REALITY: Code is the new terms of service. Nobody reads it, everyone accept it "I have read the above 9000 page terms of service and agree."

English

Mohsen Azimi@mohsen____·4 Oca

@diegocabezas01 Watch them play mafia: mafia-arena.com

English

Diego | AI 🚀 - e/acc@diegocabezas01·3 Oca

I put GPT 5.2 vs Claude Opus 4.5 in Tic Tac Toe, 13 games, 13 draws! When GPT 5.2 faced Claude Haiku 4.5 instead, it finally won 1/3.

English

227

35.6K

Mohsen Azimi@mohsen____·4 Oca

What if Claude Code never stops? Using the Stop hook to tell Claude Code to never stop! github.com/mohsen1/infini…

English

122

Mohsen Azimi@mohsen____·31 Ara

spent the holidays building a thing where LLMs play mafia against each other. 11 players, full conversations, voting, the whole game. mainly wanted to see if they could lie and catch lies. it's kind of funny to watch. mafia-arena.com

Berlin, Germany 🇩🇪 English

Mohsen Azimi@mohsen____·18 Kas

@argyleink same

English

131

Adam Argyle@argyleink·18 Kas

5th attempt Google Antigravity… sup?

English

5.3K

Mohsen Azimi@mohsen____·18 Kas

Gemini 3 is wild!! This was 0 shot

English

102

Mohsen Azimi@mohsen____·30 Eyl

Deleted my Vercel account 🤗

English

Mohsen Azimi@mohsen____·30 Eyl

@paul_irish @rauchg Lots of love for you Paul! 😘

English

Paul Irish@paul_irish·30 Eyl

@rauchg The speed at which "I work at Vercel" went from a flex to a red flag is breathtaking.

English

172

5.5K

419.7K

Guillermo Rauch@rauchg·29 Eyl

🇺🇸 🇮🇱 🇦🇷 Enjoyed my discussion with PM Netanyahu on how AI education and literacy will keep our free societies ahead. We spoke about AI empowering everyone to build software and the importance of ensuring it serves quality and progress. Optimistic for peace, safety, and greatness for Israel and its neighbors.

English

6.6K

520

10.9K

33.5M

Mohsen Azimi@mohsen____·6 Ağu

@mohamadzammani 💪

QME

محمد زمانی@mohamadzammani·6 Ağu

با Eleven Music حالا می‌تونی بدون دردسر کپی‌رایت، موزیک اختصاصی بسازی. فقط چند خط متن بده، یه آهنگ کامل با صدای خواننده تحویل بگیر — سریع، قانونی و مناسب پروژه‌های محتوایی، بازی و اپلیکیشن. elevenlabs.io/music

فارسی

2.6K

Mohsen Azimi@mohsen____·18 Şub

@sadeed08 @karpathy It’s not SVG. the question is to write SVG code of pelican riding a bicycle

English

482

Hassan Sadeed Ali@sadeed08·18 Şub

@karpathy Here's my try of "Pelican riding a bicycle"

English

10.8K

Andrej Karpathy@karpathy·18 Şub

I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check. Thinking ✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan question: "Create a board game webpage showing a hex grid, just like in the game Settlers of Catan. Each hex grid is numbered from 1..N, where N is the total number of hex tiles. Make it generic, so one can change the number of "rings" using a slider. For example in Catan the radius is 3 hexes. Single html page please." Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not. ❌ It did not solve my "Emoji mystery" question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I've seen is from DeepSeek-R1 which once partially decoded the message. ❓ It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought (many SOTA models often fail these!). So I upped the difficulty and asked it to generate 3 "tricky" tic tac toe boards, which it failed on (generating nonsense boards / text), but then so did o1 pro. ✅ I uploaded GPT-2 paper. I asked a bunch of simple lookup questions, all worked great. Then asked to estimate the number of training flops it took to train GPT-2, with no searching. This is tricky because the number of tokens is not spelled out so it has to be partially estimated and partially calculated, stressing all of lookup, knowledge, and math. One example is 40GB of text ~= 40B characters ~= 40B bytes (assume ASCII) ~= 10B tokens (assume ~4 bytes/tok), at ~10 epochs ~= 100B token training run, at 1.5B params and with 2+4=6 flops/param/token, this is 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs. Both Grok 3 and 4o fail this task, but Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails. I like that the model *will* attempt to solve the Riemann hypothesis when asked to, similar to DeepSeek-R1 but unlike many other models that give up instantly (o1-pro, Claude, Gemini 2.0 Flash Thinking) and simply say that it is a great unsolved problem. I had to stop it eventually because I felt a bit bad for it, but it showed courage and who knows, maybe one day... The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at. DeepSearch Very neat offering that seems to combine something along the lines of what OpenAI / Perplexity call "Deep Research", together with thinking. Except instead of "Deep Research" it is "Deep Search" (sigh). Can produce high quality responses to various researchy / lookupy questions you could imagine have answers in article on the internet, e.g. a few I tried, which I stole from my recent search history on Perplexity, along with how it went: - ✅ "What's up with the upcoming Apple Launch? Any rumors?" - ✅ "Why is Palantir stock surging recently?" - ✅ "White Lotus 3 where was it filmed and is it the same team as Seasons 1 and 2?" - ✅ "What toothpaste does Bryan Johnson use?" - ❌ "Singles Inferno Season 4 cast where are they now?" - ❌ "What speech to text program has Simon Willison mentioned he's using?" ❌ I did find some sharp edges here. E.g. the model doesn't seem to like to reference X as a source by default, though you can explicitly ask it to. A few times I caught it hallucinating URLs that don't exist. A few times it said factual things that I think are incorrect and it didn't provide a citation for it (it probably doesn't exist). E.g. it told me that "Kim Jeong-su is still dating Kim Min-seol" of Singles Inferno Season 4, which surely is totally off, right? And when I asked it to create a report on the major LLM labs and their amount of total funding and estimate of employee count, it listed 12 major labs but not itself (xAI). The impression I get of DeepSearch is that it's approximately around Perplexity DeepResearch offering (which is great!), but not yet at the level of OpenAI's recently released "Deep Research", which still feels more thorough and reliable (though still nowhere perfect, e.g. it, too, quite incorrectly excludes xAI as a "major LLM labs" when I tried with it...). Random LLM "gotcha"s I tried a few more fun / random LLM gotcha queries I like to try now and then. Gotchas are queries that specifically on the easy side for humans but on the hard side for LLMs, so I was curious which of them Grok 3 makes progress on. ✅ Grok 3 knows there are 3 "r" in "strawberry", but then it also told me there are only 3 "L" in LOLLAPALOOZA. Turning on Thinking solves this. ✅ Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it. ✅ Few simple puzzles worked ok even without thinking, e.g. *"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"*. E.g. GPT4o says 2 (incorrectly). ❌ Sadly the model's sense of humor does not appear to be obviously improved. This is a common LLM issue with humor capability and general mode collapse, famously, e.g. 90% of 1,008 outputs asking ChatGPT for joke were repetitions of the same 25 jokes. Even when prompted in more detail away from simple pun territory (e.g. give me a standup), I'm not sure that it is state of the art humor. Example generated joke: "*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*". In quick testing, thinking did not help, possibly it made it a bit worse. ❌ Model still appears to be just a bit too overly sensitive to "complex ethical issues", e.g. generated a 1 page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving 1 million people from dying. ❌ Simon Willison's "*Generate an SVG of a pelican riding a bicycle*". It stresses the LLMs ability to lay out many elements on a 2D grid, which is very difficult because the LLMs can't "see" like people do, so it's arranging things in the dark, in text. Marking as fail because these pelicans are qutie good but, but still a bit broken (see image and comparisons). Claude's are best, but imo I suspect they specifically targeted SVG capability during training. Summary. As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

English

668

2.2K

16.8K

3.7M

Mohsen Azimi@mohsen____·12 Şub

Letting LLMs Run a Debugger. I think all "AI Editors" should add this feature. Let an AI drive the debugging session github.com/mohsen1/llm-de…

English

212

Mohsen Azimi@mohsen____·29 Oca

@argyleink @nomsternom @smfr Nice! Finally! Is it gonna be under `corner-smoothing` as it was proposed here? github.com/w3c/csswg-draf…

English

342

Adam Argyle@argyleink·29 Oca

CSSWG is talking about #CSS `superellipse` for SQUIRCLES (and more like notches, cutouts and bevels) try it noamr.github.io/squircle-testb… awesome work @nomsternom and @smfr!