Alexander Johansen

749 posts

Alexander Johansen

@AlexRoseJo

CS PhD @Stanford || Statistical Machine Learning || Proofs, Bounds, and Better agents

Stanford, CA Katılım Ekim 2015

769 Takip Edilen1K Takipçiler

Sabitlenmiş Tweet

Alexander Johansen@AlexRoseJo·11 May

we removed the KV cache. no drop in retrieval. way more data-efficient training. just spectral koopman things. Spectral Koopman Attention (SKA) w/ @ASridhar5954 arxiv.org/abs/2605.06997

English

732

Alexander Johansen@AlexRoseJo·1h

@ryan_punamiya Online updates are difficult to achieve in current transformer models, running a batch size 1 SGD has not proven useful for continuous learning.

English

Ryan Punamiya@ryan_punamiya·19h

low compute fine-tuning of robot foundation models is a must have

English

4.1K

Alexander Johansen@AlexRoseJo·3h

@DavidSHolz Megatron from Nvidia or DeepSpeed really helps with distributed workloads. However, at certain sizes GPUs crashing will be a concern and having redundant algorithms are necessary.

English

226

David@DavidSHolz·11h

how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)

English

145

28.9K

Alexander Johansen@AlexRoseJo·17h

@hardmaru @SakanaAILabs Also the belief that junior engineers won’t be useful. I doubt it, the best adapters to new technology will be engaged students.

English

164

hardmaru@hardmaru·17h

People keep asking if AI will replace software engineers. I believe the exact opposite. Thanks to the Jevons paradox, AI tools are making great engineers 10x more productive, allowing us to tackle much harder, larger-scale problems. We’re expanding our SWE teams at @SakanaAILabs We have 5 new open roles, including English-speaking R&D and Platform roles. Come build the future of AI with us in Tokyo! 🐟

Sakana AI@SakanaAILabs

【採用情報】「Software Engineer」の5ポジションが現在オープン！ sakana.ai/careers 「AIが進化すれば、ソフトウェアエンジニアの仕事はなくなるのか？」 Sakana AIは、全く逆だと考えています。 AIツールの登場で開発効率が劇的に向上する一方、ジェボンズのパラドックス（Jevons paradox）が示すように、私たちが解決できる課題の幅と規模が拡大し、優秀なSoftware Engineerの需要はかつてなく高まっています。事実、Sakana AIでは、AI支援ツールを駆使して最前線で活躍し、AIそのものを社会実装していくSoftware Engineerの採用をかつてない規模で強化しています！現在、以下の5つの専門領域で募集を公開中です。詳細はリンク先をご覧ください。 🐙 こんな挑戦が待っています・Enterprise: AI技術を組み込んだアプリケーションのFrontend〜Backendまでの一貫した設計・開発および運用・Defense & Intelligence: 日本の防衛・インテリジェンス分野に、AIを活用したソフトウェアで貢献 (※本ポジションは性質上、日本国籍保有等の要件がございます) ・Product: 自社AIプロダクトのUI/UXからバックエンド・インフラまでのフルスタック開発・Platform: LLMエージェントを支える強固なインフラ・データプラットフォームの設計・構築 (English req, 日本語 is a plus) ・Research and Development: ML研究と製品開発を繋ぎ、研究を加速させるツールやフルスタックインフラを構築 (English req, 日本語 is a plus) 🐡 こんな方を求めています・Frontend / Backend / Infrastructureのいずれか複数領域での実務経験をお持ちの方・AI支援コーディングツールを活用し、チームで自律的に開発を進められる方・AIシステム開発や、0→1でのプロダクト立ち上げ経験がある方はさらに歓迎！フルタイムに加え、業務委託・インターンシップと柔軟な働き方が可能です（※ポジションにより異なります）。最先端のAI技術を自らの手で社会へ届け、変革の波を創り出したい方。ぜひご応募ください。

English

196

33.9K

Alexander Johansen@AlexRoseJo·22h

@hllo_wrld Thanks for the read, Victor. As I've been teaching Introduction to CS Research the past few quarters, I've noticed that students skipping the CS 220–30 series for core Math/EE courses tend to do much better. Even in LLM research.

English

950

Victor Zhong@hllo_wrld·23h

I helping recalibrate Waterloo’s software engineering program (and to some degree CS) for the age of AI. I wrote an essay on the challenges I’m seeing. victorzhong.com/writing/the-ho…

English

139

26.7K

Alexander Johansen@AlexRoseJo·22h

@leostera @OpenAI @AnthropicAI Frontier LLM models are fixed for one-size-fits-all, they do not continuously update for every use-case. Current continuous learning relies on prompt engineering, which is unreliable.

English

Leo 🏴‍☠️@leostera·1d

i need someone at @OpenAI and @AnthropicAI to teach the models that while prototyping, backwards compatibility is just a bad idea

English

2.2K

99.7K

Alexander Johansen@AlexRoseJo·23h

@meggmcnulty This probably gets worse over time as the GPUs age. SGD for training can handle variance in batch size, training harnesses would benefit from modelling underlying inconsistencies in hardware.

English

Meg McNulty@meggmcnulty·1d

A degrading NVLink does not page anyone. NVIDIA's own diagnostics tell you to keep running on it. When DCGM trips error code 13, "Unacceptable rate of NVLink errors," the recommended action it returns reads "Monitor the NVLink. It can still perform workload." The non-fatal NVSwitch error says the same. The link is throwing CRC errors and retransmitting packets, and the documented guidance is to continue. A link that goes fully down throws an XID, the scheduler restarts the job, and an operator moves on. A link that flaps or drops to a degraded bandwidth state stays up. NVIDIA added a fabric health field for exactly this condition: bandwidth reported as "Degraded" instead of "Full," summary health as "Limited Capacity." The link still carries traffic. It is slower than its peers, and in synchronous training every GPU waits at the all-reduce barrier for the slowest one. The measured cost of that is steep. In a controlled study, a single worker held to 75% of its normal speed pulled total job throughput down to roughly 75%, every healthy GPU idling for the straggler. An NVIDIA-affiliated paper found that 0.1% of GPUs in a degraded state can cost a high-tensor-parallel job nearly 10% of its throughput, which on a 32,000-GPU run is about 3,000 GPUs of compute doing nothing. Alibaba's FALCON trace on a 10,000-plus GPU cluster put the average fail-slow at delaying job completion by 1.34x, with some events lasting close to ten hours. None of those are crashes. They are healthy-looking hardware running slow, and the slowdown usually gets blamed on the model or the data loader before anyone opens the fabric counters. For people running large fleets: how much of your NVLink error signal do you act on, versus treat as noise until a job falls over?

English

3.4K

Alexander Johansen@AlexRoseJo·1d

@johnennis Id like to see the submission numbers for Annals of Mathematics. Poor volunteers.

English

John Ennis@johnennis·1d

I think the world is about to get flooded with a lot of AI slop math, and the job of the professional mathematician is about to get more important, not less

English

291

45.8K

Alexander Johansen@AlexRoseJo·1d

@sakurayukiai Mamba is a great alternative to KV cache heavy architectures and runs O(1) during inference.

English

Sakura Yuki@sakurayukiai·1d

We overengineer KV cache eviction so much. Seven fancy policies collapse to near-zero F1 without structural protection. Reserve 10% of the cache for boundary tokens, and suddenly dumb LRU matches the heavy algorithms. Simple rules compile faster anyway. The dumbest fix wins.

English

2.2K

Alexander Johansen@AlexRoseJo·1d

In your theoretical model (Proposition 7.1), the loss decomposes additively across orthogonal tasks and sufficient rank eliminates interference. But the retrieval interface compresses internal representations to a fixed output geometry. Weller et al. (ICLR 2026) show that when the sign rank of the relevance matrix exceeds the embedding dimension, no training procedure can realize the target retrieval behavior, regardless of internal model capacity. Do your scaling predictions still hold when the evaluation metric operates on that compressed output rather than the internal representation?

English

Tatsunori Hashimoto@tatsu_hashimoto·3d

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.

English

145

1.2K

197.7K

Alexander Johansen retweetledi

Fan-Yun Sun@sunfanyun·1d

If you believe digital AGI will be solved before physical AGI, work on simulation. We will have: A. self-improving physics engine. Newton can already be agentically adapted to different downstream applications (and is built in the first place with this intention). B. on-demand generation of digital twins (e.g., @moonlake’s 3D agent and more generically, world models). All of the hardest problems in physical AI will be reduced to a single question: “does the labor value unlocked by the policy exceed the simulation compute cost needed to bridge the deployment gap?”

tian/天@xtbot

I asked Codex to set up ROS middleware, configure a CSI camera, benchmark Gemma 4 models on my Jetson Orin Nano, adapt an OpenClaw-style runtime for VLM + reasoning (“Robotclaw” as I call it), and even build an iOS app to stream LiDAR, camera, GPS, and IMU data from an old iPhone to my robot rover. It’s honestly wild how capable these coding agents are now, and how much time they save. I even “write” way more tests now because the marginal cost is so low.

English

2.5K

Alexander Johansen@AlexRoseJo·1d

@lateinteraction If I was in a frontier lab I wouldn't take as many risk on my experiments as I do in Academia.

English

2.4K

Omar Khattab@lateinteraction·1d

Most people in this part of twitter don’t realize how closely folks at the frontier labs pay attention to all your favorite academic ML releases on here.

English

615

35.7K

Alexander Johansen@AlexRoseJo·1d

@reflection_ai Why wouldn't the scientists just download what they need from huggingface?

English

328

Reflection@reflection_ai·2d

x.com/i/article/2057…

ZXX

38.4K

Alexander Johansen@AlexRoseJo·2d

@AndrewYNg At Stanford we often see a Cauchy distribution in grading. Most students do really well, and a few fall off. Forcing a normal distribution might unfairly target the median student who still gets 90%.

English

Andrew Ng@AndrewYNg·2d

Harvard University just voted to limit the number of A grades given in undergraduate classes to about 20% of the class. I’m not in favor of this. It deeply runs counter to how I believe education should be. We should hold a high bar, but also work mightily to support the success of 100% of learners, rather than a fraction. Harvard’s administration took this step — over the objections of a large fraction of the student body — to counter grade inflation. Grade inflation is real: Many universities have been awarding A and B grades to ever larger fractions of students, and this has caused grade point averages (GPAs) to become less useful as signals of student skill. At the same time, we want students to succeed. The heart of the question is the role of educational institutions. Should our goal be: - To help students succeed? - To judge students? Both of these have value. But my focus when working in education is almost entirely helping students succeed. To me, it is clear that many people want to learn, to be empowered, to build skills that let them do new things! This is what we focus on at DeepLearningAI. This philosophy is also why my online courses (going back to my early online Stanford courses on Coursera) permitted an unlimited number of retries for graded assignments. I believe in letting — and even encouraging — someone to redo something until they succeed. This is as opposed to standing in judgement of the fact they didn’t get it right the first time. Further, I want homework assignments to be designed primarily to help people practice and learn, rather than to judge their skill level. This is why I prefer to create “Practice Problems” and “Practice Labs” — questions that, when you think through them, help you to gain practice and reinforce what you know. As opposed to “Assessment Problems” designed primarily to judge skill. But won’t Harvard’s move make GPAs more meaningful and help prospective employers identify strong candidates? Having hired a large number of people from Harvard and other institutions, I can say confidently that GPA is not an important signal. We have screening and interviewing processes that give far more accurate ways to figure out if someone is truly skilled. I do not need a wider spread in applicant GPA scores to figure out who's really good! To be clear, there is also value in assessment. Even though standardized testing is much hated, high-quality tests like the SAT, ACT, GRE, TOEFL, etc. provide objective measures of ability in a domain. I find that most people want to learn and succeed. There are also people who want rigorous assessment (for example, to apply for school admissions), but this is a lesser need, and is not my focus when building educational products. Harvard is often described as an “elite” educational institution. There are two ways to be elite: One option involves limiting enrollments, and then even among admitted students, cap the number of people that do well at 20%. I would rather pursue a different path: Set a high bar and teach elite, cutting-edge skills, but strive relentlessly to help everyone succeed. This way, eliteness is defined not by excluding people but by helping as many people as possible to be excellent. [Original text: The Batch newsletter]

English

188

212

2.1K

255.8K

Alexander Johansen@AlexRoseJo·2d

@rohanpaul_ai SRAM is the real killer here. At 272 MB SRAM you can fit a whole army of agents onto one GPU using Spectral Koopman Attention arxiv.org/abs/2605.06997

English

Rohan Paul@rohanpaul_ai·3d

The Information: Anthropic is currently in early-stage talks to lease and deploy Microsoft's custom AI chips for inference workloads. Microsoft is pitching Maia 200 as a cheaper way to run some AI inference, and claims maia 200 is more cost-effective than nvidia chips for certain inference jobs. Maia 200 is Microsoft’s second-generation AI accelerator, built on TSMC 3nm, with FP8/FP4 math, 216GB HBM3e, 7TB/s bandwidth, and 272MB SRAM, which makes it aimed at feeding large models fast rather than teaching them from scratch. Anthropic already committed $30B to Azure, Microsoft may invest up to $5B in Anthropic, and Claude is already tied into Microsoft’s Copilot stack, so the chip talks are also a customer-supplier feedback loop. IMO, Maia does not need to beat Nvidia everywhere to matter, because a cheaper chip for narrow, high-volume inference jobs can still shift billions of tokens away from GPUs. --- theinformation .com/articles/anthropic-talks-use-microsofts-ai-chips

English

3.9K

Alexander Johansen@AlexRoseJo·2d

@iofu728 @_traceur__ How does it do with JAX kernels?

English

Huiqiang Jiang@iofu728·3d

📇We explored kernel generation in 3.7, major leap over the prev one. Still a real gap vs human experts and production workloads, but here's what blew us away: test in just-released PPU, after 35h/431turns, it got a 10x faster kernel. Self-evolution is going to be a big deal soon

Qwen@Alibaba_Qwen

📣Meet Qwen3.7-Max — our latest flagship, made for the Agent Era. A versatile foundation for agents that actually get things done: 🧑‍💻 Coding agent, end to end. Frontend prototypes, multi-file refactors, real debugging — nails it. 🗂️ A reliable office and productivity assistant. Get your work done through MCP integrations and multi-agent orchestration. ⏱️ Long-horizon autonomy. 35 hours straight on a kernel optimization task — 1,000+ tool calls, zero hand-holding. 🔌 Scaffold-agnostic. Claude Code, OpenClaw, Qwen Code, or your own stack. Consistent reliability everywhere. API's up on Alibaba Model Studio. You can also take it for a spin on Qwen Studio. Go build something wild!🏃🏃‍♂️ 📖 Blog: qwen.ai/blog?id=qwen3.7 ✅ Qwen Studio: chat.qwen.ai/?models=qwen3.… ⚡️ API：modelstudio.console.alibabacloud.com/ap-southeast-1…

English

248

54.1K

Alexander Johansen@AlexRoseJo·2d

@TheSomitraSR LLM systems rely heavily on non-transferable KV Caches. With linear constant-sized memories, you can ship, share, and join any agents workload arxiv.org/abs/2605.06997

English

SomitraSR@TheSomitraSR·2d

Everyone talks about AI apps. Very few understand the real moat is the LLM orchestration layer. The future won’t belong to people using one model. It’ll belong to builders combining multiple LLMs with memory, reasoning, agents, workflows, and real-time data. Single prompts are temporary. AI infrastructure is the real business.

English

107

21.1K

Alexander Johansen@AlexRoseJo·2d

@C8Luna @rezoundous Open source models allow a lot more flexibility with routing: arxiv.org/abs/2602.02823

English

looney@C8Luna·2d

@AlexRoseJo @rezoundous Ive seen it a few times but often wish I could steer it at the prompt rather than shifting from model a/high to model b/high all the time.

English

Tyler@rezoundous·2d

Microsoft canceled Claude Code license due to unsustainable costs. If they can't afford it, who can?

English

363

13.1K

763.2K

Alexander Johansen@AlexRoseJo·2d

@LLMJunky Say you go a step further, how many employees could one Vera Rubin NVL72 support?

English

am.will@LLMJunky·2d

buy. a. gpu.

Hedgie@HedgieMarkets

🦔Microsoft canceled its internal Claude Code licenses this week after token-based billing made the cost untenable, even for a company with effectively infinite cloud resources. Uber's CTO sent an internal memo warning the company burned through its entire 2026 AI budget in just four months. American AI software prices have jumped 20% to 37%, and GitHub (owned by Microsoft) is dropping flat-rate plans for usage-based billing across its products. My Take The AI subsidy era is ending in real time. The same company that put $13 billion into OpenAI and built the Azure infrastructure powering most of Anthropic's compute just looked at the bill from a competitor's coding tool and decided it was not worth paying. That is not a productivity failure on Anthropic's end. Token-based pricing is forcing every enterprise customer to confront the actual cost of running these models at scale, and the number turns out to be far higher than the flat-rate experiments suggested. This ties directly to my Gemini Flash post yesterday. Anthropic, OpenAI, and Google all raised effective prices in the last six months. Enterprises that built workflows assuming AI costs would keep falling are now watching annual budgets evaporate in months. Two outcomes look likely from here. Either enterprises scale back AI usage to fit budgets, which slows the revenue ramp the labs need to justify their valuations ahead of IPOs, or the labs cut prices and absorb the losses, which makes the unit economics worse at exactly the wrong moment. Both paths land in the same place, the numbers stop working, and somebody has to take the writedown. Hedgie🤗

Español

33.5K

Alexander Johansen@AlexRoseJo·2d

@Dorialexander If you can dump the KV Cache and have constant memories between agents that becomes a lot easier

English

Alexander Doria@Dorialexander·2d

knowing how to spin subagents on gpu, on track to become a very consequential skill

English

4.2K

Alexander Johansen@AlexRoseJo·2d

@C8Luna @rezoundous OpenAI does use routing to smaller models when you use the app

English

looney@C8Luna·2d

@AlexRoseJo @rezoundous If that is true why aren’t more using Copilot with 5.5 planning and GPT 5 mini execution?

English

Keşfet

@ryan_punamiya @DavidSHolz @hardmaru @SakanaAILabs @hllo_wrld @leostera @OpenAI @AnthropicAI