Alexander Johansen

749 posts

Alexander Johansen banner
Alexander Johansen

Alexander Johansen

@AlexRoseJo

CS PhD @Stanford || Statistical Machine Learning || Proofs, Bounds, and Better agents

Stanford, CA Katılım Ekim 2015
769 Takip Edilen1K Takipçiler
Alexander Johansen
Alexander Johansen@AlexRoseJo·
@ryan_punamiya Online updates are difficult to achieve in current transformer models, running a batch size 1 SGD has not proven useful for continuous learning.
English
0
0
0
22
Ryan Punamiya
Ryan Punamiya@ryan_punamiya·
low compute fine-tuning of robot foundation models is a must have
English
5
3
55
4.1K
Alexander Johansen
Alexander Johansen@AlexRoseJo·
@DavidSHolz Megatron from Nvidia or DeepSpeed really helps with distributed workloads. However, at certain sizes GPUs crashing will be a concern and having redundant algorithms are necessary.
English
0
0
0
226
David
David@DavidSHolz·
how many gpus do you think a single researcher can handle at once for a single big training job without any help? (assume it's set up with slurm or something)
English
33
0
145
28.9K
Alexander Johansen
Alexander Johansen@AlexRoseJo·
@hardmaru @SakanaAILabs Also the belief that junior engineers won’t be useful. I doubt it, the best adapters to new technology will be engaged students.
English
0
0
0
164
hardmaru
hardmaru@hardmaru·
People keep asking if AI will replace software engineers. I believe the exact opposite. Thanks to the Jevons paradox, AI tools are making great engineers 10x more productive, allowing us to tackle much harder, larger-scale problems. We’re expanding our SWE teams at @SakanaAILabs We have 5 new open roles, including English-speaking R&D and Platform roles. Come build the future of AI with us in Tokyo! 🐟
Sakana AI@SakanaAILabs

【採用情報】「Software Engineer」の5ポジションが現在オープン! sakana.ai/careers 「AIが進化すれば、ソフトウェアエンジニアの仕事はなくなるのか?」 Sakana AIは、全く逆だと考えています。 AIツールの登場で開発効率が劇的に向上する一方、ジェボンズのパラドックス(Jevons paradox) が示すように、私たちが解決できる課題の幅と規模が拡大し、優秀なSoftware Engineerの需要はかつてなく高まっています。 事実、Sakana AIでは、AI支援ツールを駆使して最前線で活躍し、AIそのものを社会実装していくSoftware Engineerの採用をかつてない規模で強化しています! 現在、以下の5つの専門領域で募集を公開中です。詳細はリンク先をご覧ください。 🐙 こんな挑戦が待っています ・Enterprise: AI技術を組み込んだアプリケーションのFrontend〜Backendまでの一貫した設計・開発および運用 ・Defense & Intelligence: 日本の防衛・インテリジェンス分野に、AIを活用したソフトウェアで貢献 (※本ポジションは性質上、日本国籍保有等の要件がございます) ・Product: 自社AIプロダクトのUI/UXからバックエンド・インフラまでのフルスタック開発 ・Platform: LLMエージェントを支える強固なインフラ・データプラットフォームの設計・構築 (English req, 日本語 is a plus) ・Research and Development: ML研究と製品開発を繋ぎ、研究を加速させるツールやフルスタックインフラを構築 (English req, 日本語 is a plus) 🐡 こんな方を求めています ・Frontend / Backend / Infrastructureのいずれか複数領域での実務経験をお持ちの方 ・AI支援コーディングツールを活用し、チームで自律的に開発を進められる方 ・AIシステム開発や、0→1でのプロダクト立ち上げ経験がある方はさらに歓迎! フルタイムに加え、業務委託・インターンシップと柔軟な働き方が可能です(※ポジションにより異なります)。 最先端のAI技術を自らの手で社会へ届け、変革の波を創り出したい方。ぜひご応募ください。

English
23
12
196
33.9K
Alexander Johansen
Alexander Johansen@AlexRoseJo·
@hllo_wrld Thanks for the read, Victor. As I've been teaching Introduction to CS Research the past few quarters, I've noticed that students skipping the CS 220–30 series for core Math/EE courses tend to do much better. Even in LLM research.
English
0
0
2
950
Victor Zhong
Victor Zhong@hllo_wrld·
I helping recalibrate Waterloo’s software engineering program (and to some degree CS) for the age of AI. I wrote an essay on the challenges I’m seeing. victorzhong.com/writing/the-ho…
English
6
25
139
26.7K
Alexander Johansen
Alexander Johansen@AlexRoseJo·
@leostera @OpenAI @AnthropicAI Frontier LLM models are fixed for one-size-fits-all, they do not continuously update for every use-case. Current continuous learning relies on prompt engineering, which is unreliable.
English
0
0
2
93
Alexander Johansen
Alexander Johansen@AlexRoseJo·
@meggmcnulty This probably gets worse over time as the GPUs age. SGD for training can handle variance in batch size, training harnesses would benefit from modelling underlying inconsistencies in hardware.
English
0
0
1
62
Meg McNulty
Meg McNulty@meggmcnulty·
A degrading NVLink does not page anyone. NVIDIA's own diagnostics tell you to keep running on it. When DCGM trips error code 13, "Unacceptable rate of NVLink errors," the recommended action it returns reads "Monitor the NVLink. It can still perform workload." The non-fatal NVSwitch error says the same. The link is throwing CRC errors and retransmitting packets, and the documented guidance is to continue. A link that goes fully down throws an XID, the scheduler restarts the job, and an operator moves on. A link that flaps or drops to a degraded bandwidth state stays up. NVIDIA added a fabric health field for exactly this condition: bandwidth reported as "Degraded" instead of "Full," summary health as "Limited Capacity." The link still carries traffic. It is slower than its peers, and in synchronous training every GPU waits at the all-reduce barrier for the slowest one. The measured cost of that is steep. In a controlled study, a single worker held to 75% of its normal speed pulled total job throughput down to roughly 75%, every healthy GPU idling for the straggler. An NVIDIA-affiliated paper found that 0.1% of GPUs in a degraded state can cost a high-tensor-parallel job nearly 10% of its throughput, which on a 32,000-GPU run is about 3,000 GPUs of compute doing nothing. Alibaba's FALCON trace on a 10,000-plus GPU cluster put the average fail-slow at delaying job completion by 1.34x, with some events lasting close to ten hours. None of those are crashes. They are healthy-looking hardware running slow, and the slowdown usually gets blamed on the model or the data loader before anyone opens the fabric counters. For people running large fleets: how much of your NVLink error signal do you act on, versus treat as noise until a job falls over?
English
5
2
39
3.4K
John Ennis
John Ennis@johnennis·
I think the world is about to get flooded with a lot of AI slop math, and the job of the professional mathematician is about to get more important, not less
English
68
38
291
45.8K
Sakura Yuki
Sakura Yuki@sakurayukiai·
We overengineer KV cache eviction so much. Seven fancy policies collapse to near-zero F1 without structural protection. Reserve 10% of the cache for boundary tokens, and suddenly dumb LRU matches the heavy algorithms. Simple rules compile faster anyway. The dumbest fix wins.
English
7
2
35
2.2K
Alexander Johansen
Alexander Johansen@AlexRoseJo·
In your theoretical model (Proposition 7.1), the loss decomposes additively across orthogonal tasks and sufficient rank eliminates interference. But the retrieval interface compresses internal representations to a fixed output geometry. Weller et al. (ICLR 2026) show that when the sign rank of the relevance matrix exceeds the embedding dimension, no training procedure can realize the target retrieval behavior, regardless of internal model capacity. Do your scaling predictions still hold when the evaluation metric operates on that compressed output rather than the internal representation?
English
0
0
0
61
Tatsunori Hashimoto
Tatsunori Hashimoto@tatsu_hashimoto·
Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit.
Tatsunori Hashimoto tweet media
English
31
145
1.2K
197.7K
Alexander Johansen retweetledi
Fan-Yun Sun
Fan-Yun Sun@sunfanyun·
If you believe digital AGI will be solved before physical AGI, work on simulation. We will have: A. self-improving physics engine. Newton can already be agentically adapted to different downstream applications (and is built in the first place with this intention). B. on-demand generation of digital twins (e.g., @moonlake’s 3D agent and more generically, world models). All of the hardest problems in physical AI will be reduced to a single question: “does the labor value unlocked by the policy exceed the simulation compute cost needed to bridge the deployment gap?”
tian/天@xtbot

I asked Codex to set up ROS middleware, configure a CSI camera, benchmark Gemma 4 models on my Jetson Orin Nano, adapt an OpenClaw-style runtime for VLM + reasoning (“Robotclaw” as I call it), and even build an iOS app to stream LiDAR, camera, GPS, and IMU data from an old iPhone to my robot rover. It’s honestly wild how capable these coding agents are now, and how much time they save. I even “write” way more tests now because the marginal cost is so low.

English
1
3
22
2.5K
Omar Khattab
Omar Khattab@lateinteraction·
Most people in this part of twitter don’t realize how closely folks at the frontier labs pay attention to all your favorite academic ML releases on here.
English
19
10
615
35.7K
Alexander Johansen
Alexander Johansen@AlexRoseJo·
@AndrewYNg At Stanford we often see a Cauchy distribution in grading. Most students do really well, and a few fall off. Forcing a normal distribution might unfairly target the median student who still gets 90%.
English
4
0
10
2K
Andrew Ng
Andrew Ng@AndrewYNg·
Harvard University just voted to limit the number of A grades given in undergraduate classes to about 20% of the class. I’m not in favor of this. It deeply runs counter to how I believe education should be. We should hold a high bar, but also work mightily to support the success of 100% of learners, rather than a fraction. Harvard’s administration took this step — over the objections of a large fraction of the student body — to counter grade inflation. Grade inflation is real: Many universities have been awarding A and B grades to ever larger fractions of students, and this has caused grade point averages (GPAs) to become less useful as signals of student skill. At the same time, we want students to succeed. The heart of the question is the role of educational institutions. Should our goal be: - To help students succeed? - To judge students? Both of these have value. But my focus when working in education is almost entirely helping students succeed. To me, it is clear that many people want to learn, to be empowered, to build skills that let them do new things! This is what we focus on at DeepLearningAI. This philosophy is also why my online courses (going back to my early online Stanford courses on Coursera) permitted an unlimited number of retries for graded assignments. I believe in letting — and even encouraging — someone to redo something until they succeed. This is as opposed to standing in judgement of the fact they didn’t get it right the first time. Further, I want homework assignments to be designed primarily to help people practice and learn, rather than to judge their skill level. This is why I prefer to create “Practice Problems” and “Practice Labs” — questions that, when you think through them, help you to gain practice and reinforce what you know. As opposed to “Assessment Problems” designed primarily to judge skill. But won’t Harvard’s move make GPAs more meaningful and help prospective employers identify strong candidates? Having hired a large number of people from Harvard and other institutions, I can say confidently that GPA is not an important signal. We have screening and interviewing processes that give far more accurate ways to figure out if someone is truly skilled. I do not need a wider spread in applicant GPA scores to figure out who's really good! To be clear, there is also value in assessment. Even though standardized testing is much hated, high-quality tests like the SAT, ACT, GRE, TOEFL, etc. provide objective measures of ability in a domain. I find that most people want to learn and succeed. There are also people who want rigorous assessment (for example, to apply for school admissions), but this is a lesser need, and is not my focus when building educational products. Harvard is often described as an “elite” educational institution. There are two ways to be elite: One option involves limiting enrollments, and then even among admitted students, cap the number of people that do well at 20%. I would rather pursue a different path: Set a high bar and teach elite, cutting-edge skills, but strive relentlessly to help everyone succeed. This way, eliteness is defined not by excluding people but by helping as many people as possible to be excellent. [Original text: The Batch newsletter]
English
188
212
2.1K
255.8K
Rohan Paul
Rohan Paul@rohanpaul_ai·
The Information: Anthropic is currently in early-stage talks to lease and deploy Microsoft's custom AI chips for inference workloads. Microsoft is pitching Maia 200 as a cheaper way to run some AI inference, and claims maia 200 is more cost-effective than nvidia chips for certain inference jobs. Maia 200 is Microsoft’s second-generation AI accelerator, built on TSMC 3nm, with FP8/FP4 math, 216GB HBM3e, 7TB/s bandwidth, and 272MB SRAM, which makes it aimed at feeding large models fast rather than teaching them from scratch. Anthropic already committed $30B to Azure, Microsoft may invest up to $5B in Anthropic, and Claude is already tied into Microsoft’s Copilot stack, so the chip talks are also a customer-supplier feedback loop. IMO, Maia does not need to beat Nvidia everywhere to matter, because a cheaper chip for narrow, high-volume inference jobs can still shift billions of tokens away from GPUs. --- theinformation .com/articles/anthropic-talks-use-microsofts-ai-chips
Rohan Paul tweet media
English
5
8
43
3.9K
SomitraSR
SomitraSR@TheSomitraSR·
Everyone talks about AI apps. Very few understand the real moat is the LLM orchestration layer. The future won’t belong to people using one model. It’ll belong to builders combining multiple LLMs with memory, reasoning, agents, workflows, and real-time data. Single prompts are temporary. AI infrastructure is the real business.
English
45
33
107
21.1K
looney
looney@C8Luna·
@AlexRoseJo @rezoundous Ive seen it a few times but often wish I could steer it at the prompt rather than shifting from model a/high to model b/high all the time.
English
1
0
0
12
Tyler
Tyler@rezoundous·
Microsoft canceled Claude Code license due to unsustainable costs. If they can't afford it, who can?
English
363
1K
13.1K
763.2K
Alexander Johansen
Alexander Johansen@AlexRoseJo·
@LLMJunky Say you go a step further, how many employees could one Vera Rubin NVL72 support?
English
1
0
1
63
am.will
am.will@LLMJunky·
buy. a. gpu.
Hedgie@HedgieMarkets

🦔Microsoft canceled its internal Claude Code licenses this week after token-based billing made the cost untenable, even for a company with effectively infinite cloud resources. Uber's CTO sent an internal memo warning the company burned through its entire 2026 AI budget in just four months. American AI software prices have jumped 20% to 37%, and GitHub (owned by Microsoft) is dropping flat-rate plans for usage-based billing across its products. My Take The AI subsidy era is ending in real time. The same company that put $13 billion into OpenAI and built the Azure infrastructure powering most of Anthropic's compute just looked at the bill from a competitor's coding tool and decided it was not worth paying. That is not a productivity failure on Anthropic's end. Token-based pricing is forcing every enterprise customer to confront the actual cost of running these models at scale, and the number turns out to be far higher than the flat-rate experiments suggested. This ties directly to my Gemini Flash post yesterday. Anthropic, OpenAI, and Google all raised effective prices in the last six months. Enterprises that built workflows assuming AI costs would keep falling are now watching annual budgets evaporate in months. Two outcomes look likely from here. Either enterprises scale back AI usage to fit budgets, which slows the revenue ramp the labs need to justify their valuations ahead of IPOs, or the labs cut prices and absorb the losses, which makes the unit economics worse at exactly the wrong moment. Both paths land in the same place, the numbers stop working, and somebody has to take the writedown. Hedgie🤗

Español
11
1
77
33.5K
Alexander Doria
Alexander Doria@Dorialexander·
knowing how to spin subagents on gpu, on track to become a very consequential skill
English
8
1
79
4.2K
looney
looney@C8Luna·
@AlexRoseJo @rezoundous If that is true why aren’t more using Copilot with 5.5 planning and GPT 5 mini execution?
English
1
0
0
73