Kaizhao Liang

3.8K posts

Kaizhao Liang banner
Kaizhao Liang

Kaizhao Liang

@KyleLiang5

@MicrosoftAI, ex @SambaNovaAI, PhD student @UTCompSci, working on optimizers and neural architectures, alumni @IllinoisCDS

Redmond, Seattle Katılım Aralık 2018
101 Takip Edilen668 Takipçiler
Kaizhao Liang retweetledi
Satya Nadella
Satya Nadella@satyanadella·
Introducing Critique, a new multi-model deep research system in M365 Copilot. You can use multiple models together to generate optimal responses and reports.
English
421
509
4.2K
1.4M
Kaizhao Liang retweetledi
Jia-Bin Huang
Jia-Bin Huang@jbhuang0604·
A great example that medium shapes impact. A research paper on arXiv 11 months ago: 👉 2 citations so far An accessible blog post one day ago: 👉 12 M views, instant community adoption
Google Research@GoogleResearch

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: goo.gle/4bsq2qI

English
28
82
1.1K
121.1K
Kaizhao Liang retweetledi
Lucas Maes
Lucas Maes@lucasmaes_·
JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑: le-wm.github.io
English
101
540
3.9K
897.6K
Kaizhao Liang retweetledi
elie
elie@eliebakouch·
whaaaaaat microsoft ai just poached part of ai2 leadership team
elie tweet media
English
16
10
304
41.1K
Kaizhao Liang retweetledi
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
OpenAI just dropped a training challenge: Train a <16MB language model in 10 minutes on 8×H100s and minimize held-out loss on a fixed FineWeb dataset. Basically NanoGPT Speedrun. They’re sponsoring $1M in compute. I can summon my autoresearch army to win it… if I have time.
Yuchen Jin tweet media
English
53
75
1.3K
110.2K
Kaizhao Liang retweetledi
Percy Liang
Percy Liang@percyliang·
In Marin, we are trying to get really good at scaling laws. We have trained models up to 1e22 FLOPs and have made a prediction of the loss at 1e23 FLOPs, which @WilliamBarrHeld is running. This prediction is preregistered on GitHub, so we'll see in a few days how accurate our prediction was. What we want is not just a single model but a training recipe that scales reliably.
Percy Liang tweet media
English
18
47
469
76.1K
Kaizhao Liang retweetledi
Felix Rieseberg
Felix Rieseberg@felixrieseberg·
We're shipping a new feature in Claude Cowork as a research preview that I'm excited about: Dispatch! One persistent conversation with Claude that runs on your computer. Message it from your phone. Come back to finished work. To try it out, download Claude Desktop, then pair your phone.
English
973
1.5K
17.4K
6.2M
Kaizhao Liang
Kaizhao Liang@KyleLiang5·
Everyday this meme becomes more and more relevant
Kaizhao Liang tweet media
English
0
0
0
89
Kaizhao Liang
Kaizhao Liang@KyleLiang5·
If you look at the linear layer, both fwd and bwd are linear attentions. Fwd it’s retrieving with activations, and bwd it’s retrieving with err signal (loss gradient) The symmetry is beautiful.
Andrej Karpathy@karpathy

@Yulun_Du @ilyasut SGD is a ResNet too (the blocks of it are fwd+bwd), the residual stream is the weights so... 🤔 We're not taking the Attention is All You Need part literally enough? :D

English
0
0
3
276
Kaizhao Liang retweetledi
Claude
Claude@claudeai·
1 million context window: Now generally available for Claude Opus 4.6 and Claude Sonnet 4.6.
Claude tweet media
English
1.2K
2K
25.2K
5.6M
Kaizhao Liang retweetledi
Shuangfei Zhai
Shuangfei Zhai@zhaisf·
Say hi to Exclusive Self Attention (XSA), a (nearly) free improvement to Transformers for LM. Observation: for y = attn(q, k, v), yᵢ and vᵢ tend to have a very high cosine similarity Fix: exclude vᵢ from yᵢ via zᵢ = yᵢ - (yᵢᵀvᵢ)vᵢ/‖vᵢ‖² Result: better training/val loss across model sizes; increasing gains as sequence length grows. See more: arxiv.org/abs/2603.09078
Shuangfei Zhai tweet media
English
32
81
944
214.9K
Kaizhao Liang retweetledi
Kaizhao Liang retweetledi
Satya Nadella
Satya Nadella@satyanadella·
Announcing Copilot Cowork, a new way to complete tasks and get work done in M365. When you hand off a task to Cowork, it turns your request into a plan and executes it across your apps and files, grounded in your work data and operating within M365’s security and governance boundaries.
English
2.3K
2.1K
16.7K
9.8M
Kaizhao Liang retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code synchronously grows a single thread of commits in a particular research direction. But the original repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. Git(Hub) is *almost* but not really suited for this. It has a softly built in assumption of one "master" branch, which temporarily forks off into PRs just to merge back a bit later. I tried to prototype something super lightweight that could have a flavor of this, e.g. just a Discussion, written by my agent as a summary of its overnight run: github.com/karpathy/autor… Alternatively, a PR has the benefit of exact commits: github.com/karpathy/autor… but you'd never want to actually merge it... You'd just want to "adopt" and accumulate branches of commits. But even in this lightweight way, you could ask your agent to first read the Discussions/PRs using GitHub CLI for inspiration, and after its research is done, contribute a little "paper" of findings back. I'm not actually exactly sure what this should look like, but it's a big idea that is more general than just the autoresearch repo specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.
English
529
715
7.6K
1.1M
Kaizhao Liang retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
(I still have the bigger cousin running on prod nanochat, working a bigger model and on 8XH100, which looks like this now. I'll just leave this running for a while...)
Andrej Karpathy tweet media
English
71
62
2.1K
422.3K