John

145 posts

John

John

@John3tck

Katılım Nisan 2026
156 Takip Edilen2 Takipçiler
John retweetledi
Dan McAteer
Dan McAteer@daniel_mac8·
guys, I think memory gets cracked in 2026 then infinite context and memory + infinite context = continual learning you can feel it in the air
Dan McAteer tweet media
English
75
140
1.3K
124.9K
John retweetledi
Kazuki Irie
Kazuki Irie@kzkirie·
Speaking of recursive self-improvement (RSI): many recent RSI loops involve *fixed-weight* foundation models (FM). Another exciting meta-level is the weight space. Making FMs part of the RSI loop to improve their weights & meta-improve their own weight update strategies & so on.
Kazuki Irie tweet mediaKazuki Irie tweet mediaKazuki Irie tweet mediaKazuki Irie tweet media
Kazuki Irie@kzkirie

Great to see a workshop like this! I wasn't on X back then: I presented "Training a Weight Matrix to Train Itself" in Mar 2022 at KAUST AI Symp, showing how to turn a linear layer into a recursively self-modifying module--reviving @SchmidhuberAI's self-referential weight matrix.

English
2
5
49
6.1K
John retweetledi
Wei Huang
Wei Huang@WeiHuang_USTC·
Just made public the slides from my UTokyo lecture on diffusion models (Spring 2026). Three perspectives: • Variational (VAE → DDPM) • Score-based (EBM → Score SDE) • Practice (U-Net, CFG, LDM) Hope it's useful for students entering the area weihuang05.github.io/files/diffusio…
Wei Huang tweet mediaWei Huang tweet media
English
2
17
103
7.3K
John retweetledi
AB Kuai.Dong
AB Kuai.Dong@_FORAB·
马斯克公布了 X 平台的新开源算法,使 X 成为世界上第一个,开源算法的主流社交平台。 在你打开 X 的一瞬间,系统会根据你最近的点赞、回复、转发,以及在某个推文的停留时间、访问对方个人资料的次数等,来判断你是什么人。 这也就是为何,有些老哥不互动黄推,但因每次看时间久,结果天天被推黄片。 之后,在 X 平台了解你的用户画像后,会快速给你推荐两类帖子,分别是你关注的那些博主更新的内容,其次是根据你的画像,可能感兴趣的内容。 然后 X 会再筛选这些推文,是否符合审查结果,它们会对发布者的认证、历史信用、互动情况,进行计算。 最终在筛掉重复看过、被标记为暴力倾向的推文后,经过以上 4 层筛选,X 会推荐给你,真正想看的。 另外 X 后台有一套打分机制,无论是推文还是回复。正向行为包括点赞、回复、转发、分享。负向行为,如被拉黑、被举报、被点不感兴趣,会扣分。 相比于上次 X 平台算法的公布,这次引入了多层筛选、过滤、评分机制,这也让 X 上的内容变得更多样性,话题越来越丰富了。
AB Kuai.Dong tweet mediaAB Kuai.Dong tweet media
Elon Musk@elonmusk

The latest 𝕏 algorithm has been published to GitHub github.com/xai-org/x-algo…

Meguro-ku, Tokyo 🇯🇵 中文
146
202
1.4K
259.8K
John retweetledi
OptimaLab
OptimaLab@optimalab1·
Huge kudos to Barbara Su (Rice CS -> MSc Stanford): she led every part of this end-to-end: algorithm, GLUE/SQuADpipelines, multi-GPU H200 scaling. Jasper Liao (Rice CS) helped with the theory. akyrillidis.github.io/aiowls/adapad.… AdaPaD: parallel rank-1 deflation for LoRA that self-corrects. 🧵
OptimaLab tweet media
English
4
52
396
23.8K
John retweetledi
Rishabh Agarwal
Rishabh Agarwal@agarwl_·
Training LLMs is synonymous with updating their weights. However, LLMs can also learn in-context using *frozen* weights. There is no good reason for restricting learning to being in-context or in-weights. So a natural idea is "Learning, Fast and Slow" (FST). In FST, slow learning is LLM weights trained with RL while fast learning is context / prompt (fast weights) optimized with GEPA. Compared to RL, FST performs better while being more data efficient, adaptable (plasticity), and forgetting less (stays closer to base models). I think this idea of learning both fast-slow weights would be a good foundation for continual learning. PS: Geoff Hinton (the OG) described the idea of fast weights and slow weights several years ago, and back then I remember thinking it's a very cool idea. See more details here: gepa-ai.github.io/gepa/blog/2026…
Rishabh Agarwal tweet media
English
18
72
561
67.5K
John retweetledi
思维怪怪
思维怪怪@0xLogicrw·
Prime Intellect 公布了一项为期两周的自主 AI 研究实验。研究团队让 Codex(gpt 5.5 xhigh)和 Claude Code(opus 4.7 xhigh)在 nanoGPT 速度赛中自主迭代优化器方案,试图用最少步数达到目标验证损失。经过约 1 万次实验并消耗 1.4 万小时 H200 算力后,Opus 最终以 2930 步打破了 2990 步的人类记录。 实验揭示了当前 AI 代理的能力边界。在强制要求提出新算法的测试分支中,两个模型均无法在脱离人类社区已有代码或论文的情况下跑通任何想法。它们破纪录的成果完全依赖对已有开源技术进行海量组合与参数扫描。 不同模型表现出截然不同的行为缺陷。Claude 频繁违背保持自主运行的系统指令,多次擅自停机等待人类介入,在一次 47 小时的任务中主动闲置了 22 小时。Codex 虽能保持全天候运转,但极易陷入死循环,会在同一个超参数空间内进行长达数小时的无效穷举。 在获取外部信息时,Codex 几乎不查看代码托管平台的最新动态,仅凭本地历史记录搜索。Claude 则将大量 Token 预算用于阅读人类开发者的合并请求。 这个实验证明了前沿 AI 现阶段只是极具效率的工程验证机。一旦人类停止提供初始的算法猜想,它们根本无法独立完成从零到一的创新。
elie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more primeintellect.ai/auto-nanogpt

中文
4
12
112
22.2K
John retweetledi
Sanbu 散步
Sanbu 散步@sanbuphy·
Qwen3、GLM-5、MiMo 等大模型后训练均采用 OPD,它被视作 RL 的低成本替代方案,但实际落地远比 SFT 和 RL 更脆弱。三篇最新论文揭示了其底层机制与四大失效根源:OPD 名义上是匹配师生全词表分布,实则97%–99% 学习信号仅来自每步少量师生高概率重叠候选 token,依赖双方局部合理下一词集合的适配。
Zhuokai Zhao@zhuokaiz

Qwen3, GLM-5, and MiMo all use on-policy distillation in post-training. Thinking Machines also wrote it up as a cheap alternative to RL. But in practice it is surprisingly brittle to make work — much more so than SFT or RL. Three recent papers [1, 2, 3] helped me make sense of why. The mechanism is consistent across the failure modes they describe, and it's worth understanding before running another OPD experiment. OPD looks like "match the teacher distribution." But in practice, the update is driven by a very small set of next-token choices at each generation step. Mostly just the handful of tokens that both the student and teacher think are plausible as the next token. Once that small set breaks, OPD breaks. The real object OPD is learning on At every generation step, the model has a huge vocabulary. 150K tokens, maybe more. But almost all of the probability mass sits on a tiny number of tokens. One paper [1] shows that the overlapping high-probability tokens between teacher and student carry around 97–99% of the total probability mass. So although OPD is written as reverse-KL over a full vocabulary, most of the useful learning signal comes from a tiny local menu of next-token options. Here is what I mean by "the handful of tokens." For a given prefix like: "Let's solve this step by step. First, we…" the student may think the next token should be one of: "need", "can", "have", "find", "know", "compute", … The teacher has its own version of this menu. OPD works when these two menus mostly overlap, and when the teacher puts higher probability on better choices inside that menu. It fails for a few different reasons — the menus don't overlap, the menu drifts somewhere bad mid-training, we only look at one item on the menu instead of the whole thing, or the per-position learning signals end up pulling the model in inconsistent directions. Four things can go wrong. The first two are about whether the menu is in good shape. The third is about whether we look at the whole menu or just one item from it. The fourth is about whether the signals across positions combine into a useful update. 1. The student and teacher are thinking in different "languages" A stronger teacher does not necessarily make a better OPD teacher. Li et al. [1] shows this very clearly: a 7B teacher can outperform a 1.5B model on benchmarks, but still fail to improve the 1.5B student through OPD. Why? Because benchmark accuracy measures final answers. OPD trains on next-token probabilities. A stronger model may solve the same problem through a different reasoning path: different intermediate steps, different phrasing, different proof structure, different local token choices. So when the student writes its own partial solution, the teacher may not assign useful probability to the student's next natural steps. The teacher is better overall, but not necessarily helpful on the student's current path. The most interesting experiment is the "reverse distillation", where they take a 1.5B model that was improved by RL — a student that has already moved beyond its original base behavior — and try to distill it back using two teachers: the original pre-RL 1.5B model and a larger 7B model from the same family. Both teachers pull the RL-improved student backward. The student loses its RL gains and regresses toward the older behavior. This sounds surprising at first. But the explanation is simple: OPD does not know that the student's RL behavior is better unless the teacher's token probabilities support it. If the teacher still prefers the old reasoning pattern, OPD will train the student back toward that pattern. So the RL gains disappear not because the teacher is "weak" in benchmark terms, but because the teacher is giving token-level supervision for a behavior the student has already moved past. Benchmark gap does not tell you whether OPD will work. Token-level compatibility does. 2. Repetition becomes locally rewarding Even if OPD starts well, it can still collapse. The most striking failure mode is when training looks fine for a while, then within roughly 30 steps the model starts producing much longer outputs, stops terminating, repetition spikes, and accuracy collapses. The mechanism is counterintuitive at first but makes sense once you see it. In sampled-token OPD, the reward for a token is roughly the teacher's log-probability minus the student's log-probability on that token. So if the teacher gives a token much higher probability than the student does, that token receives a large positive signal. Now imagine the student starts repeating itself. In practice this looks less like coherent sentences repeating and more like degenerate loops — something like "wait, wait, wait, wait, wait" filling the rest of the context. This prefix is bad globally. But locally, it is very predictable. A strong teacher is often very confident about predictable text. Once the loop has gone on for a while, the teacher can assign high probability to the next repeated token. The student may be less confident than the teacher. So the repeated token gets a large positive log-ratio. That means OPD accidentally rewards continuing the repetition. Before repetition starts, repeated tokens are rare, so they don't matter much. But once repetition appears, those tokens become frequent. And because they also receive large positive advantages, they start dominating the update. Luo et al. [2] measures repeated tokens getting 4 to 9 times larger advantage than normal tokens after collapse. Then the loop reinforces itself: more repetition → more predictable prefix → higher teacher confidence → larger positive signal on repeated tokens → even more repetition This is different from the usual length-bias issue in RL. It's more specific — a broken prefix creates locally high-reward repeated tokens, and OPD faithfully amplifies them. 3. We often only look at one item on the menu The clean objective would compare the teacher and student distributions over multiple possible next tokens. But many public OPD recipes — including the ones used industrially — use a cheaper version: Let the student generate one token. Then ask: did the teacher assign this exact token higher or lower probability than the student? If teacher probability is higher, push the student toward it. If lower, push the student away. That is the sampled-token log-ratio. It is cheap because you only score the token the student actually sampled. You do not need to compare the full vocabulary. There is a real reason for this design choice. Full sequence-level reverse-KL is noisy for long generations because an early token update gets entangled with many future rewards. Token-level OPD avoids that by giving each token its own local feedback. That gives much better variance scaling with length [3] — worst-case variance grows as O(T²) for token-level instead of O(T⁴) for sequence-level. So for long reasoning traces, token-level feedback is attractive. The problem is that "one sampled token" is a very noisy view of the teacher's actual next-token preference. At a given step, the teacher may have a whole cluster of reasonable next tokens. But sampled-token OPD only checks the one token the student happened to pick. This creates three problems. First, the student samples tokens from its own distribution, so on most positions the student's probability exceeds the teacher's and the log-ratio is negative. The reward is computed as teacher minus student, and the student is picking tokens where its own log-prob is near its highest — meaning the subtraction is almost always against the student's strongest values. Positive signal only shows up when the student happens to sample a token the teacher likes even more than the student does, which is the minority case. Second, if the student drifts into weird prefixes, the teacher's local probabilities may no longer reflect global quality. Third, tokenization and special-token differences can create fake disagreements. The student and teacher may represent the same text with different token boundaries, so a single-token comparison can look terrible even when the underlying string is fine. The fix proposed in [3] is simple: don't guess the teacher's local preference from one sampled token. Instead, take the teacher's top-k next tokens, renormalize both teacher and student probabilities over that set, and compute reverse-KL there. It's still cheap — you only need the top-k logits, not the whole vocabulary. But it changes the supervision from "did the teacher like this one sampled token?" to "among the teacher's plausible next tokens, does the student put probability mass in the same places?" That is a much better local learning signal. They also add top-p sampling during rollout, so the student is less likely to wander into extremely low-probability prefixes, and mask special tokens to avoid fake tokenization mismatches. 4. Even good token signals may not add up This is the least developed of the four failure modes, and possibly the most important. Li et al. [1] compares a successful OPD setup against a failing one and finds something strange. The failing teacher's per-token advantages are actually larger than the working teacher's, but the gradient norms are smaller. And the failing teacher's sequence-level reward can still distinguish correct from incorrect rollouts, comparable to the working case. So the reward signal is globally informative. It just does not produce useful gradients. Turns out what's going on is that OPD computes a learning signal at every token position in the rollout, and then sums them into one gradient update. Each position's contribution is a vector in parameter space pointing in some direction. So if the per-position vectors mostly point in the same direction, they add up to a big coherent push. But if they point in different directions, they partially cancel when summed, and the model barely moves even though each individual position had something to say. The empirical fingerprint is consistent with the second case: large per-token advantages but small gradient norms after summing — individually strong signals that cancel out when combined. The successful teacher shows the opposite pattern: smaller per-token advantages, larger gradient norms after summing — weaker individual signals that reinforce each other when combined. The paper that raises this leaves it as a hypothesis, but the empirical fingerprint — large per-token advantages, small gradient norms, informative sequence-level reward — is specific enough that it should be testable? What this explains A lot of practical OPD fixes start to look related. SFT cold start helps because it moves the student closer to the teacher's reasoning style before OPD begins. Teacher-aligned prompts help because they put the student in regions where the teacher gives more reliable feedback. KL regularization helps because it prevents the student from drifting too quickly into weird generations. Mixture distillation helps because it keeps some clean reference trajectories in training, so the rollout distribution does not become fully self-generated garbage. Top-k matching helps because it stops pretending one sampled token is enough to represent the teacher's local preference. These look like different tricks. But they are all trying to protect the same thing: the small set of plausible next tokens where teacher and student can actually communicate. The ceiling, and the real open question The more interesting part is long-horizon reasoning. The deeper the student gets into its own generated solution, the more likely the prefix is something the teacher would not have written. And once the teacher is judging prefixes outside its own natural distribution, its token probabilities become less reliable. One paper [1] shows this directly: teacher continuation advantage drops sharply as the student prefix gets longer, from +0.37 at 1K prefix to +0.02 at 16K. That is a bad sign for long-CoT and agentic OPD, because those are exactly the settings where the student spends many steps inside its own partially-generated world. OPD works best when the teacher and student stay close enough that teacher probabilities remain meaningful. Long-horizon agentic training pushes in the opposite direction. This is the open question I would most want to see investigated. The failure modes in sections 1-3 are diagnosable and have proposed fixes. Section 4 is a hypothesis about local gradient structure that should be testable. But the long-horizon ceiling is different — it is about whether OPD's core assumption (the teacher knows something useful about the student's next token) can hold at all when the student is operating many steps inside its own generated world. My current takeaway OPD isn't really a full-vocabulary distribution-matching problem. It's a fragile communication protocol between teacher and student through a tiny local menu of next-token choices. When that menu overlaps, stays clean, and produces gradients that add up, OPD works. When it doesn't, OPD quietly trains the student in the wrong direction. References [1] Li et al. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. arxiv.org/abs/2604.13016 [2] Luo et al. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models. arxiv.org/abs/2604.08527 [3] Fu et al. Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes. arxiv.org/abs/2603.25562

中文
2
23
140
25.8K
John retweetledi
Xiuyu Li
Xiuyu Li@sheriyuo·
These three papers are indeed the most recent research on OPD. Thank you for summarizing and sharing them. Your write-up is very detailed and provides great insights. The development of OPD nowadays is truly astonishing — just look at how many papers have already been collected in Awesome OPD. Before diving into the latest research on OPD, it is definitely essential to be familiar with these follow-ups. 49 Core OPD Papers, 113 new arXivs in 2026🤯 Awesome OPD: github.com/chrisliu298/aw…
Xiuyu Li tweet media
Zhuokai Zhao@zhuokaiz

Qwen3, GLM-5, and MiMo all use on-policy distillation in post-training. Thinking Machines also wrote it up as a cheap alternative to RL. But in practice it is surprisingly brittle to make work — much more so than SFT or RL. Three recent papers [1, 2, 3] helped me make sense of why. The mechanism is consistent across the failure modes they describe, and it's worth understanding before running another OPD experiment. OPD looks like "match the teacher distribution." But in practice, the update is driven by a very small set of next-token choices at each generation step. Mostly just the handful of tokens that both the student and teacher think are plausible as the next token. Once that small set breaks, OPD breaks. The real object OPD is learning on At every generation step, the model has a huge vocabulary. 150K tokens, maybe more. But almost all of the probability mass sits on a tiny number of tokens. One paper [1] shows that the overlapping high-probability tokens between teacher and student carry around 97–99% of the total probability mass. So although OPD is written as reverse-KL over a full vocabulary, most of the useful learning signal comes from a tiny local menu of next-token options. Here is what I mean by "the handful of tokens." For a given prefix like: "Let's solve this step by step. First, we…" the student may think the next token should be one of: "need", "can", "have", "find", "know", "compute", … The teacher has its own version of this menu. OPD works when these two menus mostly overlap, and when the teacher puts higher probability on better choices inside that menu. It fails for a few different reasons — the menus don't overlap, the menu drifts somewhere bad mid-training, we only look at one item on the menu instead of the whole thing, or the per-position learning signals end up pulling the model in inconsistent directions. Four things can go wrong. The first two are about whether the menu is in good shape. The third is about whether we look at the whole menu or just one item from it. The fourth is about whether the signals across positions combine into a useful update. 1. The student and teacher are thinking in different "languages" A stronger teacher does not necessarily make a better OPD teacher. Li et al. [1] shows this very clearly: a 7B teacher can outperform a 1.5B model on benchmarks, but still fail to improve the 1.5B student through OPD. Why? Because benchmark accuracy measures final answers. OPD trains on next-token probabilities. A stronger model may solve the same problem through a different reasoning path: different intermediate steps, different phrasing, different proof structure, different local token choices. So when the student writes its own partial solution, the teacher may not assign useful probability to the student's next natural steps. The teacher is better overall, but not necessarily helpful on the student's current path. The most interesting experiment is the "reverse distillation", where they take a 1.5B model that was improved by RL — a student that has already moved beyond its original base behavior — and try to distill it back using two teachers: the original pre-RL 1.5B model and a larger 7B model from the same family. Both teachers pull the RL-improved student backward. The student loses its RL gains and regresses toward the older behavior. This sounds surprising at first. But the explanation is simple: OPD does not know that the student's RL behavior is better unless the teacher's token probabilities support it. If the teacher still prefers the old reasoning pattern, OPD will train the student back toward that pattern. So the RL gains disappear not because the teacher is "weak" in benchmark terms, but because the teacher is giving token-level supervision for a behavior the student has already moved past. Benchmark gap does not tell you whether OPD will work. Token-level compatibility does. 2. Repetition becomes locally rewarding Even if OPD starts well, it can still collapse. The most striking failure mode is when training looks fine for a while, then within roughly 30 steps the model starts producing much longer outputs, stops terminating, repetition spikes, and accuracy collapses. The mechanism is counterintuitive at first but makes sense once you see it. In sampled-token OPD, the reward for a token is roughly the teacher's log-probability minus the student's log-probability on that token. So if the teacher gives a token much higher probability than the student does, that token receives a large positive signal. Now imagine the student starts repeating itself. In practice this looks less like coherent sentences repeating and more like degenerate loops — something like "wait, wait, wait, wait, wait" filling the rest of the context. This prefix is bad globally. But locally, it is very predictable. A strong teacher is often very confident about predictable text. Once the loop has gone on for a while, the teacher can assign high probability to the next repeated token. The student may be less confident than the teacher. So the repeated token gets a large positive log-ratio. That means OPD accidentally rewards continuing the repetition. Before repetition starts, repeated tokens are rare, so they don't matter much. But once repetition appears, those tokens become frequent. And because they also receive large positive advantages, they start dominating the update. Luo et al. [2] measures repeated tokens getting 4 to 9 times larger advantage than normal tokens after collapse. Then the loop reinforces itself: more repetition → more predictable prefix → higher teacher confidence → larger positive signal on repeated tokens → even more repetition This is different from the usual length-bias issue in RL. It's more specific — a broken prefix creates locally high-reward repeated tokens, and OPD faithfully amplifies them. 3. We often only look at one item on the menu The clean objective would compare the teacher and student distributions over multiple possible next tokens. But many public OPD recipes — including the ones used industrially — use a cheaper version: Let the student generate one token. Then ask: did the teacher assign this exact token higher or lower probability than the student? If teacher probability is higher, push the student toward it. If lower, push the student away. That is the sampled-token log-ratio. It is cheap because you only score the token the student actually sampled. You do not need to compare the full vocabulary. There is a real reason for this design choice. Full sequence-level reverse-KL is noisy for long generations because an early token update gets entangled with many future rewards. Token-level OPD avoids that by giving each token its own local feedback. That gives much better variance scaling with length [3] — worst-case variance grows as O(T²) for token-level instead of O(T⁴) for sequence-level. So for long reasoning traces, token-level feedback is attractive. The problem is that "one sampled token" is a very noisy view of the teacher's actual next-token preference. At a given step, the teacher may have a whole cluster of reasonable next tokens. But sampled-token OPD only checks the one token the student happened to pick. This creates three problems. First, the student samples tokens from its own distribution, so on most positions the student's probability exceeds the teacher's and the log-ratio is negative. The reward is computed as teacher minus student, and the student is picking tokens where its own log-prob is near its highest — meaning the subtraction is almost always against the student's strongest values. Positive signal only shows up when the student happens to sample a token the teacher likes even more than the student does, which is the minority case. Second, if the student drifts into weird prefixes, the teacher's local probabilities may no longer reflect global quality. Third, tokenization and special-token differences can create fake disagreements. The student and teacher may represent the same text with different token boundaries, so a single-token comparison can look terrible even when the underlying string is fine. The fix proposed in [3] is simple: don't guess the teacher's local preference from one sampled token. Instead, take the teacher's top-k next tokens, renormalize both teacher and student probabilities over that set, and compute reverse-KL there. It's still cheap — you only need the top-k logits, not the whole vocabulary. But it changes the supervision from "did the teacher like this one sampled token?" to "among the teacher's plausible next tokens, does the student put probability mass in the same places?" That is a much better local learning signal. They also add top-p sampling during rollout, so the student is less likely to wander into extremely low-probability prefixes, and mask special tokens to avoid fake tokenization mismatches. 4. Even good token signals may not add up This is the least developed of the four failure modes, and possibly the most important. Li et al. [1] compares a successful OPD setup against a failing one and finds something strange. The failing teacher's per-token advantages are actually larger than the working teacher's, but the gradient norms are smaller. And the failing teacher's sequence-level reward can still distinguish correct from incorrect rollouts, comparable to the working case. So the reward signal is globally informative. It just does not produce useful gradients. Turns out what's going on is that OPD computes a learning signal at every token position in the rollout, and then sums them into one gradient update. Each position's contribution is a vector in parameter space pointing in some direction. So if the per-position vectors mostly point in the same direction, they add up to a big coherent push. But if they point in different directions, they partially cancel when summed, and the model barely moves even though each individual position had something to say. The empirical fingerprint is consistent with the second case: large per-token advantages but small gradient norms after summing — individually strong signals that cancel out when combined. The successful teacher shows the opposite pattern: smaller per-token advantages, larger gradient norms after summing — weaker individual signals that reinforce each other when combined. The paper that raises this leaves it as a hypothesis, but the empirical fingerprint — large per-token advantages, small gradient norms, informative sequence-level reward — is specific enough that it should be testable? What this explains A lot of practical OPD fixes start to look related. SFT cold start helps because it moves the student closer to the teacher's reasoning style before OPD begins. Teacher-aligned prompts help because they put the student in regions where the teacher gives more reliable feedback. KL regularization helps because it prevents the student from drifting too quickly into weird generations. Mixture distillation helps because it keeps some clean reference trajectories in training, so the rollout distribution does not become fully self-generated garbage. Top-k matching helps because it stops pretending one sampled token is enough to represent the teacher's local preference. These look like different tricks. But they are all trying to protect the same thing: the small set of plausible next tokens where teacher and student can actually communicate. The ceiling, and the real open question The more interesting part is long-horizon reasoning. The deeper the student gets into its own generated solution, the more likely the prefix is something the teacher would not have written. And once the teacher is judging prefixes outside its own natural distribution, its token probabilities become less reliable. One paper [1] shows this directly: teacher continuation advantage drops sharply as the student prefix gets longer, from +0.37 at 1K prefix to +0.02 at 16K. That is a bad sign for long-CoT and agentic OPD, because those are exactly the settings where the student spends many steps inside its own partially-generated world. OPD works best when the teacher and student stay close enough that teacher probabilities remain meaningful. Long-horizon agentic training pushes in the opposite direction. This is the open question I would most want to see investigated. The failure modes in sections 1-3 are diagnosable and have proposed fixes. Section 4 is a hypothesis about local gradient structure that should be testable. But the long-horizon ceiling is different — it is about whether OPD's core assumption (the teacher knows something useful about the student's next token) can hold at all when the student is operating many steps inside its own generated world. My current takeaway OPD isn't really a full-vocabulary distribution-matching problem. It's a fragile communication protocol between teacher and student through a tiny local menu of next-token choices. When that menu overlaps, stays clean, and produces gradients that add up, OPD works. When it doesn't, OPD quietly trains the student in the wrong direction. References [1] Li et al. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe. arxiv.org/abs/2604.13016 [2] Luo et al. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models. arxiv.org/abs/2604.08527 [3] Fu et al. Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes. arxiv.org/abs/2603.25562

English
0
18
147
14.7K
John retweetledi
半吊子程序猿大铭
半吊子程序猿大铭@CoderDaMing·
💥谷歌推出了Duolingo的100%免费替代品 无需订阅。无打卡要求。采用Gemini人工智能。 让我为你解释一下它如何运作 👇
半吊子程序猿大铭 tweet media半吊子程序猿大铭 tweet media
中文
5
36
265
67.7K
John retweetledi
叫我阿杭
叫我阿杭@Astronaut_1216·
AI时代的忠告——记录中推圈史诗级0514事件 不要使用minimax模型 不要买minimax的股票 一家不把模型的基础功能处理好的模型厂商 只想着怎么去追OpenClaw,怎么去追Hermes的噱头的公司,它能有什么前景和发展? 连基础的「帮我上传到github」的 自然语言命令都识别不了,给我把1900个文件全部删除,实在是让人想不通,而且我多次艾特官方都做不到 但AI时代有三个人和一家公司你必须关注 @gkxspace @nopinduoduo @skaas777 @AnthropicAI 三位都是AI应用领域资历贼深的专家 余温给我了一个宪法级指令:使用 `trash` 命令而不是 `rm`,以便恢复误删文件。 或者直接:alias rm='trash' 拼多多老师一直在给我出谋划策,让我学了贼多git指令 jason的Claude模型,帮我恢复好了
叫我阿杭 tweet media
中文
48
5
106
44.5K
John retweetledi
Saul Goodman
Saul Goodman@Goodmanprotocol·
Built a typography-based travel poster prompt that turns any city name into a premium editorial-style illustrated destination poster ✈️ Tip: if the city name is long (like SINGAPORE or LOS ANGELES), use a wide format like 16:9. If it’s short (like ROME or KYOTO), vertical ratios like 3:4 or 4:5 usually look much stronger visually. GPT Image 2 on ChatGPT Prompt: Create an ultra-high-resolution typography-based travel poster design themed around [CITY NAME]. Aspect ratio: (16:9 poster) IMPORTANT: All visible text inside the poster must be in English only. Typography must be perfectly spelled and professionally typeset. Absolutely no distorted letters, random symbols, broken text, or AI-generated gibberish. Aspect ratio: 16:9 poster CORE COMPOSITION: Place the giant English word “[CITY_NAME]” prominently in the center of the composition Each individual letter should contain a different illustrated scene from the city Letters should be tall, elongated, bold sans-serif forms The typography itself should feel like a series of “city gallery windows” Distribute landmarks, streets, transportation, nature, culture, and architecture naturally across the letters Scenes should visually flow from one letter into another like one connected urban panorama TOP HORIZONTAL STRIP: At the top of the poster, include a thin panoramic horizontal strip containing: city skyline silhouettes cars trams or trains boats if relevant birds clouds sun All elements should appear minimalist, elegant, and rhythmically balanced. STYLE: mid-century modern editorial poster, Swiss graphic design, minimal vector illustration, architectural infographic aesthetic, travel typography poster, flat geometric illustration, ultra clean composition, premium magazine design, screen print poster feeling, retro-futuristic travel branding ILLUSTRATION STYLE: flat vector shapes only no realism no gradients no texture noise clean geometric shadows simplified architectural forms map-like top-down illustration mixed with side-view cityscape subtle line-art details perfectly clean vector edges strong negative space usage harmonious visual rhythm between letters TYPOGRAPHY: giant bold sans-serif typography letters occupy most of the canvas height ultra precise alignment each letter acts as an independent framed illustration panel smooth rounded corners where appropriate editorial spacing highly balanced composition typography must look professionally designed, print-ready, and geometrically perfect COLOR PALETTE: Automatically derive a cohesive palette inspired by [CITY_NAME]. Examples: coastal city → aqua, sand, coral, muted teal desert city → terracotta, beige, warm cream cyber city → mint, navy, steel blue historic European city → dusty rose, olive green, parchment Use: muted pastel tones soft vintage travel poster colors elegant low-saturation combinations maximum 4–6 colors only CONTENT GENERATION: Automatically include: iconic landmarks of [CITY_NAME] famous streets and transportation local urban patterns nearby nature elements skyline silhouettes bridges, rivers, or coastline if relevant culturally symbolic architecture recognizable local atmosphere COMPOSITION: centered typography composition white or soft ivory background lots of breathing room top panoramic strip balances the heavy typography below asymmetrical but visually balanced layout each letter contains different scene depth and perspective premium poster hierarchy with museum-quality layout MOOD: premium, intellectual, calm, design-forward, travel editorial aesthetic, stylish enough for a museum gift shop poster QUALITY: 8K ultra detailed, print-ready, extremely sharp vector edges, perfect typography rendering, clean professional graphic design, high-end editorial poster quality, no distorted text, no random characters, no spelling errors, no AI artifacts
Saul Goodman tweet mediaSaul Goodman tweet mediaSaul Goodman tweet mediaSaul Goodman tweet media
English
40
62
592
47.2K
John retweetledi
比特币橙子Trader
比特币橙子Trader@oragnes·
Codex App / CLI 也可以直接接入股票、财报、SEC 文件和金融新闻数据了。 用的是 Financial Datasets 官方 MCP Server。 它不是单纯“查股价”的插件,而是把金融数据源接进 AI Agent,让 Codex 可以一边拿实时数据,一边做分析。 能做什么? 1. 查股票最新价格 比如 AAPL、NVDA、TSLA 的最新价格、涨跌、成交量。 2. 查历史行情 可以看某只股票过去一段时间的 OHLCV 数据:开盘、最高、最低、收盘、成交量。 3. 分析财报 可以读取收入、利润、资产负债表、现金流等数据,做同比、环比、利润率、现金流质量分析。 4. 看估值指标 比如 P/E、市值、EV/Revenue、股息率等。 5. 查 SEC filings 可以让 Codex 总结 10-K、10-Q、8-K 里的风险因素、管理层讨论、重大事件。 6. 看公司新闻 结合最近新闻,分析短期催化、风险和市场情绪。 7. 横向对比公司 比如让 Codex 对比 NVDA、AMD、AVGO 的增长、利润率、估值和风险。 8. 做筛股 按行业、估值、收入规模、利润率等条件筛出符合要求的公司。 Codex CLI 安装方式: bash codex mcp add financial-datasets --url mcp.financialdatasets.ai codex mcp login financial-datasets codex mcp list 执行 login 后会走 OAuth 登录 Financial Datasets。 如果你用的是 Codex App,只要和 CLI 是同一个 macOS 用户,配置会共用。安装完成后重启 Codex App,新开会话就能用。 使用示例: 用 financial-datasets 查 NVDA 最新股价,只返回价格、时间和来源。 分析 AAPL:最新股价、最近财报、估值、新闻催化和主要风险,最后给我一个短期/中期观察结论。 对比 TSLA、BYD、RIVN 的收入增长、毛利率、现金流和估值,判断谁的基本面更健康。
比特币橙子Trader@oragnes

如何用20美金买到2个GPT的Team账号,今天给大家出一个可以支付成功的详细教程(收藏级别) 官方的优惠活动跟本地IP等都没太大关系,只跟邀请链接有关,如果A邀请链接不行,就用B邀请链接。 关于支付,一般的国内Visa信用卡、虚拟卡我都试了,根本过不去,提示:付款未获批准 。 然后在朋友的帮助下我尝试了下面方法,秒过,成功用20美金买到了2个ChatGPT的Team账号。 下面进入正式的教程,支付直接秒过。

中文
8
43
154
22.6K
John retweetledi
Zhengtong Xu
Zhengtong Xu@XuZhengtong·
Residual latent action is all you need to train an effective and generalizable world model!! Try our Colab demo and check out the website to explore RLA-WM and its applications to robot policies (world action model and RL). Website: mlzxy.github.io/rla-wm/ Colab: colab.research.google.com/github/mlzxy/r…
Xinyu Zhang@XinyuZhang82004

𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐚 𝐰𝐨𝐫𝐥𝐝 𝐦𝐨𝐝𝐞𝐥 𝐢𝐧 𝐚 𝐥𝐚𝐭𝐞𝐧𝐭 𝐬𝐩𝐚𝐜𝐞 𝐜𝐚𝐩𝐭𝐮𝐫𝐢𝐧𝐠 𝐬𝐭𝐚𝐭𝐞 𝐞𝐯𝐨𝐥𝐮𝐭𝐢𝐨𝐧, 𝐮𝐬𝐞 𝐢𝐭 𝐟𝐨𝐫 𝐰𝐨𝐫𝐥𝐝 𝐚𝐜𝐭𝐢𝐨𝐧 𝐦𝐨𝐝𝐞𝐥 𝐚𝐧𝐝 𝐯𝐢𝐬𝐮𝐚𝐥 𝐫𝐞𝐢𝐧𝐟𝐨𝐫𝐜𝐞𝐦𝐞𝐧𝐭 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠. 🚀 Introduce RLA-WM, a simple and efficient state-of-the-art world model. ✂️ RLA-WM decouples a world action model from the video backbone, and enables the first demonstration 🎬 of visual reinforcement learning entirely inside our world model, learned only from videos (𝚆̲orld 𝙼̲odel‑based 𝚁̲𝙻̲). ⚡ Talk is cheap, open the notebook in Colab ▶️ to run RLA-WM and WMRL in a single T4 GPU! 🌐 Website 📄 Paper: mlzxy.github.io/rla-wm ▶️ Colab: colab.research.google.com/github/mlzxy/r… 👏 Our work, "𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗩𝗶𝘀𝘂𝗮𝗹 𝗙𝗲𝗮𝘁𝘂𝗿𝗲-𝗕𝗮𝘀𝗲𝗱 𝗪𝗼𝗿𝗹𝗱 𝗠𝗼𝗱𝗲𝗹𝘀 𝘃𝗶𝗮 𝗥𝗲𝘀𝗶𝗱𝘂𝗮𝗹 𝗟𝗮𝘁𝗲𝗻𝘁 𝗔𝗰𝘁𝗶𝗼𝗻", is a collaborative effort by computer vision and robotics researchers from Rutgers University, Purdue University, and the University of Wisconsin‑Madison. Shoutout to my amazing collaborators! @XuZhengtong , Yutian Tao, @YepingWang , @yushe_1 , @ABoularias

English
1
13
60
7.2K
John retweetledi
烟花老师
烟花老师@teach_fireworks·
根据大佬的推荐我梳理了一份高质量 AI Engineer 的学习资料清单,值得收藏学习! 太干了太干了! 🥳🥳🥳 一共 11 部分太长了放不下,剩下6部分放评论区。 1. Harness engineering,不只是 prompt engineering 文章|Martin Fowler:Harness Engineering for Coding Agent Users — 理解“agent = model + harness”,也就是模型之外的上下文组装、工具接口、状态、执行循环、错误处理、评测与观测层。链接:martinfowler.com/articles/harne… 文章|Anthropic:Building Effective AI Agents — 学习 agentic workflow、tool use、agent-computer interface、透明规划、简化设计等工程原则。链接:anthropic.com/research/build… 文章|OpenAI:Unrolling the Codex Agent Loop — 看真实 coding agent harness 如何组织模型、工具、prompt、执行循环和性能设计。链接:openai.com/index/unrollin… YouTube|How We Build Effective Agents: Barry Zhang, Anthropic — Anthropic agent 架构文章的视频版补充。链接:youtube.com/watch?v=D7_ipD… 2. Prompt caching vs. semantic caching tradeoffs 官方文档|OpenAI Prompt Caching — 学 prompt caching 的 provider-side 机制、适用条件和 cached token 统计。链接:developers.openai.com/api/docs/guide… 官方文档|Anthropic Prompt Caching — 学 automatic caching 与 explicit cache breakpoints 的区别。链接:platform.claude.com/docs/en/build-… 文章|Redis:Prompt caching vs semantic caching — 建立 tradeoff:prompt caching 适合复用固定长上下文;semantic caching 适合复用“语义相近”的问题答案;生产系统常常两者结合。链接:redis.io/blog/prompt-ca… PDF|GPTCache: An Open-Source Semantic Cache for LLM Applications — 学 semantic cache 的论文级实现:embedding、similarity search、cache hit、错误命中风险、成本/延迟收益。链接:aclanthology.org/2023.nlposs-1.… 3. KV cache management at scale PDF|vLLM / PagedAttention:Efficient Memory Management for LLM Serving — 核心论文,重点看 PagedAttention 如何把 KV cache 分成 block,减少碎片并提升 serving throughput。链接:arxiv.org/pdf/2309.06180 文档|vLLM Automatic Prefix Caching Implementation — 看工程实现:KV cache 被分成 KV blocks,并允许非连续物理内存存储。链接:docs.vllm.ai/en/v0.6.1/auto… PDF|LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — 学跨请求、跨 engine 的 KV cache 复用、offloading、prefill-decode disaggregation。链接:lmcache.ai/tech_report.pdf YouTube|Fast LLM Serving with vLLM and PagedAttention — 配合论文理解 vLLM serving、KV cache、PagedAttention 的直觉。链接:youtube.com/watch?v=5ZlavK… 4. Speculative decoding vs. quantization PDF|Fast Inference from Transformers via Speculative Decoding — speculative decoding 经典论文,理解 draft model 先猜 token、target model 并行验证的机制。链接:arxiv.org/pdf/2211.17192 PDF|QLoRA: Efficient Finetuning of Quantized LLMs — 理解 quantization 基础:4-bit quantized model、LoRA adapter、NF4、double quantization、paged optimizer。链接:openreview.net/pdf?id=OUIFPHE… PDF|QSPEC: Speculative Decoding with Complementary Quantization Schemes — 专门研究 speculative decoding 与 quantization 如何结合。链接:aclanthology.org/2025.emnlp-mai… 文章|Google Cloud:Five techniques to reach the efficient frontier of LLM inference — 把 continuous batching、paged attention、routing、speculative decoding、quantization 放在一个 inference optimization 框架里看。链接:cloud.google.com/blog/topics/de… YouTube|Faster LLMs: Accelerate Inference with Speculative Decoding — speculative decoding 的入门视频。链接:youtube.com/watch?v=VkWlLS… 5. Structured output failures & fallback chains 官方文档|OpenAI Structured Outputs — 学 JSON Schema、strict schema、structured response 的基本能力和限制。链接:developers.openai.com/api/docs/guide… 文章|OpenAI:Introducing Structured Outputs in the API — 理解 JSON mode 与 Structured Outputs 的差别:JSON mode 不等于 schema 一定正确。链接:openai.com/index/introduc… 文档|Instructor:Structured LLM Outputs + Validation / Reasking — 学 Pydantic schema、validation failure 后自动 retry / re-ask 的模式。链接:python.useinstructor.com 文档|Pydantic AI Output Validation — 学模型原生 structured output 之外,为什么还需要应用层 validation 与 retry budget。链接:pydantic.dev/docs/ai/core-c… 文档|Guardrails AI — 学如何把 raw output、validated output、validation success/failure 作为系统状态处理。链接:guardrailsai.com/guardrails/doc… YouTube|Validate & Standardize LLM Output with Guardrails-AI — 输出验证和标准化的实操视频。链接:youtube.com/watch?v=r3JdQx…
YouTube video
YouTube
YouTube video
YouTube
YouTube video
YouTube
YouTube video
YouTube
烟花老师 tweet media
Akshay 🚀@akshay_pachaar

As an AI Engineer. Please learn: - Harness engineering, not just prompt engineering - Prompt caching vs. semantic caching tradeoffs - KV cache management at scale - Speculative decoding vs quantization - Structured output failures & fallback chains - Evals (LLM-as-judge + human evals) - Cost attribution per feature, not just per model - Agent guardrails & loop budgets - LLM observability as a first-class discipline - Model routing & graceful fallback logic - Knowing when to fine-tune vs. in-context learning

中文
18
83
352
26.9K
John retweetledi
Berryxia.AI
Berryxia.AI@berryxia·
刚刷到CJ Zafir 发了一条关于 fine-tuning 小模型的帖子,看下来觉得这波建议特别实在。 他直接说,如果你也喜欢玩开源模型 fine-tuning,那先听听这些: 从 1B、2B、4B、8B 这些小模型开始练手,别一上来就冲 27B 以上。 云 GPU 用 Google Colab Pro 就够了,A100 80GB 一小时才 0.6 美元左右,小模型完全够用。 数据集自己造,用 Codex 5.5 先规划,再配 DeepSeek v4 Pro 生成每一行数据。 底座模型推荐 Unsloth 的 instruct 版本,Hugging Face 上直接拉,fine-tuning 笔记也用他们的做参考,直接丢给 Codex 让它帮你改成你想要的配置。 他建议花一天时间把这些东西过一遍:SFT、RL 训练(GRPO、DPO、PPO 这些)、LoRA / QLoRA、量化类型、本地推理引擎(llama.cpp)、KV cache 和 prompt cache。 他说就直接上手吧,Claude、Codex、ChatGPT 都能给你设计第一步的完整计划。 最后他还提到,未来技术会越来越往 5B 到 15B 的 Expert Language Models 走,并非一味堆通用大模型,所以 fine-tuning 这门手艺现在学特别值。 很多公司愿意花 5 万美元以上,让你帮他们用自家数据训个性化模型。 整条帖子的意思就是:fine-tuning 其实谁都能入门,调模型、测模型、用模型,慢慢就能把这变成一份靠谱的事业。 感兴趣的可以看看,还挺有意思。
Berryxia.AI tweet media
CJ Zafir@cjzafir

If you love fine-tuning open-source models (like me), then listen. > Start with 1B, 2B, 4B, and 8B models. (Don't start with a 27B model or bigger at first.) > Use WebGPU providers. I use Google Colab Pro for any model smaller than 9B. A single A100 80GB costs around $0.60/hr, which is cheap. Enough for small models. > Don’t buy GPUs unless you fine-tune 7 to 10 models. You'll understand the nitty-gritty in the process. > Use Codex 5.5 × DeepSeek v4 Pro to create datasets. Codex to plan, DeepSeek v4 Pro to generate rows. > Use Unsloth's instruct models as a base from Hugging Face. Yes, there are others too, but Unsloth also provides fast fine-tuning notebooks. > Use Unsloth's fine-tuning notebooks as a reference. Paste them into Codex, and Codex will write a custom notebook with the configs you need. > Spend 1 day learning about: - SFT (supervised fine-tuning) - RL training (GRPO, DPO, PPO, etc.) - LoRA / QLoRA training - Quantization and types - Local inference engines (llama.cpp) - KV cache and prompt cache > Just get started. Claude, Codex, and ChatGPT can design a step-by-step plan for how you can fine-tune your first AI model. Future tech is moving toward small 5B to 15B ELMs (Expert Language Models) rather than general 1T LLMs. So fine-tuning is an important skill that anyone can acquire today. Tune models, test them, use them. Then fine-tune for companies and make a career out of it. (Companies pay $50k+ to fine-tune models on their data so they can get personalized AI models.) Shoot your questions below. I'll be sharing in-depth raw findings about this topic in the coming days.

中文
25
111
567
73.6K