Ruixiang Zhang (@onloglogn) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

At #NeurIPS2025 Tue-Fri presenting 3 papers from our 🍎Apple ML research team. Interested in LLM, RL, reasoning, and diffusion LLMs. We also have FY26 research intern and full-time positions available. DM me if interested for a chat!

English

4

6

104

9.2K

Ruixiang Zhang@onloglogn·3d

Thanks @BoWang87 for posting our work! We have released our model checkpoints on 🤗 at huggingface.co/collections/ap… Please also checkout our detailed thread on this work at x.com/YizheZhangNLP/…

Bo Wang@BoWang87

Apple Research just published something really interesting about post-training of coding models. You don't need a better teacher. You don't need a verifier. You don't need RL. A model can just… train on its own outputs. And get dramatically better. Simple Self-Distillation (SSD): sample solutions from your model, don't filter them for correctness at all, fine-tune on the raw outputs. That's it. Qwen3-30B-Instruct: 42.4% → 55.3% pass @1 on LiveCodeBench. +30% relative. On hard problems specifically, pass@5 goes from 31.1% → 54.1%. Works across Qwen and Llama, at 4B, 8B, and 30B. One sample per prompt is enough. No execution environment. No reward model. No labels. SSD sidesteps this by reshaping distributions in a context-dependent way — suppressing distractors at locks while keeping diversity alive at forks. The capability was already in the model. Fixed decoding just couldn't access it. The implication: a lot of coding models are underperforming their own weights. Post-training on self-generated data isn't just a cheap trick — it's recovering latent capacity that greedy decoding leaves on the table. paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd

English

0

5

14

2.1K

Ruixiang Zhang retweetledi

Huangjie Zheng@UnderGroundJeg·3d

Self-improvement is always an attractive goal, but actually how? 👇 Check out our latest work, SimpleSD (SSD)! We found a simple solution that requires no external verifier. Just by adjusting temperature and truncation in self-sampling, the model got impressive performance gains

Yizhe Zhang @ ICLR 2026@YizheZhangNLP

1/6 The "Self-Improvement" Paradox Can an LLM get smarter using only its own raw, unverified outputs? No verifiers. No teachers. No RL. We found the answer is an emphatic YES. Introducing SimpleSD: Embarrassingly Simple Self-Distillation. By simply sampling solutions from a model with specific temperature and truncation settings and then fine tuning the model on those exact samples, Qwen3-30B jumped from 42.4% to 55.3% (30% improvement) on LiveCodeBench v6 just by training on its own samples! 🚀 The gain is universal across different model sizes (4B, 8B, 30B) and model families (Llama, Qwen). The harder the problem is, the larger the gain. 📈 Kudos to my amazing colleagues @onloglogn, @richard_baihe, @UnderGroundJeg, Navdeep Jaitly, @trebolloc. Check out the paper and code below! 👇 paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd HF models: huggingface.co/collections/ap…

English

1

3

18

2.6K

Ruixiang Zhang retweetledi

Richard He Bai@richard_baihe·3d

Excited to share our new work — SimpleSD: Embarrassingly Simple Self-Distillation! 🎉 Huge thanks to my incredible co-authors for making this happen! 📄 Paper: arxiv.org/abs/2604.01193 💻 Code: github.com/apple/ml-ssd 🤗 Models: huggingface.co/collections/ap…

Yizhe Zhang @ ICLR 2026@YizheZhangNLP

1/6 The "Self-Improvement" Paradox Can an LLM get smarter using only its own raw, unverified outputs? No verifiers. No teachers. No RL. We found the answer is an emphatic YES. Introducing SimpleSD: Embarrassingly Simple Self-Distillation. By simply sampling solutions from a model with specific temperature and truncation settings and then fine tuning the model on those exact samples, Qwen3-30B jumped from 42.4% to 55.3% (30% improvement) on LiveCodeBench v6 just by training on its own samples! 🚀 The gain is universal across different model sizes (4B, 8B, 30B) and model families (Llama, Qwen). The harder the problem is, the larger the gain. 📈 Kudos to my amazing colleagues @onloglogn, @richard_baihe, @UnderGroundJeg, Navdeep Jaitly, @trebolloc. Check out the paper and code below! 👇 paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd HF models: huggingface.co/collections/ap…

English

0

3

7

984

Ruixiang Zhang retweetledi

Yizhe Zhang @ ICLR 2026@YizheZhangNLP·3d

1/6 The "Self-Improvement" Paradox Can an LLM get smarter using only its own raw, unverified outputs? No verifiers. No teachers. No RL. We found the answer is an emphatic YES. Introducing SimpleSD: Embarrassingly Simple Self-Distillation. By simply sampling solutions from a model with specific temperature and truncation settings and then fine tuning the model on those exact samples, Qwen3-30B jumped from 42.4% to 55.3% (30% improvement) on LiveCodeBench v6 just by training on its own samples! 🚀 The gain is universal across different model sizes (4B, 8B, 30B) and model families (Llama, Qwen). The harder the problem is, the larger the gain. 📈 Kudos to my amazing colleagues @onloglogn, @richard_baihe, @UnderGroundJeg, Navdeep Jaitly, @trebolloc. Check out the paper and code below! 👇 paper: arxiv.org/abs/2604.01193 code: github.com/apple/ml-ssd HF models: huggingface.co/collections/ap…

English

7

29

189

15.8K

Ruixiang Zhang retweetledi

Yizhe Zhang @ ICLR 2026@YizheZhangNLP·5d

Latent reasoning is an interesting domain. It bridges continuous and discrete modalities, and bridges autoregressive and non-autoregressive thinking.

Yann LeCun@ylecun

@elonmusk Thinking in language has limited applications, largely in coding and mathematics where the language itself can help reasoning. But, as I've been saying for years, thinking manipulates mental models in abstract (continuous) representation space. Soooo, xAI gonna use JEPA now?

English

0

2

8

1.3K

Ruixiang Zhang@onloglogn·17 Mar

I remember Shuangfei introducing this idea to me some time ago and walking me through a few promising preliminary results. It’s really great to see this direction now validated at a larger scale!

Shuangfei Zhai@zhaisf

I explored the same thing 2-3 years back and got some positive results, but wasn’t convinced it was worth the overhead/complexity so I quickly put it on the shelf (maybe should have written something about it anyway). One thing the kimi paper didn’t highlight but I think is worth mentioning: the idea of treating network depth as sequence modeling is exactly what gave rise to the Highway Network, which was all about applying an LSTM along depth. In this sense taking one step further of replacing the LSTM with attention should come very natural. The only caveat here is that attention in the standard seq modeling case is fully parallelized, which makes it extremely efficient at training time; applying it along depth unfortunately looses this benefit, and computation overhead could become a real concern (but it appears not as bad as I originally thought based on the new paper’s large scale results).

English

0

9

1K

Ruixiang Zhang retweetledi

Shuangfei Zhai@zhaisf·12 Mar

Say hi to Exclusive Self Attention (XSA), a (nearly) free improvement to Transformers for LM. Observation: for y = attn(q, k, v), yᵢ and vᵢ tend to have a very high cosine similarity Fix: exclude vᵢ from yᵢ via zᵢ = yᵢ - (yᵢᵀvᵢ)vᵢ/‖vᵢ‖² Result: better training/val loss across model sizes; increasing gains as sequence length grows. See more: arxiv.org/abs/2603.09078

English

33

84

948

222K

Ruixiang Zhang retweetledi

Huangjie Zheng@UnderGroundJeg·9 Şub

It's great to see @KL_Div 's derivation here, which reminds me the CT work series we did a few years ago. (cc @MingyuanZhou). At that time we had the reweigting, and tried using a learnable network or using pretrained features for this map. Excited to see the connections!

Ke Li 🍁@KL_Div

Thanks for pointing out the similarity between drifting and Implicit Maximum Likelihood Estimation! I worked out the mathematical connection - the crux is that drifting fields are similar to the gradient of a soft version of the IMLE loss. So drifting is defined in terms of the gradient, whereas IMLE is defined in terms of the objective, but the behaviour should be similar. It's reminiscent of the formulation of classical mechanics vs. Lagrangian mechanics from physics. One difference is that in drifting the weights on the positive samples and the negative samples are different, whereas they are the same in IMLE. It'd be interesting to see if the negative weights can be replaced with positive weights.

English

3

10

106

10.9K

Ruixiang Zhang retweetledi

Yizhe Zhang @ ICLR 2026@YizheZhangNLP·10 Şub

We found that latent reasoning + RL achieved 20.5 on AIME25 and 52.7 on LCB v6 for an 8b model with 2x faster reasoning. Also, surprisingly, RL for latent reasoning seems to not suffer from entropy/diversity collapsing.

Murray Kang@haoqik322

1/9 Softmax is the enemy of diversity in reward-maximization RL like GRPO. 📉 Recent analysis reveals: As RL boosts a "correct" token, Softmax automatically suppresses all others to maximize reward. This mechanism aggressively drives down entropy. This is Mode Elicitation: trading creativity for a local optimum. To fix this, we need to escape the discrete space. 🧵👇

English

2

6

54

6.7K

Ruixiang Zhang retweetledi

Yuyang Wang@YuyangW95·28 Oca

Now accepted to ICLR 2026! Check our repo if interested in how a "simple" flow-matching model with standard Transformer works for protein folding: github.com/apple/ml-simpl…. See you in Brazil!

Yuyang Wang@YuyangW95

New preprint & open-source! 🚨 “SimpleFold: Folding Proteins is Simpler than You Think” (arxiv.org/abs/2509.18480). We ask: Do protein folding models really need expensive and domain-specific modules like pair representation? We build SimpleFold, a 3B scalable folding model solely built on general-purpose transformers + flow matching, and is trained on 9M structures. SimpleFold supports easy deployment and efficient inference on consumer-level hardware with PyTorch/MLX (try it on your MacBook!) (1/n)

English

1

10

58

6.9K

Ruixiang Zhang retweetledi

Yuyang Wang@YuyangW95·7 Ara

I’ll present SimpleFold at MLSB workshop tomorrow. Come by if interested!

Yuyang Wang@YuyangW95

New preprint & open-source! 🚨 “SimpleFold: Folding Proteins is Simpler than You Think” (arxiv.org/abs/2509.18480). We ask: Do protein folding models really need expensive and domain-specific modules like pair representation? We build SimpleFold, a 3B scalable folding model solely built on general-purpose transformers + flow matching, and is trained on 9M structures. SimpleFold supports easy deployment and efficient inference on consumer-level hardware with PyTorch/MLX (try it on your MacBook!) (1/n)

English

2

8

68

8.8K

Ruixiang Zhang retweetledi

Shuangfei Zhai@zhaisf·7 Ara

Check out the new addition to our TarFlow franchise. TLDR: normalizing flows “just work” for generating videos. This adds another strong evidence to our argument that NFs are capable generative models; and I’m now more convinced than ever that they will continue working better.

Jiatao Gu@thoma_gu

STARFlow gets an upgrade—it now works on videos🎥 We present STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows, a invertible, causal video generator built on autoregressive flows! 📄 Paper huggingface.co/papers/2511.20… 💻 Code github.com/apple/ml-starf… (1/10)

English

0

15

76

12.5K

Ruixiang Zhang@onloglogn·5 Ara

@alfcnz @UnderGroundJeg

QAM

1

0

3

121

Alfredo Canziani@alfcnz·4 Ara

Cute poster! 🥰🥰🥰 #NeurIPS2025

English

8

39

387

201.8K

Ruixiang Zhang@onloglogn·2 Ara

@_rabiulawal Just opened! :)

English

0

375

Rabiul Awal@_rabiulawal·2 Ara

@onloglogn dm closed :(

English

1

0

1

452

Ruixiang Zhang@onloglogn·2 Ara

At #NeurIPS2025 Tue-Fri presenting 3 papers from our 🍎Apple ML research team. Interested in LLM, RL, reasoning, and diffusion LLMs. We also have FY26 research intern and full-time positions available. DM me if interested for a chat!

English

4

6

104

9.2K

Ruixiang Zhang retweetledi

Yuyang Wang@YuyangW95·2 Ara

I’ll be at San Diego attending #NeurIPS2025 Dec 3-7. DM me if interested in diffusion model, multimodal, protein generative models! We’re looking for FTE to join us working on generative models. You can also find me at Apple  booth on Dec 3 3-5pm.

Yuyang Wang@YuyangW95

New preprint & open-source! 🚨 “SimpleFold: Folding Proteins is Simpler than You Think” (arxiv.org/abs/2509.18480). We ask: Do protein folding models really need expensive and domain-specific modules like pair representation? We build SimpleFold, a 3B scalable folding model solely built on general-purpose transformers + flow matching, and is trained on 9M structures. SimpleFold supports easy deployment and efficient inference on consumer-level hardware with PyTorch/MLX (try it on your MacBook!) (1/n)

English

3

6

51

8.1K

Ruixiang Zhang@onloglogn·27 Kas

Who thought we can get normalizing flow for video generation before GTA6… super impressive work by @thoma_gu @zhaisf team

Jiatao Gu@thoma_gu

STARFlow gets an upgrade—it now works on videos🎥 We present STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows, a invertible, causal video generator built on autoregressive flows! 📄 Paper huggingface.co/papers/2511.20… 💻 Code github.com/apple/ml-starf… (1/10)

English

0

11