
On the input side of things, researchers typically used the training sets of popular benchmarks as “seeds” to elicit knowledge from the teacher (e.g. they tell the teacher to expand on or enrich the data they provide it). It is unclear if Chinese labs are using this method for distillation, and my hunch is that they are doing much more than inputting benchmark questions as seed knowledge. 3/ Distillation vs. other training methods. Another issue is that some literature does more than just distillation. For example, the Phi-1.5 paper distills using billions of teacher-generated tokens, but it also trains the student on 6B tokens of “textbook-quality” data from the web, which was not generated by the teacher. So it's harder to know how much the distillation of teacher-generated data influenced the student’s performance. 4/ Benchmark issues. Using benchmarks to evaluate the effectiveness of distillation can be problematic. While distillation provides uplift for student models on benchmarks, it may not truly reflect the degree to which the teacher’s general knowledge and capabilities are transferred to the student. Moreover, some of the distillation literature runs the risk of benchmaxxing, as the distillation methods they employ focus very heavily on enriching and fine-tuning on benchmark data. 5/ Distillation in context. My last point, and perhaps one of the most important, is that Chinese labs’ distillation efforts are one piece of a much broader post-training pipeline. It’s hard to know how much distillation provides uplift relative to all of the other methods they employ to optimize model performance. This is another reason why it’s so easy to overstate or understate the role distillation plays in China’s overall competitiveness. 6/ What to do. None of this is to suggest that we cannot know the effectiveness of Chinese distillation, rather it's that the literature only paints half of the picture. There’s likely some uplift, but we cannot yet quantify it reliably. We need more research here. We need to get a better understanding of how distillation scales, and the degree to which it provides uplift for strong student models on the most challenging benchmarks. Without this, we really can’t know how much distillation can help Chinese labs The level of urgency here really does, in my view, boil down to this core question of ‘how much knowledge and capability can you effectively distill from frontier proprietary models.’ If it's a minor uplift, then maybe we can accept it as a natural byproduct of API access, or apply defenses in ways that actually match the threat. If it's a major uplift that really helps the Chinese labs compete, then maybe more aggressive defenses and policies need to be pursued.













