Max Kirkby
244 posts

Max Kirkby
@maxkirkby
training models @baseten. PhD'ing @OxNeuro @rhodes_trust. hierarchical plans and continual learning

Best career hack is to make sure you’re the person in the room who's always having fun.







We replicated Microsoft Research's Generative Adversarial Distillation (GAD) to distill Qwen3-4B from GPT-5.2. Standard black-box distillation teaches a student to copy teacher outputs, but at inference the student generates from its own prefixes, small errors compound and it drifts off the expert distribution. GAD reframes this as an on-policy distillation problem, training a co-evolving discriminator that provides adaptive reward signals on the student's own generations. Exploring methods like this are how our post-training team surfaces new training patterns. Read here: baseten.co/resources/rese…



MIDNIGHT







