Most surprising finding: the hardest red-team skill isn't making sabotages, it's predicting which single change actually flips a result. Many of my attempted sabotages didn't move the outcome of the experiment.
Can frontier LLMs and humans catch sabotage in ML research code?
In Auditing Sabotage Bench, I added subtle sabotages to 9 existing ML codebases which change a key finding of the research. Neither LLMs nor LLM-assisted humans reliably caught them.
If Tinker didn't exist I likely wouldn't work on this project at all. I'd tried renting GPUs before but scaling multi-node training is always a pain. I was basically waiting for something reliable to exist before diving into RL work
What I really like about Tinker is that you can directly look at the code, understand it, and modify it. It's much more flexible than other RL APIs, and you don't have to deal with the infra yourself. It's also just really nicely written.
I've been using Tinker at Redwood Research to RL-train long-context models like Qwen3-32B on difficult AI control tasks - specifically teaching models to write unsuspicious backdoors in code similar to the AI control paper. Early stages but seeing some interesting backdoors 👀