We trained Composer to self-summarize through RL instead of a prompt.
This reduces the error from compaction by 50% and allows Composer to succeed on challenging coding tasks requiring hundreds of actions.
We use a combination of offline benchmarks and online evals to measure model quality.
This makes results more useful, especially as public benchmarks are increasingly saturated.