
@nateberkopec Yes, but not at massive scale yet.
It's amazing for finding paper cuts that sum up fast. But also amazing at overfitting for a particular metric.
Main problem is the old one — benchmark noise: agents often pick things that don't actually move the needle, and vice-versa
English











