
@davieball @OpenAI right. even with finbench or more specific benchmarks, I don’t think test suites rigorously measure if and how protocol was followed. Seems like you’d need to look into and evaluate raw agent activity + tool calls at every step
English








