
Really interesting scoping review that points out numerous flaws in LLM-as-Judge evaluation in healthcare, including minimal human oversight, absent bias testing, model monoculture, ignore implicit eval components, no check for consistency over time (etc)
arxiv.org/abs/2604.25933
English











