
Ivan's Cat
344 posts










@xriskology It is a combination of namecalling, ad hominem, and in-group signaling of disdain/contempt. It’s communicating that these people are beneath our concern and critiques like this are taboo and deserving of ridicule. It doesn’t engage with the substance of the article.



München und Hamburg werden die großen Gewinner des aktuellen politischen Niedergangs von Berlin sein München wird die neue Start-Up Metropole Ansonsten Paris und London



This paper tests an LLM on real NHS (National Health Service) medication reviews and finds it spots risks but misses safe fixes. They ran a medication safety reviewer on structured United Kingdom National Health Service (NHS) primary care records, mostly coded fields without typed notes, then had an expert clinician grade 277 sampled patients. AI did not miss any case where an intervention (a clear action, like starting or stopping a drug) was needed, but it produced a fully correct review in only 46.9% of patients. When it failed, the main issue was context, for example acting confident with missing details, applying guidelines (the usual best practice rules) without patient goals, or mixing up drug facts. The paper argues this gap between spotting risk and choosing the right next step is why LLMs still need human checking in real clinics. ---- Paper Link – arxiv. org/abs/2512.21127 Paper Title: "A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care"







LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our preprint on how to properly use LLM as a judge. 🧵 === So how do people actually use LLM as a judge? Most people just use the LLM as an evaluator and report the empirical probability that the LLM says the answer looks correct. When the LLM is perfect, this works fine and gives an unbiased estimator. If the LLM is not perfect, this breaks. Consider a case where the LLM evaluates correctly 80 percent of the time. More specifically, if the answer is correct, the LLM says "this looks correct" with 80 percent probability, and the same 80 percent applies when the answer is actually incorrect. In this situation, you should not report the empirical probability, because it is biased. Why? Let the true probability of the tested model being correct be p. Then the empirical probability that the LLM says "correct" (= q) is q = 0.8p + 0.2(1 - p) = 0.2 + 0.6p So the unbiased estimate should be (q - 0.2) / 0.6 Things get even more interesting if the error pattern is asymmetric or if you do not know these error rates a priori. === So what does this mean? First, follow the suggested guideline in our preprint. There is no free lunch. You cannot evaluate how good your model is unless your LLM as a judge is known to be perfect at judging it. Depending on how close it is to a perfect evaluator, you need a sufficient size of test set (= calibration set) to estimate the evaluator’s error rates, and then you must correct for them. Second, very unfortunately, many findings we have seen in papers over the past few years need to be revisited. Unless two papers used the exact same LLM as a judge, comparing results across them could have produced false claims. The improvement could simply come from changing the evaluation pipeline slightly. A rigorous meta study is urgently needed. === tldr: (1) Almost all LLM-as-a-judge evaluations in the past few years were reported with a biased estimator. (2) It is easy to fix, so wait for our full preprint. (3) Many LLM-as-a-judge results should be taken with grains of salt. Full preprint coming in a few days, so stay tuned! Amazing work by my students and collaborators. @chungpa_lee @tomzeng200 @jongwonjeong123 and @jysohn1108























