
This paper shows that today’s AI detectors are not trustworthy enough to decide whether a student used AI. The authors built 3 large datasets from real pre-GenAI student work, paired them with AI versions, and ran 13 commercial and open tools on more than 280,000 samples. They found that detectors did somewhat better on long theses, but they broke down on short coursework and especially on engineering code, and STEM writing was more likely to be flagged unfairly because technical writing often sounds formulaic. --- sciencedirect. com/science/article/abs/pii/S0360131526000540


