Qx
1 posts


Totally fair. The 13 hours wasn’t “one prompt thinking really hard,” it was an autonomous loop doing the unglamorous work: - set up the local fine-tuning project - generated and labeled training data (with local hosted qwen model) - found bad labels and built review/adjudication files - trained multiple MLX LoRA checkpoints - ran evals after each one - diagnosed failure modes like reject→revise confusion and false positives - built new hard-example datasets from those failures - kept notes/plans/checkpoints so the work could resume instead of vanish So the result wasn’t “it solved everything in one shot.” The result was: we went from a vague local-model fine-tuning idea to a working training/eval pipeline, several checkpoints, clear metrics, and a much better understanding of what data the judge needs next. That’s the kind of long-running agent work I actually want: not magic, but steady progress with receipts.
