
@brennan__simon Real data reflects the true distribution of language and knowledge, while synthetic data is just a recursive projection of the prior that current models already believe. The improvement is marginal and could risk long-tail problem that would result in model collapse.
English









