
Last week, amid the global headlines surrounding the high-stakes summit between President Trump and Xi Jinping in Beijing, a quieter but profoundly consequential piece of research dropped in Nature. A team of seven researchers from major American universities published the first peer-reviewed evidence that China’s state-controlled media has successfully worked its way into the training data of AI chatbots that the world increasingly relies on. The study demonstrates that scripted articles, official slogans, and party-line phrasings churned out daily by entities like the Xinhua News Agency and the Communist Party's study apps are now demonstrably embedded inside ChatGPT and other top models. A quick test of one of Xi Jinping's signature political doctrines shows that global chatbots seamlessly finish the phrases and offer to explain their political significance, reflecting an underlying saturation of state doctrine. By combing through CulturaX, a massive open-source data set containing 189 million Chinese-language documents widely used to train AI models, the researchers found that state-media content is 41 times more abundant in the corpus than Chinese-language Wikipedia. While the overall overlap sits at a modest 1.64%, that share climbs to roughly one in four documents when filtering for politically sensitive terms like the Party Congress or the Central Committee. “What is new here is now they are shaping the systems people increasingly ask to summarize, explain, and interpret the world for them,” explained Molly Roberts, a researcher on the team and co-director of the China Data Lab at the University of California San Diego. She noted that through this mechanism, authoritarian governments can now shape information consumption not just domestically, but across international borders. When the team posed politically sensitive questions regarding Chinese governance to major commercial chatbots, the Chinese-language answers came back overwhelmingly more favorable to Beijing than their English counterparts. While Western models like GPT, Claude, Gemini, and Grok showed a distinct divergence between languages, China’s own DeepSeek model remained uniformly pro-Beijing across both English and Chinese, reflecting strict state regulatory control over its data. The phenomenon extends beyond China, revealing a similar pattern for queries regarding Russia and North Korea. Crucially, this ideological slant did not require covert cyber operations; the propaganda is simply available on the open web in plain, unpaywalled HTML, making it free and easy for any AI lab's web crawler to scoop up and ingest. This reality highlights an uncomfortable systemic asymmetry in global media ecosystems. While independent, high-quality journalism in democracies increasingly relies on paywalls to sustain its operations, state-run propaganda from authoritarian regimes remains entirely free, creating a massive economic imbalance in the textual material available for machine learning. A broader audit spanning 37 nations confirmed that this trend is a global issue: the lower a country's press freedom, the more regime-friendly the local-language output of the AI becomes. Because large language models do not transparently cite their sources, users are left completely unable to decipher the true origins of the geopolitical narratives presented to them. The Beijing summit generated a brief wave of international headlines, but this structural penetration of artificial intelligence demands a policy conversation that lasts years. While the scientific community has officially proven that authoritarian states are shaping global AI outputs, the question of how democracies will counter this invisible influence remains entirely unanswered. wsj.com/world/china/th…
























