
Last week, we shared a synthetic populations dataset for the United States but this week we’re sharing one published by researchers for the whole world. 🌏
Marijin Ton et al released a gigantic synthetic population dataset that represents ~𝟳.𝟯𝟯 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝘂𝗺𝗮𝗻𝘀, which matches the 2015 human population count, and ~𝟭.𝟵𝟵 𝗯𝗶𝗹𝗹𝗶𝗼𝗻 𝗵𝗼𝘂𝘀𝗲𝗵𝗼𝗹𝗱𝘀.
𝗧𝗵𝗲 𝗠𝗼𝘁𝗶𝘃𝗮𝘁𝗶𝗼𝗻
To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior.
According to the authors – “𝘍𝘰𝘳 𝘦𝘹𝘢𝘮𝘱𝘭𝘦, 𝘪𝘯𝘵𝘦𝘨𝘳𝘢𝘵𝘦𝘥 𝘢𝘴𝘴𝘦𝘴𝘴𝘮𝘦𝘯𝘵 𝘮𝘰𝘥𝘦𝘭𝘴 𝘰𝘧 𝘤𝘭𝘪𝘮𝘢𝘵𝘦 𝘤𝘩𝘢𝘯𝘨𝘦 𝘵𝘺𝘱𝘪𝘤𝘢𝘭𝘭𝘺 𝘢𝘴𝘴𝘶𝘮𝘦 𝘢 𝘳𝘦𝘱𝘳𝘦𝘴𝘦𝘯𝘵𝘢𝘵𝘪𝘷𝘦 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳 𝘰𝘧 𝘢 𝘴𝘪𝘯𝘨𝘭𝘦 𝘢𝘷𝘦𝘳𝘢𝘨𝘦 𝘨𝘭𝘰𝘣𝘢𝘭 𝘰𝘳 𝘳𝘦𝘨𝘪𝘰𝘯𝘢𝘭 𝘤𝘰𝘯𝘴𝘶𝘮𝘦𝘳.”
By creating a synthetic individuals dataset that’s consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, they’re hoping to improve the data and assumptions used in global impact simulations.
𝗧𝗵𝗲𝗶𝗿 𝗗𝗮𝘁𝗮 𝗦𝗼𝘂𝗿𝗰𝗲𝘀
The team primarily used data from 2 databases:
• Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries.
• Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries.
Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics.
This is a great dataset to explore geospatial visualizations or to build regional or global impact models.
📚 Link to the paper: nature.com/articles/s4159…
🗄️ Link to the dataset: dataverse.harvard.edu/dataset.xhtml?…
#syntheticdata #machinelearning #generativeai
Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts
Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.

English