Synthetic Data Vault

260 posts

Synthetic Data Vault banner
Synthetic Data Vault

Synthetic Data Vault

@sdv_dev

Join our growing ecosystem of #opensource libraries & resources for generating #SyntheticData for different data modalities. Created at @lab_dai, MIT.

Cambridge, MA Sumali Eylรผl 2020
46 Sinusundan379 Mga Tagasunod
Naka-pin na Tweet
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
Today, DataCebo launched SDV Enterprise & raised $8.5M in VC. SDV Enterprise is a commercial model of the source-available Synthetic Data Vault (SDV). It makes it easy to develop, manage & deploy #generativeAI models for apps when real data is limited. bit.ly/4a8eIgF
Synthetic Data Vault tweet media
English
0
1
4
592
Synthetic Data Vault nag-retweet
Akshay ๐Ÿš€
Akshay ๐Ÿš€@akshay_pachaarยท
Generate synthetic data at scale! SDV is an open-source Python library that generates tabular synthetic data by using ML algorithms to learn and replicate patterns from your real data. Here's how it works in 3 steps: 1๏ธโƒฃ Train: Point SDV at your real table; it will capture the underlying distributions & relationships. 2๏ธโƒฃ Generate: Run the trained SDV model to pop out as many look-alike rows as you needโ€”no real data exposed. 3๏ธโƒฃ Validate: Use SDVโ€™s quality report to see how closely the generated data matches the real stuff; tweak and repeat if you want it tighter. Class imbalanceโ€”solved in one shot! โœจ Key features: ๐Ÿง  Multiple models from GaussianCopula to CTGAN ๐Ÿ”— Single, multi & sequential-table support ๐Ÿ”’ Built-in anonymization & logical constraints โš™๏ธ Single call does it all `sdv.sample()` Link to the GitHub repo in next tweet! ____ Share this with your network if you found this insightful โ™ป๏ธ Follow me ( @akshay_pachaar ) for more insights and tutorials on AI and Machine Learning!
Akshay ๐Ÿš€ tweet media
English
26
121
525
48.2K
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
Generating synthetic data that maintains realistic relationships between columns is crucial for testing and analysis. Traditional random generation approaches often create unrealistic patterns, like luxury hotel rooms priced cheaper than basic rooms. GaussianCopulaSynthesizer automatically learns and maintains these relationships, creating synthetic data that preserves the statistical patterns of your original dataset. โญ๏ธ Full code: datacebo.com/dev-posts/sdv.โ€ฆ
Synthetic Data Vault tweet media
English
0
0
3
139
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
Many businesses collect and store their customersโ€™ GPS locations to help improve their products. But GPS locations may contain precise locations of peopleโ€™s homes. Businesses are sensitive to sharing this data even to internal teams, as it may reveal private information about people they know. For example, a food delivery application stores the GPS location associated with each delivery. An internal product team wants to use this data to improve the local restaurant recommendations the application makes to users for future orders. The company needs a way to preserve local insights on the best restaurants from the GPS location data without exposing sensitive user locations. One anonymization approach they could take is replacing every collected GPS location with a randomly chosen one from within the same postal code. Users tend to order from restaurants in the same or neighboring postal codes, so the integrity of local trends is still preserved. To implement this approach, they would need a dataset that contains the geographic boundaries for each postal code and an algorithm for identifying the postal code from a GPS location. To make this process seamless, we created the MetroAreaAnonymizer. With just a few lines of code, you can use the MetroAreaAnonymizer to replace GPS locations with a randomly chosen one from the same postal code. MetroAreaAnonymizer is part of our RDT library, which contains many helpful transformations for your raw data. ๐Ÿ“š Learn more about MetroAreaAnonymizer here: docs.sdv.dev/rdt/transformeโ€ฆ ๐Ÿ“š Learn about RDT here: github.com/sdv-dev/RDT ๐Ÿ“š Learn more about the SDV here: sdv.dev #syntheticdata #machinelearning #anonymization #geospatial
Synthetic Data Vault tweet media
English
0
0
3
107
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
Synthetic tabular data can help you test software applications because it resembles the key properties and patterns in your real data. Consider a news publication that wants to use synthetic data to test a new software change for their mobile application before it rolls out to their entire reader base. They trained an AI model on their real data and used it to generate synthetic data. Before they can incorporate this synthetic data into the test environment however, it must meet some minimum criteria for the application to function properly. Here are some examples of criteria that the synthetic data must meet: 1. Data Validity: Primary keys must be unique and non-null. Many features need to retrieve a specific row in a table using a unique identifier. For example, to authenticate a user, the application needs to find the specific row corresponding to their unique user_id value. 2. Data Structure: Data types, column names, and table names should match those in the real data. Application code that retrieves or updates data using specific column names, column types, and table names will error, like when the application needs to update a userโ€™s settings. 3. Relationship Validity: Each foreign key must have a reference to a valid primary key (also known as referential integrity). Many features in the app require joining data from multiple tables, like the recommended articles feature. Without referential integrity, the retrieved data might contain a subset or none of the recommended articles for the user. To help them validate that the synthetic data meets the minimum criteria for usability, they could use the SDVโ€™s Diagnostic Report. This report runs all of our basic data format and validity checks by comparing the real and synthetic data. The Diagnostic Report is part of our open-source and vendor-neutral SDMetrics library. Synthetic data generated by the default synthesizers in the SDV will always result in 100% diagnostic scores. We call this the ๐—ฆ๐——๐—ฉ ๐—š๐˜‚๐—ฎ๐—ฟ๐—ฎ๐—ป๐˜๐—ฒ๐—ฒ. If the SDV ever generates synthetic data that doesnโ€™t score 100% on the Diagnostic Report, then youโ€™ve identified a bug! Please reach out to us on GitHub or Slack and we will prioritize investigating it. ๐Ÿ“š Learn more about the single-table Diagnostic Report: docs.sdv.dev/sdmetrics/repoโ€ฆ ๐Ÿ“š Learn more about the multi-table Diagnostic Report: docs.sdv.dev/sdmetrics/repoโ€ฆ ๐Ÿ“š Learn more about the SDV here: sdv.dev #dataquality #generativeai #machinelearning #softwaretesting #syntheticdata
Synthetic Data Vault tweet media
English
0
0
2
97
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Letโ€™s explore an example of one such rule. The one-to-many relationship is a common pattern in database schemas. An interesting variation of this pattern occurs when only some rows are allowed to have connections while others arenโ€™t. For example, a gym offers a premium membership tier that gives access to extra benefits (like pool access and sauna access). To record the perks available to each member, they use a members table and a benefits table. Only the rows representing premium members are allowed to have connections to rows in the benefits table while the rows representing basic members are not. This enables the gym to store specific information for a subset of their membership in a separate table in a simple way. We call this the ForeignToPrimaryKeySubset pattern because only a subset of the primary keys in the parent table have a 1-to-many relationship with the foreign keys in the child table. If your data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our Constraint Augmented Generation bundle, or CAG, in the SDV Enterprise. ๐Ÿ“š Learn more about the ForeignToPrimaryKeySubset pattern here: docs.sdv.dev/sdv/reference/โ€ฆ ๐Ÿ“š Learn more about the CAG bundle here: docs.sdv.dev/sdv/reference/โ€ฆ ๐Ÿ“š Learn more about the SDV here: sdv.dev #syntheticdata #generativeai #databases #machinelearning #datamodeling
Synthetic Data Vault tweet media
English
0
0
0
67
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
โœˆ๏ธ @Expedia recently shared a very interesting methodology on how they collect and use synthetic data to improve their flight price forecasting models. When a user makes a flight search, Expedia retrieves the latest pricing data from their data providers for the specified search parameters - route, fare class, trip dates, etc. To build interesting price prediction features for their customers, the Expedia team trains forecasting models on data theyโ€™ve collected but they wanted to improve prediction accuracy even further. ๐Ÿ›‘ The Challenge Even though millions of searches are made by users daily, the sheer number of combinations for possible routes, trip dates, and passenger counts is so large, that there were a lot of combinations for which the team did not have the price. To develop a robust forecasting model ideally the team would have at least one search a day for each of the combinations of the search parameters. ๐Ÿค– How they Incorporated Synthetic Data? To fill these gaps they built automated software that requests flight prices for specific search parameters. ๐ŸŽฏ Their goal with synthetic searches is to have at least one search a day for their most popular routes for the trip dates that fall within the upcoming months. During the model training phase, they combine data from real user searches and from synthetic searches to ensure they have better data coverage. โœ… User Impact When a user searches for a flight, Expedia shows a chart that visualizes how prices are forecasted to change between now and takeoff. By improving the accuracy of their price forecasts, Expedia helps their users decide if they should book a flight immediately or wait until a forecasted price drop occurs in the future. ๐Ÿšง Limitations Using an automated search based on synthetically created search parameters could interfere with the experience of onsite users - who are trying to search for price. The team took this into consideration and were deliberate about balancing the data retrieval needs of real user searches with the teamโ€™s needs for synthetic searches. ๐Ÿ“š Read the Dec 2024 @thenewstack article by Shiyi Pickrell, the SVP of Data and AI at Expedia: thenewstack.io/the-future-of-โ€ฆ ๐Ÿ“š Read the Oct 2023 @Medium article b y Andrew Reuben: Senior Machine Learning Scientist at Expedia: medium.com/expedia-group-โ€ฆ #syntheticdata #generativeai #machinelearning #openai #travel Image credit: Expedia
Synthetic Data Vault tweet media
English
0
0
2
100
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
One challenge in training AI models to generate valid synthetic data is teaching them to mimic the rules-based business logic that exist in real datasets. Letโ€™s explore an example of one such rule. Some applications need to store numerical data with different units of measurement in the same column. For example, an online retailer accepts payments in many different currencies and records every transaction in a table. They use an amount column to record the transaction amount and a currency column to record the currency for each transaction. The transaction amounts associated with each currency might have radically different scales (min-max ranges and distributions) because of the exchange rate. 1 USD (American Dollar) is equivalent to ~1063 ARS (Argentinian Pesos), which is reflected in the transaction amounts. We need a way to instruct the AI model to learn the scales for each currency separately. To enable SDV synthesizers to model this business logic and generate synthetic data that adheres to it, we created the MixedScales constraint. You can use this constraint whenever the value of one or more categorical columns (like the currency column) determines the scale of a numerical column (like the amount column). The MixedScales constraint is part of our Constraint Augmented Generation, or CAG, in the SDV Enterprise. ๐Ÿ“š Learn more about the MixedScales constraint here: docs.sdv.dev/sdv/reference/โ€ฆ ๐Ÿ“š Learn more about the CAG bundle here: docs.sdv.dev/sdv/reference/โ€ฆ #syntheticdata #generativeai #databases #finance #datamodeling
Synthetic Data Vault tweet media
English
0
0
1
83
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
Today, weโ€™re excited to introduce a powerful new bundle to The Synthetic Data Vault: AI connectors. AI connectors address 2 key challenges that SDV users face when training generative AI models on datasets from enterprise data stores. (Link to the announcement: bit.ly/3EURLCB) โŽ Creating accurate metadata is time consuming, especially for complex multi-table schemas Metadata provides a deeper context (semantic and statistical) about your data and the synthesizers use this context to generate high quality synthetic data. Without AI connectors, SDV users have to export data from the database, use SDVโ€™s metadata auto-detection feature to establish metadata, and then manually update the metadata to be accurate. โœ… AI Connectors automatically generate higher quality metadata AI connectors automatically infers higher quality metadata using the database schema and our own inference engine, without having to read tables into memory from the database. When benchmarked with 55 datasets stored in 4 different database platforms, metadata generated using AI connectors resulted in 35% higher quality metadata (average score of 0.98) compared to metadata generated using the auto-detection approach (average score of 0.73). โŽ Identifying a referentially sound and representative sample for training data is tricky Training SDV Synthesizers requires loading a representative sample of data from your database into memory. In addition, the data needs to have referential integrity for the synthesizers to learn the proper relationships. Approaches to identifying a high quality, referentially sound sample of data can be tedious and time-consuming to implement. โœ… AI Connectors uses an inbuilt algorithm to generate a training data set and guarantee referential integrity With AI connectors, we created an algorithm called Referential First Search (RFS) that guarantees that the real data used to train the model is a subset with referential integrity. When benchmarked with 7 datasets stored in 5 different databases, training data created using AI connectors achieved an average of 18% higher quality data score over the standard approach of random subsampling and then enforcing referential integrity after. Read more about AI connectors and how to access it in our latest product announcement here: bit.ly/3EURLCB #syntheticdata #generativeai #machinelearning #databases
Synthetic Data Vault tweet media
English
0
0
1
75
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
SDV Enterprise v0.23.0 is out ๐ŸŽ‰ This release enhances your ability to program your synthesizer to find certain patterns and recreate themโ€” whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. ๐Ÿ† Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below. ๐Ÿ’ก Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties. ๐Ÿ“š Read the full Release Notes here: bit.ly/4152LVn ๐Ÿ“š Learn more about the SDV: bit.ly/4b858Lu #syntheticdata #generativeai #machinelearning #ai
Synthetic Data Vault tweet mediaSynthetic Data Vault tweet mediaSynthetic Data Vault tweet mediaSynthetic Data Vault tweet media
English
0
0
1
59
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
SDV Enterprise v0.23.0 is out ๐ŸŽ‰ This release enhances your ability to program your synthesizer to find certain patterns and recreate themโ€” whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. ๐Ÿ† Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below. ๐Ÿ’ก Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties. ๐Ÿ“š Read the full Release Notes here: bit.ly/4152LVn ๐Ÿ“š Learn more about the SDV: bit.ly/4b858Lu #syntheticdata #generativeai #machinelearning #ai
English
0
0
1
57
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
Synthetic data is a powerful way to generate test data that looks and feels like real production data. You can either insert the synthetic data back into the database in an environment for manual testing or use the data for running automated tests. But if you need to test a new application that has no real world usage or collected data, then you need to adopt a different approach. Instead of training models on your real data to generate synthetic data, you can generate fake test data from scratch that adheres to your database schema. In the SDV, we created a dedicated synthesizer called DayZSynthesizer to support this workflow. Here are the 3 main steps: 1. Generate baseline metadata Auto-generate baseline metadata from your databaseโ€™s schema (for supported databases) or use our Metadata APIs to create a JSON representation of your metadata that mirrors your database schema. 2. Improve the data realism You can update sdtypes to add semantic meaning to special columns like social security numbers, postal codes, and addresses to improve the format and type of fake data thatโ€™s generated. You can also define min-max value ranges for numerical columns, define a fixed set of categories for categorical columns, define datetime ranges, and control the proportion of missing data youโ€™d like for each column. 3. Generate and export fake data ๐Ÿš€ Generate the rows you need for each table and export the data into your database. The beauty of this workflow is that every time you make a software change that requires a change in the database schema, you can re-generate fake data with minimal changes to the code you already wrote. ๐Ÿ“š Learn more about DayZSynthesizer here: bit.ly/41j5ADs ๐Ÿ“š Learn more about the Metadata Creation API Here: bit.ly/3QnPVfX ๐Ÿ“š Learn more about the SDV here: bit.ly/4b858Lu #syntheticdata #fakedata #machinelearning #generativeai
Synthetic Data Vault tweet media
English
0
0
3
81
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
Last week, we shared a synthetic populations dataset for the United States but this week weโ€™re sharing one published by researchers for the whole world. ๐ŸŒ Marijin Ton et alย released a gigantic synthetic population dataset that represents ~๐Ÿณ.๐Ÿฏ๐Ÿฏ ๐—ฏ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป ๐—ต๐˜‚๐—บ๐—ฎ๐—ป๐˜€, which matches the 2015 human population count, and ~๐Ÿญ.๐Ÿต๐Ÿต ๐—ฏ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป ๐—ต๐—ผ๐˜‚๐˜€๐—ฒ๐—ต๐—ผ๐—น๐—ฑ๐˜€. ๐—ง๐—ต๐—ฒ ๐— ๐—ผ๐˜๐—ถ๐˜ƒ๐—ฎ๐˜๐—ถ๐—ผ๐—ป To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior. According to the authors โ€“ โ€œ๐˜๐˜ฐ๐˜ณ ๐˜ฆ๐˜น๐˜ข๐˜ฎ๐˜ฑ๐˜ญ๐˜ฆ, ๐˜ช๐˜ฏ๐˜ต๐˜ฆ๐˜จ๐˜ณ๐˜ข๐˜ต๐˜ฆ๐˜ฅ ๐˜ข๐˜ด๐˜ด๐˜ฆ๐˜ด๐˜ด๐˜ฎ๐˜ฆ๐˜ฏ๐˜ต ๐˜ฎ๐˜ฐ๐˜ฅ๐˜ฆ๐˜ญ๐˜ด ๐˜ฐ๐˜ง ๐˜ค๐˜ญ๐˜ช๐˜ฎ๐˜ข๐˜ต๐˜ฆ ๐˜ค๐˜ฉ๐˜ข๐˜ฏ๐˜จ๐˜ฆ ๐˜ต๐˜บ๐˜ฑ๐˜ช๐˜ค๐˜ข๐˜ญ๐˜ญ๐˜บ ๐˜ข๐˜ด๐˜ด๐˜ถ๐˜ฎ๐˜ฆ ๐˜ข ๐˜ณ๐˜ฆ๐˜ฑ๐˜ณ๐˜ฆ๐˜ด๐˜ฆ๐˜ฏ๐˜ต๐˜ข๐˜ต๐˜ช๐˜ท๐˜ฆ ๐˜ค๐˜ฐ๐˜ฏ๐˜ด๐˜ถ๐˜ฎ๐˜ฆ๐˜ณ ๐˜ฐ๐˜ง ๐˜ข ๐˜ด๐˜ช๐˜ฏ๐˜จ๐˜ญ๐˜ฆ ๐˜ข๐˜ท๐˜ฆ๐˜ณ๐˜ข๐˜จ๐˜ฆ ๐˜จ๐˜ญ๐˜ฐ๐˜ฃ๐˜ข๐˜ญ ๐˜ฐ๐˜ณ ๐˜ณ๐˜ฆ๐˜จ๐˜ช๐˜ฐ๐˜ฏ๐˜ข๐˜ญ ๐˜ค๐˜ฐ๐˜ฏ๐˜ด๐˜ถ๐˜ฎ๐˜ฆ๐˜ณ.โ€ By creating a synthetic individuals dataset thatโ€™s consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, theyโ€™re hoping to improve the data and assumptions used in global impact simulations. ๐—ง๐—ต๐—ฒ๐—ถ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ๐˜€ The team primarily used data from 2 databases: โ€ข Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries. โ€ข Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries. Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics. This is a great dataset to explore geospatial visualizations or to build regional or global impact models. ๐Ÿ“š Link to the paper: nature.com/articles/s4159โ€ฆ ๐Ÿ—„๏ธ Link to the dataset: dataverse.harvard.edu/dataset.xhtml?โ€ฆ #syntheticdata #machinelearning #generativeai Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.
Synthetic Data Vault tweet media
English
0
1
2
102
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
Some multi-table datasets have interesting data patterns, like mirroring 1 or more columns in a child table from its parent table. This design pattern helps the database user avoid the need to run a time-consuming or expensive JOIN query, especially if one of the tables is extremely large or if the database is column-oriented like OLAP databases are. For example, imagine youโ€™re building an #ecommerce orders dashboard that frequently needed to analyze order volume and amounts by the userโ€™s country of origin. With a fully normalized table design, this application would need to accumulate this information by frequently querying and joining both the orders and users tables. If this query was slow or expensive, you could instead mirror the country of origin information from the ๐˜ถ๐˜ด๐˜ฆ๐˜ณ๐˜ด table to the ๐˜ฐ๐˜ณ๐˜ฅ๐˜ฆ๐˜ณ๐˜ด table. We call this the ๐—–๐—ฎ๐—ฟ๐—ฟ๐˜†๐—ข๐˜ƒ๐—ฒ๐—ฟ๐—–๐—ผ๐—น๐˜‚๐—บ๐—ป๐˜€ ๐—ฝ๐—ฎ๐˜๐˜๐—ฒ๐—ฟ๐—ป because 1 or more columns are carried over from one table to another. If your real data contains this pattern, you can now generate multi-table synthetic data using the SDV that also adheres to this pattern. This pattern is part of our ๐—–๐—ผ๐—ป๐˜€๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐˜ ๐—”๐˜‚๐—ด๐—บ๐—ฒ๐—ป๐˜๐—ฒ๐—ฑ ๐—š๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป bundle, or CAG, in the SDV Enterprise. ๐Ÿ“šLearn more about the CarryOverColumns pattern here: bit.ly/40WbYza ๐Ÿ“š Learn more about the CAG bundle here: bit.ly/410V4Q3 #syntheticdata #generativeai #databases #machinelearning #datamodeling
Synthetic Data Vault tweet media
English
0
0
1
66
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
James Rineer et al just released a new dataset containing millions of #syntheticdata about households and individuals in the US. Using publicly available census data from the U.S. Census Bureau, they generated: ๐Ÿ˜๏ธ 120,754,708 synthetic households ๐Ÿ‘ฅ 303,128,287 synthetic individuals ๐Ÿ—„๏ธ 3 Gigabytes of compressed parquet files The team was very meticulous with many aspects of the data generation. For example, they used external population density sources to place households inside real census block groups instead of just randomly generating locations inside the US. This is a great dataset for practicing spatiotemporal analysis and visualization. ๐Ÿ—บ๏ธ๐Ÿ“Š Link to the paper: nature.com/articles/s4159โ€ฆ Link to the dataset: springernature.figshare.com/articles/datasโ€ฆ #gis #machinelearning #ai #openai Collaborators: Nicholas Kruskamp Caroline Kery Kasey Jones Rainer Hilscher Georgiy Bobashev Credit to the @Nature magazine and the authors for the excellent image.
Synthetic Data Vault tweet media
English
0
1
3
179
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
In 2024, synthetic data routinely made headlines alongside many AI product launches. ๐—›๐—ฒ๐—ฟ๐—ฒ ๐—ฎ๐—ฟ๐—ฒ ๐—ผ๐˜‚๐—ฟ ๐—ฝ๐—ฟ๐—ฒ๐—ฑ๐—ถ๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€ ๐—ณ๐—ผ๐—ฟ ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ ๐Ÿ”ฎ ๐Ÿญ. ๐—ง๐—ต๐—ฒ ๐—ฟ๐—ถ๐˜€๐—ฒ ๐—ผ๐—ณ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—”๐—œ ๐˜„๐—ถ๐—น๐—น ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜ ๐—ถ๐—ป ๐—ฎ ๐—ป๐˜‚๐—บ๐—ฏ๐—ฒ๐—ฟ ๐—ผ๐—ณ ๐—Ÿ๐—Ÿ๐— -๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐˜๐—ผ๐—ผ๐—น๐˜€ ๐—ณ๐—ผ๐—ฟ ๐˜๐—ฎ๐—ฏ๐˜‚๐—น๐—ฎ๐—ฟ ๐—ฑ๐—ฎ๐˜๐—ฎ. ๐—ก๐—ผ๐—ป๐—ฒ ๐˜„๐—ถ๐—น๐—น ๐—ฑ๐—ฒ๐—น๐—ถ๐˜ƒ๐—ฒ๐—ฟ ๐—ผ๐—ป ๐˜๐—ต๐—ฒ ๐—ฝ๐—ฟ๐—ผ๐—บ๐—ถ๐˜€๐—ฒ, ๐—ฏ๐˜‚๐˜ ๐˜๐—ต๐—ถ๐˜€ ๐—ฝ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—ต๐—ฒ๐—น๐—ฝ ๐—ฒ๐—ป๐˜๐—ฒ๐—ฟ๐—ฝ๐—ฟ๐—ถ๐˜€๐—ฒ๐˜€ ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ฒ ๐—ฟ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ๐—บ๐—ฒ๐—ป๐˜๐˜€. Researchers have started to use LLMโ€™s to generate synthetic tabular data. We predict that these efforts will show promise on toy or single-table datasets but will fall short for complex, enterprise-grade, multi-table databases that contain lots of hidden context. Even though these tools will be tested and will fail to deliver ... it will lead to the development of much more concrete requirements for tabular synthetic data generators. ๐Ÿฎ. ๐—–๐—ผ๐—บ๐—ฝ๐—ฎ๐—ป๐—ถ๐—ฒ๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—ณ๐—ฎ๐—ฐ๐—ฒ ๐—ฎ ๐—ณ๐—ฟ๐—ฒ๐—ฒ๐˜‡๐—ฒ ๐—ถ๐—ป ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ฎ๐˜€๐˜€๐—ฒ๐˜ ๐—ฎ๐˜ƒ๐—ฎ๐—ถ๐—น๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† ๐—ฑ๐˜‚๐—ฒ ๐˜๐—ผ ๐—ฟ๐—ฒ๐—ด๐˜‚๐—น๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ฑ๐—ฒ๐—ฐ๐—น๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐—ฐ๐˜‚๐˜€๐˜๐—ผ๐—บ๐—ฒ๐—ฟ ๐—ฐ๐—ผ๐—ป๐˜€๐—ฒ๐—ป๐˜. Increased privacy and security regulations and increased customer privacy consciousness will make it harder to use customer data to train AI models. This will lead companies to run out of usable data and turn to synthetic data as a viable solution. ๐Ÿฏ. ๐—˜๐˜ƒ๐—ฒ๐—ฟ๐˜† ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ป๐˜† ๐˜„๐—ถ๐—น๐—น, ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐˜ƒ๐—ฒ๐—ฟ๐˜† ๐—น๐—ฒ๐—ฎ๐˜€๐˜, ๐—ฒ๐˜…๐—ฝ๐—ฒ๐—ฟ๐—ถ๐—บ๐—ฒ๐—ป๐˜ ๐˜„๐—ถ๐˜๐—ต ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ถ๐—ป ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ ๐—ฎ๐˜€ ๐—ฝ๐—ฎ๐—ฟ๐˜ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ๐—ถ๐—ฟ ๐—ฏ๐—ฟ๐—ผ๐—ฎ๐—ฑ๐—ฒ๐—ฟ ๐—”๐—œ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜€๐˜๐—ฟ๐—ฎ๐˜๐—ฒ๐—ด๐˜†. Synthetic data is often better than real data in AI training and can be more freely shared across the organization. AI models simply perform better when trained with upsampled, augmented, and bias-corrected synthetic data as they can identify patterns more efficiently without overfitting. We are already seeing this โ€” the SDV software has been downloaded more than 7 million times, and as many as 10% of global Fortune 500 companies currently experiment with SDV. We predict this number will grow exponentially next year. ๐Ÿฐ. ๐—ฆ๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ณ๐—ผ๐—ฟ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—ฏ๐—ฒ๐—ฐ๐—ผ๐—บ๐—ฒ ๐—ฎ ๐—บ๐—ผ๐—ฟ๐—ฒ ๐—ฝ๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ป๐—ด ๐—ป๐—ฒ๐—ฒ๐—ฑ. Enterprises will need additional data to train more robust AI agents and synthetic data can help fill the gap. ๐Ÿฑ. ๐—˜๐—ป๐˜๐—ฒ๐—ฟ๐—ฝ๐—ฟ๐—ถ๐˜€๐—ฒ๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—ด๐—ฎ๐—ถ๐—ป ๐—ฏ๐—ถ๐—ด ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐˜๐—ฎ๐—ฏ๐˜‚๐—น๐—ฎ๐—ฟ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ฎ๐—ป๐—ฑ ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜๐—ผ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€. While big tech focuses on improving LLMโ€™s, most enterprises will gain more immediate value from synthetic tabular data to improve data access, train more robust ML models, or train better AI agents. ๐Ÿ“– Read more about our 2025 predictions and our 2024 recap here: datacebo.com/blog/syntheticโ€ฆ #generativeai #ai #openai #syntheticdata #machinelearning
English
0
2
3
91
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
If you want to use AI generated synthetic data in place of your sensitive real data, then you need to be confident that the ๐ฌ๐ฒ๐ง๐ญ๐ก๐ž๐ญ๐ข๐œ ๐๐š๐ญ๐š ๐š๐๐ก๐ž๐ซ๐ž๐ฌ ๐ญ๐จ ๐ญ๐ก๐ž ๐ฌ๐š๐ฆ๐ž ๐›๐ฎ๐ฌ๐ข๐ง๐ž๐ฌ๐ฌ ๐ซ๐ฎ๐ฅ๐ž๐ฌ.โฃ โฃ For example, imagine that youโ€™re an online retailer that wants to test, using realistic data, how a new version of your website displays order history. Each order contains product names, their SKUโ€™s (stock keeping units), along with some other fields.โฃ โฃ Every SKU value is linked to a unique product name and the generated synthetic data needs to reflect this pattern to help you accurately test the change. A SKU value canโ€™t appear next to different product names in the synthetic data.โฃ โฃ In the SDV, you can define this business rule using the ๐…๐ข๐ฑ๐ž๐๐‚๐จ๐ฆ๐›๐ข๐ง๐š๐ญ๐ข๐จ๐ง๐ฌ ๐‚๐จ๐ง๐ฌ๐ญ๐ซ๐š๐ข๐ง๐ญ and require your synthesizer to generate synthetic data that adheres to it. โฃ โฃ ๐Ÿ“–Learn more about the ๐…๐ข๐ฑ๐ž๐๐‚๐จ๐ฆ๐›๐ข๐ง๐š๐ญ๐ข๐จ๐ง๐ฌ ๐‚๐จ๐ง๐ฌ๐ญ๐ซ๐š๐ข๐ง๐ญ here: docs.sdv.dev/sdv/reference/โ€ฆโฃ โฃ ๐ŸคJoin the SDV community here: bit.ly/sdv-slack-inviโ€ฆโฃ โฃ #generativeai #syntheticdata #machinelearning #openai
Synthetic Data Vault tweet media
English
0
0
1
68
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
An easy way to improve the quality of the synthetic data that the SDV generates is to accurately define each columnโ€™s sdtype. Sdtypes are a key part of the SDVโ€™s Metadata model, which lets you, the expert of the data, provide additional context for the SDV to incorporate. For example, a column containing the values 75023, 10002, and 10003 could represent any of the following sdtypes based on the dataset: - Numerical - Categorical - Postal Code - Identifier (or ID) Each sdtype results in different synthetic data generation behavior for a column, as you can tell from the diagram below. Start by establishing baseline metadata using SDVโ€™s auto-detection feature and then update the sdtype for specific columns to better align with the behavior you expect. Learn more about sdtypes here: docs.sdv.dev/sdv/reference/โ€ฆ #generativeAI #syntheticdata #AI
Synthetic Data Vault tweet media
English
0
0
3
60
Synthetic Data Vault
Synthetic Data Vault@sdv_devยท
Many real-world classification datasets have severe class imbalance. For example, imagine a fraud dataset where 99.9% of the rows are labelled non-fraudulent and only 0.01% are labelled fraudulent. By incorporating synthetic data in your training data, you can achieve a more desirable label balance. Start by training a generative AI model in the SDV on your real data. Then, use the Conditional Sampling feature to generate synthetic data for just the rows in the minority label class. Because the model is trained on your real data, the generated synthetic data will mirror the column distributions and correlations between the columns in your real data. By supplementing your training data with synthetic data thatโ€™s conditionally sampled from the minority class, you can even achieve a 50-50 class balance. Learn more about our Conditional Sampling feature here: docs.sdv.dev/sdv/single-tabโ€ฆ
Synthetic Data Vault tweet media
English
0
0
2
85