Study shows synthetic data can be game-changer for AI development

WEDNESDAY, JULY 17, 2024

The challenge of acquiring and labelling real-world data is a major hurdle in developing AI models, according to a recent Gartner survey of 644 organisations, highlighting data availability as a top barrier to implementing generative AI (GenAI).

The study, released on Tuesday, cited synthetic data as a potential solution. It poses significantly lower privacy risks than real data, allowing for the training of machine learning models and the analysis of previously inaccessible data.

Alys Woodward, Gartner's senior director analyst, emphasised the value of synthetic data in addressing privacy concerns and data scarcity in AI development. 

"Synthetic data can open a range of opportunities to train machine learning models and analyse data that would not be available if real data were the only option," she explained.

The report then discussed the primary benefits of synthetic data. Among the advantages is its ability to connect information silos without jeopardising sensitive data. 

This is especially important in fields like ID verification and automated driver assistance systems (ADAS), where personally identifiable information is frequently requested. Synthetic data can be used to create a variety of datasets for ADAS training, such as different facial expressions, skin colours, and even low-light conditions.

Furthermore, synthetic data provides an alternative to the time-consuming and error-prone process of manual data anonymisation. It enables faster, cheaper, and more convenient access to data that closely resembles the original source while protecting privacy. 

Alys Woodward

Differential privacy techniques can be used to ensure that synthetic data derived from real data is highly unlikely to be deanonymised.

However, there are challenges to the adoption of synthetic data. Maintaining data utility while also ensuring privacy is a delicate balance.

Additionally, there are also misconceptions about the quality of synthetic data, with some believing it to be inferior to real data. 

Woodward disagrees with this idea, saying, "Synthetic data can be better than real data, not in how it represents the current world, but in how it can train AI models to work with the ideal or future world."
 

Study shows synthetic data can be game-changer for AI development

Challenges on the road to widespread adoption
Creating synthetic tabular datasets requires striking a balance between privacy and utility. The data must remain both useful and accurate to the original dataset. If the utility is too high, privacy may be jeopardised, particularly for unique records, because the synthetic dataset can be matched with other data sources.

In contrast, privacy-enhancing techniques such as disconnecting attributes or introducing noise via differential privacy can inherently reduce the dataset's usefulness.

Similar to how data quality has historically plagued data management, some people are sceptical of synthetic data, considering it inferior to real data. Synthetic data, on the other hand, has the potential to outperform real-world data in terms of training AI models for idealised or future scenarios.

While a synthetic dataset is similar to the original, it may lack unusual occurrences or "edge cases" that were not present in the original data. 

This is especially important for image and video data used in applications such as autonomous driving, where AI models are trained on massive amounts of driving footage. 

Study shows synthetic data can be game-changer for AI development

However, synthetic data can be used to simulate unusual situations such as emergency vehicles, snow driving, or roadside animals.

Woodward concluded that synthetic data would be a promising solution for overcoming privacy issues in AI development. As technology matures and these challenges are addressed, synthetic data has the potential to become a powerful tool for training robust and responsible AI models, while also pushing the boundaries of what is possible in machine learning and data analysis significantly.