Tech Translated: Synthetic data

  • December 07, 2023

What is synthetic data? Synthetic data is exactly what it sounds like: data that has been artificially created (usually via algorithms, statistical models, or generative AI), rather than generated directly from real-world activities. To develop synthetic data, information from almost any source is analyzed to detect structures and patterns, which are then used as the foundation for creating new datasets that mimic the core characteristics of the original.

What business problems can it address?

Identifying, collecting, and structuring relevant data in ways that enable it to inform business decisions is time-consuming, expensive, and potentially risky. At the same time, “Every business is dealing with data protection rules and the right handling of sensitive data,” says Marcus Hartmann, partner and chief data officer for PwC Germany and Europe. “Synthetic data can give a clear impression of something without exposing the underlying origins and sensitive information.”

When appropriate data is inaccessible due to concerns about confidentiality, privacy, or regulatory compliance—or simply doesn’t exist in sufficient quantities to be useful—synthetic datasets can sidestep these restrictions.

How does it create value?

Synthetic data is often a lower-cost, faster way to access vast quantities of data than traditional data collection and curation methods. This means it has the potential to turbocharge the data-driven transformation of every industry by becoming the foundation for training machine-learning models and AI, which in turn enables the development of new products, services, and ways of working—finally delivering on the promise of “big data” that got us all so excited a few years back.

Synthetic data is already being used in many industries. Amazon used synthetic data about speech patterns, syntax, and semantics to improve multilingual speech recognition in its Alexa virtual assistant. The UK’s National Health Service (NHS) has converted real-world data on patient admissions for accidents and emergency (A&E) treatment into a statistically similar but anonymized open-source dataset to help NHS care organizations better understand and meet the needs of patients and healthcare providers. This kind of health data has also been leveraged by Alphabet and US insurance company Anthem to improve insurance fraud detection.

However, this is still relatively early-stage tech, and as with any other machine-generated information, the output is only as good as the inputs and algorithms. Anomalies and outliers in the source data can be amplified or lost altogether; either option will make the end product less representative of the real data it’s meant to replace. Synthetic datasets might also accidentally retain some personally identifiable information from the source, which could violate people’s privacy and expose organizations using the data to legal action.

Generative AI has been known to “hallucinate” incorrect information, when it fails to recognize anomalies in the underlying model and draws conclusions that seem statistically likely, but are not supported by the actual data. Any synthetic datasets created from those hallucinations are then affected. Some fear that because of this phenomenon, the proliferation of synthetic data could, over time, introduce feedback loops that would make AI-generated information less reliable.

Ensuring the value of synthetic data will require robust human due diligence. Following the guidance of PwC’s “Responsible AI” toolkit can help.

Who should be paying attention?

There are potential applications for synthetic data in almost every business, with CIOs, CTOs, CISOs, and the research and development, data and analyticslegal and compliance, and marketing and sales departments likely already exploring their options. Industries that deal with issues of data privacy and access—notably, healthcarepharmaceuticals and life sciences, and financial services—are likely to see the greatest benefits.

Learn more

Follow us