Tech Translated: Synthetic data

s+b a PwC publication

December 07, 2023

What is synthetic data? Synthetic data is exactly what it sounds like: data that has been artificially created (usually via algorithms, statistical models, or generative AI), rather than generated directly from real-world activities. To develop synthetic data, information from almost any source is analyzed to detect structures and patterns, which are then used as the foundation for creating new datasets that mimic the core characteristics of the original.

What business problems can it address?

Identifying, collecting, and structuring relevant data in ways that enable it to inform business decisions is time-consuming, expensive, and potentially risky. At the same time, “Every business is dealing with data protection rules and the right handling of sensitive data,” says Marcus Hartmann, partner and chief data officer for PwC Germany and Europe. “Synthetic data can give a clear impression of something without exposing the underlying origins and sensitive information.”

When appropriate data is inaccessible due to concerns about confidentiality, privacy, or regulatory compliance—or simply doesn’t exist in sufficient quantities to be useful—synthetic datasets can sidestep these restrictions.

How does it create value?

Synthetic data is often a lower-cost, faster way to access vast quantities of data than traditional data collection and curation methods. This means it has the potential to turbocharge the data-driven transformation of every industry by becoming the foundation for training machine-learning models and AI, which in turn enables the development of new products, services, and ways of working—finally delivering on the promise of “big data” that got us all so excited a few years back.

Synthetic data is already being used in many industries. Amazon used synthetic data about speech patterns, syntax, and semantics to improve multilingual speech recognition in its Alexa virtual assistant. The UK’s National Health Service (NHS) has converted real-world data on patient admissions for accidents and emergency (A&E) treatment into a statistically similar but anonymized open-source dataset to help NHS care organizations better understand and meet the needs of patients and healthcare providers. This kind of health data has also been leveraged by Alphabet and US insurance company Anthem to improve insurance fraud detection.

However, this is still relatively early-stage tech, and as with any other machine-generated information, the output is only as good as the inputs and algorithms. Anomalies and outliers in the source data can be amplified or lost altogether; either option will make the end product less representative of the real data it’s meant to replace. Synthetic datasets might also accidentally retain some personally identifiable information from the source, which could violate people’s privacy and expose organizations using the data to legal action.

Generative AI has been known to “hallucinate” incorrect information, when it fails to recognize anomalies in the underlying model and draws conclusions that seem statistically likely, but are not supported by the actual data. Any synthetic datasets created from those hallucinations are then affected. Some fear that because of this phenomenon, the proliferation of synthetic data could, over time, introduce feedback loops that would make AI-generated information less reliable.

Ensuring the value of synthetic data will require robust human due diligence. Following the guidance of PwC’s “Responsible AI” toolkit can help.

Who should be paying attention?

There are potential applications for synthetic data in almost every business, with CIOs, CTOs, CISOs, and the research and development, data and analytics, legal and compliance, and marketing and sales departments likely already exploring their options. Industries that deal with issues of data privacy and access—notably, healthcare, pharmaceuticals and life sciences, and financial services—are likely to see the greatest benefits.

Learn more

Last updated on 7 December 2023

Explore more from the Tech Agenda

How businesses can unlock the power of technology to capture more value and deliver sustained outcomes.

Explore more

Get the strategy+business newsletter

Sign up to get s+b’s twice-weekly newsletter featuring business insights, research, interviews, book reviews, and more.

4 results

19/10/23

CISOs should rewrite the playbook for cyber breaches

As threats become more interconnected, incidents are getting costlier and more frequent, according to a new PwC survey. A systemic response rests on five key actions.

16/08/23

Responsible AI: Building trust, shaping policy

In this episode, hear from industry leaders on responsible AI — what it is, the impact with generative AI and how you can use it to build trust across your organization.

15/08/23

Boston Dynamics wants to change the world with state-of-the-art robots

CEO Robert Playter dispels worries about the potential harm robots could inflict and thinks they will empower people instead of displacing them.

13/02/23

Overcoming a cyber “gut punch”: An interview with Jamil Farshchi

The chief information security officer of Equifax offers hard-won advice for leading under pressure, building a strong risk culture, and making security strategic.

Hide

Contact us

Joe Atkinson

Global Chief AI Officer for the PwC Network of Firms, PwC US

Tel: +1 215-704-0372

Matt Wood

Global and US Commercial Technology & Innovation Officer (CTIO), PwC US

PwC office locations Site map Contact us

© 2017 - 2025 PwC. All rights reserved. PwC refers to the PwC network and/or one or more of its member firms, each of which is a separate legal entity. Please see www.pwc.com/structure for further details.