How Do Companies Use Synthetic Data in Machine Learning?

Dr. Sigal Shaked, CTO of Datomize

For original article, click here.

You can’t launch a machine learning project without data, but you can’t use your data for machine learning if this poses a disclosure risk. This conundrum has frustrated the plans of companies across all sectors for years. Sure, there are plenty of privacy-enhancing technologies out there, but none of them offer a perfect solution. The risk of data leaks and re-identification are a constant barrier.

Unless, that is, you use synthetic data in machine learning projects. Synthetic data is a method of data generation for machine learning that creates an entirely artificial dataset from the original data, retaining all of the statistical distribution and insight, but without tracing back to any real people. You can then use this synthetic test data for any data science purpose you like without worrying about the disclosure risk. Exactly what you decide to do with that data next depends on your industry and the specific needs of your business. For example, synthetic data for financial services enables banks and other financial institutions to build models that vastly improve KYC and customer onboarding while minimizing lending risk. Other industries might want to focus their efforts on improving marketing or removing inefficiencies from their supply chains.

Synthetic Data Generation Open Up Machine Learning Opportunities in Three Key Ways:

1. It Gives You a Risk-Free Training Dataset for Machine Learning

How much of your valuable data is currently locked away in silos or behind walls of encryption, keeping it safe? How many months or years does it take to go through the requisite compliance hoops to free up that data so you can use it in machine learning projects? With synthetic data generation, you can skip these steps and jump straight to experimenting with your data. Rather than waiting around until the underlying data is no longer fresh enough to deliver genuinely useful insights, your synthetic test data is ready to use immediately, without any risk of disclosure or re-identification.

This broadens your scope, too. Many privacy-enhancing technologies function by obscuring, encrypting or scrambling data, or dividing up datasets between different parties so that no one stakeholder has complete access. Essentially, each of these approaches is designed to make it harder for you to truly understand what you’re looking at or make sense of how all the data points interlink.

This is entirely at odds with the point of machine learning and predictive analytics, which is all about teasing out vital connections, nuance and context, providing deep-level understanding of the business and giving you a competitive edge. Using synthetic data means you can fully unlock that potential.

2. It Helps You Monetize their Data

Far-reaching regulation like GDPR places tight restrictions on what you can do with the data you collect on customers, users and website visitors. It’s highly unlikely you’d be able to sell or repackage this for third parties without express permission. You may not even be able to store certain types of historical data for very long or to use it for internal data science projects that deviate from the original purpose of collection. This greatly limits the commercial opportunities of your data.

However, the rules apply to real people’s data that you had to seek consent to collect, not artificial data that you create yourself. As such, synthetic data opens up a myriad of potentially lucrative new revenue streams, including machine-learning-backed products.

3. It Allows You to Collaborate with External Vendors and Partners

It’s hard enough to share data between internal teams for data science projects. Getting the green light to hand your data over to external developers like fintech companies is typically a long and arduous process.

This can nip otherwise exciting partnerships in the bud – especially when you’re in the early stages of assessing potential vendors and just want to see what they can do with your data. No company wants to go through a year or more of bureaucracy before they can share a dataset with a third party, only to decide that this collaboration isn’t right for them – or, worse, discover that a competitor has now beaten them to it.

If you have a synthetic dataset ready to go, you can share this swiftly and easily, even across national borders. Plus, since you don’t need to store the data safely on-premise, you can make use of cloud storage services and applications that remove pain points from the process and free you up to innovate.

Final thoughts

As privacy and security legislation intensifies around the world and compliance obligations become ever more onerous, more companies are exploring how to generate synthetic data for their machine learning projects. The next few years are likely to see an explosion of game-changing, machine-learning-backed products and collaborations in industries that are usually hamstrung by rules around using sensitive data.

Without a versatile, privacy-risk-free training dataset for machine learning, you’ll be stuck at the back of the queue. Don’t leave it too late to embrace the inevitable.