Synthetic Data Generation in eCommerce

As ecommerce businesses make more use of Machine Learning to improve pricing decisions, assortment planning and demand forecasting, it becomes more important to accurately train the models used by these algorithms, which depends on larger data sets of training data. For businesses without large data sets, and for data with privacy concerns, synthetic data generation can be used as a solution.

Synthetic Data Generation is a data science technique for producing data that mimics known data sets statistically, but without being copied from them.

How does it work

The input into the synthetic data generator is a set of training data, for example actual sales transactions.

The generator is a machine learning algorithm that must be trained to produce an output data set that is indistinguishable from the training data, i.e. synthetic data.

During the training of the generator, each time it produces an output dataset, a discriminator, another machine learning algorithm, determines whether that output is close to being indistinguishable from the training data. If it is not, this insight is fed back to the generator to help it improve its algorithm. If it is, then the generator ML algorithm has successfully been able to create synthetic data.

Since there is no limit to the quantity of synthetic data that a trained generator can produce, these synthetic datasets can be used to train deep neural networks that need vast quantities of raw data to be trained on.

Machine Learning Models For Synthetic Data Generation

The two main ML approaches for generating synthetic data are generative adversarial networks (GAN) and variational autoencoders (VAE). Both are suitable, each with pros and cons.

In ecommerce, most synthetic data needs to be in the form of tables, for example a table of sales transactions, with columns such as date, time, value, customer ID, and so on. In order to correctly generate synthetic tabular data, the distribution of values in a column must match the training data (for example most sales transactions are made during the day), but also the relationships between the columns (for example some customers place higher value orders than others).

Neural Networks ideal to mimic the statistical data points in individual and multi-column datasets. GANs and VAEs are both types of neural networks. 


With the VAE, an encoder transforms the original, training data, along with some noise to randomize the data, into a latent distribution, and a decoder transforms it back into the original space. The goal of the VAE is to quantify the error between the output and the input data and then, through training, to minimize it.

VAEs are simple to implement and easy to train, but they struggle when the data includes multiple data types - text (e.g. name), binary (e.g. password hash), and categorical (e.g. product category) and continuous (e.g. time of day).


If a VAE can not adequately support the data, a GAN can be used. GANs were developed to assist with unsupervised learning. The GAN trains two neural networks, a generator and a discriminator. The generator is trained to produce data that mimics the training data and the discriminator is trained to tell synthetic data apart from original data. GANs are good for synthetic image generation and other applications dealing with unstructured data. The downsides of a GAN are the effort required to train the models and the risk that training can effectively go on for ever. 

Future of Synthetic Data Generation

Gartner claims synthetic data is fundamental to producing well-trained ML models.


This article was updated on April 9, 2022

M Ryan

M Ryan is an ecommerce consultant with twenty years experience working with retailers, consumer brand manufacturers and other consumer-facing businesses helping them to develop their ecommerce strategy, implement ecommerce technology and improve their ecommerce operations. He works extensively throughout US and Europe, with clients including global brands, large retailers and household names in consumer goods.