Getty Images

Tip

Guide to synthetic test data

Synthetic data can replicate real-world scenarios in software testing. Discover how synthetic data addresses challenges posed by real-world data while preserving its advantages.

Testers need high-quality, diverse and secure data to succeed. This can sometimes be hard to come by.

For example, testers might encounter a situation where they need to test a bug or new feature but don't have enough unique data in the test environment to exercise the tests. They might also run into situations where they need large data sets to run the proper performance tests against the test environment. Synthetic test data gives testers an avenue for overcoming these problems.

What is synthetic test data?

The word synthetic describes substances that are artificial or faked but created to imitate an original. Synthetic test data is data that's artificially created to imitate data from a production environment. It is useful in both exploratory testing and automated testing.

Testers use synthetic data to increase the amount of unique data they have to work with without manually creating it. Typically, there is more data in a production environment than in a test environment, which leads to more ways that the software will be used, which, then, leads to more scenarios and code paths that will be executed. Effective testing should test as much data and as many unique scenarios as possible prior to the software's release. Having synthetic data readily available rather than having to manually create it can save testers a lot of time and effort.

Testers also use synthetic data to ensure data privacy for the customers of their product. Synthetic data can function in a test environment just as real data can, without exposing the sensitive customer information associated with real data.

Synthetic data can imitate real-world data in automated tests. Testers can build scripts and tools that enable them to generate synthetic data during automated tests, which can help find edge case bugs that would otherwise go unnoticed.

Use cases for synthetic test data

In testing software, there are two main ways that teams discover issues: manual testing. which is when humans exercise tests, and automated testing, which is when the computer exercises tests.

In manual testing, synthetic data has several use cases:

  • Eliminates privacy and security risks.
  • Generates data similar in shape to that in a production environment.
  • Creates massive amounts of data very quickly to simulate a production-sized data set and get more realistic test results.

Synthetic data also has several use cases in automated testing:

  • Ensures automated tests are deterministic, meaning the tests are more reliable and repeatable due to the type and quality of the data that is generated.
  • Can be applied to negative test scenarios when there is a need to validate proper error handling. This is handled both on the user interface and API interfaces.
  • Can be created programmatically on the fly during an automated test run.

Beyond the two main testing types above, synthetic test data is useful in many other specific testing types, such as the following:

  • Integration testing. Synthetic test data can help testers evaluate different integration points between applications and identify places where data does not transmit between them correctly.
  • Performance testing. Testers can generate high volumes of synthetic data to test an application's ability to function under high levels of traffic.
  • Regression testing. Testers can use synthetic data to determine if code changes introduce new bugs or break application functionality.
  • Unit testing. Testers can use synthetic data to test individual application units in isolation.

Benefits of synthetic test data

There are several benefits to using synthetic data in testing as opposed to using real data.

  • Quality. Test teams can design synthetic data to meet quality standards and limit the amount of errors in data, leading to more reliable tests overall.
  • Quantity. Test teams can generate large quantities of synthetic data where manual data generation might be limited in terms of quantity.
  • Diversity. Synthetic data is customizable. Testers can augment it to suit their use case. They can add data to represent situations that are not present in authentic data sets or mold synthetic data sets to mimic authentic ones.
  • Privacy, security and compliance. Synthetic data protects customers' sensitive personal information by creating fake data with no sensitive elements that behave similarly to authentic data. By protecting customer data, synthetic data also helps companies comply with data protection regulations and reduces legal risk and potential reputational damage for the company.

Challenges of synthetic test data in software testing

Despite the benefits of synthetic test data, there are several challenges development teams can face when using it:

  • Time-consuming. Generating synthetic data for every single test case or scenario can be time-consuming, depending on your approach and needs.
  • Tooling. Synthetic data creation requires specific tooling and techniques. Creating a bulk amount of synthetic test data without the right tools is a tedious task.
  • Learning curve. Learning about data models and how data is structured within a data store can be a new challenge for software testing professionals.
  • Programming. Architecting and building a reliable method for generating synthetic data will require programming skills.

How to generate and use synthetic test data

Testers have several methods at their disposal when it comes to synthetic data creation.

Effective testing aims to test as much data and as many unique scenarios as possible prior to the software's release. Having synthetic data readily available rather than having to manually create it can save testers a lot of time and effort.

Random data generation

There are libraries for all the popular programming languages that provide ways to generate synthetic data. These libraries allow users to create random first names, last names, phone numbers, email addresses, passwords, URLs, company names and more. Using a library to generate random synthetic data lets users create randomness in their data set or for test data in automation. These libraries can be somewhat deterministic, depending on the library and the data being generated. For example, a synthetically generated postal code could include a five-digit number for a United States address or a six-digit alphanumeric code for a Canadian address.

Rule-based generation

Similar to random data generation, the same types of libraries can generate rule-based synthetic data. A few examples of rule-based generation include the following:

  • A number range with a minimum of 1 and a maximum of 500.
  • A five-digit number between 00000 and 99999.
  • A company name that is less than 25 characters.
  • Data based on an array of eight different items from a dropdown box.
  • A string that can check for cross-site scripting injection
    "<script>alert('XSS')</script>"​

This rule-based data generation is more precise than random data generation. Testers can use it within automated tests not only to generate certain data types, but to use different data based on certain rules. One example of this is if there is a test that is running on the weekend and the software doesn't allow invoices to be created with a due date of Saturday or Sunday. Testers could create a rule that would generate a due date in the future that is always on a weekday. This makes the data more deterministic in nature, which is always a goal in test automation.

Data masking and anonymization

Anonymization and masking can be a risky and complex strategy. This idea comes from copying real data from a production environment -- which might include sensitive or personally identifiable information -- and attempting to anonymize this data by changing certain values, such as first names and last names, to random data. It is easy to miss a data scenario or column when working with larger data sets.

Data anonymization can also include dissociation of data -- in which testers use the actual data but swap individuals' first and last names randomly or reassociate actual addresses with different users. This goal is to make customers' information or data no longer recognizable in an irreversible way.

Another approach might be to mask certain information within the data. This could mean keeping the first and last letters of a name and replacing the rest of the characters with asterisks, which can be useful for any string data type. This can be tricky depending on what data is allowed within the application through the user interface or API. The user interface might block special characters, such as "*", which could cause errors when testing or editing these data fields during test case execution.

In both anonymization and masking, data transformation is required. This is typically the "T" in the extract, transform, load (ETL) process. For each column or row of data that's used in a data store, data must be extracted from the source system, transformed using either anonymization or masking logic -- which must be built and maintained -- then loaded into the target system. There are many commercial tools that offer this ability but they are costly, and many teams elect to build and maintain their own tools. Out of all the ways to generate data, this is the costliest but does provide the most realistic data to test with.

Generative models

With the recent advancement of artificial intelligence, it's possible to use generative models to assist in creating test data. There are many tools on the market that assist in this process. Look for tools that can generate large volumes of synthetic data using a statistical model. The tool should be able to keep creating similar correlations and distributions of the data set provided. The benefit of a generative AI approach to synthetic test data is that the model learns to mimic the production data set through training.

The typical steps to generate test data using generative models include the following:

  1. Provide a data source or CSV data file with columns and rows -- i.e. a database table.
  2. Train a new AI model based off of this production data.
  3. Provide output parameters to the model outlining how much data should be generated.
  4. Receive a CSV data file containing fully anonymized and synthetic test data generated by artificial intelligence.

Then, from the newly created data file, import this into the test database. Now that the tester has this data set, they typically won't have to regenerate this data, unless new data columns are added. In this case, testers can repeat the same process for the entire table or just the new column.

Butch Mayhew is a Playwright Ambassador and a writer at Playwright Solutions with many years of experience building software with a focus on quality.

Dig Deeper on Software testing tools and techniques

Cloud Computing
App Architecture
ITOperations
TheServerSide.com
SearchAWS
Close