Minerva Studio - Fotolia

Tip

How to build a master data index: Static vs. dynamic indexing

Expert David Loshin explores the differences between static and dynamic indexing in master data management systems, and which queries each approach can support.

Master data management systems are intended to present a unified view of information about key data domains, such as customers or products, using data pulled from original sources located within and outside the organization. Identity resolution and record linkage techniques are used to load all of the input source records, block the records according to predefined strategies, look for similarities and link records presumed to represent the same real-world entities. In some MDM systems, data collected from the linked records is combined into a single master record.

However, there is a risk that such artificially produced master records are inconsistent with the original source records. A different approach to provide accessibility to the information about a sought-after entity is to use a searchable master data index. The goal of the master index is to allow consumer applications to request information about a named entity and retrieve all the original records that have been linked together.

Identity resolution is typically performed as a batch operation, pulling data from the original sources, extracting the values from the data attributes to be used for similarity scoring (we will call them the "matching attributes"), followed by the process of linking sets of similar records into groups. Each group of linked records is assigned a unique identifier, and this unique identifier becomes the key for building the master data index. That index consists of two mapping tables: the search table maps the set of matching attributes to the unique identifier, and the index table maps the assigned unique identifier to all records assigned that identifier.

Static indexing vs. dynamic indexing

This index configuration provides what could be called a static master index used for search and retrieve. The search process begins with a consumer request for any records associated with a set of presented matching values (such as a customer's last name, first name and telephone number). The search table is queried to find any records with the presented matching values. If any records are found, it means there was a match in the data set, and for each of the found records, the corresponding unique identifier is looked up in the index table to find all other records linked to the found record. All those associated records can be retrieved and assembled into a result set given back to the data consumer.

The goal of the master index is to allow consumer applications to request information about a named entity and retrieve all the original records that have been linked together.

This master index solution works well, as long as there is an exact match for the attributes provided by the consumer seeking the data. The challenge is that even though this configuration is designed to link records in the presence of data variation, it does not support approximate searching, in which there is tolerance for variation in the presented matching values. In other words, unless you know the exact values for at least one of the indexed records, you won't be able to find any matches.

This suggests the need for a second type of master data index that can be called a dynamic index. A dynamic indexing system uses the same two mapping tables, but it also relies on the same type of identity resolution techniques used to create the master data index in the first place.

The records in the search table need to be blocked according to the same blocking keys used for the identity resolution process to create the master data index. Any set of matching values presented by a data consumer is used to determine the blocks that might contain matching records, and the records in those blocks are selected from the search index. Instead of executing an exact match, each selected record is paired with the presented matching values and is subjected to the same similarity scoring method used for the batch identity resolution. At this point, the mapped unique identifiers for any search records with scores at or above the matching threshold are used to search the index table to find all other records that are linked to the found record. All the associated records can be retrieved and assembled into a result set given back to the data consumer.

The response time for the static master data index is relatively fast, as it can be executed using a standard query and then table accesses using that query's results. But although the searching process for dynamic indexing may take longer, the result is greater precision and accuracy in retrieving matching records, providing more complete visibility of information about the sought-after entity.

Dig Deeper on Data management strategies