Sergey Nivens - Fotolia
Data lineage documentation imperative to data quality
Understanding the detailed journey of data elements throughout the data pipeline can help an enterprise maintain data quality and improve trustworthiness.
There are many reasons why you might need to understand the history -- or lineage -- of a piece of data. If you see a sales figure in a report then, you want to know where that figure came from, what system or systems stored it and when it was produced.
Just as your passport has stamps that indicate which countries you have visited, a data element similarly has a journey through the systems of a company. From where it was first entered to being copied to other systems, each stage of the journey has a timestamp, just like the stamps on your passport show the progress of your travels around the world. This ability to trace the path of data through an enterprise is called data lineage.
Data may go through one or more business processes and have controls applied to it at different stages, such as data quality validation -- e.g., verifying a postcode or checking that a value is within a valid range. Think of data lineage documentation as a kind of treasure map for your data that shows the passage of data through your systems.
Data lineage tools
How does an enterprise provide such a data audit trail or map of data's journey? It is possible to draw diagrams to illustrate the flow of data from system to system, but to do so manually would be impractical at scale.
Fortunately, there is no shortage of data lineage tools to help. There are many meta repositories from vendors such as Collibra, Alation, Infogix, Erwin and others. There are open source tools too, such as data lineage tools from Octopai and Talend.
These tools vary, but they all provide at least some degree of assistance with tracing data lineage. Software such as this can automatically search database catalogs and, in some cases, even program code in order to produce dependency diagrams and visually show the data lifecycle. In one case study, a large American bank estimated that their compliance project would have taken 80 times more effort had it been done without the use of automated tools.
Who is responsible for data lineage?
Deciding who is responsible for data lineage is important. Ideally data lineage documentation would sit within the remit of the data governance team. Data governance bodies define the ownership of data in an enterprise, usually with a steering group consisting of a small team to coordinate things and a network of data stewards embedded within the business lines.
Those on the data governance team are responsible for determining the golden copy versions of key data like customer or product hierarchies and the quality of data, so adding data lineage documentation to these responsibilities seems like a logical extension. Otherwise responsibility may end up in the data management function of the IT department, who may lack the business knowledge to prioritize things.
Why data lineage is important
Knowing the lineage of data is not just of curiosity to a data analyst. In many industries, such as finance and pharmaceuticals, there are extensive regulations covering how data should have a verifiable audit trail.
Examples of these types of regulations for data lineage documentation include standards from the Basel Committee on Banking Supervision and the International Financial Reporting Standards. Compliance with such regulations is not optional.
These are different from regulatory compliance issues. And there are many other reasons organizations may want proper data lineage documentation. Some data is sensitive -- such as medical records -- and it is imperative to know who has accessed these records, as well as when and by whom the records were updated.
Additionally, transactional data in a corporation may be entered in one system but moved to data warehouses or data marts and be transformed according to business rules along the way. Having an audit trail of where the data went and whether it was modified is important in many situations.
During the pandemic, virologists have used genetic sequencing to trace the exact viral lineage of COVID-19 as it slowly mutates, enabling them to see whether an outbreak originated locally or involved slight mutations of the virus transmitted by travelers from other countries. In a more mundane example, trying to debug a report showing spurious data involves tracing the lineage of data used in the report, which systems the data originally came from and what transformation functions were used on it.
The ability to trace the lifecycle of data through an organization is important for operational and regulatory reasons. Governments more frequently demand companies demonstrate that their data is trustworthy and are dishing out increasingly heavier fines in cases where companies are lax.
This bodes well for the somewhat neglected area of metadata technology that can automate the tricky and time-consuming tasks associated with data lineage documentation. Avoiding potentially heavy fines puts real impetus on documenting your data and its lifecycle. A data treasure map indeed.