https://www.techtarget.com/searchdatamanagement/tip/How-data-lineage-tools-boost-data-governance-policies
Data lineage documents the journey that data takes through an organization's IT systems, showing how it flows between them and gets transformed for different uses along the way. It uses metadata -- data about the data -- to enable both end users and data management professionals to track the history of data assets and get information about their business meaning or technical attributes.
For example, data lineage records can help data scientists, other data analysts and business users understand the data they work with and ensure that it's relevant to their information needs. Data lineage also plays a valuable role in data governance, master data management and regulatory compliance programs. Among other aspects of those initiatives, it simplifies two critical data governance procedures: analyzing the root causes of data quality issues and the impact of changes to data sets.
Data lineage information is collected from operational systems as data is processed and from the data warehouses and data lakes that store data sets for BI and analytics applications. In addition to the detailed documentation, data flow maps and diagrams can be created to provide visualized views of data lineage mapped to business processes. To simplify end-user access to lineage information, it's often incorporated into data catalogs, which inventory data assets and the metadata associated with them.
Information on data lineage is crucial to data management and analytics efforts. Lineage details shine a light on data to help organizations manage and use it effectively. Without access to those details, it becomes much harder to take full advantage of data's potential business value.
The following are some of the benefits that data lineage provides.
Data lineage is also closely aligned with data classification and data provenance, two other data management processes. Here's a look at what they are and how they differ from and relate to data lineage.
Data classification. This involves assigning data to different categories based on its characteristics, primarily for security and compliance purposes. Classification is used to categorize data based on how sensitive it is -- for example, as personal, proprietary, confidential or public information. Doing so separates data sets that need higher levels of security and more restrictive access controls from ones that don't. Data lineage provides information about data sets that can aid in classifying them.
Data provenance. Sometimes considered to be synonymous with data lineage, data provenance alternatively is seen as being more narrowly focused on the origins of data, including its source system and how it's generated. In that context, data lineage and data provenance can work hand in hand, with the latter providing high-level documentation of where data comes from and what it entails.
The essence of data governance is creating corporate data policies and ensuring that people comply with them. Such policies can span an array of intents, including directives on data protection, validation and usage. Data governance managers and data stewards must solicit data requirements from business users and work with members of the decision-making data governance committee to agree on common data definitions, specify data quality metrics and develop the policies and associated procedures.
It's a big challenge, though, to bridge the gap between defining data governance policies and implementing them. That's where data lineage comes in. It documents data sources and flows, enabling governance teams to monitor how data moves through systems and is modified and used. The lineage information helps them ensure that proper data security and access controls are in place and that data is stored, maintained and used in accordance with governance policies.
Data lineage can also ease specific governance-related tasks. For example, without a way to determine where data errors are introduced into systems, it's difficult for data stewards and data quality analysts to identify and fix them. That has consequences: If data flaws aren't caught, an organization may be plagued by inconsistent or inaccurate analytics results that lead to bad business decisions.
In root cause analysis of data errors, lineage records provide visibility into the sequence of processing stages a data set goes through. Quality levels can be examined at each stage to find where data errors originate. Working backward from where an error is first identified, a data steward can check whether the data conformed to expectations at earlier points or included the error then. By pinpointing the stage at which the data was compliant upon entry but flawed upon exit, workers involved in a data governance program can eliminate the error's root cause instead of just correcting the bad data.
Data lineage is useful, too, in doing impact analysis to stay on top of issues caused by changes to source data formats and structures, a common problem in today's increasingly dynamic data environments.
When data is changed, there may be unintended consequences downstream. By working forward from the point of data creation or collection, a data steward can rely on data lineage documentation to help trace data dependencies and identify processing stages that are affected by the changes. Those stages can then be reengineered to accommodate the changes and ensure that data remains consistent in different systems.
Various techniques can be used to collect and document data lineage information. They aren't necessarily mutually exclusive -- an organization may use more than one lineage technique, depending on its application needs and the nature of its data environment. The available methods include the following:
Another approach is fully manual: interviewing business users, BI analysts, data scientists, data stewards, data integration developers and other workers about how data moves through systems and gets used and modified. The information that's gathered can be used to map out data flows and transformations, perhaps as a starting point for a data lineage initiative before implementing more automated techniques.
Here are some best practices to help keep the data lineage process on track and ensure that it provides accurate and useful information about data sets.
Manually collecting metadata and documenting data lineage requires a significant resource investment. It's also prone to error, which can cause big problems, especially as organizations increasingly rely on data analytics to drive business operations. As a result, it aids data governance efforts to seek out tools that manage representations of data lineage and automatically map them across the enterprise.
If you do decide to move forward on the technology evaluation process for a possible purchase, you should look for data lineage tools that can do the following:
There's a plethora of data lineage technology options available to consider. Tools for documenting and managing data lineage are offered by several different types of vendors, including the following:
Vendors that offer self-service data preparation software for data engineers and analytics teams, such as DataRobot and Alteryx's Trifacta unit, also support data lineage capabilities, as do various vendors of BI and analytics tools for use within the applications run on them.
28 Jul 2022