cherezoff - stock.adobe.com
Key components of an effective data virtualization architecture
Experts break down which elements -- both technical and nontechnical -- are most crucial to successfully deploying and managing a data virtualization architecture.
Is your organization prepared to implement a data virtualization architecture? A successful implementation often requires coordination between several technical pieces for capturing data, organizing it and ensuring data quality and governance.
The ultimate goal of a data virtualization architecture, which can provide integrated views of data from different source systems, is to enable users and applications to access data without having to understand the intricacies of the underlying data technology.
Sometimes that's easier said than done, but luckily there are a set of specific components that can help ensure effective implementation and management. These range from a good abstraction tier to help hide some of the underlying complexity, a metadata management layer to help orchestrate important data virtualization processes and data quality adeptness to help identify problems and clean data. It's also important to work out some of the governance and security issues around the underlying data and how it's shared.
Here, experts detail the elements that are most important to a data virtualization architecture.
Abstraction tier
A data virtualization architecture requires a layer of technology that acts as an abstraction layer between the user and the one or more data stacks needed in the framework, said Avi Perez, CTO of Pyramid Analytics, a data analytics software company.
In the data analytics space, the abstraction layer usually comes in the form of the end-user tools themselves. Many analytic tool sets offer the user the ability to explore the data without needing to write queries -- or even know how the underlying data technology works. Using data models, the complexity of the underlying data structure can be significantly hidden in a way that exposes only the schematic model to end users, Perez said.
This involves constructing virtual schematic models of the data. A robust analytic platform will allow such models to be written against multiple types of data stacks in a consistent manner. Look for tools that have the flexibility to handle all underlying data structures, technologies and stack idiosyncrasies, Perez advised.
Dynamic data catalog
In order to surface data across the enterprise, it's important to provide business-friendly data exploration and preparation capabilities using the features of a traditional data catalog, said Saptarshi Sengupta, director of product marketing at Denodo, a data virtualization vendor.
This includes classification and tagging, data lineage and descriptions. It's also useful to enable keyword-based search and discovery. This may also require a translation tier for mapping data labels to terms businesspeople may be more familiar with, Sengupta said.
Model governance
The virtual data model or schema is the central element in data virtualization architecture, especially in relation to analytics. It then follows that the governance of the model definitions is central to successful virtualization, Perez said. The governance of the model should include its inputs -- the data in the data stack -- and its outputs -- analytic artifacts like reports, calculations and machine learning logic.
The model and its definition need to be fully governed through tools like proper documentation, solid security measures and definition versioning. Sanctioned watermarking to denote the quality of a published model is an example of good governance, Perez said.
Metadata and semantics mediation
The most important component of a data virtualization architecture is the metadata layer that captures syntax and semantics of source schemas, said Kris Lahiri, CSO and co-founder of Egnyte, a secure file-sharing and synchronization platform.
This component will need to dynamically observe schema changes and either merge or escalate differences over time. A virtual view that exposes a unified schema for analytics is also important to hide implementation differences between sources. Some may be accessed in real time, while others will need to work off of snapshots, Lahiri said.
Often, the meaning of specific fields and codes is not described in the data itself. In order to reuse the data, metadata -- including schemas, glossaries or terms, as well as governance and security rules -- must be documented and retained.
"While this information is often documented separately from the technical architectures, an integrated system encompassing data and metadata allows for far more data reuse with less risk," said Aaron Rosenbaum, chief strategy officer at MarkLogic.
Security and governance controls
Aaron RosenbaumChief strategy officer, MarkLogic
Secure sharing requires that the data virtualization architecture maintains and enforces the rules and authorities around data access, as well as retains auditable records of where data came from and where it was used, Rosenbaum said. The source system cannot implement security over the data when multiple systems are being integrated -- the data virtualization tier must control this access.
This isn't only a technical capability. "Groups and individuals sharing data for uses must trust that their data will be protected," Rosenbaum said. Some systems automatically protect data and preserve the lineage information, allowing for simpler governance with less risk than systems requiring extensive configuration for every combination of usage.
Clear governance rules
In enterprises, "it's too hard to extract data from our legacy system" is sometimes code for "I'm not prepared to share this data," Rosenbaum said. While removing technical barriers to sharing, it's also important to remove the organizational barriers as well. Clear governance rules can help so each group doesn't feel compelled to establish their own policies ad hoc.
Visibility into downstream usage can also help with trust across groups. Tracking lineage and provenance so that data owners can see where their data is being used can help surface organizational issues more rapidly.
Data quality
In order to ensure that the data delivered by the enterprise data layer is correct, a data virtualization architecture should include validations of data and on-the-fly transformations in an agile and flexible way, Denodo's Sengupta said.
Data quality capabilities will ensure that data validations are applied to any application connecting to the enterprise data layer. Filtering incorrect rows and values after applying data validation logic ensures that data consumer only receive rows with correct data. Flagging incorrect values is done by adding extra columns with a flag indicating that particular values are wrong. Restorative functions can also replace originally incorrect value through some transformation logic.
Consider graph databases
Casey Phillips, director of IT at StrategyWise, a data science consultancy, said that graph databases are making solid inroads with their enterprise clients for data virtualization projects. Products like Neo4j can be installed on top of or alongside traditional disparate data sources and used to ingest cherry-picked data from the traditional systems.
A data model for the graph is defined as attaching attributes such as tags, labels and even directional connections to other bits of incoming data sourced from a completely different location. These added flourishes allow for advanced machine learning- and AI-based applications to more easily work with the different data sets to perform predictions and glean other insights, Phillips said.
"When engaging with a client who is interested in rolling out a graph, we often suggest the formation of data government councils and Data stewards for the underlying systems and related processes," he said.
Having a static team that becomes the main enforcement body for the organization that enforces rules and standards for data cleanliness, data identification and any other issues related to the body of a company's data is critical.
"Without having these guidelines and stakeholders defined, most data virtualization projects at scale will be extremely hard, if not impossible, to pull off successfully," Phillips said.