Getty Images
The 5 components of a DataOps architecture
Reaping the benefits of DataOps requires good architecture. Use five core components to design a DataOps architecture that best fits organizational needs.
DataOps promises better data-driven decision-making through increased data quality and security. The process starts with understanding the core components needed to create a DataOps architecture.
From assessing which data to capture and metrics to track to turning data into value, the path to DataOps can be long and tricky. A continuous flow of data -- possibly dramatically increasing in volume and variety over time -- and evolving goals require a streamlined, automated process with regular tests and security checks.
There are five core components data teams should support when building a DataOps architecture, as well as best practices and solutions to frequent problems. Success depends on the kind of data and the goals you want to achieve. Knowing the peculiarities of your specific situation helps you select a vendor or build an in-house architecture to meet business needs. It all starts with the source: the type of data. Then, teams must address data collection and storage, organization into a database or multiple data repositories, and protection. The final component is the data lifecycle.
1. Data type
A data architecture design starts with several key questions:
- Is the data internal, such as purchase transactions or product accesses and downloads?
- Is the data gathered from web crawling -- for example, to feed into a generative AI system?
- Is the data structured or unstructured?
- Does it integrate third-party data, such as demographic information to augment collected user data?
- Is the data a source of liability, which could occur if it's accessed by a nonauthorized person?
- Is it real time or near-real time?
The answers to these questions help you determine the architecture you need. In project design and vendor selection, these questions shape the options, expected costs, data volume and expected ROI after costs. Start small, and test various vendors' capabilities for the expected architecture.
The problem is a bit different for organizations that already have big data and want to make changes to reduce costs, improve efficiency, modernize or scale. Ease of integration and the tool set learning curve may be important factors in that case, in addition to product capabilities.
2. Data gathering and storage
Data may come from various sources, such as financial transactions or sales, weather or traffic monitoring, hardware-associated performance measurements, sensors in medical devices and other IoT devices, and user-uploaded data.
Data is usually captured in a raw state, with minimum validation or preprocessing. If entering data manually, scan the data, and turn it into digital information. Checksums can reduce the amount of invalid data. Many applications need encryption. Eventually, the data ends up in a database or file repository. Depending on the needs, it resides in a centralized location or locally. The storage may be on data center infrastructure owned by the business or a vendor's cloud platform.
3. Data organization
If the data comes from multiple sources with timestamps, you can blend it into one database. Use the aggregated data to produce summarized data at various levels of granularity, such as daily aggregate reports.
Databases can produce satellite tables, link them together via keys and automatically update them. For example, user ID is the key for personal information in a user transactions table. If the data is in a standardized field, use metadata to indicate the type of each field -- text, integer and category -- labels and size.
Graph databases show relationships between different data points, and you can use a distributed architecture to increase efficiency. Examples of data to organize in a graph database include LinkedIn connections, Facebook friends or keyword associations.
Key-value databases offer another option for storing sparse high-dimensional or binned data with millions of multivariate bins. Both the key and the value can be multidimensional. For example, key-value databases can store entities Keyword1, Keyword2 and count where Key = {Keyword1, Keyword2} is a pair of keywords and Value = count represents the number of instances where both keywords appear in the same piece of text.
You can cache large, frequently downloaded chunks of data to accelerate access without burdening production servers or use a database vendor that offers caching. Older data can reside on slower servers, possibly in an archived format. Documenting all changes and additions to data is critical because new data sources can be incompatible with current data sources or require different metric measures. In some cases, it is a good idea to take measurements from multiple sources and track discrepancies over time. For example, compare internal web traffic statistics with the same data obtained from Google Analytics.
4. Data security and privacy issues
Security and privacy present multiple issues for DataOps. If you're working with a data hosting vendor, make sure the data is secure. If data storage is internal, the data team should handle protection. Data backups are an important part of this process.
Decide who has access to what data at any given time. Typically, business analysts have read-only access to production tables. They use dashboards and tools to access core databases and create local tables to help reduce bandwidth consumption and avoid server overload. Decision-makers and stakeholders can receive a customized report with summary data and actionable highlights relevant to their needs.
Privacy is another issue. Not everyone should have access to every field in an organization's databases, especially when sensitive data is retained. In compliance with various laws, users who want their data deleted should be able to easily tell the organization to do so.
5. Data lifecycle
DataOps is not just about the acquisition and use of business data. Archive rarely used or no longer relevant data, such as inactive accounts. Some satellite tables may need regular updates, such as lists of blocklisted IP addresses. Back up data regularly, and update lists of users who can access it. Data teams should decide the best schedule for the various updates and tests to guarantee continuous data integrity.
The DataOps process may seem complex, but chances are you already have plenty of data and some of the components in place. Adding security layers may even be easy.
Organizations that need to change vendors to improve or implement DataOps should look for companies that offer automated conversion and integration. Automation simplifies the process for turning CSV files into a database and can handle millions of documents. An auditor can help prioritize the team's needs, make any necessary changes at an acceptable pace, test upgrades before they go into production with backward compatibility -- if possible -- and archive legacy data.
Vincent Granville is an author and publisher at MLtechniques.com, machine learning scientist, mathematician, book author, patent owner and former VC-funded executive, with 20-plus years of corporate experience.