alphaspirit - Fotolia
5 best practices for managing real-time data integration
Real-time data integration isn't like traditional data integration -- "it's moving, it's dirty and it's temporal," cautions one data pro. Experts offer up some best practices.
Enterprises are starting to embrace a variety real-time data streams as part of their data management infrastructures. This runs the gamut from extremely high-frequency market trading data to less frequent updates like IoT readings, weather data and customer counters.
The term real time is somewhat relative. An autonomous vehicle or market trading app has a much lower tolerance for processing delays than weather updates or a passenger counter. However, the use of all these different types of real-time streams opens up a paradigm shift around creating apps that can respond to constantly changing inputs of data, compared with traditional batch-oriented data integration.
This shift also creates new challenges. "With real-time data integration, there is not as much opportunity to fully cleanse and validate the data," said Tony Baer, principal analyst at Ovum. "That means that the heavy lifting must be performed upstream, carefully tracking and documenting the lineage of the data sources, and the trustworthiness of the sources."
The properties of real-time data
Real-time data integration is different than traditional data integration because the physics of the data are different. "It's moving, it's dirty and it's temporal," said Mark Palmer, senior vice president of analytics at Tibco.
Because it's moving, enterprises need real-time data preparation technology that can complement traditional extract, transform and load (ETL) technologies. ETL can still help load context from corporate warehouses, ERP or customer relationship management systems. In contrast, real-time data integration can help add dynamic context closer to the vein for streaming data, and even at the edge using an emerging class of edge computing architectures.
Because it's dirty, enterprises need to find ways to use stream processing to correct errors, aggregate based on streaming windows and smooth data on the fly.
Because it's temporal, enterprises can implement time-based continuous queries that act on real-time and historical data. For example, when a temperature sensor reading increases by more than 5% in any five-minute sliding window, an app could count this as a spike of great interest. A streaming query could use smoothing to eliminate spurious readings and run continuously over a sliding window looking for matches against historical patterns.
Here are five best practices for data management professionals to follow when developing real-time data integration strategies:
1. Simulate the integration
Real-time data integration requires more up-front simulation and testing than traditional data integration, Palmer said. In the old days, some algorithmic trading desks on Wall Street would build a new trading algorithm for real-time data, test its logic a bit and start trading.
"That works sometimes, but real time cuts both ways," Palmer said. Knight Capital lost $440M in less than 40 minutes due to what was essentially a bug.
2. Don't put a Tesla engine into a Model T
Real time should disrupt old batch-oriented ETL applications. "Too often, we see companies use real time to 'speed up' the same old manual systems," Palmer said. As a result, the enterprise gets a faster mess than it had before.
For example, an airport's management team might decide to use real-time data integration to rebuild its gate agent app. Although this gives real-time data to the gate agent, it does not create new kinds of value. A better strategy might be to provide real-time monitors for passengers to check flight status on their own, or on their mobile phones. In the early data of airport operations, this mistake was common, Palmer said. Now, the right real-time data apps are deployed all over the world.
3. Process in parallel
Real-time data integration, by its very nature, requires real-time action by the system consuming it. It is also usually the case that the volume of the received data is large. "High-speed and high-volume create a difficult problem for systems that were not originally designed for these challenges," said Scott McMahon, senior solutions architect at Hazelcast, an in-memory data grid platform.
Mark Palmersenior vice president of analytics, Tibco
The critical design approach for handling these streams of data is to operate in a highly parallel fashion, making use of multiple parallel and coordinated ingestion engines that can scale and shrink elastically to meet the processing needs of the data. Many architecture schemes have been attempted over the years, with varying degrees of success. The real breakthroughs for handling today's high-speed data streams have come from recent advances in concepts of parallel processing and execution.
There are now a few open source and proprietary platforms that have done the hard work of building the processing engines to enable developed applications to run in highly parallel configurations. "The first, best practice for organizations developing new applications for real-time computation is to start with one of these platforms to leverage the work that's already done and concentrate on their own application logic instead," McMahon said.
For example, Hazelcast worked with one company that wanted to analyze omnichannel communications with its subscriber base. In order to do this, they normalized messages from all channels and sent them though a single server analytics application. It worked well in testing, but once it received the full data load, the hardware simply couldn't keep up. Because the system was not designed to operate in parallel, there was no way for it to scale up. It had to be scrapped and redesigned to operate in parallel execution.
4. Plan for component failure
One big real-time data integration challenge is component failure in some part of the pipeline. "If not properly designed, component failures can lead to data loss, stale or out-of-order data and system outage," said Venkat Krishnamurthy, vice president of product management at OmniSci, a GPU-accelerated analytics platform. Decoupling of each phase of the pipeline and establishing resiliency in each phase will make the system as a whole run more smoothly.
5. Package streams for better insights
Real-time data streams can only generate business value when developers can weave this data into new applications. Untapped data streams can render businesses data rich but information poor when there is no strategy for pulling actionable insights from the data, said David Chao, head of industry solutions at MuleSoft. To address these challenges, businesses first need clear visibility into where their data resides and how all of their applications, systems and devices are interacting.
One strategy is to package data sources as APIs in an applications network where all applications data and devices serve as pluggable and reusable building blocks. "By standardizing data sources with APIs and aligning them to specific processes, businesses can reduce the complexity of multiple sources of truth and make their data quickly usable across the business," Chao said.