Getty Images/iStockphoto
Data streaming platforms fuel for agile decision-making
Real-time analysis is critical as organizations try to compete amid economic uncertainty. Continuous intelligence data streaming technology is one way to do it.
With real-time decision-making becoming critical for organizations as they deal with trying economic times, deploying a data streaming platform is a growing means of enabling speed and agility.
Data streaming is the continuous ingestion of data at a high-speed rate as it's created. The alternative is batch ingestion, in which data is programmed to be uploaded in a more organized, serial way and not done in real time.
Naturally, data streaming enables organizations to view data and act more quickly than with data that's ingested in batches.
That need for fast data analysis and agile decision-making has become apparent ever since the onset of the COVID-19 pandemic starting in March 2020 when economic conditions changed suddenly and drastically, and it has continued with repeated supply chain disruptions, war in Ukraine and fears of a recession.
As a result, data streaming -- already established in data management -- is a growing trend in analytics.
Now, data ingestion and integration specialists such as Confluent, Equalum, Informatica and Striim offer streaming data platforms. In addition, full-featured analytics vendors including SAS and Tibco provide data streaming tools, as do tech giants such as AWS, Google, IBM, Microsoft and Oracle.
In a recent interview, Michael Drogalis, principal technologist for the office of the CTO at Confluent, discussed data streaming, including the technology needed to build a platform for it, its practical applications and its evolution.
How do you define data streaming?
Michael Drogalis: Data streaming is a new paradigm for working with data. When people go to school and they learn software engineering, they learn a very standard pattern. There's some data in a file or in a data structure that gets sucked up and processed and spit out somewhere to carry out data processing activities step by step until goals are accomplished. Streaming flips all that on its head.
Instead of processing the data structure one element at a time, the data structure is unbounded, and data is consistently flowing. Instead of working with data that's whole corpus one at a time, streaming data gets processed incrementally as it shows up.
That drives a whole range of benefits, capabilities and challenges.
When did data streaming platforms gain prominence and become part of organizations' toolkit for analytics?
Drogalis: Everything new is old, and streaming had been dabbled with in the late 1990s and early 2000s by vendors who are no longer around. None of those were really successful. Apache Kafka was the first one to take this idea of streaming and bring it to people with a usable interface with decent performance characteristics. That came out in 2011, but for the first five or so years, nothing really happened. It was on the periphery of everybody's radar.
I started to play with it around 2013, and I remember being at a conference and there was a talk on Kafka. There was a group of us who were interested in it, but it was still very niche. Apache Storm was just getting off the ground as a streaming processor. In 2013, there was a critical inflection point when everyone knew this was going to be something, but none of the pieces connected with each other yet. Since then, everything has slowly developed. I'd say I began to hear more of my circle start talking about data streaming in 2016, and about 2018 is when it became mainstream. It's been bubbling slowly with things gradually getting easier, and over time, things have clicked into place.
Why is it important for organizations to have a platform for data streaming?
Drogalis: The main thing that people really like to talk about is being able to process data with low latency. There are use cases like ride-sharing where a person wants a ride someplace and they don't want to have to wait for someone at Uber headquarters to come back from their coffee break and match them up with a driver. That's really important. But more than that, data streaming allows organizations to express push-oriented actions where the actions drive business operations based off the data that's received.
Michael DrogalisPrincipal technologist for the office of the CTO, Confluent
Imagine watching a baseball game. Data streaming is like being able to watch the game, see the batter hit the ball and get thrown out at first base, and view the entire process as it happens with no interference in the data. Batch-oriented data ingestion or pulling data is like looking at the game through photographs -- batter hit the ball, batter is on the bench. But it doesn't tell what happened. Did the batter ground out? Did the batter reach base and get thrown out trying to steal second?
That high resolution of data is what enables companies to take action faster and take more intelligent action. When you consider business intelligence, the whole becomes so much more interesting when organizations can feed off the data as it's happening.
What technological capabilities make up a data streaming platform and enable organizations to view and analyze data in real time?
Drogalis: We would call that an event streaming platform. It's a category of technology like databases or data warehouses. It's more than just an event broker that's open to receive streaming data.
You need a few capabilities. You need connectivity to the hundreds of thousands of systems out there -- some way of capturing events as they are being created. You need storage -- someplace to put the data so it's accessible whenever someone wants to consume the data. You need an ecosystem of client bindings of different programming languages to be able to interact with those event brokers. Very critically, you need stream processing -- it's not enough to just take the streams that show up, but you actually want to curate them, join them, aggregate them. And then there are a slew of secondary capabilities that are just as important, such as good security, governance for that data and metrics around it.
How does streaming data get consumed -- does it connect to a BI platform for visualization or is there another means to interacting with the data and analyzing it?
Drogalis: The cool thing is that it's very open, so there are many different ways that the data can be consumed. It can be consumed programmatically by telling it to open client bindings and connect to an API to get the actual data in its rawest form. On another level, it can connect to any range of systems, and the stream can pour out into a MySQL database or some system that a BI platform like Tableau could turn into a visualization.
There's a spectrum of how low or high level a consumer wants to get.
What are some common use cases for a data streaming platform?
Drogalis: It's hard to pick. I've seen every industry and scale at this point. Ride-sharing comes to mind because it's very relatable. There are lots of retail use cases. Timing is so critical when there's a customer on a website who's browsing. If they put something in a basket and then don't do anything for 20 minutes, data streaming can enable a company to send that customer a coupon [to reengage them]. Stock exchanges use data streaming tools, sometimes Kafka and sometimes others. That's a mission-critical example.
What are some of the major challenges organizations face when adopting a data streaming platform and trying to make that real-time data actionable?
Drogalis: Data streaming is kind of like driving on the opposite side of the road from what someone is used to. There is a typical way to process data, and streaming turns it around.
One of the concepts that is critical to understand is that data can arrive in a different order than it actually happened. For example, if someone is on a mobile device, and they're out of cellphone range and suddenly get back in range, their phone will suddenly upload all the data they missed while out of range. They need to know what to do with that, which is both a power and a responsibility. That's a struggling point.
Asynchronicity is another big challenge. We're generally taught to program in a single threaded, synchronous environment -- A happens, then B happens, and then C happens. Streaming is asynchronous by default. You're dealing with things as they arrive, and it's generally multithreaded across different systems. That ends up being a software development hurdle, and it's been an area that's been slow to mature. That's a critical piece that will make it simple to use data streaming tools.
What are common pitfalls organizations face when trying to build data streaming workflows?
Drogalis: Thinking about data governance too late is a mistake some organizations make. It's not always obvious where they should be [placing restrictions]. An organization may have a stream and may derive a cache of data and do something on that cache. They can govern the stream, but they also have to follow the footprint of that data into the cache and create lineage for that data.
Is data governance a challenge as well as a common pitfall?
Drogalis: I don't think governance is more complex when it comes to data streaming, but it is a piece that's been underdeveloped. I think it tends to come later in the technological maturity curve when building something new. The first thing is to get it working, and then security and governance is added. It's critical. Once something is working, the next step is to roll it out to a bunch of developers, and when doing that, organizations have to be careful.
What's the future of data streaming?
Drogalis: My primary interest is in stream processing, and I've watched stream processing evolve as its own arm of technology. In an adjacent but different part of the industry, I've seen real-time analytics rise up. I see them on a track of unification.
In stream processing, we're on a track of being able to do everything in real time. That started with performing an action like cancel a credit card or send an email. It's led to being able to do light analytics in real time. On the other side of the industry are real-time analytical databases that are focused on answering pretty complex questions -- not doing an action, but giving real-time insight into the data. My personal belief is that the two technologies will come together in the next five or so years.
Editor's note: This Q&A has been edited for clarity and conciseness.
Eric Avidon is a senior news writer for TechTarget Editorial. He covers analytics and data management.