Data integration is hard. Over the years, of all the technologies and processes that are part of an organization’s analytics stack/lifecycle, data integration continuously has been cited as a challenge. In fact, according to recent ESG research, more than 1 in 3 (36%) organizations say data integration is one of their top challenges with data analytics processes and technologies. The data silo problem is very real, but it’s about so much more than having data in a bunch of locations and needing to consolidate. It’s becoming more about the need to merge data of different types and change rates; the need to leverage metadata to understand where the data came from, who owns it, and how it’s relevant to the business; the need to properly govern data as more folks ask for access; and the need to ensure trust in data because if there isn’t trust in the data, how can you trust the outcomes derived from it?
Whether ETL or ELT, the underlying story is the same. At some point, you need to extract data from its source, transform it based on the destination tool and/or merging data set, and then load it into the destination tool, whether that be something like a data warehouse or data lake for analysis. While we won’t get into the pros and cons of ETL or ELT, the ETL process is still prevalent today. And this is due in part to the mature list of incumbents in the ETL space, like Oracle, IBM, SAP, SAS, Microsoft, and Informatica. These are proven vendors that have been in the market for multiple decades and continue to serve many of the largest businesses on the planet. There are also several new(ish) vendors looking to transform the data integration market. Big companies like Google (via the Alooma acquisition), Salesforce (via MuleSoft), Qlik (via Attunity acquisition), and Matillion all have growing customer bases that are embracing speed, simplicity, automation, and self-service.
Now whichever your approach is to addressing data integration, I keep hearing the same things from customers: “Vendor X is missing a feature” or “I wish I could…” or “I can’t get buy-in to try a new solution because the technology isn’t mature” or “that sounds great, but it’s a lot of work and we’re set in our ways” or “I’m just going to keep using Vendor Y because it’s too disruptive to change.” And every time I hear these common responses, I ask the same follow-up question: what’s your ideal tool? Everyone wants to ensure the technology is secure, reliable, scalable, performant, and cost-effective, but I wanted to understand the more pointed wants based on the actual folks who are struggling with data integration challenges day in and day out.
Without further ado, I present to you the top list of “wants” when it comes to an ideal data integration tool/product/solution/technology:
- Container-based architecture – Flexibility, portability, and agility are king. As organizations are transforming, becoming more data-driven, and evolving their operating environments, containers enable consistency in modern software environments as organizations embrace microservice-based application platforms.
- GUI and code – Embrace the diversity of personas that will want access to data. A common way I’ve seen organizations look at this is that (generally speaking) the GUI is for the generalists and the code behind is for the experts/tinkerers. By the way, this mentality is evolving as modern tools are looking to help the generalists and experts alike with more automation via no-code/low-code environments and drag-and-drop workflow interfaces.
- Mass working sets – Common logic or semantic layers are desired. The last thing an engineer or analyst wants to do is write unique code for each individual table. It doesn’t scale and becomes a nightmare to maintain.
- Historic and streaming – Using batch and ad-hoc on historic and streaming data will ensure relevant outcomes. Organizations increasingly want hooks to better meet the real-time needs of the business and that means real-time availability and access to relevant data without having to jump through hoops.
- Source control with branching and merging – Code changes over time. Ensure source control is in place to understand how and why code has changed. Going hand in hand with source control is the ability to support branching and/or merging of code to address new use cases, new data sources, or new APIs.
- Automatic operationalization – This is focused on the DevOps groups. Ensure new workflows can easily go from source control to dev/test or production. Deployment is the first priority, but do not lose sight of management and the iterative nature of data integration processes as users, third-party applications, and data changes/evolves.
- Third-party integrations and APIs – The analytics space is massive and fragmented. The more integrations with processing engines, BI platforms, visualization tools, etc., the better. And ensure the future of the business is covered, too. That means incorporating more advanced technology that feeds data science teams, like AI and ML platforms and services.
While this list is by no means complete or all encompassing, it speaks to where the market is headed. Take it from the data engineers and data architects: they’re still primarily ETLing and ELTing their lives away, but they want change and recognize there are opportunities for vast improvement. And marginal improvements without massive disruption is the preferred approach. So a note for the vendors: it’s about meeting customers where they are today and minimizing risk as they continue on their data transformation journeys.