your123 - stock.adobe.com
Surging AI development will heighten focus on data in 2025
Without proper training, models and applications will fail. As a result, enterprises will pay increased attention to the information used to train artificial intelligence tools.
As interest in AI development surges, the underlying data used to train models and applications becomes increasingly important.
As a result, 2025 will likely be a year when enterprises place greater emphasis than ever on the basic tenets of proper data management -- how it's governed, stored, prepared and analyzed – according to Tony Baer, principal at dbinsight.
Tony BaerPrincipal, dbinsights
"I see 2025 being the year of the renaissance of data," he said. "As AI projects get closer to production, enterprises will start to pay attention again to data."
Enterprise interest in developing AI-powered applications has grown exponentially in the two years since OpenAI's launch of ChatGPT marked a significant improvement in the capabilities of generative AI (GenAI) models. Large language model technology has improved since then, with AI developers Anthropic, Google, Meta, Mistral and others all striving to top one another.
But without an organization's proprietary data, those models are of little use.
It's only when GenAI is combined with proprietary data and trained to understand an organization's operations that the models become useful. Only then can they deliver benefits such as smarter decision-making and improved efficiency that make generative AI such an attractive proposition for businesses.
So as enterprises invest more in AI development, they will also need to take steps to ensure that their data is properly prepared.
Data lakehouses and data catalogs will be front and center. So will data and AI governance. And accessing and operationalizing unstructured data will be critical.
In those organizations that have their data to properly develop AI-powered applications, traditional business intelligence will be transformed, according to Yigal Edery, senior vice president of product and strategy at Sisense.
"In 2025, AI will completely obliterate the boundaries of traditional BI, enabling anyone to develop and use analytics without specialized knowledge," he said. "Emerging AI-driven platforms will make analytics as intuitive as natural conversation, eliminating the need for clunky dialog boxes and complex interfaces."
Governance and preparation
Successful AI development begins with properly prepared data.
Without high-quality, well-governed data, AI projects will fail to deliver their desired outcomes. At the very least, improperly trained AI tools will lead to a lack of trust in their outcomes, which will result in them going unused. Of more concern is that they do get used, and decisions made on bad outputs lead to customer dissatisfaction, regulatory noncompliance and organizational embarrassment.
To better ensure that high-quality data is used to train AI tools, and to make that data discoverable, semantic models will become more popular in 2025, according to Jeff Hollan, head of applications and developer platform at Snowflake.
Semantics are descriptions of data sets that give them meaning and enable users to understand the data's characteristics. When implemented across an organization, semantic models ensure that the organization's data is consistent and can be trusted. In addition, they make data sets discoverable.
Traditionally, semantic models have been used to define the data sets that inform data products, such as reports and dashboards. Now, they can -- and perhaps should -- be used to define the data sets used to train AI models and applications.
"Investing in high-quality, well-governed semantic data models will become a top priority for organizations in 2025," Hollan said. "The growing adoption of AI-powered applications, chatbots and data agents highlights the critical need for curated models that organize and structure data effectively."
Enterprises that don't invest in semantic modeling often wind up trying to develop AI tools with fragmented data that leads to poor accuracy, he continued.
"As a result, this area is poised to see significant investment and innovation in tools, paving the way to fully realize AI's potential," Hollan said.
Effective data and AI governance will also help result in desired outcomes, according to Sanjeev Mohan, founder and principal of analyst firm SanjMo.
A data governance framework is a documented set of guidelines to determine an organization's proper use of data, including policies addressing who can do what with data, along with data privacy, quality and security. AI governance is an extension of data governance, applying the same policies and standards as data governance to AI models and applications.
"In 2024, most organizations are still grappling with picking the appropriate use cases and doing experimentations," Mohan said. "But as generative AI workloads become more pervasive, the need for AI governance will grow."
Like semantic models, catalog tools are a means of governing data and AI and making sure they're both high-quality and used effectively.
Catalogs are applications that use metadata to inventory and index an organization's data and AI assets -- including data sets, reports, dashboards, models and applications -- to make them discoverable for analytics and AI-driven analysis. In addition, they are where administrators can put governance measures in place, including access controls.
Given their key capabilities, they will only grow in importance as the use of AI tools increases. However, because many enterprises use open table formats such as Apache Iceberg and Delta Lake to federate data across multiple systems, the catalogs will need to be open as well, according to James Malone, Snowflake's head of data storage and engineering.
"It's already clear that an open table format can't truly exist without an open catalog," he said. "In the coming year, I expect all open catalog solutions to prioritize federation between each other because customers simply don't have the time or resources to constantly switch and migrate between catalogs. Every catalog provider will need to offer seamless federation to win in the market."
Storage and development
One of the keys to developing successful AI applications is the amount of data used to train them.
Models that aren't trained with enough data are prone to hallucinations -- incorrect and sometimes even bizarre outputs -- whereas those trained with an appropriate amount of data are more likely to be accurate.
How much data is needed for proper training depends on the use case. Narrow use cases naturally require less data than broader ones. But even applications developed for hyper-specific use cases need to be able to draw on enough data to prevent them from making up responses when there's not enough data to inform a proper response to a query.
Enter unstructured data.
Historically, analytics has focused largely on structured data, such as financial records and point-of-sale transactions. However, structured data now makes up less than 20% of all data. Unstructured data such as text, audio files and images make up the rest. For AI models and applications to be accurate, enterprises need to access their unstructured data and use it to inform their AI tools.
Many data management vendors have added capabilities such as vector search and retrieval-augmented generation to make unstructured data discoverable and actionable.
But to truly get value from unstructured data, enterprises need to start treating it with the same care as they do their structured data, according to Mohan.
"All the best practices for structured data need to be applied to unstructured data like modeling, security and access governance," he said. "Unstructured data needs to become a first-class citizen, just like its structured data brethren."
Meanwhile, as unstructured data gains increased importance, so will the ways in which it is stored.
Traditional data warehouses store mainly structured data. As unstructured data became more ubiquitous, data lakes were developed to provide organizations with a repository for text and other forms of data that lack a traditional structure.
However, with structured data in one location and unstructured data in another, the two were isolated from one another. Organizations either had to go through painstaking labor to unify them or leave them isolated and accept that any analysis would be based on data sets made up of only some of their pertinent information.
Data lakehouses, first developed about 10 years ago, are a hybrid of data lakes and warehouses, enabling structured and unstructured data to be stored together. Engineers still need to use methods such as vectorization -- the algorithmic assignment of a numerical value -- of unstructured data to make it compatible with structured data, but at least the two aren't isolated from one another in lakehouses.
Toward that end, with unstructured data essential to AI development, Mohan said he expects lakehouses to continue gaining popularity. But not just any lakehouses.
Just as increasing use of open table formats will result in increased use of open catalogs, open table format-based lakehouses will gain popularity, according to Mohan. AWS' recent introduction of S3 Tables and S3 Metadata will help fuel the trend.
"Open table format-based lakehouses will become the de facto analytical approach," he said.
In addition, the preferred open table format will become Apache Iceberg, Mohan continued.
"Apache Iceberg will increase in its prominence at the cost of Delta format," he said.
Open table formats won't be the only open source capabilities that gain popularity in 2025 and spur adoption of tools that support open source, according to JB Onofre, principal software engineer at Dremio and a member of the Apache Software Foundation's board of directors.
Instead, an increased emphasis on interoperability between systems and a corresponding fear of vendor lock-in will drive widespread open source adoption.
"Projects that support hybrid architectures and are extensible across diverse environments will thrive," Onofre said. "In particular, we'll see open source communities focusing on AI-ready data, developing tools that not only democratize access but also ensure data governance and security meet enterprise-grade standards."
Transforming analysis
The heightened emphasis on data and its preparation, storage, governance and access that many expect will mark 2025 is merely a means to an end.
It's to build successful AI models and applications aimed at transforming business.
Given how GenAI can enable virtually any employee to query and analyze data using natural language and let engineers and other experts automate processes, generative AI is expected to be as significant as technologies such as the internet in the 1990s and the telephone a century ago.
In 2023, in the months after the launch of ChatGPT, enterprises and vendors alike began investing in generative AI tools. In 2024, they started delivering them. As the year progressed, however, so did the capabilities of the AI applications. At first, many were assistants that enabled users to ask questions of their data. By the end of the year, agentic AI -- AI applications that not only reacted to when prompted but also were able to act autonomously to make suggestions, surface insights and even carry out actions -- was becoming a trend.
That will continue in 2025, according to Saurabh Abhyankar, chief product officer of MicroStrategy.
"Contextual, agentic AI," he said when asked what the biggest trend in data will be in 2025. "Instead of wasting time seeking information from disparate applications and dashboards, information should come to you when and where you need it. Agentic AI can proactively provide insights in-stream based on the context of an employee's work."
For example, when a sales representative prepares for a customer meeting and needs to do research, such as looking into past orders and recent support tickets to prepare, they typically need to use multiple systems and reports, Abhyankar continued.
"Contextual, agentic AI will change that."
The result will be a paradigm shift in the way data is delivered and users consume information, according to Thomas Bodenski, COO and chief data and analytics officer of TS Imagine, a financial technology firm that uses Snowflake for its data management needs.
"Generative AI will redefine the data delivery lifecycle in 2025, driving significant productivity gains for data professionals," he said. "SaaS companies with strong data strategies will lead the way, enabling the rapid deployment of AI-driven features and empowering users with self-service analytics. Firms that prioritize complete, trustworthy data will outpace competitors."
TS Imagine is one of those firms, he continued.
"We've dedicated significant effort to leveraging generative AI in every step of our data delivery lifecycle -- from conception to deployment, and even troubleshooting production problems," he said. "Now, we're automating the entire data delivery lifecycle, bringing these individual steps together to achieve a new level of efficiency and scalability."
While BI will be transformed by AI, data management will be affected as well, according to Ariel Katz, chief executive officer of analytics vendor Sisense.
Enterprises will train AI applications to write code, enabling the AI tools to develop entire applications without human intervention other than making sure the code is correct. In addition, enterprises will deploy AI tools to automate complex tasks such as bug detection.
"Major tech companies such as Microsoft and Google have already integrated AI tools into their development pipelines, marking a turning point in software creation," Katz said.
But while the benefits might be substantial for organizations, he added that they could come at a human cost.
"While AI liberates developers from routine work, it also threatens job security, forcing them to evolve or risk being left behind in this new era of coding innovation," Katz said.
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.