Getty Images/iStockphoto
AWS reimagines SageMaker as suite for data, analytics, AI
The service is now a unified platform featuring a data catalog, lakehouse and integrations with other AWS platforms.
AWS on Tuesday introduced a reimagined version of SageMaker, transforming what had been a service for training and deploying machine learning models into a unified platform for data management, analytics and AI development.
The new incarnation of SageMaker includes Unified Studio to connect previously disparate AWS data management, analytics and AI development capabilities. In addition, it features a data catalog to provide governance capabilities; a data lakehouse to unify data previously stored in lakes, warehouses and databases; and integrations that simplify access to data in third-party applications.
The new version of SageMaker was unveiled during AWS re:Invent 2024, a user conference hosted by the tech giant in Las Vegas. The updated version of SageMaker is generally available, except for Unified Studio, which is in preview and scheduled for general availability in 2025.
AWS is in a fierce race against fellow tech giants Google Cloud and Microsoft to provide customers with unified platforms for data and AI, according to David Menninger, an analyst at ISG's Ventana Research.
AWS was rated slightly ahead of Google Cloud and Microsoft in a recent ISG Buyers Guide, he noted. Once generally available, Unified Studio, in concert with the other new features that comprise the reimagined SageMaker, should help AWS boost its standing.
"All the cloud service and data platform providers are working to provide a unified platform for data and AI," Menninger said. "They are all adding capabilities to their platforms, and it will continue to be a competitive market. But these new announcements will improve AWS ratings once they are generally available."
First launched in November 2017, SageMaker was initially a fully managed service for developing and deploying machine learning models. Since then, AWS has steadily modernized it, including SageMaker JumpStart to simplify access to prebuilt models in 2020, new features for model building in 2021, and governance and geospatial data tools in 2022.
Now, SageMaker is not simply a managed service for machine learning but has expanded into a unified environment for data management, AI development and analysis.
A new SageMaker
Enterprises view generative AI as a means of making employees smarter as well as more efficient.
As a result, their investment in developing AI-powered applications, some that enable natural language interactions with data and others that automate repetitive processes, has surged in the two years since OpenAI's launch of ChatGPT significantly improved generative AI technology.
Key to that development is proprietary data, without which generative AI models and applications cannot understand the unique characteristics of a given organization.
Therefore, in response to the surging investment in AI development, many data management and analytics vendors have created environments aimed at making it easy for enterprises to develop AI applications that combine proprietary data with generative AI technology.
While not solely aimed at enabling AI development -- AWS also provides Bedrock, a machine learning and AI platform more specifically tailored to developing generative AI-powered applications -- the reimagined SageMaker brings together the elements needed for developing AI tools.
Unified Studio builds on SageMaker's pre-existing machine learning development capabilities by combining them with previously disparate data management and application development services in a single, integrated environment. Among them are capabilities from AWS Glue for data integration, EMR for data processing, Redshift and S3 for data storage, and Bedrock for generative AI development.
In addition, the suite includes Amazon Q Developer, a generative AI-powered assistant that enables developers to use natural language to seek advice on such topics as data discovery and coding as they build applications for specific uses.
Given that enterprises use myriad platforms to ingest, integrate, prepare and analyze data -- including developing data and AI products -- any steps vendors such as AWS can take to make their tools more interoperable with one another are important, according to Menninger.
As a result, the addition of Unified Studio is significant for AWS customers.
"Unifying data and analytics processes, including AI, is a real challenge today," Menninger said. "There are just too many tools and technologies that need to be integrated even when you are working with a single vendor. Anything the software providers can do to bring all these components together will be welcome improvements."
Kevin Petrie, an analyst at BARC U.S., similarly said that vendors should reduce complexity by making tools easier to use together and potentially even reducing the number of different tools needed to reach data-informed insights.
Business intelligence, machine learning and AI are all converging and overlapping, so AWS' attempt to make its data management and AI development easier to use together is important, he continued.
"The more you can intermingle model types and data types, the more you can enrich your analytical outputs and enrich business workflows," Petrie said. "So it is critical to reduce the number of tools and platforms that companies use to manage multifaceted data and multifaceted analytics. AWS is taking a good step in this regard."
Beyond Unified Studio, the new SageMaker includes SageMaker Catalog and SageMaker Lakehouse.
Data catalogs are connective tissue for data that can otherwise be isolated within the many systems enterprises use across departments and, in the case of some organizations, physical locations. They can serve as an index for datasets and data products such as reports and dashboards so they can be discovered and reused to inform decisions. They can be metadata management tools and semantic modeling layers that guarantee consistent data across all of an organization's domains.
Perhaps most importantly, they can serve as governance frameworks through which administrators can ensure the safety and security of their organization's data.
AWS' new SageMaker Catalog, built on the Amazon DataZone data cataloging service, enables administrators to define and implement governance policies that ensure the proper use of their organization's data and AI assets. For example, customized permissions can be set and enforced across data products, AI products, datasets and data sources to ensure data remains secure and compliant.
While beneficial, SageMaker Catalog has a significant shortcoming -- as do data catalogs from Google Cloud and Microsoft -- according to Petrie.
Many enterprises use multiple clouds for data storage. In addition, they don't necessarily have all their data in the cloud and use on-premises databases in addition to cloud data warehouses, lakes and lakehouses.
"The challenge is that AWS, like Google and Azure, does not adequately integrate with or support hybrid, heterogeneous and multi-cloud environments that are predominant in modern enterprises," Petrie said. "The catalog capabilities are limited in this regard."
David MenningerAnalyst, ISG's Ventana Research
While data catalogs make it easier for enterprises to govern their data and AI assets, lakehouses make it easier for them to integrate their data in preparation for analytics and AI development.
Lakehouses combine the structured data storage capabilities of data warehouses with the unstructured data storage capabilities of data lakes to enable organizations to combine and operationalize all their data rather than just a portion of it.
Enterprises traditionally based analysis only on structured data such as financial records and point-of-sale transactions. Now, however, unstructured data such as text, images and audio files account for more than 80% of all data. Accessing unstructured data and combining it with structured data is, therefore, important for organizations to get a complete understanding of their operations.
Meanwhile, to simplify data processing of large datasets, data lakes and lakehouses use table storage formats such as Delta Lake, Apache Hudi and Apache Iceberg.
AWS' SageMaker Lakehouse unifies data stored in S3 data lakes and Redshift data warehouses to reduce data isolation and is compatible with Apache Iceberg, which is the most popular table storage format. Using SageMaker Lakehouse, AWS customers can access their data from within Unified Studio to train and develop AI models and applications as well as inform data products such as reports and dashboards.
Compatibility with Apache Iceberg is perhaps SageMaker Lakehouse's most significant feature, according to Menninger. Apache Iceberg enables SageMaker Lakehouse to interact with Iceberg-compatible tools from other vendors, reducing the need to move or replicate data.
"There is a groundswell of support for Apache Iceberg in the market, and for good reason," Menninger said. "Less data movement means less cost and less effort. Less data redundancy means better control and governance of the data. It also gets enterprises closer to a single version of the truth."
In addition to the updated SageMaker, AWS launched new integrations that eliminate traditional extract, transform and load (ETL) workloads when ingesting data from SaaS applications into AWS databases, data warehouses, data lakes and now data lakehouses.
Using the zero-ETL integrations, customers can capture data from applications such as SAP and Zendesk and move it into Redshift; SageMaker Lakehouse; and a host of AWS databases, such as Amazon Aurora and Amazon RDS.
The integrations are designed to reduce the cost and labor generally associated with data ingestion, including developing and managing data pipelines.
Looking ahead
While AWS' reimagining of SageMaker unifies previously disparate systems and processes, it doesn't unify all data management, analytics and AI development processes, Petrie said.
Therefore, there is room for improvement.
Successful analytics and AI initiatives encompass the lifecycles of data, models and the applications built on the underlying data and models, according to Petrie.
"This announcement addresses the data and model lifecycles," he said. "I will be interested to see how AWS helps customers optimize the application lifecycle and integrate it with the data and model lifecycles as well."
Menninger, meanwhile, noted that while zero-ETL integrations with SaaS applications are intriguing, they're only valuable if they're with the right applications for a given organization.
AWS mentioned SAP and Zendesk but provided no further details. For the zero-ETL integrations to have a significant effect, there needs to be more connectivity with major enterprise resource planning, customer relationship management and healthcare management applications.
"It would be extremely helpful to be able to apply those capabilities to the major ERP, CRM, HCM and other business applications enterprises are using today," Menninger said.
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.