Explore real-world use cases for multimodal generative AI

Multimodal generative AI can integrate and interpret multiple data types within a single model, offering enterprises a new way to improve everyday business processes.

With multimodal generative AI, teams can create machine learning models that support multiple data types, such as text, images and audio. These new capabilities enable content creation, customer service, and research and development.

Many generative AI offerings from Google, Microsoft, AWS, OpenAI and the open source community now support at least text and images within a single model. Efforts are also underway to support other inputs, such as data from IoT devices, robot controls, enterprise records and code.

"Multimodality in AI for business applications is best understood by first recognizing the variety and complexity of data types businesses deal with every day," said Christian Ward, executive vice president and chief data officer at digital experience platform Yext.

Multimodal generative AI can help with financial data, customer profiles, store statistics, geographical information, search trends and marketing insights -- all of which are stored in diverse forms, including images, charts, text, voice and dialogues. Multimodal AI can automatically find connections among different data sets representing entities such as customers, equipment and processes.

"We are so used to seeing these data sets as separate, often different software packages, but multimodality is also about merging and meshing this into completely new output forms," Ward said.

Getting started with multimodal models

Major AI services, including OpenAI's GPT-4 and Google's Gemini, are starting to support multimodal capabilities. These models can understand and generate content across multiple formats, including text, images and audio.

The advent of capable generative multimodal models, such as GPT-4 and Gemini, marks a significant milestone in AI development.
Samuel HamwayResearch analyst, Nucleus Research

"The advent of capable generative multimodal models, such as GPT-4 and Gemini, marks a significant milestone in AI development," said Samuel Hamway, research analyst at technology research firm Nucleus Research.

Hamway recommends that businesses start by exploring and experimenting with consumer-available chatbots such as ChatGPT and Gemini, formerly called Bard. With their multimodal functionality, these platforms provide an excellent opportunity for businesses to enhance their productivity in several areas. For example, ChatGPT and Gemini can automate routine customer interactions, assist in creative content generation, simplify complex data analysis and interpret visual data in conjunction with text queries.

Despite recent progress, multimodal AI is generally less mature than LLMs, primarily due to challenges related to obtaining high-quality training data. In addition, multimodal models can incur a higher cost of training and computation compared with traditional LLMs.

Vishal Gupta, partner at advisory firm Everest Group, observed that current multimodal AI models predominantly focus on text and images, with some models including speech at experimental stages. That said, Gupta expects that the market will gain momentum in the coming years, given multimodal AI's broad applicability across industries and job functions.

8 multimodal generative AI use cases

Here are eight real-world use cases where multimodal generative AI can provide value to enterprises today or in the near future.

1. Marketing and advertising

Marketing content creation is one of the top multimodal generative AI use cases seeing relatively substantial traction, Gupta said. Multimodal models can integrate audio, images, video and text to help develop dynamic images and videos for marketing campaigns.

"This has huge potential to further elevate the customer experience by dynamically personalizing content for users, as well as improving efficiency and productivity for content teams," Gupta said.

However, enterprises need to balance personalization with privacy concerns, Hamway cautioned. In addition, they must develop data infrastructures capable of effectively managing large and diverse data sets to glean actionable insights.

2. Image and video labeling

Multimodal generative AI models can generate text descriptions for sets of images, Gupta said. This capability can be applied to caption videos, notate and label images, generate product descriptions for e-commerce, and generate medical reports.

3. Customer support and interactions

Yaad Oren, managing director of SAP Labs U.S. and global head of SAP Innovation Center Network, believes that the most promising multimodal generative AI use case is customer support. Multimodal generative AI can enhance customer support interactions by simultaneously analyzing text, images and voice data, leading to more context-aware and personalized responses that improve the overall customer experience.

Chatbots can also use multimodality to understand and respond to customer queries in a more nuanced manner by incorporating visual and contextual information. One key challenge, however, is ensuring accurate and ethical handling of diverse data types, especially with sensitive customer information.

4. Supply chain optimization

Multimodal generative AI can optimize supply chain processes by analyzing text and image data to provide real-time insights into inventory management, demand forecasting and quality control. Oren said SAP Labs U.S. is exploring analyzing images for quality assurance in manufacturing processes and identifying defects or irregularities. The company is also examining how natural language processing models can analyze textual data from various sources to predict demand fluctuations and optimize inventory levels.

5. Improved healthcare

Taylor Dolezal, head of ecosystem at the Cloud Native Computing Foundation, sees considerable promise in the healthcare sector for integrating various data types to enable more accurate diagnostics and personalized patient care. Multimodal generative AI is particularly useful for diagnostic tools, surgical robots and remote monitoring devices.

"While these advancements promise improved patient outcomes and accelerated medical research, they pose challenges in data integration, accuracy and patient privacy," Dolezal said.

6. Improving manufacturing and product design

Multimodal generative AI can improve manufacturing and design processes, Dolezal said. Models trained on design and manufacturing data, defect reports, and customer feedback can enhance the design process, increase quality control and improve manufacturing efficiency.

AI can analyze market trends and consumer feedback in product design, and implement quality control and predictive maintenance in manufacturing processes. The main challenge lies in integrating multiple data sources and ensuring the interpretability of AI decisions, Dolezal said.

7. Employee training

Multimodal generative AI can enhance learning and mastery in employee training programs, Ward said. By using diverse instructional materials and data to create content, AI can create a custom experience for each role. From here, employees can "teach" the material back to the AI through an audio or video recording to create an interactive feedback mechanism. As employees articulate their understanding of the material to the AI system, it assesses their comprehension and identifies learning gaps.

Ward cautioned that this approach could face challenges, particularly in human adoption of AI feedback. Nevertheless, it promises a more personalized and effective learning experience.

8. Multimodal question answering

Ajay Divakaran is the technical director of the Vision and Learning Laboratory in the Center for Vision Technologies at SRI International, a nonprofit scientific research institute. SRI International is currently exploring how to improve question answering through combining images and text, as well as audio when possible.

This is particularly useful for applications that involve carrying out ordered steps. For example, someone querying an AI system with a home repair question could receive a combination of textual steps along with generated images and videos, with the text and visuals working together to explain the steps to the user.

George Lawton is a journalist based in London. Over the last 30 years, he has written more than 3,000 stories about computers, communications, knowledge management, business, health and other areas that interest him.

Dig Deeper on AI technologies