Getty Images/iStockphoto

News

OpenAI shows developers what is possible with speech tool

The AI vendor introduced four new offerings for enterprises. Its Realtime API is notable for its speech-to-speech capability. Meanwhile, the startup is growing explosively.

Esther Shittu, News Writer

Published: 03 Oct 2024

OpenAI and its partner Microsoft introduced generative AI tools for enterprise developers and general users.

At its DevDay San Francisco event earlier this week, the ChatGPT maker introduced Realtime API, model distillation, Prompt Caching and vision fine-tuning.

The new products arrive as the independent generative AI vendor closed a new funding round in which the generative AI vendor received $6.6 billion in new funding from Thrive Capital, Microsoft, Khosla Ventures, Fidelity Management & Research Co. and Nvidia, among other investors.

The cash infusion -- one of the biggest private investments -- brings OpenAI's valuation to about $157 billion, making the vendor one of the world's biggest startups. The new funding also comes as OpenAI is trying to restructure itself from a nonprofit research lab into a fully for-profit company.

Microsoft also introduced Copilot Labs and Copilot Vision.

Realtime API

Of all the new OpenAI tools, Realtime API stands out, according to Kjell Carlsson, a former tech analyst who is now head of AI strategy at vendor Domino Data Lab.

"It's very impressive technology, and wonderful that they're making it available," Carlsson said. Now in public beta, Realtime API enables developers to build multimodal experiences in their applications and supports natural speech-to-speech conversations using six preset voices already supported in the API.

OpenAI also introduced audio input and output in the Chat Completions API. The chat completions API lets users create prompts by providing messages that contain instructions for the large language model.

With the introduction of audio input and output, developers can pass any text or audio into GPT-4o and have the model respond with their choice of text or audio or both, OpenAI said.

With Realtime API and soon the audio in Chat Completions API, developers can build natural conversational experiences with a single API call.

"Now [OpenAI is] enabling folks to build these voice-based chatbots and systems in a way that is much harder to do previously and do this in a lower latency fashion," Carlsson said.

Realtime API could be used in scenarios in which typing and text are hard to do, said Gartner analyst Arun Chandrasekaran.

For example, voice input and output are helpful when someone is driving a car.

Moreover, car makers might be interested in using a ChatGPT-style conversation chatbot to boost the quality of in-car experiences, Chandrasekaran said.

Now we are enabling folks to build these voice-based chatbots and systems in a way that is much harder to do previously and do this in a lower latency fashion.

Kjell CarlssonHead of AI strategy, Domino Data Lab

"Realtime API and similar APIs potentially that can come from other providers could be a significant differentiator in terms of enabling those use cases," he said.

Although there are hypothetical use cases for tools like Realtime API or other voice technologies such as the idea of them replacing interactions with a call center, good examples of those applications working are lacking, Carlsson said.

"We have a lot of examples of folks liking, adopting and sometimes preferring the interaction with the generative AI model over a human. Nut when it comes to doing that through voice interactions and providing an end-to-end voice voice-based experience, we haven't seen it work well," he said.

However, generative AI technology is about the art of the possible, Chandrasekaran said.

"Once you showcase something possible, people will come up with ideas in terms of ways in which they could potentially use it," he said.

Vision fine-tuning

The new vision fine-tuning tool follows a similar process as text fine-tuning and allows developers to fine-tune the model for a better understanding of the images they want to use.

"The vision fine-tuning one is … a useful feature, but it's not a feature which is building on where OpenAI has its strengths," Carlsson said.

He added that OpenAI's Dall-E is not as strong as some competitors' image models, such as those from Stable Diffusion and others.

Despite this, vision fine-tuning is important because it supports different use cases such as medical image analysis or self-driving cars, Constellation Research analyst Andy Thurai said.

Microsoft Copilot Vision and Copilot Labs

With Microsoft's Copilot Vision, Copilot sits within the Microsoft Edge browser and can understand web pages users view and understand questions about their content. Copilot Vision is opt-in, and users can decide when or how they want to use it.

In its preview version, none of the content and interactions users have with Copilot will be stored or used for training, Microsoft said. The sessions are deleted as soon as the feature is closed.

The service is also being blocked from being used on paywall-protected sites and sensitive content.

"Microsoft has good reason for containing these capabilities because it doesn't want to get in trouble," Carlsson said. "It doesn't want you going in and misusing the Copilot so that it goes in and generates harmful content."

Copilot Labs is also in preview. The first feature available in Labs is Think Deeper. This gives Copilot the ability to reason through more complex problems.

It's unclear if Microsoft uses any OpenAI models to support Think Deeper or Copilot vision.

Model Distillation and Prompt Caching

Meanwhile, OpenAI's Model Distillation tool enables developers to use the outputs of models like GPT-o1 and GPT-4o to fine-tune and improve models like GPT-4o mini.

This feature makes model costs considerably lower, especially when users need to deploy the smaller models at edge locations, according to OpenAI.

"This has been a major pain point for almost everyone," Thurai said. "While I am not sure how good their solution is … this is probably the most welcome one for small language model trainers."

Finally, Prompt Caching is seen as a cost saver for enterprises that use large amounts of API calls. It is a response to OpenAI competitor Anthropic, another independent generative AI vendor that introduced similar capabilities earlier this year.

"At this point. Anthropic is catching up with them pretty fast and ahead in certain areas," Thurai said. "It is mostly a two-horse race though others are trying to implement the technology as fast as they can."

Esther Ajao is a TechTarget Editorial news writer and podcast host covering artificial intelligence software and systems.

OpenAI shows developers what is possible with speech tool

The AI vendor introduced four new offerings for enterprises. Its Realtime API is notable for its speech-to-speech capability. Meanwhile, the startup is growing explosively.

Realtime API

Vision fine-tuning

Microsoft Copilot Vision and Copilot Labs

Model Distillation and Prompt Caching

Dig Deeper on AI technologies

Microsoft Copilot vs. Google Gemini: How do they compare?

Building an internal AI call simulator: Lessons for CIOs

The implications of OpenAI's speech model gpt-realtime

Interview: How OpenAI is making ChatGPT public and private sector-ready

Realtime API

Vision fine-tuning

Microsoft Copilot Vision and Copilot Labs

Model Distillation and Prompt Caching

Related Resources

Dig Deeper on AI technologies

Microsoft Copilot vs. Google Gemini: How do they compare?

Building an internal AI call simulator: Lessons for CIOs

The implications of OpenAI's speech model gpt-realtime

Interview: How OpenAI is making ChatGPT public and private sector-ready