Getty Images/iStockphoto
OpenAI shows developers what is possible with speech tool
The AI vendor introduced four new offerings for enterprises. Its Realtime API is notable for its speech-to-speech capability. Meanwhile, the startup is growing explosively.
OpenAI and its partner Microsoft introduced generative AI tools for enterprise developers and general users.
At its DevDay San Francisco event earlier this week, the ChatGPT maker introduced Realtime API, model distillation, Prompt Caching and vision fine-tuning.
The new products arrive as the independent generative AI vendor closed a new funding round in which the generative AI vendor received $6.6 billion in new funding from Thrive Capital, Microsoft, Khosla Ventures, Fidelity Management & Research Co. and Nvidia, among other investors.
The cash infusion -- one of the biggest private investments -- brings OpenAI's valuation to about $157 billion, making the vendor one of the world's biggest startups. The new funding also comes as OpenAI is trying to restructure itself from a nonprofit research lab into a fully for-profit company.
Microsoft also introduced Copilot Labs and Copilot Vision.
Realtime API
Of all the new OpenAI tools, Realtime API stands out, according to Kjell Carlsson, a former tech analyst who is now head of AI strategy at vendor Domino Data Lab.
"It's very impressive technology, and wonderful that they're making it available," Carlsson said. Now in public beta, Realtime API enables developers to build multimodal experiences in their applications and supports natural speech-to-speech conversations using six preset voices already supported in the API.
OpenAI also introduced audio input and output in the Chat Completions API. The chat completions API lets users create prompts by providing messages that contain instructions for the large language model.
With the introduction of audio input and output, developers can pass any text or audio into GPT-4o and have the model respond with their choice of text or audio or both, OpenAI said.
With Realtime API and soon the audio in Chat Completions API, developers can build natural conversational experiences with a single API call.
"Now [OpenAI is] enabling folks to build these voice-based chatbots and systems in a way that is much harder to do previously and do this in a lower latency fashion," Carlsson said.
Realtime API could be used in scenarios in which typing and text are hard to do, said Gartner analyst Arun Chandrasekaran.
For example, voice input and output are helpful when someone is driving a car.
Moreover, car makers might be interested in using a ChatGPT-style conversation chatbot to boost the quality of in-car experiences, Chandrasekaran said.
Kjell CarlssonHead of AI strategy, Domino Data Lab
"Realtime API and similar APIs potentially that can come from other providers could be a significant differentiator in terms of enabling those use cases," he said.
Although there are hypothetical use cases for tools like Realtime API or other voice technologies such as the idea of them replacing interactions with a call center, good examples of those applications working are lacking, Carlsson said.
"We have a lot of examples of folks liking, adopting and sometimes preferring the interaction with the generative AI model over a human. Nut when it comes to doing that through voice interactions and providing an end-to-end voice voice-based experience, we haven't seen it work well," he said.
However, generative AI technology is about the art of the possible, Chandrasekaran said.
"Once you showcase something possible, people will come up with ideas in terms of ways in which they could potentially use it," he said.
Vision fine-tuning
The new vision fine-tuning tool follows a similar process as text fine-tuning and allows developers to fine-tune the model for a better understanding of the images they want to use.
"The vision fine-tuning one is … a useful feature, but it's not a feature which is building on where OpenAI has its strengths," Carlsson said.
He added that OpenAI's Dall-E is not as strong as some competitors' image models, such as those from Stable Diffusion and others.
Despite this, vision fine-tuning is important because it supports different use cases such as medical image analysis or self-driving cars, Constellation Research analyst Andy Thurai said.
Microsoft Copilot Vision and Copilot Labs
With Microsoft's Copilot Vision, Copilot sits within the Microsoft Edge browser and can understand web pages users view and understand questions about their content. Copilot Vision is opt-in, and users can decide when or how they want to use it.
In its preview version, none of the content and interactions users have with Copilot will be stored or used for training, Microsoft said. The sessions are deleted as soon as the feature is closed.
The service is also being blocked from being used on paywall-protected sites and sensitive content.
"Microsoft has good reason for containing these capabilities because it doesn't want to get in trouble," Carlsson said. "It doesn't want you going in and misusing the Copilot so that it goes in and generates harmful content."
Copilot Labs is also in preview. The first feature available in Labs is Think Deeper. This gives Copilot the ability to reason through more complex problems.
It's unclear if Microsoft uses any OpenAI models to support Think Deeper or Copilot vision.
Model Distillation and Prompt Caching
Meanwhile, OpenAI's Model Distillation tool enables developers to use the outputs of models like GPT-o1 and GPT-4o to fine-tune and improve models like GPT-4o mini.
This feature makes model costs considerably lower, especially when users need to deploy the smaller models at edge locations, according to OpenAI.
"This has been a major pain point for almost everyone," Thurai said. "While I am not sure how good their solution is … this is probably the most welcome one for small language model trainers."
Finally, Prompt Caching is seen as a cost saver for enterprises that use large amounts of API calls. It is a response to OpenAI competitor Anthropic, another independent generative AI vendor that introduced similar capabilities earlier this year.
"At this point. Anthropic is catching up with them pretty fast and ahead in certain areas," Thurai said. "It is mostly a two-horse race though others are trying to implement the technology as fast as they can."
Esther Ajao is a TechTarget Editorial news writer and podcast host covering artificial intelligence software and systems.