Fotolia

News

Google Cloud Speech-to-Text, Text-to-Speech see upgrades

Google upgrades its cloud-based speech and text products with higher accuracy levels, new WaveNet voice options and support for several new languages.

Mark Labbe

Published: 22 Feb 2019

Google revealed a number of enhancements to its Google Cloud Speech-to-Text and Text-to-Speech products that the tech giant said will make the AI-driven speech tools less expensive and more useful to enterprises.

Overall, the technologies, which are for developers and support easy integration with a variety of devices and applications, are now more accurate and less costly compared to last year's iterations, according to a Feb. 21 blog post on the Google Cloud website by Dan Aharon, product manager of the Cloud Speech line.

The news represents "a continuation of general industry improvement of speech-to-text and text-to-speech services in the market over the last two or three years," said Dave Schubmehl, research director for cognitive and AI systems and content analytics at IDC.

Both Google speech products are typically used by "enterprises and technology vendors that want to include speech-to-text and text-to-speech capabilities in their applications," Schubmehl said.

A better case for speech use cases

Krish Ramineni, CEO of conversation transcription and tracking platform vendor Fireflies.ai, based in San Francisco, shared similar sentiments. He said the updates spell out "good news for the industry as a whole."

"Just like we saw with the decrease in cloud storage and compute prices a decade ago that allowed for the rise of companies like Dropbox, Box and many cloud-enabled SaaS providers, the speech space, I believe, will see similar trends," Ramineni said.

"Lower costs and increased accuracy will be an enabler for startups to leverage this technology and build unique use cases for their particular customers," he added.

Of a DeepMind to turn audio into text

First introduced in 2016 as Google Cloud Speech API, Google Cloud Speech-to-Text enables developers to use speech technologies developed by parent company Alphabet Inc.'s DeepMind division to automatically transcribe audio from sources that include video, phone calls and regular speech.

Google Cloud Speech, sound wave — Google Cloud Speech APIs enable developers to turn speech sound waves into text and text into speech.

Google rebranded and revamped the speech-to-text product last year, with premium services for video and enhanced phone capabilities in beta. Now in general availability, the models are customized for their unique purposes and have higher accuracy rates than the generic services.

The update also brings general availability to a multichannel speech capability that enables Google Cloud Speech-to-Text to better differentiate between multiple speakers. Google introduced the tool in beta form last year.

New ways to put a voice to things

As for Google Cloud Text-to-Speech, a developer service that was introduced last year, Google added a variety of new capabilities, including beta support for new languages and several new voice options.

The new release adds support for Danish, Portuguese, Russian, Polish, Slovakian, Ukrainian and the Norwegian Bokmål language, which brings the total number of supported languages to 21.

[Users] don't need to worry about what device the speech comes from, and they also don't necessarily need to tie themselves to a specific device or hardware supplier.

Dave Schubmehlanalyst at IDC

Google also added new voices, including dozens of WaveNet voices. WaveNet, a technology developed by DeepMind, is essentially a deep neural network designed to mimic real human voices. For the last few years, Google has used WaveNet to create Google Assistant voices, as well.

In addition, Google made generally available a feature that optimizes audio playback across different devices.

For customers, unlike some prebuilt models and services, "using APIs and services like this provides device-agnostic capabilities for developers," Schubmehl said. Google also sells some customized versions of its products.

Users "don't need to worry about what device the speech comes from, and they also don't necessarily need to tie themselves to a specific device or hardware supplier," Schubmehl continued. "It also offers the developers more fine-grained control over the functions that they are using."

Google competes with several developer-focused providers of AI-driven speech technologies, including AWS and Microsoft. Each of the big tech vendors provides its products and services in the cloud, and they all use advanced machine learning and deep learning models to enable high degrees of accuracy.

For Google Cloud Speech-to-Text, Google charges by 15-second increments, at about $0.006 to $0.009 per increment up to a million minutes for standard and premier models. For Google Cloud Text-to-Speech, standard voice models are priced at $4 per 1 million characters, while WaveNet models are $16 per the same amount. Customers get a free 1 million-character tier.

Google Cloud Speech-to-Text, Text-to-Speech see upgrades

Google upgrades its cloud-based speech and text products with higher accuracy levels, new WaveNet voice options and support for several new languages.

A better case for speech use cases

Of a DeepMind to turn audio into text

New ways to put a voice to things

Dig Deeper on AI infrastructure

Speaking in volumes: UiPath talks up Google Gemini models for voice agents

What is voice recognition and how does it work?

AI speech-to-text eavesdropping can serve the greater good

James Earl Jones, AI and the growing voice cloning market

A better case for speech use cases

Of a DeepMind to turn audio into text

New ways to put a voice to things

Related Resources

Dig Deeper on AI infrastructure

Speaking in volumes: UiPath talks up Google Gemini models for voice agents

What is voice recognition and how does it work?

AI speech-to-text eavesdropping can serve the greater good

James Earl Jones, AI and the growing voice cloning market