Automatic speech recognition may be better than you think

Even as more enterprises turn to voice recognition systems to process unstructured audio and build virtual assistants, many organizations don't have confidence in the high accuracy of these systems.

In the touchless economy accelerated by COVID-19, automatic speech recognition has seen a sharp uptick in use. As the world rapidly shifted to remote work and expanded online contact centers and storefronts, businesses turned quickly to virtual assistants, chatbots and automated transcription services.

Yet, even before COVID-19, enterprises were steadily moving towards ASR to augment their workflows.

ASR uses AI-based technologies, including machine learning and deep learning, to identify and process human speech and turn it into text. The technology can be used to power voice-based AI systems or virtual assistants, like Google Home or Amazon Alexa, or run voice-to-text software.  

More ASR

Organizations have increasingly turned to ASR over the last couple of years, as advances in AI, particularly machine learning and deep learning, have greatly improved ASR systems' accuracy, said Hayley Sutherland, a senior research analyst for conversational AI and intelligent knowledge discovery at IDC.

Right now, most systems have an accuracy of 75% to 85% off-the-shelf, but training can improve that, she noted.

COVID-19 further increased interest in ASR systems, as the pandemic drove a rapid shift to remote work and education and sparked a profusion of virtual meetings.

Scott Stephenson, CEO of ASR vendor Deepgram, acknowledged that, before the pandemic, organizations that hadn't started using ASR technology expected they would do so when they eventually upgraded their infrastructure.

"They would say, if you had talked to them a year prior to the pandemic, 'in the next three years, we're going to update our infrastructure,'" he said, adding that the same organization likely had been saying that for the past decade.

"Now when you talk to them," Stephenson continued, "they say, 'We have already upgraded our infrastructure; we had to because we wouldn't be able to operate if we didn't.'"

Deepgram, in partnership with Opus Research, recently surveyed 400 North American decision-makers in various industries to determine if and how respondents use ASR.

About 99% of the respondents indicated they are currently using ASR in some form. Most, about 78%, are using ASR systems to transcribe and analyze voice data from consumer-facing devices -- largely voice assistants within mobile apps.

5 AI technologies driving business value

Common applications

Indeed, outside of broadcast subtitling, one of the most common use cases for ASR is within voice-enabled virtual assistants, most of which rely on speech-to-text software to first convert spoken word to text, Sutherland said.

"Once in text format, advanced natural language processing can be performed to help conversational AI systems 'understand' what users are saying and determine how to respond," she noted.

Other common applications include enterprise meeting transcription, class transcription and medical notes dictation, she said.

Deepgram's survey found that, after using ASR with consumer-facing devices, organizations are most commonly integrating ASR systems with their collaboration platforms (such as Zoom, Webex, Skype and Slack), with their customer-facing contact centers and with their internal help desks.

Still, despite respondents' intensive use of ASR, the survey showed that more than half of the respondents don't believe they are properly using their recorded audio.

According to Stephenson, that's a silo problem.

Potential problems

Since the advent of big data years ago, organizations have stored as much data as they can. Until a few years ago, organizations have largely kept more complex data, such as images, audio and video, unstructured.

Early experiences with less accurate ASR have made some business leaders leery of adopting them.
Hayley SutherlandSenior research analyst, IDC

Years ago, this data would have required manual curation, so it sat in older systems as organizations focused on using more straightforward information, such as website clicks or emails.

While audio processing technology has become more advanced over the last few years, "we're still stuck in the legacy way of capturing and storing this audio," Stephenson said.

But, modern technology enables organizations to run audio through an accurate model, put it into a data warehouse, and open up access to it to their data scientists, just as they had previously done with information such as clicks on their websites, he continued.

"Now you can do this with previously untouchable data," Stephenson said.

The problem here, though, is that many organizations don't realize how much better ASR systems have gotten over the past few years, according to Sutherland.

"Early experiences with less accurate ASR [systems] have made some business leaders leery of adopting them," she noted.

In addition, organizations may find that their audio quality is lacking, she noted.

The accuracy of ASR systems partly depends on the quality of the source audio, Sutherland said.

In certain industry use cases -- for example, voice-enabled applications on manufacturing floors -- audio quality may be poor, she continued.

"Similarly, some of these systems struggle with heavy accents while others are better at adapting to different speakers' voices," she said.  "Pre-processing of the audio may be needed, and this can require additional work and investment."

But, she added, vendors are making advances in audio quality.

More vendors, such as Speech Processing Solutions, are creating higher-powered and AI-enhanced recording devices to address this problem. Other vendors are building better noise-cancelling and audio-enhancing software.

Enterprises interested in ASR technology should evaluate their options, and understand the strengths and limitations of current ASR systems. Still, the technology in its current form is promising.

Dig Deeper on Enterprise applications of AI