Grafvision - Fotolia
How Amazon Transcribe opens the door for speech recognition
Amazon's audio transcription service, Transcribe, has caught developers' eyes for a number of reasons, but make sure its early limitations don't land you in trouble.
Amazon Transcribe provides speech-to-text transcription at scale, and like similar APIs from Microsoft, Google and IBM, the service allows developers to convert lengthy audio and video files into formatted text.
Developers can easily bake automated speech recognition from Amazon Transcribe into their workflows and output these files into other native or third-party services that run on the AWS platform. Transcribe supports common audio and video file formats, including WAV, MP3, FLAC and MP4, and it automatically adds time stamps for each word, along with inferred punctuation.
The service can also transcribe lesser quality audio files, including those from a telephone. The early version of Amazon Transcribe supports speech-to-text translation for English and Spanish, with more languages planned for the future.
The Transcribe API includes three calls: StartTranscriptionJob, ListTranscriptionJobs and GetTranscriptionJob. StartTranscriptionJob starts the process of audio or video file conversion into text. ListTranscriptionJobs returns a list of pending, completed and failed jobs. GetTranscriptionJob returns a link to a JSON file with time-coded text output.
Amazon Transcribe costs $0.0004 per second, billed in one-second increments. Developers can test the free tier, which includes up to 60 minutes per month.
New voice use cases, possibilities
Amazon Transcribe is still in its infancy, so it's not likely to replace the accuracy of human transcriptionists in the immediate future. But the deep learning-based AI service will enable new enterprise application use cases around call center analytics, call logging, sentiment analysis, automated captioning, targeted advertising and improved audio and video searches.
Amazon plans to add features that enable developers to expand and customize the speech recognition vocabulary. These capabilities could improve ease of use with specialized vocabulary, such as notes taken by medical professionals or equipment repair instructions by technicians in the field.
The service will also add the ability to recognize different speakers in a call or recording. This feature would help developers distinguish between multiple voices on audio files, such as call agents and customers, or actors in a movie.
Amazon Transcribe could make it easier to track compliance in regulated industries, such as finance, which must record customer interactions or automatically record, transcribe and index meeting notes. Other services, like Amazon Comprehend, could integrate with Transcribe to automatically extract meaning and intent from conversations.
Transcribe's time stamps could also integrate with subtitle files associated with movies and TV shows to provide captions for the hearing impaired or with translation applications to generate foreign language subtitles. When coupled with the Amazon Polly text-to-speech engine, Transcribe could even help automatically generate audio in the target language.
In the long run, the core technology might also improve the Alexa conversational interface to enable long-form dictation of emails and office notes.
Proceed with caution
Amazon Transcribe seems like a good fit for workflow automation around predictive analytics and trend analysis. But most speech-to-text transcription engines only tend to get excellent results with high-quality microphones in quiet environments. Additionally, enterprises could face liability issues in use cases that apply to a person's health or safety, such as medical records or prescriptions; these applications will likely require humans to verify accuracy.
The Transcribe API will make it easier to integrate into these verification workflows. But it's important to note that Amazon currently offers Transcribe as an asynchronous API only, which means there are no guarantees of how long it will take to return results.
Good transcription also depends in part on providing feedback to the user, who might be mumbling. A real-time API that can provide immediate feedback and ease algorithm customization will return better results in safety-critical scenarios.