Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to identify human speech and convert it into readable text.
Whilst the more basic speech recognition software has a limited vocabulary, we are now seeing the emergence of more sophisticated software can handle natural speech, different accents and various languages, whilst also achieving much higher accuracy rates. We are also using speech recognition technology much more in our everyday lives, with an increasing number of people taking advantage of digital assistants like Google Home, Siri, and Amazon Alexa.
So, how has the technology evolved, how does it work and what are the opportunities for businesses and professionals across numerous industries and sectors to exploit speech recognition in the everyday work?
Here’s a quick overview of how speech recognition has developed from the early prototypes:
- 1952 - The first-ever speech recognition system, known as “Audrey” was built by Bell Laboratories. It was capable of recognising the sound of a spoken digit – zero to nine – with more than 90% accuracy when uttered by a single voice (its developer HK David).
- 1962 – IBM created the “Shoebox, a device that could recognise and differentiate between 16 spoken English words.
- 1970s - As part of a US Department of Defence-funded program, Carnegie Mellon University developed the “Harpy” system that could recognise entire sentences and had a vocabulary of 1,011 words.
- 1980s – IBM developed a voice-activated typewriter called Tangora which used a statistical prediction model for word identification with a vocabulary of 20,000 words.
- 1996 – IBM were involved again, this time with VoiceType Simply Speaking, a speech recognition application that had a 42,000 word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words.
- 2000s – With speech recognition now achieving close to an 80% accuracy rate, voice assistants (also commonly referred to as digital assistants) came to the fore, firstly Google Voice to be followed a few years after by Apple’s launch of Siri and Amazon coming out with Alexa.
How it works
A wide range of speech recognition applications and devices are available, with the more advanced solutions now use Artificial Intelligence (AI) and machine learning. They are typically based on the following models:
- Acoustic models – making it possible to distinguish between the voice signal and the phonemes (the units of sound).
- Pronunciation models – defining how the phonemes can be combined to make words.
- Language models – matching sounds with word sequences in order to distinguish between words that sound the same.
Initially, the Hidden Markov Model (HMM) was widely adopted as an acoustic modelling approach. However, it has largely been replaced by deep neural networks. The use of deep learning in speech recognition has had the effect of significantly lowering the word error rate.
Word error rate
A key factor in speech recognition technology is its accuracy rate, commonly referred to as the word error rate (WER). A number of factors can impact upon the WER, for example different speech patterns, speaking styles, languages, dialects, accents and phrasings. The challenge for the software algorithms that process and organise audio into text are to address these effectively, whilst also being able to separate the spoken audio from background noise that often accompanies the signal.
The application of speech recognition
Thanks to laptops, tablets and smartphones, together with the rapid development of AI, speech recognition software has entered all aspects of our everyday life. Examples include:
These integrate with a range of different platforms and enable us to command our devices just by talking. At the personal level examples include Siri, Alexa and Google Assistant. In the office they can be used to complement the work of human employees by taking responsibility for repetitive, time-consuming tasks and allowing employees to focus their energy on more high-priority activities.
Speech recognition technology is not only impacting the way businesses perform daily tasks but also how their customers are able to reach them. Voice search is typically used on devices such as smartphones, laptops and tablets, allowing users to input a voice-based search query instead of typing their query into a search engine. The differences between spoken and typed queries can cause different SERP (search engine results page) results since the way we speak creates new voice search keywords that are more conversational than typed keywords.
Speech to text solutions
And finally, the most significant area as far as business users are concerned is speech to text software. This area is growing rapidly, due in no small part to the availability of cloud-based solutions that are enabling users to access fully featured versions of speech to text apps from the smartphones or tablets irrespective of their locations. Furthermore, speech recognition technology can reduce repetitive tasks and free up professionals to use their time more productively, whilst also allowing businesses to save money by automating processes and doing administrative tasks more quickly.