Innovation That Matters

Voice recognition | Photo source Shutterstock

Tech Explained: Voice Recognition

Work & Lifestyle

AI-powered voice recognition systems are getting closer to human accuracy. This 2-minute read will bring you up to speed with the technology behind digital assistants such as Apple’s Siri and Amazon’s Alexa.

Voice-search enabled digital assistants like Apple’s Siri, Microsoft’s Cortana and Amazon’s Alexa are quickly becoming the go-to search mode many people. In fact, consumer research firm ComScore has reported that by 2020, fully half of all searches will be voice searches. The voice recognition technology that allows voice searches has taken a long time to develop. Early researchers were hampered by the need for software that could cope with the wide variation in human speech, including differences in dialect, speed, emphasis, and gender.

Until the 1990s, most voice recognition systems required the speaker to speak slowly and clearly, and were prone to errors like turning ‘Give me a new display’ into ‘give me a nudist play’. More recently, artificial intelligence software, and an increase in computing speeds, has allowed voice recognition to develop into a practical, everyday tool. At Springwise, we have covered voice recognition software used for banking, reading bedtime stories and dispensing medication.

So how does it work? The analog sound waves of speech are first converted into digital data and adjusted to account for differences in pitch, volume and speed. The signal is then divided into small segments. These are then matched to known phonemes (a phoneme is a representation of a sound used in speech; there are around 44 phonemes in the English language). The software then plots each phoneme into a sophisticated statistical model. This model compares the phoneme to a huge library of words and phrases. In the most common model for doing this, the Hidden Markov model, the software assigns a probability score to each phoneme to assess the likelihood that it is the correct one. The program then analyses the phrases that came before and after the phoneme to determine whether the phoneme is correct.

All of the voice recognition models are based on large amounts of real human speech. The software makes its ‘guesses’ about what has been said, and calculates the odds that it has found the right word or phrase, based on what it has seen earlier. This training greatly reduces the software’s guesswork and increases accuracy. The more training, the greater the accuracy. Accuracy can also be improved by teaching the software what types of words the speaker is likely to use.

The use of deep learning algorithms has also greatly improved accuracy. Microsoft recently announced that its latest speech-recognition system, which uses six neural networks running in parallel, has brought error rates down to just 5.9 percent. This is the same rate as that achieved by human transcribers.

Despite these advancements, researchers are hoping to achieve even more from voice recognition. The Defense Advanced Research Projects Agency (DARPA) and other organisations are working to develop a universal, simultaneous translator. This will require a program capable of handling a range of slang, dialects, complex grammatical structures, accents and background noise. Will AI one day allow the creation of speech recognition software that can truly understand speech?