Big Data

IBM speech recognition becoming as accurate us humans

Wednesday, March 8, 2017

Google-Play-Store-Developers-Claim-Leaderboard

IBM has used SWITCHBOARD linguistic corpus to achieve an all new record in speech recognition errors.

On average, according to IBM, humans tend to misunderstand or mishear up to 5 to 10% of all words they hear from other human beings in a typical conversation. Although that might seem like a lot, our minds can compensate for this quite well and so we don’t tend to even notice that much. But just like humans, computers also have similar issues with mishearing words. This makes a computer’s job much more complex; having to piece together sentences without our well adjusted brains.

IBM has just released a blog post outlining their latest achievement in their quest for perfect conversational speech recognition. They have created a machine that has “reached a new industry record of 5.5 percent,” when it comes to percent of words that are unrecognizable by the software in any given conversation. IBM achieved a major milestone in this area last year: a computer system that reached a word error rate of 6.9%. But it was by using the SWITCHBOARD linguistic corpus, that IBM was able to achieve their newest record. This brings them closer than ever before to what they consider to be the human error rate, 5.1%.

“To reach this 5.5 percent breakthrough, IBM researchers focused on extending our application of deep learning technologies. We combined LSTM (Long Short Term Memory) and WaveNet language models with three strong acoustic models. Within the acoustic models used, the first two were six-layer bidirectional LSTMs. One of these has multiple feature inputs, while the other is trained with speaker-adversarial multi-task learning. The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples - so it gets smarter as it goes and performs better where similar speech patterns are repeated.” said the blog post by George Saon, Principal Research Scientist at IBM.

“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex. It’s also difficult to define human performance, since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated,” said Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University.

They worked to reproduce human-level results with there partner Appen, and determined human performance is still a bit better than a machine’s, at 5.1%. Along with others in the industry, they’ve been chasing this milestone for a while and some have recently even claimed to have reached it - claiming it to be 5.9%. However, while our breakthrough of 5.5% is a big one, this shows us that there is much more ground to cover before anyone can truly say the technology is on par with humans. Finding a standard measurement for human parity across the industry is more complex than it seems, and we must remain accountable to the highest standards of accuracy when measuring for it.