Friday, 20 February 2009

Freedom of Speech

Speech-based transcription, dictation, and translation capabilities have improved markedly over the past decade due to refined algorithms, increased computational power, and general advancements in digital audio and recording equipment. As a result, desktop speech recognition technology now promises to enhance the productivity and effectiveness of the workforce not only in niche markets but also in the more general areas of office/process automation and multimodal applications. In addition, speech recognition is appearing in automotive systems, customer service solutions, and an increasing array of consumer products such as mobile phones.

Speech recognition is generally defined as the ability of a machine or computer program to recognise and carry out voice commands or to take dictation. Speech recognition involves the ability to match a voice pattern against a provided or acquired vocabulary. In the early days of speech recognition, systems usually held a very limited vocabulary, as computing resources and the sophistication of the algorithms in use were not what they are today. Now, however, a large vocabulary (200,000+) is provided with most products, and the user can add extra words, phrases, or specialised vocabularies as required.

Research has been carried out for many years, and although commercial products were available, it was not until the 1990s that vendors started to really understand the way this technology could be applied to business processes, usage patterns, and niche requirements in greater detail. It is interesting to note how the cost of this technology has dropped over the years, beginning in 1990 when DragonDictate for DOS was US$9,000 for a single-user licence. By 1997 the price had fallen to around US$1,000, and today Amazon is selling the entry-level version of Dragon NaturallySpeaking 10 for around US$60.

High prices and issues relating to accuracy and usability, i.e. long periods of system ‘training’, meant that the business adoption of desktop speech recognition got off to a very slow start. The market for PC and server-based speech recognition was around the US$60 million mark in 2007, with services bringing the market total up to around US$200 million. Sales were fairly flat before this point, as early adopters found that the technology did not meet their expectations, and hence failed to ‘cross the chasm’ into mainstream use; instead remaining a niche product in specialised markets. It is, perhaps, somewhat ironic to note that one of the key factors for recent sales increases is that of people adopting desktop speech recognition tools to ease computer-related ailments, such as RSI. Previously these solutions were used to help individuals with disabilities or accessibility issues to use a computer.

My own first contact with speech recognition technology came in the mid-1990s when I evaluated an early system from Marconi Speech & Information Systems. I can remember sitting in a quiet office wearing an expensive microphone headset and being asked to read out three digit numbers being displayed on a computer screen. Without any form or training or tuning the system recognised 19 out of 20 numbers – 95% accuracy. However, with the office door and window open this fell to 85% (and I worked in a relatively quiet office building). Even though the technology seemed like magic at the time, the system was a none-starter for the intended warehousing application we had envisaged, as delicate headsets and quiet conditions were quite incongruous with the hostile environment of the factory floor.

So herein is one of the persistent challenges for even today’s speech recognition systems: noise. In order to get reasonable accuracy with speech recognition software it is usually necessary to place the microphone in close proximity to the mouth by wearing a quality headset or using a handheld microphone, and so this immediately precludes the casual use of this technology. Having said that, we are now starting to see array microphones coming down in price (UK£145 excluding VAT), and in some cases appearing in notebook computers. With such devices, users can be located up to two feet away from the microphone and their voices can still be recognised. However, for even greater flexibility, my advice would be to consider an over-the-ear Bluetooth headset. For around UK£150 (excluding VAT), organisations can supply users with a lightweight, high-end, wireless headset that can be used with mobile phones, Internet telephony applications, and speech recognition software.

For some speech recognition scenarios, such as digital dictation, the user need not even be present when their voice is being processed. By recording their voice into a digital recording device (costing approximately UK£100), while either at their desk or on the move, users submit the digital dictation file and have this transcribed by a server, elsewhere, at another time. This does not mean, however, that today’s general purpose speech recognition software can be used to transcribe meetings or interviews, as most products tend to be speaker-dependent, i.e. they are trained to recognise the voice of a single user and cannot distinguish between individuals having a conversation.

The accuracy of speech recognition has improved significantly in recent years, with the market leader in the desktop arena, Nuance, claiming accuracy rates of up to 99% with its latest offering. However, in real life dictation situations, users should expect 3 to 4 errors per minute given the fact that the average person speaks around 130 words per minute. Industry specific vocabularies, and the ability to centrally manage a customer word list, also make applications based on this technology far more usable than was previously the case. The legal and medical professions are especially well catered for in this regard, as off-the-shelf products can be purchased with preconfigured vocabularies of over 30,000 legal specific terms and phrases. In the medical field, tailored speech recognition products are starting to be common place, and are often integrated with Electronic Medical Record (EMR) systems.

‘Speech recognition’ is a somewhat generic term, and so it is not uncommon to hear vendors, system integrators, and solution providers using the following terms when describing voice- and speech-based offerings: Speech Synthesis, Intelligent Speech Interpretation, Telephony Speech Recognition, Automatic Speech Recognition, Speech Processing, and Voice Processing. The use of so many terms clearly highlights the sophistication of the market, and the range of offerings in this space. Any domain where there is a lot of jargon usually means that there is a services market out there too, and while the overall revenue for desktop speech recognition-based products looks likely to grow by over 20% in the five years, I expect the speech technology market as a whole to grow even more as the technology finds its way into more areas and products.

Despite Microsoft’s efforts in the desktop arena, it was Dragon NaturallySpeaking (sold by Nuance Communications) and ViaVoice from IBM that established speech recognition as a viable business tool. In 2003, IBM gave ScanSoft (which owned the competing product Dragon NaturallySpeaking) exclusive global distribution rights to ViaVoice for the desktop (IBM continues to offer Embedded ViaVoice for use in mobile devices and automotive products). Then, two years later, Nuance Communications merged with ScanSoft to form Nuance.

Despite the dominant position held by Nuance in the speech recognition market, there are several vendors participating in the three main desktop speech recognition markets of dictation, transcription, and command and control (see Figure 2). Dictation is probably the most visible market at present due to the marketing of products such as Dragon NaturallySpeaking. This real-time use of speech recognition has benefited significantly with the advent of faster processors and the introduction of new instruction set innovations by Intel and AMD.

The digital dictation industry has developed through advances in recording devices and is gradually replacing older tape-based recording systems. Now, with digital dictation, the process is much slicker: the dictator speaks into a digital recording device, PC, or even a smartphone; a digital audio file is created; and then this is transferred (if not already there) to a PC via a USB port or wireless connection. Once the file is stored on the computer it can easily be transcribed there and then, or put into a workflow system from where it can be sent via the network to be processed elsewhere. Philips Speech Recognition Systems (acquired by Nuance in September 2008 for €66 million) claims that it can scale to 15,000 users and 1,500 hours dictation throughput per day with its networked-based SpeechMagic offering, and the solution works in Citrix/Microsoft Terminal Server environments too.

Of course, the main challenge for vendors is persuading users that speech recognition has improved in terms of accuracy and latency. This presents a significant challenge for vendors, as they need to prove that speech recognition has matured, and that deployments will provide immediate value to the user rather than being a hindrance to productivity.

As with handwriting recognition, speech recognition technology is maturing rapidly, and software solutions are now industrial-grade. With a Compounded Annual Growth Rate (CAGR) of around 23% between 2008 and 2013 (source: Datamonitor), the desktop speech recognition market looks set to bring this technology to a wider audience. Although Nuance is the headline act of the speech recognition market today, there are other players in the game. IBM, Loquendo, and LumenVox are ingraining their speech recognition engines into the fabric of enterprise applications and IT infrastructures. The keyboard may well become a thing of the past in the years ahead.

0 comments:

Post a Comment