Text Box: IT'S GOOD TO TALK

[Panel] Everyone's talking about the latest speech recognition systems and the prices have come tumbling down. In theory any of them should whisk translators one step closer to heaven. But how do they shape up in practice?  The ITI Bulletin tested all the major systems on the market over a three-month period and evaluated their performance specifically as a productivity aid for translators. The good news is that, though still far from perfect, speech recognition is definitely a workable option and has been for several years. What's more, it's now easier to use than ever. So, here's everything you need to know to make 1999 a more productive and comfortable year.

What do translators stand to gain?

First, there is the productivity argument. The ITI Rates & Salaries Survey showed that in some cases translators using speech recognition achieved outputs of almost twice the average. Much of that was unquestionably due to experience. It naturally helps productivity if you're not continually stopping to peer in a dictionary or stumbling over how to unravel your syntax. Likewise, dictation simply isn't suited to a small minority who find they need to see and play with the words on the screen.

Not having to do so, however, is one of the main reasons why dictation is so successful a productivity aid. It means you can keep your eyes on the source text. There's no don't need to look at the keyboard, because you're not using it. And you don't have to look at the screen because you know every word will be spelled correctly. Because you're not constantly switching from one text to another, you're less likely to lose your place or skip paragraphs by mistake. As a result you don't lose many valuable minutes over the day constantly having to find your place.

Then, naturally, there's the fact that most of us can talk faster than we can type. But talking also has a further advantage - a quality advantage. Many people who dictate comment that they are far less likely to come out with a clumsy construction that inelegantly bridges the gap between two languages if they actually have to say it out loud.

Finally, your body and mind will thank you. The best programs available today allow you to format and navigate your texts by voice as well as dictating them.  Meaning that in one fell swoop you get rid of the two major causes of repetitive strain injury - keyboard work and mouse work. Instead of being hunched up over your keyboard peering at the screen, you can lean back, printed text in hand, or even move about while you're working. There will soon even be cordless microphones on the market. So that while dictating your translation of a Russian novel, you'll also be able to pace up and down like the characters in it. You could start off a whole new school of method translation. Even then, you'll still feel much more relaxed after a hard day's work.

Why not simply use a typist?

Well, firstly, there's the sordid matter of cost. Then there's the fact that unless you have a typist physically present in your office, some short-to-medium length jobs requiring a fast turnaround simply aren't suitable for "human" dictation.

Until recently, a human typist certainly had the edge for people who need to dictate on the move, such as translators who are also interpreters. But now that has changed too, with Dragon, IBM and Lernout & Hauspie all offering solutions that allow you to combine their systems with portable dictation devices. You'll find more information on this below and in Roger Fletcher's review of the Olympus recorder.

Sounds too good to be true, eh? Well, in some cases it is. You can't treat a computer like a human typist. It won't accept vague commands like "do a table more or less like the one on page 5". And above all you have to speak more clearly than you can get away with if there's some poor human being on the other end deciphering your ramblings.  This can mean that you end up dictating slower, at least in the beginning. In some cases your computer will make even worse mistakes than the most ignorant typists.  The results can sometimes be hilarious (which means you have to be particularly vigilant and make sure that it's YOU that spots them). But, generally speaking, you only have to correct the system once or twice and it won't make the same mistakes again.  This is a particular advantage for translators working in very technical fields.

To get the best out of the systems, you have to teach them how you talk and learn the most effective way of talking to them. It's a bit like marriage really. Except that using speech recognition keeps getting easier. Not so long ago you had to perfect the art of "discrete speech". That basically involved talking like a Dalek because the systems available at the time were only able to recognise one word at a time. Thankfully, things have moved on since then, and everyone now offers some form of continuous speech recognition that allows you to talk more or less like a normal human being. It takes a little perseverance in the beginning but you'll soon reap the rewards. In short, don't expect miracles and you'll be more than satisfied.  But then that applies to most things, doesn't it?


What you need, what you get and why

This is the technical bit. If you're an information technology boffin, lick your lips and read on.  If you aren't, save it for bedtime reading late night. Hopefully you'll understand it, but if not it will at least help you get to sleep.

All the systems basically come with five components: a recognition engine with associated correction utility, an enrolment facility with scripts, a microphone wizard, a vocabulary expander and a text to speech facility.

Vroom, vroom!

The recognition engine is the animal that does all the work. It basically works in two ways. Firstly, it recognises the sound of the words you speak. To do this it needs a sound card and microphone, more of which later. The problem here is, of course, that everyone speaks differently. By which I mean not only that one person speaks differently from another, but that we don't always pronounce the same words in the same way. This causes human beings problems with understanding one another, so you can imagine the furrowed brow and puzzled expression it's likely to inspire in your computer.  All the systems on test get round this problem in precisely the same way that we do ourselves.  They analyse what the speaker is likely to say.  Meaning they look at the words in context.

This is both simpler and more complex than you'd suppose.  Essentially, the systems analyse three words at a time in what is, surprise surprise, referred to as a trigram recognition pattern.  The words selected are provisional or "infirm" until confirmed by the contextual analysis of the words around them.  Because the sets of three overlap, the choices of up to five different words could be shuffled around at the same time. To borrow an example of IBM's, if the system hears "I rode....." it would carry on quite delighted with itself if it then heard " my bike", but quickly correct "rode" to "rowed" if it heard "my boat" instead.

As you can now imagine, the complex bit concerns the mass of statistics regarding the probability of different words occurring next to one another, and the processing carried out using these statistics in conjunction with the voice files for each individual speaker.  This work is performed by the speech engine, which improves in performance in relation to processing power, system RAM and the way your speech files (word usage statistics) and voice files (sound) are fine tuned as you use the system.

Getting it right

The correction utility is an essential part of this fine-tuning process. It allows you to correct misrecognised words and therefore refine the statistics, and also "train" words by speaking them into the system to refine your voice files. The systems will typically misrecognise between 10 and 20 words in every 100, so the extent to which the correction utility is user-friendly has a significant influence on productivity. That said, the correction process can be built into a translator's first check of their translation, so that you correct words at the same time as you correct your own translation against the source text. This adds very little time to the overall process and means that one final read-through is all that's required for any additional stylistic editing and quality control.

One note of warning: when using these systems it's very important to separate editing from correction. If you change your mind about how you have translated or worded a phrase, always dictate it afresh over the previous translation. Never use the correction facility to make these changes, since this could corrupt your speech and/or voice files and cause a significant deterioration in recognition accuracy over time.

Sending out the all right signals

The microphone wizard does exactly what you'd expect it to.  It makes sure your microphone is set up correctly. This is absolutely essential, because the cleaner the signal you put into the system, the better the results you'll get out of it.  None of them work perfectly, however, and this is one of those areas where an expert's advice can make all the difference.

The enrolment facility speeds up the learning process. On most of the systems it works in two parts. The first requires you to read for anything between fifteen minutes and half an hour from one or more scripts that are then processed by the computer to build your voice files, thereby ensuring good recognition from the start.  The second part allows you to perform additional enrolments from other scripts as and when you please. These will help further improve recognition accuracy. That said, it's usually a good idea to just complete the long initial enrolment and then enrol all over again from scratch after you have been using the system for one or two months. By then you will have a much clearer idea of how you need to talk to the system and this will give you even better results.

Show it what you want

The vocabulary expander does a similar job of speeding up the acquisition of statistics about your vocabulary usage. You use it to open existing documents on the same subject as the one you are about to dictate. These documents are then fed into the vocabulary expander, which analyses the statistics, on the one hand, and offers you the option of training any words for which it does not have any voice file information, on the other. You can do this as soon as you get the system to ensure higher recognition accuracy, but it's also a handy tool when changing from one subject area to another.  Using it this way minimises any hiccups when you switch from completing a long dictation about automotive technology to a new one about Tuscan regional cooking, for example.  If you don't use the vocabulary expander, the system will still be expecting to hear you talk about automotive technology and the resultant misrecognitions could create some very strange recipes for the first few minutes' dictation.

Giving as good as it gets

Finally, the text to speech facility uses a voice synthesiser to read texts back to you.  This can be a very useful productivity tool in its own right and, indeed, a more sophisticated version is available as a stand-alone product from Talking Technologies, reviewed below.  The drawback is that the voices the systems use makes the Daleks sound like Pavarotti.  You'll either get used to it and love having something else that gets you away from the screen or hate the poor little thing and never use it again.  A further disadvantage is that the systems are unable to recognise uncommon or specialist technical terms. Talking Technologies'  "Talk Back" product is the only exception, offering an additional module that allows you to teach it new words.  Note, however, that there is a disadvantage to using these systems to check your dictations; namely that they won't help you detect "soundalike" recognition errors.

What you need to get up and running

So, that's what you get. The question now is, what do you need to get the most out of it?  The bad news for Mac owners is that you need a PC.

All the manufacturers suggest a minimum specification. These are a waste of time. Anything less than a Pentium 200 MHz MMX and it's not really worth giving any of the systems a try, although if you've got a 166 it will give decent enough performance for you to evaluate whether you want to upgrade your PC for speech recognition alone. Dragon's Naturally Speaking is an exception to this rule, but only if you use it without Best Match technology, which basically means using it with a bigram recognition system (the speech engine analyses two words at a time) instead of the trigram system described above.  Recognition performance will be significantly below the system's true capabilities as a result. But you can upgrade performance when you upgrade your PC.

Ideally you'll need a minimum of a 233 processor for them to be accurate and usable, particularly if you want to dictate directly into Microsoft Word. What's more, that's assuming your processor is a Pentium MMX, Celeron with on-die L2 cache, Pentium II or AMD K6/K6-2. The processing performed by these applications is all floating-point-unit intensive, meaning you'll need a higher processor speed if your computer has a Cyrix CPU or, above all, an idt WinChip. Also check that you have at least 256K of L2 cache.

The story is the same when it comes to RAM. 64 megabytes is your minimum, but frankly I'd recommend at least 128.

If you're on the market for a new machine, however, life is much simpler. Pretty much any entry-level machine comes with a 300 or 350 processor that should cope with most things. Just make sure you opt for 128 MB Ram. You'll also require about 250 MB of free hard disk space and up to 100 MB more during installation and training.

Where you need to be careful buying a new machine is were the sound card is concerned.  DON'T get a machine with an on-board sound chip. The manufacturers all recommend that you get a Sound Blaster compatible card, but the general consensus is that you should avoid the Vibra or Value models with the exception of the Live! Value.  The latter is the card that Sound Blaster (Creative Labs) themselves recommend as providing the best performance for speech recognition, although you will not be disappointed with the old Sound Blaster 64 Gold. The Turtle Beach Montego also has a good reputation, as does the Ensoniq.

Last but not least, comes the microphone. As the first element in the chain that hopefully provides a clear voice signal for the systems to work on, the importance of a good microphone cannot be overlooked. Which is why we've provided a separate section on this below. With the exception of L&H Voice Xpress Professional and Dragon's Naturally Speaking Professional, which are both supplied with high-quality microphones, all the systems come with an economy speech recognition microphone that's more than good enough to get you up and running, but which you may want to replace shortly.

As the cherry on the cake to these preliminaries, there is something else you'd be well advised to consider investing in, namely training - particularly if your love of computers is combined with elements of fear and loathing.  Approaching a specialist reseller who can optimise your installation and provide, say, two half-day's training will cost you more than buying the system itself but save you much fretting and heartache while ensuring you benefit from significant productivity increases much quicker.  There's no doubt that someone who isn't computer shy will learn how to fine-tune and make the most of the system on their own, but getting a specialist who can show you all the tricks immediately will save a great deal of time and frustrating experimentation.

A word on Word

So much for the theory. Let's get down to how the systems worked in practice.  But first, I'm going to make you wait for one more tantalising moment. There's a good reason for this. Most of us use Microsoft Word, whether we like it or not. Because it's the industry-standard word processor, all the packages reviewed allow you to dictate directly into it. That doesn't, however, mean any of them will necessarily work faultlessly with your own version of Word 97. Indeed, this also applies to any other programs with which you use Word, such as Trados' Translator's Workbench.  Word 97 is a renowned memory-hog that can cause even the most powerful systems to slow down dramatically. Check your version of Word 97 before blaming the other application for poor performance.  Click "Help" and then "About Word" on the Word toolbar.  If you find SR-1 appended to the words Microsoft Word 97, then you'll enjoy optimum performance.  If not, you'll need to get Service Release 1, which can either be ordered from Microsoft or downloaded directly over the Internet.  If you need it, make sure you get it. SR-1 makes a massive difference.

Philips FreeSpeech

Philips were in fact the first company to market a continuous speech recognition product, albeit a highly specialised one not for the mass market.  FreeSpeech, on the other hand, is a recent arrival on this burgeoning market.  It's available for the very competitive price of £29.99 on its own or bundled with the very dinky SpeechMike at £69.99 for a saving of almost £10.

The system was very easy to install using the usual Windows procedures and was also easy to train. Once it was up and working, dictation was surprisingly quick because the system has a dedicated dictation function that disables everything else while its running, including the mouse and keyboard.  The downside to this, and it's a drawback as far as productivity is concerned, is that you have to dictate and format your documents separately.  You also have to stop dictating to correct misrecognised words. Correction is a two-stage process: you have to select the word first and then correct it.

Another very nice feature of the Philips system, which it shares with those offered by Dragon and IBM, is that it will play back exactly what you said, just as if it were a sound recorder.  The advantage to this is that when a word is misrecognised you can immediately hear what you said and don't have to think before correcting it. Considering misrecognised words can have absolutely nothing to do with what you actually said, this can save quite a lot of time and that, of course, means increased productivity. Every little bit counts!

Unfortunately, however, this adds a third stage to the correction process.

All the usual functions are provided. There is a vocabulary expander, in this case quite aptly named a ConText Tuner, and correcting misrecognised words calls up a menu from which you can select the correct alternative.  Saying " What can I say?", on the other hand, calls up a context-sensitive window displaying a list of all the speech commands that are available in the application concerned.

With the Philips system, you dictate directly into the application of your choice, starting them by saying, for example, "launch Word" or "launch notepad". These commands worked quite reliably and the command and control interface was in general more than acceptable. Spelling during dictation also work well. Accuracy, on the other hand, was about average for a "raw" system, which is to say one that has not yet had time to get used to your voice and subject matters.

Continued here (part two)

First published in ITI Bulletin, 1999.