Speech and Speaker Recognition

______________________________________________________________________________________

Introduction

Speech Recognition is being used today by thousands of people everyday. Such systems as calling cards and phone banking services use speech recognition by prompting the user to answer questions in voice rather than pressing digits on the phone pad to send Dual Tone Multi-Frequency (DTMF) signals. Speech Recognition is the ability to audibly detect human speech and parse that speech in order to generate a string of words or sounds to represent what a person has said. Speaker recognition is similar to that of speech recognition except that in addition to identifying the speech spoken, the system must also identify the individual who spoke.

Modern speech recognition began in the 1950s when researchers extracted features from speech that allowed them to find ways to discriminated between different words. In the 1960s, advances were made in the segmentation of speech into units and pattern-matching. The 1970s led to the techniques used today to design those speech recognition systems that have high recognition rates. This original research with in the 1970s was done by the Defense Advanced Research Projects Agency (DARPA).

Five factors are said to be involved in controlling and simplifying speech recognition:

Any combination of these improves the recognition rate of word sequences spoken.

Speech recognition systems are mostly composed of the following components (hardware and software): a speech capture device, which could be a telephone or microphone, a digital signal processor (DSP) which has the main function of separating speech from noise and provide less information to the pattern-matching algorithm, referenced speech patterns that are used to diagnose the speech sample, and a pattern-matching algorithm that chooses the best model that speech sample meets to identify the spoken words. Speech recognition systems use the speech capture device to change the audio speech to digital form. This digital form of speech is passed to the DSP to filter out irrelevant noise and stored so that the reference speech patterns can be compared with the speech sample and matched against one of the pattern-matching algorithms of the system.

To understand how speech recognition operates, some background information is needed on how speech sounds are produced. A person's vocal cords cause air to vibrate and generate sound waves. These waves travel through the air to the ear where the brain interprets the sounds. Words consist of speech sounds, which are known as phonemes. They have characteristics which allow humans to identify them.

Speech Recognition, although it continues to be perfected, has reached a point in its history that allows it to be used successfully in different applications. This is due to cheaper memory and faster processors which allow for an increase in vocabulary size and recognition accuracy.

There are two types of speech recognition: dependent and independent:

Speech Recognition (Independent)

For computer systems, which are speaker-independent, phonemes are extracted from the audio provided, converted into ASCII characters, and then formulated into words to allow applications using speech recognition to act upon the input. There are mathematical formulas and models used to identify the most likely word spoken. These models match spoken words against known word models and selects one that has the greatest likelihood of being the correct word. In order to identify the "greatest likelihood", large amounts of training data is used to create the models. This type of statistical model is known as the Hidden Markov Model (HMM).

An HMM is characterized by a finite-state Markov model and a set of output distributions. The essence of speech recognition is captured in two types of variabilities: temporal variabilities and spectral variabilities. The transition parameters in the Markov chain models the former, while the latter are modeled by the parameters in the output distribution.

HMMs are based on a sound probabilistic framework and have an integrated framework for simultaneously solving the segmentation and classification problem (the difficulty for a computer in distinguishing speech from silence, in order to segment the speech into words), which makes them suitable for continuous speech recognition. Other systems, where detection of middle silence (a pause of some unintelligible utterance in the middle of speech) is difficult, the user is requested to utter each word separately and wait for the system to recognize, making it difficult for the users to have a "natural" interface to the machine, an interface where the flow of conversation is not interrupted by forced pausing.

By considering speech to be an ordered collection of phonemes, it has become easy to recognize speech independent for the speaker's accent. Training is required, but in independent speech recognition systems, this is done when the model is constructed by using large samples.

In addition to HMM, another pattern-matching technique known as dynamic time wrapping is used. This methodology compares the preprocessed speech against a reference template by summing the differences between speech frames. Some of the words are out of alignment with the given template and so the misalignment is corrected by stretching and compressing .

A more recent technique to independent speech recognition is to use neural networks. As stated above, HMM technology works by making certain assumptions about the structure of speech recognition, and then estimates system parameters as though the structures were correct. This technique may fail if the assumptions are incorrect. The neural network approach does not require such assumptions to be made. This approach uses a distributed representation of simple nodes, whose connections are trained to recognize speech. Unlike in HMMs where knowledge or constraints are not encoded in individual units, rules or procedures, but distributed across many simple computing units. Uncertainty is modeled not as unlikelihood of a single unit, but by the pattern of activity in many units. These computing units are simple in nature, and knowledge is not programmed into any individual unit's function; rather it lies in the connections and interactions between linked processing elements.

Speech Recognition (Dependent)

A speaker-dependent system is developed to operate for a single speaker. These systems are usually easier to develop, cheaper to buy and more accurate. The system is train to understand one user's pronunciations, inflections, and accents, and can run much more efficiently and accurately. It requires users to participate in training sessions that "teach" the computer to recognize the user's voice. The computer then makes a voice profile that matches the require training.

In the past few years however, the programs and the necessary hardware have become more available, less expensive, and much more efficient. System vocabularies have improved greatly from 30,000 - 60,000 words.

The speaker-dependent side of the technology-where the system is "trained" though repetition to recognize a certain vocabulary of words and accept no substitute-is fairly well-established. This technology is base generally on template, or acoustical, representation of speech.

Users "train" the system in their particular voice patterns by speaking the words, or using voice samples/prints that will need to be recognized. These "voice prints", or templates, are then stored on the system. When the system is working, these voice print are compared with spoken command of the user. If the voice print and the spoken word match, the speech recognition system "recognizes" the word and executes the command. Template-base dependent recognition is used for relatively small to medium-sized vocabularies (generally up to a few thousand words).

Other speaker-dependent recognition systems operate by matching phonemes, multiple words, and triphones. The phoneme/multi-word approach is typically used for systems with larger vocabularies, up to ten of thousands of words. Typically, speaker-dependent system work well with medium to large vocabularies, and with either isolated or connected word recognition.

These speaker-dependent systems are by no means perfect yet and improvements need to be made, but this new technology should not be limited to just word processing and dictation. There are many areas that can be explored and it is sure that this will be implemented and adopted by many in the years to come.

Speaker Recognition

For speaker recognition the speech sample is processed to obtain speaker variability instead of being processed for phonemes. Speaker Recognition consists of verification or identification. Verification techniques are easier because the speech sample is only compared against reference templates with the decision of whether the sample is good enough match with one of the reference templates. Identification is more complicated because it involves matching a speech sample against N templates and choosing a good match with one of the N templates.

Natural Language Understanding

Natural Language Understanding is the "true" idea of humans attempting to communicate with computers; systems that comprehend the task users are trying to complete without having to use limited vocabularies required in speech recognition systems. In stead of focusing on phonemes, NLU looks at context of the speech, like a human process speech. Currently, this field of study has not been greatly integrated with speech recognition, but is believed to be the future foundations of speech recognition.

Business Markets for Speech Recognition

There are many examples of speech recognition used in the business world today. Most revolve around transaction based services such as banking, finance, and reservation systems. Some companies offering services to their customers via speech recognition are Charles Schwab and American Express. These companies have built their systems' based on the following products.

Products

Demonstrations

To hear some of the types of speech recognition systems available, the following lists some of the companies' listed above demonstrations.

References

[1] Dissolving the Last Barrier Between Man and Machine. Fonix Technologies.

[2] Machowski, Michael. Speech Recognition and Natural Language Processing as Highly Effective Means of Human-Computer Interaction. University of Colorado, Department of Computing Sciences.. June 3, 1997.

[3] McAllister, Alex. Voice/Speech Recognition Technologies Report and Tutorial. Bell Atlantic. May 23, 1995.

[4] Padilla, Enrique. Senior Technician on Voice Recognition. 21st Century Eloquence.

[5] Peacocke, Recihar D., Graf, Daryl H. An Introduction to Speech and Speaker Recognition. IEEE Computer 23(8), pp 26 - 33, August 1990.

[6] White, George M. Natural Language Understanding and Speech Recognition. Communications of the ACM, Vol. 33, No. 8, August 1990.

[7] Eberts, Ray E. User Interface Design. Voice Recognition, pp 472 - 473.

[8] Helander, Martin. Handbook of Human-Computer Interaction. System Designed for Automated Speech Recognition, pp 301 - 305.

[9] Martin, Alexander, Eastman, David. User Interface Design. Sound, pp 147 - 156.

[10] Voice Recognition Technology. Weber State University.

______________________________________________________________________________________