Text-to-speech refers to the ability to convert computer readable text into natural soundings speech base on in-depth linguistic understanding. It transforms linguistic information stored as data or text into speech .

Current Systems

Text-to-speech systems have been in development for nearly 50 years, and systems have been commercially available for about 15 years. However, many systems have offered limited intelligibility and mostly produce very "unnatural" speech which can be instantly recognized as machine-generated.

The conversion from written text to speech can broken down into three major tasks: linguistic analysis, prosodic modeling, and speech synthesis. Speech synthesis transforms a given linguistic representation, a chain of phonetic symbols enriched by information on phrasing, intonation and stress, into artificial, machine-generated speech by means of appropriate synthesis method. Text analysis modules compute the linguistic representation from written text.

System Architecture

One of the main areas of the Text-To-Speech system architecture is that can function as a synthesizer for multiple languages. Currently, there are systems for English, Spanish, Italian, German, Russian, Romanian, Chinese and Japanese. These systems are multilingual in the sense that the underlying software for both linguistic analysis and speech synthesis is identical for all languages, with the exception of English. Obviously, some language-specific information is necessary; there are acoustic inventories unique to each language and there are also special rules for linguistic analysis. This information is stored externally in tables and parameter files, and are loaded by the Text-To-Speech engine at run-time. Thus, in applications such as dialog or email reading, it is possible to switch voices and languages as desired at run-time.

The multilingual characters of the system can be compared to a text processing program that allows the user to edit text in almost any language by providing language-specific fonts whereas the same underlying principles and options concerning text formatting or output are applied disregarding the language currently being processed.

Hardware implementation

With the many recent advances made in hardware today, Text-To-Speech has been successfully implemented on a wide variety of hardware platforms. These include:


Text to speech is particularly attractive to use where there is a large amount of variable data. Typical applications are: reading of text such as email, retrieving database information, alert messages, help files, educational programs, game applications and embedded systems.


[1] Helander, Martin. Handbook of Human-Computer Interaction, pp 326 - 340.

[2] Baecker, Ronald M., Grudin, Jonathan, Buxton, William A.S., Greenberg, Saul. Speech, Language, and Audition, pp 539 - 553.

[3] Make Text-To-Speech Easy. Direct Studio.