Skip to content.


Personal tools
You are here: Home » Courses » Advanced Signal Processing 1 and 2 » Speech Synthesis » Speech Synthesis

Speech Synthesis

Seminar on Speech Synthesis

Speech Synthesis

Speech synthesis is the generation of an acoustic speech signal by a machine (computer). Its application comprises text-to-speech (TTS) systems, like e-mail or news readers, etc., dialogue systems, as for example train schedule information or flight reservation, or automatic translation (speech-to-speech) systems.

The beginnings of speech synthesis can be dated back in the 18th century, when Wolfgang van Kempelen built a (mechanical) speaking machine for empress Maria Theresia. In the 20th century electronical speech synthesizers evolved, and nowadays digital signal processing allows for the implementation of advanced speech processing algorithms on PCs.

In this seminar, we will primarily be concerned with the signal generation part of speech synthesis, reviewing state-of-the-art algorithms. However, we may as well touch the interesting topics of text-to-phoneme conversion, prosody generation, synthesis of multi-lingual text or emotional speech, or voice conversion.

Course organization

The student should work on a selected topic and give an oral presentation in class during a 45 minute discussion session. Work in small groups of 2 or 3 students is strongly encouraged.

The first meeting will be held in seminar room of INW at TU Graz, Inffeldgasse 12, first floor, on Wednesday, Oct. 8, 2003 at 2:00 p.m. This will be used to assign groups and topics, and to coordinate the seminar schedule.

Suggested topics for seminar presentations

  1. Overview
    • Text-to-speech systems
    • Parameter generation (text pre-processing)
    • Prosody
    • Signal generation
  2. Text-to-phoneme conversion
    • Text normalization
    • Pronunciation
    • Stressing (prosody)
  3. Physical model based signal generation (articulatory synthesis)
    • Models for the vocal folds
    • Sound propagation in the vocal tract
  4. Source-filter modeling
    • Formant synthesizer
    • Linear prediction (LP)
    • Acoustic tube models
  5. Database-driven synthesis
    • Concatenative synthesis
    • Unit selection
    • Synthesis with stochastic Markov graphs
  6. Prosodic manipulations
    • PSOLA (pitch-synchroneous overlap-and-add)
    • LP-PSOLA, RELP (residual excited LP)
    • Sinusoidal/harmonic-plus-noise modeling
  7. Various
    • Multi-lingual speech synthesis
    • Synthesis of emotional speech
    • Speaking styles
    • Voice conversion
    • etc.

First meeting, and group and topic assignment

8. 10. 2003 
Seminar Room INW 

List of participants and assigned topics

Participant's name 

22. 10. 2003 
Erhard Rank 
29. 10. 2003 
Marco Piccolino 
5. 11. 2003 
Hannes Pirker, Martin Hagm├╝ller 
12. 11. 2003 
Helmuth Ploner-Bernard 
19. 11. 2003 
Markus Flohberger 
26. 11. 2003 
Thomas Wiener 
3. 12. 2003 
David Ludwig 
17. 12. 2003 
Franz Zotter 
14. 1. 2004 
Robin Hofe 


Examples and Demos

Freely available synthesizers

References/Course material

Physical modeling

K. Ishizaka and J.L. Flanagan: Synthesis of Voiced Sounds from a Two-Mass Model of the Vocal Cords, Bell Systems Technical Journal, vol 51, pp 1233-1267, 1972.
G. Bailly: Learning to speak. Sensori-motor control of speech movements, Speech Communication 22, iss. 2-3, pp 251-267, 1997.
P. Badin, G. Bailly, M. Raybaudi, C. Segebarth: A Three-Dimensional Linear Articulatory Model Based on MRI Data, ESCA/COCOSDA Workshop on Speech Synthesis, Jenolan Caves, Australia, pages 249-254, 1998.

Source-filter modeling

J. D. Markel and A. H. Gray, Jr.: Linear Prediction of Speech, Springer, 1976.
D. H. Klatt: Software for a Cascade/Parallel Formant Synthesizer, J. Acoust. Soc. Am. 67, pp 971-995, 1980.
G. Fant: The Voice Source in Connected Speech, Speech Communication, vol 22, iss 2-3, pp 125-139, 1997.

Unit selection

A. Hunt and A. Black: Unit selection in a concatenative speech synthesis system using large speech database, in Proc. of ICASSP 1996, vol.1, pp.373-376, Atlanta, Georgia.
A. W. Black and P. Taylor: Automatically clustering similar units for unit selection in speech synthesis, Proc. of Eurospeech 1997, Rhodes, Greece.
B. Bozkurt, M. Bagein, and T. Dutoit: From MBROLA to NU-MBROLA, Multitel-TCTS Lab, Faculte Polytechnique de Mons, Belgium, 2001.

Prosodic manipulations

E. Moulines and F. Charpentier: Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis using Diphones, Speech Communication, vol 9, pp 452-467, 1990.
T. Dutoit and H. Leich: MBR-PSOLA: Text-To-Speech synthesis based on an MBE re-synthesis of the segments database, Speech Communication, vol 13, pp 435-440, 1993.
J. Laroche, Y. Stylianou, and E. Moulines: HNS: Speech Modification Based on a Harmonic+Noise Model, Proc. of ICASSP 1993, vol.2, pp.550-553.
J. Laroche, Y. Stylianou, and E. Moulines: HNS: A Simple, Efficient Harmonic + Noise Model for Speech, Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 1993, pp.169-172.
Y. Stylianou: Applying the harmonic plus noise model in concatenative speech synthesis, IEEE Trans. on Speech and Audio Processing, vol. 9, no.1, pp.21-29, Jan. 2001.
G. Bailly: A parametric harmonic + noise model, in Keller et al.: Improvements in Speech Synthesis, Wiley, 2002.

Emotions, speaking styles, voice conversion

Book chapters 19-28 (Part III): Issues in Styles of Speech, in Keller et al.: Improvements in Speech Synthesis, Wiley, 2002.
D. G. Childers: Glottal Source Modeling for Voice Conversion, Speech Communication, vol 16, pp 127-138, 1995.
I. Titze, D. Wong, B. Story and R. Long: Considerations in voice transformation with physiologic scaling principles, Speech Communication 22, iss 2-3, pp 113-123, 1997.
A. Kain and M. Macon: Spectral Voice Conversion for Text-to-Speech Synthesis, Proc. ICASSP 1998, vol 1, pp 285-288.

Other sources


Created by marian
Last modified 2005-10-25 16:43