Views

- Publish
- Submit
- Advanced
State: visible
Add New Item
- Business Card
- Demo Type
- Document
- Event
- Faq Folder
- File
- Folder
- Forum
- Image
- Link
- News Item
- Photo
- Photo Album
- Article
- ResearchPresentation Folder
- Topic
- Weblog
- event_folder

Speech Synthesis

Advanced Signal Processing Seminar on the topic of Speech Synthesis, held in the summer term 2008.

This seminar will focus on the two dominant state-of-the-art corpus-based methods for text-to-speech synthesis, namely unit-selection based speech synthesis and the more recently developed Hidden Markov Model (HMM) based speech synthesis. Today's commercial systems mostly employ the unit-selection method.
In unit-selection synthesis a large speech corpus is recorded and segmented. During synthesis segments/units are concatenated that minimize the distance to each other (concatenation cost) and to the target units (target cost).
In HMM based speech synthesis HMMs are trained on a corpus of speech data. During synthesis a sequence of features (spectral, pitch, and duration features) is generated from the HMMs and used for synthesizing the signal.
The following list suggests topics for presentation. It is not exhaustive and you can also use different papers to present a topic.

The first meeting (Vorbesprechung) will be on Tuesday 11.3.2008, 16:00-18:00, SR-INW.

General topics

Automatic speech segmentation

A. Ljolje, M. D. Riley (1993), Automatic segmentation of speech for TTS. In Proceedings of EUROSPEECH 1993, pages 1445-1448, Berlin, Germany.
F. Malfrere, T. Dutoit (1997), High-quality speech synthesis for phonetic speech segmentation. In Proceedings of EUROSPEECH 1997, pages 2631-2634, Rhodes, Greece.
L. Wang, Y. Zhao, M. Chu, J. Zhou, Z. Cao (2004), Refining segmental boundaries for TTS database using fine contextual-dependent boundary models. In Proceedings of ICASSP 2004, pages 641-644, Montreal, Canada.
A. Park, J. R. Glass (2005), Towards Unsupervised Pattern Discovery in Speech. In Proceedings of ASRU 2005, pages 53-58, San Juan.

Conversational speech

Y. Liu, E. Shriberg, A. Stolcke (2003), Automatic disfluency identification in conversational speech using multiple knowledge sources. In Proceedings of EUROSPEECH 2003, pages 957-960, Geneva.
E. Shriberg (2005), Spontaneous Speech: How People Really Talk, and Why Engineers Should Care. In Proceedings of EUROSPEECH 2005, pages 1781-1784, Lisbon.
N. Campbell (2006), Conversational speech synthesis and the need for some laughter. IEEE Transactions on Speech and Audio Processing, 14(4), pages 1171- 1178.

Synthesis of singing

K. Saino, H. Zen, Y. Nankaku, A. Lee, K. Tokuda (2006), An HMM-based singing voice synthesis system. In Proceedings of INTERSPEECH 2006, pages 1141-1144, Pittsburgh.
T. Saitou, M. Goto, M. Unoki, M. Akagi (2007), Vocal Conversion from Speaking Voice to Singing Voice using STRAIGHT. In Proceedings of INTERSPEECH 2007, Antwerp, Belgium
Synthesis of Singing Challenge. In Proceedings of INTERSPEECH 2007, Antwerp, Belgium

Unit-selction synthesis related topics

Basics and history of unit selection speech synthesis

Y. Sagisaka (1988), Speech synthesis by rule using an optimal selection of non-uniform synthesis units. In Proceedings of ICASSP 1988, pages 679-682.
A. Hunt, A. Black (1996), Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of ICASSP 1996, pages 373-376, Atlanta, Georgia.

Concatenation costs and target costs

A. Black, P. Taylor (1997), Automatically clustering similar units for unit selection in speech synthesis, In Proceedings of EUROSPEECH 1997, pages 601-604, Rhodes, Greece.
Pantazis, Y., Stylianou, Y., and Klabbers, E. (2005), Discontinuity detection in concatenated speech synthesis based on nonlinear speech analysis. In Proceedings of INTERSPEECH 2005, pages 2817–2820, Lisbon, Portugal.

HMM synthesis related topics

Basics of HMM-based speech synthesis

K. Tokuda, T. Kobayashi, S. Imai (1995), Speech parameter generation from HMM using dynamic features. In Proceedings of ICASSP 1995, pages 660-663.
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura (1999), Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of EUROSPEECH 1999, pages 2347-2350.
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura (2000), Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of ICASSP 2000, pages 1315-1318.
K. Tokuda, T. Mausko, N. Miyazaki, T. Kobayashi (2002), Multi-space probability distribution HMM. IEICE Transactions on Information & Systems, E85-D(3), pages 455-464.

Speaker interpolation

T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura (1997), Speaker interpolation in HMM-based speech synthesis system. In Proceedings of EUROSPEECH 1997, pages 2523-2526.
T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura (2000), Speaker interpolation for HMM-based speech synthesis system. J. Acoust. Soc. Jpn., 21(4).
M. Tachibana, J. Yamagishi, T. Masuko, T. Kobayashi (2005), Speech synthesis with various emotional expressions and speaking styles by style Interpolation and morphing. IEICE Transactions on Information & Systems, E88-D(11), pages 2484-2491.

Speaker adaptation

M. Tamura, T. Masuko, K. Tokuda, T. Kobayashi (2001), Text-to-speech synthesis with arbitrary speaker's voice from average voice. In Proceedings of EUROSPEECH 2001, pages 345-348.
J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, T. Kobayashi (2003), A training method of average voice model for HMM-based speech synthesis. IEICE Transactions on Fundamentals, E86-A(8), pages 1956-1963.
Y. Nakano, M. Tachibana, J. Yamagishi, T. Koayashi (2006), Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis. In Proceedings of INTERSPEECH 2006.
J. Yamagishi, T. Kobayashi (2007), Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE Transactions on Information & Systems, E90-D(2), pages 533-543.

Signal generation

S. Imai (1983), Cepstral analysis synthesis on the mel frequency scale. In Proceedings of ICASSP 1983, pages 93–96.
T. Fukada, K. Tokuda, T. Kobayashi and S. Imai (1992), An adaptive algorithm for melcepstral analysis of speech. In Proceedings of ICASSP 1992, pages 137–140.
Hideki Kawahara, Ikuyo Masuda-Katsuse and Alain de Cheveigné (1999), Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3-4), pages 187-207.
H. Zen, T. Toda, M. Nakamura, K. Tokuda (2007), Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005. IEICE Transactions on Inormation & Systems, E90-D(1), pages 25-33.
R. Maia, T. Toda, H. Zen, Y. Nankaku, K. Tokuda (2007), An excitation model for HMM-based speech synthesis based on residual modeling. In Proceedings of SSW6 workshop.

Context clustering

S.J. Young, J.J. Odell, P.C. Woodland (1994), Tree-Based State Tying for High Accuracy Modelling . In Proceedings of ARPA Human Language Technology Workshop, pages 307-312, New Yersey, USA.
K. Shinoda, T. Watanabe (1997), Acoustic modeling based on the MDL principle for speech recognition. In Proceedings of EUROSPEECH 1997, pages 99-102.
J. Yamagishi, M. Tamura, T. Masuko, K. Tokuda, T. Kobayashi (2003), A context clustering technique for average voice models. IEICE Transactions on Information and Systems, E86-D(3), pages 534-542.

References

HMM-based speech synthesis system

The Festival Speech Synthesis System

Viennese Sociolect and Dialect Synthesis project

Timetable

Di 11.3.2008 16:00 - 18:00	Vorbesprechung	M. Pucher	Presentation
Di 15.4.2008 16:00 - 19:00	Signal Generation	C. Caruncho	Presentation	Paper
Di 29.4.2008 16:00 - 19:00	Basics of HMM-based speech synthesis	P. Gampp, A. Sereinig	Presentation1 Presentation2	Paper1 Paper2
	Speaker interpolation	S. Rexeis, M. Stracka	Presentation	Paper
Di 10.6.2008 16:00 - 19:00	Synthesis of singing	R. Peharz, P. Meissner	Presentation1 Presentation2	Paper
	Conversational speech	J. Luig	Presentation	Paper
Di 24.6.2008 16:00 - 19:00	VSDS/ftw. Presentation	M. Pucher, F. Neubarth, C. Kranzler, M. Bruss, D. Schabus, G. Schuchmann

Contact

Michael Pucher

Telecommunications Research Center Vienna (FTW) Tech Gate Vienna Donau-City-Strasse 1, 3rd floor A-1220 Vienna Austria

Phone: +43 1 505 2830-46

Fax: +43 1 505 2830-99

E-mail: pucher at ftw.at

Web: http://dialect-tts.ftw.at, http://userver.ftw.at/~pucher, http://www.ftw.at

Created by klaus
Last modified 2008-07-07 10:47

SPSC

Sections

Personal tools