Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Síntese de fala a partir de texto com reduzidosrequisitos computacionais - 392
Carlos Miguel Duarte Mendes
Dissertação para obtenção do Grau de Mestre emEngenharia Electrotécnica e de Computadores
JúriPresidente: Doutor Carlos Jorge Ferreira SilvestreOrientador: Doutor Luís Miguel Veiga Vaz Caldas de OliveiraVogal: Doutora Isabel Maria Martins Trancoso
13 de Novembro de 2008
Acknowledgments
First, I would like to express my gratitude to Professor Luís Caldas de Oliveira, my adviser, for
is support, encouragement and guidance. I would like to thank Sérgio Paulo and Luís Figueira,
for their substantial help and work in the Tecnovoz corpus without which it would be impossible
to do this thesis; Renato Cassaca and David Matos, for their c/c++ programming suggestions
that helped me solve many problems. Helena Moniz, for helping me with the linguistic issues;
Professor João Paulo Neto and all my colleagues from the Tecnovoz project for their contagious
enthusiasm while achieving the impossible, and everyone else at L2F, for the great work environ-
ment that they builded.
Finally, I would like to give my special thanks to my friends and family for all their support over
the last two years.
Lisbon, September 29, 2008Carlos Miguel Duarte Mendes
Abstract
In recent years, TTS systems have become an important output device in human-machine
interfaces, and they are used in many applications such as car navigation systems, information
retrieval over telephone, voice mail and so on. Although most concatenation based TTS systems
are able to synthesize speech with high quality, their performance decreases when the computa-
tional requirements are very low. Usually due to the large amount of pre-recorded speech stored
in the database.
This thesis main objective was the development of a small footprint text to speech synthesis
system, by using HMM models to generate artificial speech. The developed system was applied
for European Portuguese, but is general enough to be extended to other languages.
Keywords
Text-to-Speech Systems, Small footprint synthesis, HMM-based Synthesis, Context-
dependent clustered models
iii
Resumo
Nos últimos anos, sistemas de texto para fala têm-se tornado importantes dispositivos de
saída em interfaces homem-máquina, pelo que são usados em muitas aplicações como sis-
temas de navegação, obtenção de informações via telefone, voice mail, etc. Apesar da maioria
dos sistemas de síntese de fala, baseados em concatenação de segmentos, serem capazes de
gerar fala sintética com uma grande qualidade, a sua performance decresce quando os requisi-
tos computacionais são muito baixos. Na maioria das vezes deve-se à grande quantidade de fala
pré-gravada armazenada na base de dados.
O objectivo deste trabalho foi o desenvolvimento de um sistema de síntese de fala a partir de
texto com baixos requisitos, tanto de poder de cálculo como de memória, recorrendo a modelos
HMM para a geração do sinal de fala. O sistema foi applicado ao Português Europeu mas tem a
generalidade suficiente para ser alargado a outras línguas.
Palavras Chave
Sistemas de texto para fala, Síntese com baixos requisitos, computacionais, síntese baseada
em HMMs, Grupos de modelos com dependência contextual.
v
Contents
1 Introduction 1
1.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 HMM-based Text-To-Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Context-Dependent Clustered Models 7
2.1 Language Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Tree-Based Context Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Towards Language Independency . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Configurable Grammar Features Module . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Configurable Context Factors Module . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Feature Label Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Grammar Features and Context Factors for European Portuguese . . . . . . . . . 18
2.4.1 Grammar Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Context Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Voice Building with HTS 25
3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Design of the Recording Prompts . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.4 Phonetic Segmentation and Multi-Level Utterance Descriptions . . . . . . . 29
3.1.5 Corpus sub-set for HMM Training . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vii
3.3.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 Models for Sandhi Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 Segmentation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Speech Synthesis with Dixi TTS Engine 37
4.1 Dixi Text-To-Speech Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 HMM Based Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 HMM-based Waveform Generation with HTS Engine API . . . . . . . . . . . . . . 39
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 System Footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Waveform Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Conclusions 45
Bibliography 47
A Phonetic Alphabets 51
viii
List of Figures
1.1 HMM TTS System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Context-Dependent Clustered Decision Tree . . . . . . . . . . . . . . . . . . . . . . 5
1.3 HMM-based Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Decision tree-based state tying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Recording Room . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Unit coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Dixi system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 HMM Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Dixi Component for HMM Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Original waveform for sentence "O de Aveiro custou-me vinte." . . . . . . . . . . . 42
4.5 Synthesized waveform for sentence "O de Aveiro custou-me vinte." . . . . . . . . . 42
ix
x
List of Tables
2.1 Chomsky distinctive features for PT-SAMPA vowels . . . . . . . . . . . . . . . . . . 19
2.2 Chomsky distinctive features for PT-SAMPA consonants . . . . . . . . . . . . . . . 20
2.3 POS tag system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Prosodic Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Mel generalized cepstrum coefficients extraction parameters . . . . . . . . . . . . 32
3.2 Models and questions count for duration, pitch and spectral coefficients . . . . . . 33
3.3 Question usage for each linguistic level . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Context at states 1 and 2 for one of the /j/ HMM models . . . . . . . . . . . . . . . 35
4.1 System execution footprint using HTS-based and Unit Selection based waveform
generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 System memory footprint using HTS-based and Unit Selection based waveform
generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.1 Phonetic Alphabet for Standard European Portuguese Dialect . . . . . . . . . . . . 52
A.2 Phonetic Alphabet for Standard English Dialect . . . . . . . . . . . . . . . . . . . . 53
xi
xii
List of Acronyms
DOM Document Object Model
HMM Hidden Markov Model
HMMs Hidden Markov Models
HRG Heterogeneous Relation Graph
HRGs Heterogeneous Relation Graphs
HTS HMM-based TTS System
IPA International Phonetic Alphabet
MDL Minimum Description Length
MFCC Mel Frequency Cepstrum Coefficients
MGC Mel frequency Generalized Cepstrum coefficients
MLSA Mel-Log Spectra Approximation
MSDs Multi-Space probability Distributions
POS Part-Of-Speech
xiii
xiv
1Introduction
Contents1.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 HMM-based Text-To-Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1
Mobile devices like cellphones and PDAs have serious input/output limitations, specially in
situations like driving, where the user can not maintain eye contact with the device. This accessi-
bility problem suggests the use of spoken interfaces.
Today’s most successful speech synthesis methods use speech databases based ap-
proaches, where speech segments with variable durations are selected, and concatenated pro-
duce the desired speech signal. The selection criteria consists of simultaneous optimization of
two costs: adequateness to the objective and concatenation cost. The first evaluates the differ-
ences between the desired synthesized sound and the available sounds in the database, in terms
of segmental and prosodic differences. The second evaluates the concatenation quality of several
possible segments. These two factors joint optimization allows the selection of the best speech
units sequence that better produces the desired acoustic realization. The problem inherent to
this approach is text to speech system design for any type of input text. In this case, speech
database must have a wide variability of speech segments to allow the optimization process to
find an acoustic sequence with acceptable quality. This usually corresponds to several hours of
recorded speech and high computational resources for the selection process. This disadvantage
makes concatenation based synthesis less fit for mobile devices. Recently a parametric synthesis
method re-emerged, where the speech signal is generated from source-filter models that sim-
ulates the human vocal tract, instead of using recorded speech databases. These parameters
are generated from statistical models, trained over a speech database with considerable dimen-
sions. Since parameters are generated from statistical models, there is no advantage in storing
large amounts of speech data to synthesize speech and consequently substantially reducing the
required computational resources.
This thesis main objective was the development of a small footprint text to speech synthesis
system, by using parametric models to generate artificial speech. This system was developed for
European Portuguese but is general enough to be extended to other languages.
2
1.1 State of the Art
Currently there are two most common appoaches for small footprint text-to-speech sys-
tems. The first is diphone concatenation and the second parametric synthesis with Hidden Markov
Models (HMMs).
Diphone concatenation synthesis consists on the production of the desired acoustic
phonetic sequence by concatenating diphone segments available in a pre-collected diphone
database. The number of diphones in a language is equal to the square of the total number
of phonetic segments. However, not all combinations of phone segments exists, meaning that
the resulting database is actually smaller than it would actually be if all combinations were used.
The collected databases are relatively small, since there is a limited amount of phones per lan-
guage, typically between 40 and 60. Although these systems do require a very small amount
of resources, they present a few problems. The first main problem resides in the concatenation
process, where some pitch discontinuities may appear and thus resulting in noticeable acoustical
distortions. The perfect match between the previous diphone end frontier and the next diphone
beginning frontier is a very difficult process, many times leading to some tedious and difficult
work during the database setup. Recent technics emerged as a solution for this problem, where
speech transformations are performed at the concatenation neighborhood to reduce possible sig-
nal distortions. Still there is another problem concerning this method. Although the produced
speech has very good quality, the generated speech intonation is frequently unnatural. In fact,
listeners often find it tedious. One of the procedures to avoid distortions at the concatenations
points consists of recording all diphone units at the same pitch level, meaning as flat as possible.
This technique helps the concatenation process, but as the disadvantage of making the speech
intonation unnatural.
On the other side of small footprint systems there is parametric synthesis with Hidden
Markov Models (HMMs). HMMs have successfully been applied to model the sequence of speech
spectra in speech recognition systems, and the performance of HMM-based speech recognition
systems has improved by techniques that make use of its flexibility: context-dependent modeling,
dynamic feature parameters, mixtures of Gaussian densities, tying mechanism, speaker and envi-
ronment adaptation techniques. HMM-based speech synthesis systems are becoming popular in
the present day. They were first conceived for Japanese by Tokuda, Kobayashi, and Imai (1995)
and further developed by Masuko et al. (1996); Yoshimura et al. (1999). This technique has
also been developed for several other languages such as Korean, English, Brazilian Portuguese,
Slovenian, Chinese and German as indicated in Black, Zen, and Tokuda (2007). It has been
shown that HMM-based speech synthesis can successfully be applied to speech synthesis in a
a wide range of languages. In Yoshimura et al. (1999), the HMM-based TTS system in figure
1.1 was proposed were the training and synthesis parts of the HMM-based TTS system are de-
3
picted. In the training phase, spectral parameters and excitation parameters are extracted from
the speech database. The extracted parameters are modeled by context-dependent HMMs. In
the synthesis phase, a context-dependent label sequence is obtained from input text by linguistic
analysis. An HMM sentence is constructed by concatenating context-dependent HMMs according
to the context-dependent label sequence and using a parameter generation algorithm, spectral
and excitation parameters are generated from the sentence HMM. Finally, by using a synthesis
filter, speech is synthesized from the generated spectral and excitation parameters.
Figure 1.1: HMM TTS System [YoshimuraYoshimura]
1.2 HMM-based Text-To-Speech Synthesis
Phonetic parameter and prosodic parameter are modeled simultaneously with HMMs. In
the system proposed by Yoshimura (2002), mel-cepstrum, fundamental frequency (F0) and state
duration are modeled by continuous density HMMs, multi-space probability distributions HMMs
and multi-dimensional Gaussian distributions, respectively. The distributions for spectrum, F0
and state duration are clustered independently by using a decision-tree based context clustering
technique, as depicted in figure 1.2. These decision trees rely on features that are language
dependent.
Synthetic speech is generated by using an speech parameter generation algorithm from
HMM and a mel-cepstrum based vocoding technique. A more detailed illustration of the HMM-
based text-to-speech synthesis system is shown in figure 1.2. An arbitrarily given text to be
synthesized is converted to a context-based label sequence. Then, according to the label se-
4
Figure 1.2: Context-Dependent Clustered Decision Tree [YoshimuraYoshimura]
quence, a sentence HMM is constructed by concatenating context dependent HMMs. State du-
rations of the sentence HMM are determined so as to maximize the likelihood of the state dura-
tion densities [Yoshimura, Tokuda, Masuko, Kobayashi, and KitamuraYoshimura et al., Yoshimu-
raYoshimura]. According to the obtained state durations, a sequence of mel-cepstral coefficients
and f0 values including voiced/unvoiced decisions are generated from the sentence HMM by
using an speech parameter generation algorithm [Tokuda, Masuko, Yamada, Kobayashi, and
ImaiTokuda et al., YoshimuraYoshimura]. Speech is then synthesized directly from the generated
mel-cepstral coefficients and f0 values using a synthesis filter referred as MLSA filter [Fukada,
Tokuda, Kobayashi, and ImaiFukada et al., ImaiImai,YoshimuraYoshimura].
Figure 1.3: HMM-based Speech Synthesis [YoshimuraYoshimura]
5
1.3 Thesis Outline
This thesis main objective is the development of a small footprint text to speech synthesis
system, by using parametric models to generate artificial speech. The developed system was
aplied for European Portuguese, however is general enough to be extended to other languages.
In chapter 2, a decision-based context dependent clustering technique to cluster HMMs is
described. Language dependency issues inherent to this technique are analyzed and a solution
towards language independency is proposed. Also, work developed for European Portuguese
is presented. In chapter 3, the voice building process is presented, including corpus design and
HMM training, for European Portuguese. In chapter 4, development conducted to the integration of
an HMM-based system in the Dixi TTS system is also presented. Finally in chapter 5, conclusions
are presented.
1.4 Main Contributions
The first main contribution of this thesis is a tool that automatically generates featured
labels according to some language specification. Described in chapter 2, this tool overcomes the
language barrier when developing context-dependent clustered models based synthesis engines.
The second main contribution of this thesis is the integration of an HMM-based waveform
generation module in the Dixi TTS system as a solution for small footprint speech synthesis.
6
2Context-Dependent Clustered
Models
Contents2.1 Language Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Tree-Based Context Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Towards Language Independency . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Grammar Features and Context Factors for European Portuguese . . . . . . . 182.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7
In continuous speech, parameter sequences of a particular speech unit vary according to
linguistic patterns. To model these patterns accurately, context dependent models are clustered
in a structured representation of context features. However, as the number of context factors
increase, their combinations increase exponentially. Moreover, it is impossible to prepare training
data to cover all possible contexts and to compensate for the large variations in the frequency
of each context dependent unit.To alleviate these problems, a decision-based context dependent
clustering technique is used to cluster HMM states and share model parameters, like spectral
coeficients, F0 and duration among states. Since spectrum, F0 and duration models have their
own influential patterns, their distributions are clustered independently as shown in figure 1.2.
2.1 Language Patterns
Speech production is the reproduction of sounds constrained to a language specific set of
rules. Additionally, each speaker adds his/her own particularities giving it the variability that char-
acterizes natural speech. These variations make it difficult to quantify and/or qualify speech in one
unique set of rules. To simplify the idea of natural speech, the concept of speech patterns is used
instead. This statistical perspective of natural speech production allows a better management of
language concepts and thus a better modelation (of the same).
Further analysis of speech patterns shows that these may be presented as the composition
of grammar features and context factors. Grammar features represent the language structure
and are identified by phones, syllables and morphological categories. Context factors on the
other hand are local features that are used to identify a certain linguistic context. Context factors
can be for instance left/central/right phone, left/right part-of-speech and distance(in syllables) to
previous accented syllable. To put it differently, a group of grammar features can be used to
describe a certain acoustic segment, however distinct local grammar features may influence its
characteristics, thus presenting differently under different contexts and consequently representing
different speech patterns.
This linguistic decomposition allows the construction of context dependent models and,
considering the many possible combinations of contextual factors, accurate model parameters
should be expected. However, as the number of contextual factors increase, their combinations
also increase exponentially. Therefore, model parameters with sufficient accuracy cannot be esti-
mated with limited training data. Furthermore, it is impossible to prepare a speech database with
all combinations of contextual factors. To overcome this difficulty, the next section presents the
solution usually adopted in HMM-based synthesis systems.
2.2 Tree-Based Context Clustering
HTS [Tokuda, Zen, Yamagishi, Black, Masuko, Sako, Toda, Nose, and OuraTokuda et al.]
8
tree-based context clusters are translated as binary trees in which a yes/no context question is
attached to each node. Trees are built using a top-down sequential optimization process. Initially
all models are placed in a single cluster at the root of the tree. A question is then found according
to the Minimum Description Length (MDL) criterion, which gives the optimal split of the root node.
The MDL principle is an information criterion which was introduced by Shinoda and Watan-
abe (1996) as an optimal probabilistic model selector. Maximum likelihood based methods, pre-
viously used in tree-based context clustering, have the major disadvantage of producing over-
specialized or under-specialized context trees. This is mostly a consequence of the difficulty in
determining the correct threshold for the stopping rule. MDL main contribution is its ability to pro-
duce optimal context trees without any external given parameters. The splitting process decision
is based on the models description length. If the description length is below zero threshold then
the node is divided, otherwise its not divided.
The splitting process continues for the following nodes until the minimum description length
is above the zero threshold. As a final stage, the minimum description length is calculated for
merging terminal nodes with different parents. Any pair nodes for which the length is above zero
threshold are then merged or tied. For example, figure 2.2 illustrates the case of tying the center
states of all triphones of the phone /aw/. All of the states trickle down the tree and depending on
the answer to the questions, they end up at one of the shaded terminal nodes. For example, in
the illustrated case, the center state of /s/-/aw/+/n/ would join the second leaf node from the right
since its right context is a central consonant, and its right context is nasal, but its left context is not
a central stop.
s-aw+n
t-aw+n
s-aw+t
..etc
Example
Cluster centrestates of phone /aw/
yn
yn yn
yn
R=Central-Consonant?
L=Nasal? R=Nasal?
States in each leaf node are tied
L=Central-Stop?
Figure 2.1: Decision tree-based state tying [Young, Evermann, Gales, Hain, Kershaw, Liu, Moore, Odell,Ollason, Povey, Valtchev, and WoodlandYoung et al.]
An important advantage of tree-based clustering is that it allows models which have no
9
training data to be synthesized. This is done by descending the previously constructed trees for
that phone and answering the questions at each node based on the new unseen context. When
each leaf node is reached, the model representing that cluster is used for the corresponding model
in the unseen context.
Information sharing of training data in the same cluster or leaf node is the essential concept,
therefore the construction of context factors and design of tree structure for decision tree based
context clustering must be done appropriately. Since spectrum, F0 and duration models have
their own influential contextual factors, the distributions for spectral parameters, F0 parameter
and state durations are clustered independently as seen in figure 1.2.
2.3 Towards Language Independency
The main problem in context cluster synthesis is the dependency on the target language.
To adapt a system to a new language, most context feature extractions algorithms need to be
changed to the new language specifications. Also, since most context factors are not statically
available during the synthesis itself, on-the-fly generation is necessary and therefore modifications
in the system core. To overcome language issues, configurable language dependent modules for
on-line context features extraction are proposed.
The next sections describes the proposed solution for language independent context cluster
synthesis, by using configurable language dependent modules. As a support language for these
configurable modules, the XML mark-up language was chosen due to its adaptability to new
features.
2.3.1 Configurable Grammar Features Module
A configurable grammar features module is introduced to solve language dependency is-
sues in context-dependent clustered based synthesis. The goal of this module is to provide an
easy way of configuring grammar features without performing any modifications to the system
core and avoid long and tedious work in its implementation. Next, some of the main language
features are described.
Phone Set
The first major grammar feature is the phone set. A phone set is a symbolic representa-
tion of the phonological basis of a spoken language. In language independent systems, general
phonetic symbolic representations, like the International Phonetic Alphabet (IPA) system, would
be preferred. The IPA goal is to find symbolic representations of every human language phonetic
forms. However, IPA presents a problem: its symbolic representation is not easily computable.
Also, there is a certain difficulty establishing the correct symbolic representation when particu-
10
lar forms of a certain phone are involved. Hence, each language dialect has its own phonetical
representation, for example in American English it is common to use the Darpa phone set, or its
subset, the Radio Phones phone set. The PT-SAMPA phone set is commonly accepted as the
best representation for European Portuguese. See appendix A for the cross-reference between
these systems and the IPA system.
In DTD 1 the Document Type Definition for the XML implementation is presented. The root
entity of the phoneset is the PhoneSet entity. This entity may have more than one element of type
Phone, and has only one attribute named name, which specifies the name and type of the phone
set being used. The Phone entity may have one or more elements of type PhoneFeature. This
entity has two mandatory attributes and two other optional. The first two are name and maintype.
The name attribute sets the name of the phone in question, and the maintype attribute speci-
fies the general phone type, for example, main type "Vowels" can be a vowel, a semi-vowel, a
diphthong or a triphthong. The other two attributes of the Phone entity, are translation and nucle-
arphone. The translation attribute exists for compatability issues. Graphic symbols, like /@/ and
/˜/ used in the PT-SAMPA phone set for instance, may present some parsing problems on some
systems. As for the nuclearphone attribute, its purpose is to help some context factors to deter-
mine the main vowel in a specific syllable or diphthong. With the Darpa phone set this problem is
not really important since most diphthongs are static and any feature processor would not have
a problem identifying the main vowel in a diphthong or syllable. However, when dealing with a
phoneset like PT-SAMPA, that allows multiple characters for a vowel, extracting this information
becomes a problem. Additional information is added using the nuclear phone attribute.
The last entity is the PhoneFeature entity and only as one attribute named name. Phone
characteristics are set in this entity to allow phone discrimination using for example Chomsky
distinctive features, see section 2.4.1.
In XML 1 a very simple usage example of the phoneset is presented. In this example,
two phones from the PT-SAMPA phoneset are used. The first, the vowel /6/ and the second the
diphthong /6˜w˜/. Notice the use of the translation attribute to translate /6/ into /A/ and /˜/ into /y/.
This translation assists, for instance, the HTK Toolkit [Young, Evermann, Gales, Hain, Kershaw,
Liu, Moore, Odell, Ollason, Povey, Valtchev, and WoodlandYoung et al.] parser to interpret special
phone symbols. HTK is known to have parsing problem with these special characters.
Part-Of-Speech
The next grammar feature is Part-Of-Speech (POS). POS is a word level feature type and is
an important linguistic resource in context cluster synthesis. Information obtained by a morphoss-
intactic tagging system can be relevant in several areas of natural language processing [Ribeiro,
Oliveira, and TrancosoRibeiro et al.]. For example, knowing the POS of a given word allows one
to predict which words or word types can occur in its neighborhood. Morphossintactic information
11
DTD 1 Phone set "Document Type Definition".<!ENTITY % PhoneSetElements "Phone" ><!ENTITY % PhoneElements "PhoneFeature" >
<!ELEMENT PhoneSet (%PhoneSetElements;)* ><!ATTLIST PhoneSet name CDATA #REQUIRED>
<!ELEMENT Phone (%PhoneElements;)* ><!ATTLIST Phone name CDATA #REQUIRED maintype CDATA #REQUIRED
nuclearphone CDATA #IMPLIED translation CDATA #IMPLIED>
<!ELEMENT PhoneFeature (#PCDATA)* ><!ATTLIST PhoneFeature name CDATA #REQUIRED>
XML Code 1 A very symple Phoneset xml usage example.<PhoneSet name="PT-SAMPA">
<Phone name="6" maintype="Vowels" translation="A"><PhoneFeature name="Syllabic">Syllabic</PhoneFeature><PhoneFeature name="High">NonHigh</PhoneFeature><PhoneFeature name="Low">NonLow</PhoneFeature><PhoneFeature name="Back">Back</PhoneFeature><PhoneFeature name="Labial">NonLabial</PhoneFeature><PhoneFeature name="Round">NonRound</PhoneFeature><PhoneFeature name="Nasal">NonNasal</PhoneFeature><PhoneFeature name="Dorsal">Dorsal</PhoneFeature>
</Phone><Phone name="6~w~" maintype="Vowels" translation="Aywy" nuclearphone="6~">
<PhoneFeature name="Syllabic">Syllabic</PhoneFeature><PhoneFeature name="High">NonHigh</PhoneFeature><PhoneFeature name="Low">NonLow</PhoneFeature><PhoneFeature name="Back">Back</PhoneFeature><PhoneFeature name="Labial">NonLabial</PhoneFeature><PhoneFeature name="Round">NonRound</PhoneFeature><PhoneFeature name="Nasal">Nasal</PhoneFeature><PhoneFeature name="Dorsal">Dorsal</PhoneFeature>
</Phone></PhoneSet>
can also be used to select special words (or word types) or to know which affixes a given word can
take. In the same way, a morphossintactic tagger can help context-dependent clustered models
to improve the quality of the produced speech. POS plays an important role in the prediction of
prosodic phrasing and accentuation. Certain POS categories, such as content and functional cat-
egories, have a strong influence in word accentuation. Content words belong to major open-class
lexical categories such as noun, verb, adjective, adverb and closed-class words such as nega-
tives and some quantifiers. Decision methods based on content words have been widely used in
word accentuation and have proven to be very effective [Ribeiro, Oliveira, and TrancosoRibeiro
et al.,Huang, Acero, and HonHuang et al.].
In DTD 2 the Document Type Definition for the XML implementation is presented. The
root entity of the POS is the POS entity. This entity may have two types of elements. The first
is the LexicalCategory, where morphossintactic information is defined and the second element
12
LexicalCategoryGroups, where POS categories, like content and functional categories, are also
defined. The LexicalCategory entity may have one or more elements of type Category. This last
entity has two mandatory attributes, name and tag. The name attribute is used to specify the
morphossintactic classification, while the tag attribute is used to set its respective POS tag. The
LexicalCategoryGroups entity may have one or more elements of type Groups. The Groups entity
only has one mandatory attribute named name. This entity is used to define lexical categories
groups, by means of the LexicalCategory entity.
DTD 2 Part-Of-Speech "Document Type Definition".<!ENTITY % POSElements "LexicalCategorys|LexicalCategoryGroups" ><!ENTITY % LexicalCategorysElements "Category" ><!ENTITY % LexicalCategoryGroupsElements "Group" ><!ENTITY % GroupElements "LexicalCategory" >
<!ELEMENT LexicalCategory (#PCDATA)* >
<!ELEMENT Group (%GroupElements;)* ><!ATTLIST Group name CDATA #REQUIRED>
<!ELEMENT Category EMPTY ><!ATTLIST Category name CDATA #REQUIRED tag CDATA #REQUIRED>
<!ELEMENT LexicalCategoryGroups (%LexicalCategoryGroupsElements;)* >
<!ELEMENT LexicalCategorys (%LexicalCategorysElements;)* >
<!ELEMENT POS (%POSElements;)* >
In XML 2 a very usage example of the POS grammar feature is presented. In this example,
the definition of three morphossintactic categories (Noun, Verb and Preposition) and their respec-
tive POS tags can be observed. Also note the grouping mechanism to define two POS categories
(content and functional), where content is defined with the Noun and Verb lexical categories and
functional with its only member Preposition.
Prosodic Markers
Speech events like tone characterization, are important linguistic resources. Good tonal
description allows the resulting synthesized speech to have a more natural behavior. The purpose
of this grammar feature is to define linguistic events like this.
In DTD 3 the Document Type Definition for the XML implementation is presented. The root
entity for prosodic markers is the ProsodicMarkers entity. This entity defines a set of prosodic
marker types and may have elements of type MarkersType. The MarkersType entity is used to
define linguistic markers that reproduce a known prosodic pattern, like for instance punctuation or
word breaks. This entity only has one mandatory attribute named name. The MarkersType entity
has elements of type ProsodicMarker. The ProsodicMarker entity has two mandatory attributes.
13
XML Code 2 A very symple Part-Of-Speech xml usage example.<POS>
<LexicalCategorys><Category name="Noun" tag="N"/><Category name="Verb" tag="V"/><Category name="Preposition" tag="P"/>
</LexicalCategorys><LexicalCategoryGroups>
<Group name="content"><LexicalCategory>Noun</LexicalCategory><LexicalCategory>Verb</LexicalCategory>
</Group><Group name="functional">
<LexicalCategory>Preposition</LexicalCategory></Group>
</LexicalCategoryGroups></POS>
The first, the name attribute, is used to identify the linguistic marker and the second, the tag
attribute, is used as the contextual factor identifier.
DTD 3 Prosodic Markers "Document Type Definition".<!ENTITY % ProsodicMarkersElements "MarkerType" ><!ENTITY % MarkerTypeElements "ProsodicMarker" >
<!ELEMENT ProsodicMarker EMPTY ><!ATTLIST ProsodicMarker name CDATA #REQUIRED tag CDATA #REQUIRED>
<!ELEMENT MarkerType (%MarkerTypeElements;)* ><!ATTLIST MarkerType name CDATA #REQUIRED>
<!ELEMENT ProsodicMarkers (%ProsodicMarkersElements;)* >
In XML 3, a simple usage example is shown, where prosodic markers WordBreak and
Punctuation definition can be verified.
2.3.2 Configurable Context Factors Module
A configurable Context Factors Module is introduced here to provide an easy way of config-
uring specific context factors for a specific language, without performing any modifications to the
system core. The idea consists of retrieving context factors information from a linguistic storage
facilitator. In this work, Heterogeneous Relation Graphs (HRGs) [Taylor, Black, and CaleyTaylor
et al.] were used to describe the linguistic structures. The HRG formalism was developed for use
in the Festival speech synthesis system [Black, Taylor, and CaleyBlack et al.]. In this formalism,
linguistic objects such as words syllables and phonemes are represented by objects termed lin-
guistic items. These items exist in relation structures, which specify the relationship between the
items. A relation exists for each required linguistic type. A HRG contains all the relations and
items for an utterance.
14
XML Code 3 Example of Prosodic Markers xml usage.<ProsodicMarkers>
<MarkerType name="WordBreak"><ProsodicMarker name="NB" tag="0"/><ProsodicMarker name="B" tag="3"/><ProsodicMarker name="BB" tag="4"/>
</MarkerType><MarkerType name="PunctuationType">
<ProsodicMarker name="." tag="A"/><ProsodicMarker name="?" tag="I"/><ProsodicMarker name="!" tag="E"/><ProsodicMarker name="," tag="C"/><ProsodicMarker name="" tag="O"/>
</MarkerType></ProsodicMarkers>
Next, the basic HRG feature access mechanism for this module is analyzed. In DTD 4 the
Document Type Definition for the XML implementation is presented. The root entity of this module
is the Label entity. This entity may only have elements of type BaseRelation, which defines sets of
contextual factors for different targets. The idea behind this is to allow the possibility to define sets
of contextual factors under different conditions. A set of context factors can be set for the training
process of tree-based context clustering, but the same context factors can be seen differently
during the synthesis stage. For example, post-lexical rules on the synthesis stage can only be
observed in the real phone sequence during the training stage, meaning a different disposition of
the linguistic data on different targets.
The BaseRelation entity has two attributes. The first, the name attribute that establishes the
Relation on the HRG structure for which all context factors are generated from and the second,
the target attribute, which is used to identify the set of context factors that will be returned. The
BaseRelation entity may only have elements of type Level. These define the context factors
linguistic levels such as Word level, Syllable level or Phone level.
The Level entity has two attributes, the name and switch attributes. The first attribute gives
information about the linguistic level. As for the second attribute, it works much like the switch
flux control instruction from c/c++ languages. Some context factors may change their linguistic
references under certain conditions. For example, consider the contextual factor for central phone.
When its value becomes a silence, certain context factor have little linguistic information and
sometimes no information at all, meaning that they are missing from the HRG structure. To handle
this the switch attribute determines the HRG base relation linguistic feature to be observed, whose
value will cause a modification to the underlying context factors specifications.
The Level entity only has Features elements. These elements have only one attribute, the
case attribute. Its value will cause the change on the context factors previously defined by the
switch attribute on the Level entity. When the linguistic feature defined by the switch attribute
reaches the value defined by the case attribute, a change on the specification of the context
15
DTD 4 Feature Extraction Configuration "Document Type Definition".<!ENTITY % LabelElements "BaseRelation" ><!ENTITY % BaseRelationElements "Level" ><!ENTITY % LevelElements "Features" ><!ENTITY % FeaturesElements "BaseItem" ><!ENTITY % BaseItemElements "Feature" >
<!ELEMENT Feature EMPTY ><!ATTLIST Feature position CDATA #REQUIRED pre CDATA #REQUIRED
post CDATA #REQUIRED name CDATA #REQUIREDarg CDATA #REQUIRED null CDATA #REQUIREDtag CDATA #REQUIRED question CDATA #IMPLIEDlower CDATA #IMPLIED upper CDATA #IMPLIEDstatus CDATA #IMPLIED required CDATA #IMPLIED>
<!ELEMENT BaseItem (%BaseItemElements;)* ><!ATTLIST BaseItem name CDATA #REQUIRED >
<!ELEMENT Features (%FeaturesElements;)* ><!ATTLIST Features case CDATA #REQUIRED >
<!ELEMENT Level (%LevelElements;)* ><!ATTLIST Level name CDATA #REQUIRED switch CDATA #REQUIRED >
<!ELEMENT BaseRelation (%BaseRelationElements;)* ><!ATTLIST BaseRelation name CDATA #REQUIRED target CDATA #REQUIRED>
<!ELEMENT Label (%LabelElements;)* >
factors will occur, under the current linguistic level.
The Features entity only has BaseItem type elements, which has only one attribute named
name. This attribute specifies the base path, on the HRG, from the base relation linguistic item
to the underlying linguistic features. The BaseItem entity exists for optimization purposes. The
definition of the HRG base path to the underlying features minimizes the search algorithms usage,
and as maximizing performance on feature retrieval.
The BaseItem entity may only have elements of type Feature. These are responsible for
the context factors definition. The Feature entity has twelve attributes, seven mandatory and five
optional. To facilitate description, these twelve attributes are divided in the following four sets: out-
put; context feature; question generation and flux control. In the first set position, pre and post are
responsible for the context label format. The position attribute specifies the context factors order
in the context label and the pre and post attributes the context factor separators. In the second
set, attribute name is responsible for fetching linguistic features from the HRG structure. But since
not all linguistic features are statically available in the HRG structure, attribute arg may be used to
dynamically retrieve linguistic features, by using pre-defined processing functions specified in the
name attribute. The arg attribute specifies the processing function input parameters and the null
attribute the default return value in case of missing features. In the third set, the question gen-
eration set, attributes tag, question, lower and upper are used to automatically generate context
16
tree questions. The tag attribute identifies the set of context features questions that are generated
by the question attribute. The question identifies a pre-defined processing function for automatic
question generation and the lower and upper attributes the function constrains. For example, in
XML 4, at the syllable level, the feature with name stress applies automatic question generation
function eq, to generate stress questions equal to 0 and 1. The last set, the flux control set, is
used to control the context features flow. The status attribute is used to enable or disable a certain
context feature. Often while assembling context factors its useful to enable or disable certain con-
text factors for testing. Finally the required attribute forces the current context factor to disregard
the current context label, if the feature value is null or inexistent. In XML 4, a simple XML usage
example is shown in context factors specification.
XML Code 4 XML example for the Context Factors Module.<Label>
<BaseRelation name="Observed" target="Simple-Training-Full"><Level name="RealPhone" switch="name">
<Features case="Default"><BaseItem name="">
<Feature position="0" pre="^" post="-" name="p.trans_name" arg="" null="x"tag="L.Phone" question="FeaturedPhones" lower="" upper=""/>
<Feature position="1" pre="-" post="+" name="trans_name" arg="" null="x"tag="C.Phone" question="FeaturedPhones" lower="" upper=""/>
<Feature position="2" pre="+" post="=" name="n.trans_name" arg="" null="x"tag="R.Phone" question="FeaturedPhones" lower="" upper=""/>
</BaseItem></Features>
</Level><Level name="Syllable" switch="name">
<Features case="Default"><BaseItem name="R:SylStructure.parent.R:Syllable">
<Feature position="4" pre="-" post="!" name="stress" arg="" null="x"tag="C.Syllable.Stressed" question="eq" lower="0" upper="1"/>
</BaseItem></Features><Features case="#">
<BaseItem name="R:SylStructure.parent.R:Syllable"><Feature position="4" pre="-" post="!" name="null" arg="" null="x"
tag="C.Syllable.Stressed"/></BaseItem>
</Features></Level>
</BaseRelation></Label>
Although there could be a number of different ways to design the structure of this module,
this implementation can be easily modified and has a strong adaptability to the target context
factors.
17
2.3.3 Feature Label Generator
In the previous sections, configurable language dependent modules for on-line context fea-
tures extraction were proposed. As a support language for these configurable modules, the XML
mark-up language was chosen. In this section, a c++ interface for these modules is described to
perform automatic context-dependent label extraction and automatic question generation. The
interface consists of an XML mark-up language parser, a question generator and a context-
dependent label generator.
The XML mark-up language parser was implemented using the Xerces-C library. Xerces-
C is a validating XML parser written in a portable subset of c++, and is faithful to the XML 1.0
recommendation and associated standards. The parser uses the Xerces-C DOM to load grammar
features and context factors into an object representation similar to the entities described in the
previous sections. After the XML (containing grammar features and context factors) is loaded,
questions can be automatically generated as HTK questions. Questions are generated for all
context-factors, by using the question, lower and upper attributes from the Features entity. The
context-dependent label generator uses the loaded XML to retrieve all context dependent linguistic
information. Given an input utterance, all items in the relation specified by the name attribute at the
BaseRelation entity, are sequentially processed and the underlying linguistic information fetched
as described in section 2.3.2. After the linguistic information is retrieved from the utterance, the
gathered data is then dumped as an HTK context dependent label.
2.4 Grammar Features and Context Factors for European Por-tuguese
In this section, the work done in the context of this thesis for European Portuguese is
presented. Detailed descriptions of grammar features and context factors are presented.
2.4.1 Grammar Features
Phoneset
The phonetical representation used, for the European Portuguese dialect, was the PT-
SAMPA phoneset. In appendix A a class-reference to the IPA phonetic alphabet can be found.
The necessity of language generalization and phonological rules expression in a clear way
led many linguistics to the creation of many distinctive features systems. The most notable one
was performed by Chomsky and Halle (1968) in the sequence of the pioneer work from Jakobson
and Halle (1956) in distinctive features theory. In Chomsky and Halle (1968) system, features
are binary, where /+/ indicate the presence of such propriety and /−/ its absence. Each feature
represents an independent controllable articulatory aspect.
The sets of distinctive features used in this thesis is based on work from Mateus and
18
d’Andrade (2000); Mateus, Andrade, Viana, and Villalva (1990); Oliveira (1996). Their work is
an application of Chomsky distinctive features system to European Portuguese. In table 2.1 the
distinctive features for PT-SAMPA vowels are presented and in table 2.2 the distinctive features
for PT-SAMPA consonants. These distinctive features were used in this thesis as phone features,
to build the phonetical questions set for context clustered models.
Name Syllabic High Low Back Labial Round Nasal Dorsali + + − − − − − −e + − − − − − − −E + − + − − − − −6 + − − + − − − +a + − + + − − − +O + − + + + + − −o + − − + + + − −u + + − + + + − −@ + + − + − − − +i˜ + + − + − − + −e˜ + − − + − − + −6˜ + − − + − − + +o˜ + − − + + + + −u˜ + + − + + + + −j − + − − − − − −w − + − + − + − +j˜ − + − − − − + −w˜ − + − + − + + +
Table 2.1: Chomsky distinctive features for PT-SAMPA vowels
POS
As it was seen in section 2.3.1, POS is a word level feature type and is an important
linguistic resource. In context-dependent clustered models based synthesis, POS assists the
prediction of prosodic phrasing and accentuation. The used POS tagging system was the one
defined in Ribeiro (2003). The used tag set is a subset of the POS tag set used in the PAROLE
corpus [Nascimento, Bettencourt, Marrafa, Ribeiro, Veloso, and WittmannNascimento et al.]. This
subset retains information relative to the lexical category and sub-category, discarding any other
information. The resulting set has a total of 28 tags and they can be observed in table 2.3.
Content words belong to major open-class lexical categories such as noun, verb, adjec-
tive, adverb and certain closed-class words such as negatives and some quantifiers. Likewise,
functional words belong to the closed-class lexical categories such as articles, conjunctions, pro-
nouns, prepositions and numerals. The grouping of lexical categories have a strong influence in
the prediction of accents and some prosodic events that can contribute to the better modulation
of HMM model trees. In table 2.3, one can observe the lexical categories grouping carried for
European Portuguese.
19
Name Continuant Sonorant Prior Coronal Back Distributed Nasalp − − + − − − −b − − + − − − −t − − + + − − −d − − + + − − −k − − − − + − −g − − − − + − −f + − + − − − −v + − + − − − −s + − + + − + −z + − + + − + −S + − − + − + −Z + − − + − + −l + + + + − − −l˜ + + − − + − −L − + − − − + −m − + + − − − +n − + + + − − +J − + − + − + +r + + + + − − −R + + − − + − −
Name High Strident Voiced Lateral Laryngeal Labial Dorsalp − − − − + + −b − − + − + + −t − − − − + − −d − − + − + − −k + − − − + − +g + − + − + − +f − + − − + + −v − + + − + + −s − + − − + − −z − + + − + − −S + + − − + − −Z + + + − + − −l − − + + − − −l˜ − − + + − − +L + − + + − − −m − − + − − + −n − − + − − − −J + − + − − − −r − − + − − − −R + − − − − − +
Table 2.2: Chomsky distinctive features for PT-SAMPA consonants
Prosodic Markers
Tone characterization is an important linguistic feature at the word level. Its main influence
in context clustered models is on the pitch parameter. Good description of intonation allows the
resulting synthesized speech to have a more natural behavior.
In recent years, research on Intonation for the European Portuguese dialect has been car-
20
POS Tag Lexical Category Lexical Category GroupAA None noneNc Noun.Common contentNp Noun.Proper contentV= Verb contentA= Adjective contentR= Adverb contentTd Article.Definite functionalTi Article.Indefinite functionalCc Conjunction.Coordenate functionalCs Conjunction.Subordinate functionalMc Numeral.Cardinal functionalMo Numeral.Ordinal functionalPp Pronoun.Personal functionalPd Pronoun.Demostrative functionalPi Pronoun.Indefinite functionalPo Pronoun.Possessive functionalPt Pronoun.Interrogative functionalPr Pronoun.Relative functionalPe Pronoun.Exclamative functionalPf Pronoun.Reflexive functionalS= Preposition functionalI= Interjection emotionalXf Residual.LoanWords otherXa Residual.Abbreviation otherXy Residual.Acronym otherXs Residual.Symbol otherU= PassiveMarker otherO= Punctuation other
Table 2.3: POS tag system
ried out [Hirschberg and PrietoHirschberg and Prieto,Viana, Oliveira, and MataViana et al.,Vigário
and FrotaVigário and Frota]. However tools for automatic tone stylization are still under develop-
ment, and thus a workaround solution was needed. To enhance the models intonation, syntactic
information like word break and punctuation were used instead. In table 2.4 word break and
punctuation based prosodic markers are presented.
name tagNB 0B 3
BB 4
(a) Wordbreak
name tag. A? I! E, C
/other/ O
(b) Punctuation
Table 2.4: Prosodic Markers
21
2.4.2 Context Factors
In this section, the sets of context factors for European Portuguese are presented. The
following 24 context factors sets, distributed by 5 levels of speech categories, were design for Eu-
ropean Portuguese to model context dependent models. Note that all context factors are relative
to the current phone.
Phone level
At the phone level, penta-phones were used. Penta-phone models have been used by
many HMM-based speech synthesis systems, and have proved its importance in HMM modeling
for synthesis purposes [Black, Zen, and TokudaBlack et al.].
In many spoken languages, such as European Portuguese, Sandhi phenomena are re-
sponsible for language naturalism. It consists on acoustic phone modifications produced between
two consecutive words inside a sentence. To reproduce these phenomenons in speech synthesis,
post-lexical rules are applied to canonical phone sequences. To reiterate the importance of this
phenomena in the models, observed phone sequences should be used. However, the use of ob-
served phone sequences to train context-dependent clustered HMM models may lead to lack of
accuracy on the same. To put it differently, the cluster of HMMs may disregard the neighborhood
influences from important phone transitions when using observed phone sequences. To force
training to take into account possible post-lexical influences a mixed solution is proposed. Both
canonical and observed phone sequences were included as context factor sets, where it was ex-
pected that decision trees would reflect post-lexical rules, by building different contexts for specific
phone transitions.
Phone level context factors:
1. {previous previous, previous, current, next, next next} observed phone;
2. {previous previous, previous, current, next, next next} canonical phone;
3. {backward, forward} position in syllable;
Syllable level
In this level, syllable information is retrieved. Stress, accent and proximity of major phrase
breaks are important to prosody. Consequently, related information was included.
Syllable level context factors:
1. {previous, current, next} syllable stress;
2. {previous, current, next} syllable accent;
3. {previous, current, next} number of phones in syllable;
22
4. {backward, forward} position of current syllable in word;
5. number of syllables to {previous, next} phrase break
6. number of stressed syllables to {previous, next} phrase break
7. number of accented syllables to {previous, next} phrase break
8. distance to {previous, next} stressed syllable;
9. distance to {previous, next} accented syllable;
10. syllable nuclear phone;
Word level
At the word level, intonation and prosodic ruptures are fundamental. Accordingly, word
breaks, punctuation, POS and distance measures to content words were used.
Word level context factors:
1. {previous, current, next} part-of-speech;
2. {previous, current, next} word breaks;
3. {previous, current, next} punctuation type;
4. {previous, current, next} word number of syllables;
5. {backward, forward} position of word in phrase;
6. {backward, forward} number of content words in phrase;
7. distance to {previous,next} content word;
Phrase level
During recordings, speakers tend to reproduce strong reading patterns, such as intonation
patterns. These patterns are usually a function of the phrase length and the distance to the last
speech rupture. The phrase level attempts to catch these reading patterns.
Phrase level context factors:
1. {previous, current, next} phrase number of syllables;
2. {previous, current, next} phrase number of words;
3. number of non-major phrase breaks to {previous, next} major phrase break;
23
Utterance level
The utterance level is a complementary level of the phrase level. It attempts to catch other
reading patterns like intonation transition patterns between consecutive phrases.
Utterance level context factors:
1. total number of {syllables, words, phrases} in utterance;
2.4.3 Questions
As it was seen before, tree-based context clusters are translated as binary trees in which
a yes/no context question is attached to each node. For each context factor a question is au-
tomatically generated by the feature label generation module. The context sets, described in
the previous section, were automatically transformed into questions, resulting in a total of 3231
questions.
2.5 Conclusions
In this chapter, context-dependent clustered models for speech synthesis were over-
viewed. As stated before, the main problem in context cluster synthesis is the dependency on
the target language. To overcome language issues, configurable grammar features and context
factors modules for on-line processing were proposed. The goal of these module is to provide an
easy way of configuring context-dependent clustered models without performing any modifications
to the system core and avoid long and tedious work in its implementation.
The work developed for European Portuguese grammar features and context factors was
presented, and although it was only applied for European Portuguese, this work is general enough
to be applied to other languages.
24
3Voice Building with HTS
Contents3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
25
Voice building is the process of setting up the recordings of a voice a voice to be used
by a speech synthesis system. This process involves a set of procedures like data preparation,
parameter extraction and model construction. HMM synthesis was the chosen technique for small
foot print synthesis, concretely for an HTS based system. In this chapter all HTS voice building
procedures are described.
3.1 Corpora
The corpora used in this thesis was constructed in the scope of the Tecnovoz project [Na-
tional Project TECNOVOZ number 03/165National Project TECNOVOZ number 03/165], by the
L2F laboratory at INESC-ID and its Tecnovoz partner INOV. The Tecnovoz project was a join
effort to disseminate the use of spoken language technologies in the a wide range of different do-
mains. The project consortium included 4 research centers and 9 companies specialized in areas
such as banking, health systems, fleet management security, media, alternative and augmentative
communication, computer desktop applications, etc.
The Tecnovoz road-map for speech databases was to build an inventory of a considerable
amount of speech recordings (from 3 to 10 hours or more) with carefully selected contents. This
database original target was to feed a unit selection based TTS system. In this section, the
methodologies used in designing and recording the Tecnovoz speech database are described as
well as the procedures taken while building a corpus for HMM-based synthesis. More information
on the speech database design methodologies can be seen in [Oliveira, Paulo, Figueira, Mendes,
Nunes, and GodinhoOliveira et al.].
3.1.1 Design of the Recording Prompts
Since the original purpose of the inventory was to feed a unit selection based TTS system,
the text prompts to be recorded were selected to cover the acoustic patterns observed in the
general use of the language. This coverage cannot be achieved at the word level, as the number
of words in a language is virtually infinite. Therefore, the candidate prompts must be represented
by smaller sized acoustic units. Three levels of representation were used: syllables, triphones
and diphones. These levels make up finite sets and can carry information that spans from the
phonetic level up to the prosodic level.
The text corpora mainly consists of newspaper texts and books that do not always cover
the specific requirements of certain applications such as speech-to-speech translation, medical
systems, customer support, etc. The gathered text corpora contained a total of 70 million words
and 420 thousand distinct words.
In order to have a proper sub-word selection scheme, a very high confidence in the esti-
mated phone sequence for every sentence is needed. For this reason, all sentences that con-
26
tained words not included in a manually corrected pronunciation lexicon were discarded. The text
corpus was this way reduced to approximately 400 thousand sentences. A greedy selection algo-
rithm was then used to select a representative sub-set of the sentences in the text corpora [John-
sonJohnson, Chevelu, Barbot, Boeffard, and DelhayChevelu et al.], aiming tokens coverage at
three selected levels: syllables, triphones and diphones. The greedy algorithm stopped after a
predefined coverage threshold, resulting in a total of 8260 selected text prompts.
3.1.2 Speakers
The original Tecnovoz speech database included four speakers, from which two were male.
The speakers selection was based on results from recording test sessions of several candidates.
These were pre-selected by personal contacts and through a voice talent recording studio. All
candidates were Portuguese native speakers from the Lisbon area.
The test consisted of recording a session of 600 sentences. The sentences were selected
to have a good diphone coverage. Using these recordings a synthesizer was build with each
voice allowing the possibility to evaluate not only the quality of the voice itself but also its use
for synthesis purposes. The decision was taken by listening to several phonetically rich prompts
synthesized with a variable size unit selection voice using the recordings of each speaker. The
decision criteria were:
• the recording naturalness;
• number of repetitions per prompt per session;
• voice quality consistency;
• pleasantness of the synthesized voice;
• voice ability to mask concatenation errors.
In this thesis, only one of the male speakers was used. The choice of this speaker was
influenced by the following characteristics:
• professional voice talent;
• consistent reading naturalness;
• high voice quality;
• low pitch frequency.
3.1.3 Recordings
The recording of the inventory required a large number of recording sessions and a strict
recording procedure to ensure the uniformity of the database [Bonafonte, Höge, Kiss, Moreno,
27
Ziegenhain, van den Heuvel, H.-U.Hain, X.S.Wang, and GarciaBonafonte et al., Oliver and Szk-
lannyOliver and Szklanny, Saratxaga, E., Hernaez, and LuengoSaratxaga et al.]. The recordings
were conducted in the L2F recording studio that includes a sound-proof room and a control sta-
tion (see figure 3.1.3), where the supervision of the recording process took place. The equipment
in the sound-proof room includes:
• a Studio Projects T3 Dual Triode microphone;
• an anti-pop filter;
• a Brüel & Kjær Type 2230 microphone probe;
• an LCD monitor;
• a set of headphones;
• a web camera;
• and a small mirror on the wall.
The supervisor could control the speaker position in the beginning of each session by
comparing a web camera image with pictures taken in the previous sessions. The small mirror on
the wall helped the speakers maintaining a fixed distance to the microphones during the sessions:
they were asked to check the position of their face in the mirror periodically. Also, to maintain
consistency of the speaker voice level and quality thru sessions, blocks of recording prompts from
previous sessions were played and compared with newly recorded prompts at the beginning of
each session.
(a) Recording booth (b) Control Room
Figure 3.1: Recording Room [Oliveira, Paulo, Figueira, Mendes, Nunes, and GodinhoOliveira et al.]
In the control station, the signal from both microphones were digitalized using a RME
Fireface 800 digital mixing desk. A sampling frequency of 44.1kHz and 24bit quantization were
used. The audio feedback and the supervisor instructions were also routed through the mixing
desk to the speakers headphones.
28
The control station had two display monitors, one of them being mirrored inside the sound-
proof room. These monitors were used to display the recording prompts under the control of the
recording supervisor. Since speaker throat relaxation and list effects have an important effect
in the recorded speech, recordings were done in sessions of two hours with 10 minutes interval
every half hour. Each recording session produced, on average, 40 minutes of recorded speech.
At the end of the recordings, 20 sessions per speaker had been made, resulting in 13 hours of
speech per speaker.
3.1.4 Phonetic Segmentation and Multi-Level Utterance Descriptions
The phonetic segmentation of the databases was performed in three different stages
[Weiss, Paulo, Figueira, and OliveiraWeiss et al., Paulo, Figueira, Mendes, and OliveiraPaulo
et al.]. In the first stage, speech files were segmented by Audimus [Neto and MeinedoNeto
and Meinedo] working in forced alignment mode. Next, such segmentations are used by the
HTK programs [Young, Evermann, Gales, Hain, Kershaw, Liu, Moore, Odell, Ollason, Povey,
Valtchev, and WoodlandYoung et al.] for training context-independent speaker-specific phone
models. The speaker-adapted models are subsequently provided to a phonetic segmentation tool
based on weighted finite state transducers, allowing many alternative word pronunciations [Paulo
and OliveiraPaulo and Oliveira].
The utterances orthographic transcriptions are then combined with the respective phonetic
segmentations using a procedure described in [Paulo and OliveiraPaulo and Oliveira], in order
to obtain a realistic and multi-leveled description of the spoken utterances. Moreover, those de-
scriptions are enhanced by additional descriptions, such as F0 values of the speech signal and
prosodic annotations. The F0 values are assigned to the respective phonetic segments base on
the temporal inclusion criterion.
3.1.5 Corpus sub-set for HMM Training
The selected male speaker from the Tecnovoz speech database has a total of 8260 prompts
and 13 hours of recorded speech. The huge size of this database poses a problem with tree-
based context HMM clusters. As it was seen in section 2.2 tree-based context HMM clusters are
translated as binary trees in which a yes/no context question is attached to each node. These
trees are built using a top-down sequential optimization process, where initially all models are
placed in a single cluster at the root of the tree. At this point all HMM models must be avail-
able for process, meaning a lot of computational resources are required to build the context tree.
Consequently, a smaller subset of the database should be used in the development to guarantee
enough computational resources in the building process.
The original corpus was design to cover many important linguistic features, such as phones,
diphones, triphones, syllables and words. This coverage technique allows the design of very rich
29
databases in terms of intonation and special acoustic units like diphthongs and triphthongs. Diph-
thongs and triphthongs have very specific characteristics like accent co-articulation. In section
2.4.2, the use of penta-phones as context factors were described. The importance of these re-
flect in the building process of specific models for diphthongs and triphthongs. Total penta-phones
coverage is very hard to achieve, even in a large database like the Tecnovoz database. The num-
ber of combinations for penta-phones in the order of 70 millions.
To guarantee that the subset has at least some of the most common and important linguistic
features, a statistical approach to subset corpus design was chosen. Statistically speaking, if the
large corpus is uniformly random sampled to meet a certain number of prompts, the probability
of a certain unit being in the subset is asymptotically the same as being in the large corpus. This
means that the units in the sub-set will be in asymptotical relative proportion to the same in the
large corpus. In figure 3.1.5, an example of this procedure for a 1k prompts is illustrated. Following
this procedure a sub-set for HMM training was builded, resuming to a total of 1500 prompts and
2.3 hours of recorded speech.
(a) Diphone coverage: in green the diphonecoverage of the Tecnovoz database and inred the coverage of the same diphones in
the subset
(b) Triphone coverage: in green the triphonecoverage of the Tecnovoz database and inred the coverage of the same triphones in
the subset
Figure 3.2: Unit coverage
3.2 HMM Training
In the training phase, spectral parameters (vocal tract parameters) and excitation param-
eters (F0 parameters) are extracted from a speech database and then modeled by context-
dependent HMMs. In figure 4.2, HMM training for context-dependent clustered models based
synthesis is illustrated [YoshimuraYoshimura].
Continuous density HMM are usually adapted for vocal tract modeling in the same way as
speech recognition systems. The continuous density Markov models is a finite state machine
30
Figure 3.3: HMM Training [YoshimuraYoshimura]
which makes one state transition at each time unit. First, a decision is made to which state to
succeed. Then an output vector is generated according to the probability density function for
the current state. An HMM is a doubly stochastic random process, modeling state transition
probabilities between states and the output probabilities at each state [RabinerRabiner].
The F0 pattern is composed by continuous values in voiced regions and a discrete symbol
in unvoiced regions. This double feature makes F0 modeling difficult to achieve either with discrete
or continuous HMMs. For this reason, F0 pattern modeling state output probabilities are defined
by Multi-Space probability Distributions (MSDs) [YoshimuraYoshimura,MasukoMasuko].
In this thesis, HMM training was performed with the HTS 2.1 toolkit [Tokuda, Zen, Yamag-
ishi, Black, Masuko, Sako, Toda, Nose, and OuraTokuda et al.] and a training script from one of
the speaker dependent training demos available on-line [Tokuda, Zen, Yamagishi, Black, Masuko,
Sako, Toda, Nose, and OuraTokuda et al.]. HTS 2.1 is an integrated tool in the HTK 3.4 [Young,
Evermann, Gales, Hain, Kershaw, Liu, Moore, Odell, Ollason, Povey, Valtchev, and WoodlandY-
oung et al.] toolkit.
3.2.1 Setup
The audio from the previously selected corpus sub-set was recorded with 44.1 kHz sam-
pling frequency and 24 bit quantization, as stated in section 3.1.3. The first step taken in the data
setup procedure was to perform, to the audio database, a lowpass filtering with cut-off frequency
at 8 kHz and downsample to 16 kHz to produce a 16 kHz sampling frequency and 16 bit quantiza-
tion speech database. The use of a smaller sampling frequency and lower bit quantization makes
the training procedure much lighter, without compromising either the quality of the corpus or the
quality of the synthesis.
The following procedures were spectral and F0 parameters extraction. The used method
31
for spectral extraction was Mel frequency Generalized Cepstrum coefficients (MGC). MGC is a
method available in the SPTK toolkit [Tokuda, Zen, Sako, Yamagishi, Masuko, and Nankaku-
Tokuda et al.] and is a variation of the well known Mel Frequency Cepstrum Coefficients (MFCC).
This method is used, as default, by the training scripts from the training demos of HTS. The input
configuration parameters for MGC extraction, used in this thesis, are depicted in table 3.1.
Analysis order: 32Window type: HammingFrame length: 400 pointsFrame shift: 80 pointsFFT length: 2048 pointsFrequency warping factor: 0.42
Table 3.1: Mel generalized cepstrum coefficients extraction parameters
As for F0 parameter extraction, the ESPS method from the SNACK [SjölanderSjölander]
toolkit was used. The chosen input F0 boundary’s parameters were 55Hz and 400Hz for lower
and upper limits respectively. These values were chosen after careful analysis of the selected
speaker pitch variations. The final step in data setup was HMM specifications. In this thesis,
default values from the demo scripts were used. The default values are 5 states per HMM, 3 delta
windows and 1 mixture for both MGC coefficients and F0 values.
3.2.2 Training
As it was stated before, HMM training was performed with the HTS 2.1 toolkit [Tokuda,
Zen, Yamagishi, Black, Masuko, Sako, Toda, Nose, and OuraTokuda et al.] and a training script
from one of the speaker dependent training demos available on-line [Tokuda, Zen, Yamagishi,
Black, Masuko, Sako, Toda, Nose, and OuraTokuda et al.]. However, this demo relies on festi-
val [Black, Taylor, and CaleyBlack et al.] scripts to generate context-dependent label sequences
for context-dependent cluster training. Also, these scripts are language dependent, more con-
cretely American English specific. Therefore, adaptations were made to use the multi-language
platform described in chapter 2. In essence, the performed changes included the feature label
generator, described in section 2.3.3, to generate context-dependent label sequences for Euro-
pean Portuguese context-dependent cluster HMM training. The training script consists of a series
of steps to produce context clustered HMMs. Some of the main steps performed by the HTS
demo scripts are:
1. Global variance computation;
2. Initialization and re-estimation;
3. Embedded re-estimation for mono-phones;
4. Embedded re-estimation for full-context;
32
5. Tree-based context clustering for mel generalized cepstral coefficients and log(F0);
6. Clustered embedded re-estimation;
7. Untied parameter sharing structure;
8. Untied embedded re-estimation;
9. Tree-based context clustering for mel generalized cepstral coefficients and log(F0);
10. Re-clustered embedded re-estimation;
11. Tree-based context clustering for duration.
3.3 Results
The training process was preformed in an Intel R© CoreTM 2 CPU platform at 2.40GHz and
4GB of physical memory, and toke approximately 24h, including parameter generation. Memory
peak occurred while performing tree-based context dependent clustering, where approximately
1GB of memory was allocated. The final voice (context-dependent trees, models and configura-
tion files) occupied 3.5 MB.
3.3.1 Data Analysis
Once the training process was concluded, trees and models were analyzed. In table 3.2,
models and questions count for duration, pitch and spectral coefficients available in the context-
dependent trees are presented. The first noticeable result is the number of models used in pitch
modeling. Over 7000 models were constructed and 50% of the total number of questions were
used. This information reveals an important result. Despite the fact that no pitch stylization
information was used to train the context-dependent models, the system used a lot of context in-
formation to generalize the corpus. Moreover, since the MDL principle was used to train context-
dependent trees, over-specialized context-dependent trees for pitch are unlikely. Another interest-
ing result is the low number of models and questions used to train MGC context-dependent trees.
Spectral coefficients discrimination are very important in HMM-based speech synthesis and given
the complexity of spectrum, more models were expected.
Stream # Models # Questions Questions Selecteddur 677 440 13.6 %
f0 7051 1615 50.0 %mgc 2006 553 17.1 %
Table 3.2: Models and questions count for duration, pitch and spectral coefficients
To better understand the previous results, a second analysis was performed. In table 3.3,
the question usage for each linguistic level is presented. From this table, the low number of
33
models for MGC context-dependent trees become clear. Most context-dependent models are
concentrated at the phone level, meaning that the only relevant information for spectral training
is at the phone level and eventually at the syllable level. This explains the low number of used
questions and therefor the low number of generated models. As for the pitch context-dependent
trees, a lot of questions from higher levels were used. Curiously, most context features usually
used in pitch stylization, are at the syllable, word and phrase levels. This result has an important
consequence. If usual pitch stylization context features were selected by HTS to model pitch,
then the use of pitch stylization based information for training is pointless. In section 4.4.2, results
from synthesis confirm that in fact the use of pitch stylization in HMM-based synthesis can be
disregarded.
Stream Phone Syllable Word Phrase Uttdur 65.0 % 16.6 % 8.6 % 4.1 % 5.7 %
f0 47.1 % 18.5 % 10.1 % 14.4 % 9.9 %mgc 77.6 % 12.8 % 3.4 % 2.0 % 4.2 %
Table 3.3: Question usage for each linguistic level
3.3.2 Models for Sandhi Phenomena
Sandhi phenomena are acoustic phone modifications occurring at the boundaries of words
inside a sentence. To reproduce these phenomena in speech synthesis, post-lexical rules are
applied to canonical phone sequences. As stated in section 2.4.2, observed and canonical penta-
phones were used to train context-dependent models in the expectation that models would reflect
influences from post-lexical rules and thus enhancing language naturalism. To test this, the result-
ing context-dependent trees were analyzed using some of the most common Sandhi phenomena.
The idea consisted of finding explicit post-lexical context questions that would lead to different
models for the same phone.
The first analyzed set of Sandhi phenomena were consonant modifications. There are
three situations where consonant modifications occur. The first two consonant modifications are
the /S/ at the final of a word, that is realized as /z/, when the next word begins with a vowel (e.g.
dias antes), or as a /Z/ when the next word begins with a voiced consonant (e.g. bons dias). The
other consonant modification occurs when the /l / at final word position stops being velar /l/ when
followed by a initial word vowel (e.g. mal entendido).
The second analyzed set of Sandhi phenomenons are vowel related. The majority of phone
modifications happen in vowels, in initial and final word positions. However only one of the most
common vowel modification was analyzed due to its context. The analyzed vowel modification
corresponds to the unstressed vowel [a] that should be normally by realized as a /6/, but in the
case of identical vowel sequence is realized as /a/ (e.g. visse a Antónia). Given the strong context
34
of this phenomenon, if context-dependent models do not reflect any influence for this case, then
most likely neither the other cases.
After careful examination of the context-dependent trees, the test results were discouraging.
They revealed no influence whatsoever from the use of canonical penta-phones. In spite of the
fact that there were explicit use of canonical questions, they however do not produce a node split
that explicitly reveals a post-lexical influence or that would lead to different models. The analysis
was not exhaustive, however since the most common cases revealed no influences, then there
is no point of analyzing all the other particular cases. Even if there is an actual influence in a
particular model, the probability of this model being used during synthesis is very small and thus
making it irrelevant to speech synthesis.
3.3.3 Segmentation Errors
From the analysis of the context trees, an interesting phenomenon was observed. While
walking thru the question nodes, the MGC context dependent model in table 3.4 was found. In
this table, context at states 1 and 2 for one of the /j/ HMM models is presented. A simple obser-
vation reveals that any context combination produces impossible phone sequences. In European
Portuguese, the semi-vowel /j/ can’t be isolated by consonants.
Phone Type Left Context Center Context Right ContextCanonical /f/, /v/, /s/, /z/, /S/, /Z/ Vowel AnyObserved /d/, /g/, /z/, /Z/ /j/ Consonant
Table 3.4: Context at states 1 and 2 for one of the /j/ HMM models
The reason for the occurrence of this model comes from corpus segmentation errors, were
some realizations of the vowel /i/ were understood as the semi-vowel /j/. Moreover, since HMM-
based techniques are usually insensible to small errors in the corpus, one can conclude that these
errors happen at a significant rate in the training corpus.
3.4 Conclusions
In this chapter, the process of constructing and HMM-based voiced was described. Al-
though the training process is time consuming, the benefits of this technique are clear. One of
the most important result was the adaptability of the system to language patterns, specifically to
pitch patterns. The high number of models generated for pitch suggests a very natural intonation
speech synthesis system. Another important result is the high compressibility of an HMM-based
voice. The ability to produce voices four hundred times smaller than the corpus is a clear advan-
tage, specially in small-footprint synthesis.
35
36
4Speech Synthesis with Dixi TTS
Engine
Contents4.1 Dixi Text-To-Speech Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 HMM Based Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 HMM-based Waveform Generation with HTS Engine API . . . . . . . . . . . . 394.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
37
HMM synthesis was the chosen technique for small foot print synthesis. In this thesis HTS
waveform generation module was integrated in Dixi TTS engine. In this chapter, the resulting work
performed for this integration is described. Results concerning system performance and footprint
are also reported.
4.1 Dixi Text-To-Speech Engine
Dixi [Oliveira, Paulo, Figueira, Mendes, Cassaca, do Céu Viana, and MonizOliveira et al.]
is a generic text-to-speech synthesis system, developed by L2F Laboratory at INESC-ID in the
scope of the Tecnovoz project. Although it was primarily targeted at speech synthesis for Eu-
ropean Portuguese, its modular architecture and flexible components allow its use for different
languages. Moreover, the same synthesis framework can be used for either concatenation based
or HMM-based speech synthesis applications. The HMM-based components were integrated by
the author in the scope of this thesis. Dixi currently runs on Windows and Linux, and can be
accessed, in both operating systems, by means of an API provided by a set of Dynamic Linked
Libraries and Shared Objects, respectively.
The system architecture is based on a pipeline of components, interconnected by means
of intermediate buffers, as depicted in figure 4.1. Every component runs independently from all
others, loads the to-be-processed utterances from its input buffer and, subsequently, dumps them
into its output buffer. Buffers, as the name suggests, are used to store the utterances already
processed by the previous component while the following one is still processing earlier submitted
data. Dixi internal utterance representation follows the HRG [Taylor, Black, and CaleyTaylor et al.]
formalism. The HRG formalism was first developed for use in the Festival speech synthesis
system [Black, Taylor, and CaleyBlack et al.]. In this formalism, linguistic objects such as words
syllables and phonemes are represented by objects termed linguistic items. These items exist in
relation structures, which specify the relationship between the items. A relation exists for each
required linguistic type. An HRG contains all the relations and items for an utterance.
Figure 4.1: Dixi system architecture [Oliveira, Paulo, Figueira, Mendes, Cassaca, do Céu Viana, and Moni-zOliveira et al.]
Dixi comprises five main components: text pre-processing, part-of-speech tagging,
grapheme-to-phone conversion, phonological analysis and waveform generation, as depicted in
figure 4.1.
38
4.2 HMM Based Synthesis
In HMM-based waveform generation, a context-dependent label sequence is obtained from
the input text by text analysis. An HMM sentence is constructed by concatenating context depen-
dent HMMs according to the context-dependent label sequence. State durations are determined
to maximize the likelihood of the state duration densities [Yoshimura, Tokuda, Masuko, Kobayashi,
and KitamuraYoshimura et al.]. According to the obtained state durations, a sequence of mel-
cepstral coefficients and F0 values, including voiced/unvoiced decisions, are generated from the
sentence HMM by using a maximum likelihood speech parameter generation algorithm [Tokuda,
Masuko, Yamada, Kobayashi, and ImaiTokuda et al.]. Finally, speech is synthesized directly
from the generated mel-cepstral coefficients and F0 values by the MLSA filter [Fukada, Tokuda,
Kobayashi, and ImaiFukada et al., ImaiImai]. In figure 4.2 the HMM-based waveform generation
is ilustrated [YoshimuraYoshimura].
Figure 4.2: HMM Synthesis [YoshimuraYoshimura]
4.3 HMM-based Waveform Generation with HTS Engine API
As stated in section 4.1, the Dixi system architecture is based on a pipeline of components,
interconnected by means of intermediate buffers, as depicted in figure 4.1. At this point, the
goal of this thesis was to design an HMM-based waveform generation component for the Dixi
TTS system. The HTS engine API version 1.0 [Tokuda, Zen, Yamagishi, Black, Masuko, Sako,
Toda, Nose, and OuraTokuda et al.] was used to design this component. One of the drawbacks
of the HTS engine is its inability to generate on-line context-dependent labels. To overcome this
problem, the feature label generator described in section 2.3.3 was integrated in this component
as a context-dependent label generator.
In figure 4.3, the HMM-based waveform generation component for the Dixi engine is pro-
posed. The to-be-processed utterances are loaded from the input buffer, where linguistic fea-
tures are available. The input utterance goes through the feature label generator where context-
dependent labels are automatically generated. These labels are then used by the HTS engine to
39
build HMM sentences from the HMM trees and thus generating the corresponding waveform as
described in section 4.2. After processing, the generated speech is then placed in the utterance
and the corresponding utterance in the component output buffer.
Figure 4.3: Dixi Component for HMM Synthesis
The major difficulty in the implementation was the process of loading the resources (trees
and models) that feeds the HTS Engine system. In the Dixi system, voices (trees and models) are
separated from the components pipeline. This separation exists as a memory optimization mea-
sure. By separating voices from the pipeline, the same voice can feed several parallel pipelines
and thus optimizing resources usage. In the HTS engine API, models and trees are more or less
integrated in the processing part and therefore making integration difficult. The solution for this
problem was to port some of the HTS engine API code to the Dixi system and thus separating
models and trees from the components pipeline.
4.4 Results
Results for synthesis were obtained in two sets of tests. The first set was designed to test
the system footprint and the second to evaluate the HMM-based waveform generation technique.
Tests were conducted in an Intel R© CoreTM 2 CPU platform at 2.40GHz and 4GB of physical
memory.
4.4.1 System Footprint
The Dixi system architecture is based on a pipeline of threaded components, intercon-
nected by means of intermediate buffers, as depicted in figure 4.1. Testing multi-threaded envi-
ronments can be difficult. The Dixi components run independently from each other and conse-
quently real time execution measures are inconsistent. To have more accurate measures, a load
test was conducted. Approximately 860 sentences from a specific text domain were used. The
selected sentences are part of a human-machine interface were the main topic is a religious art
40
object. The test was conducted using HMM-based waveform generation and unit selection based
waveform generation for comparison. In both techniques the same speaker was used.
In table 4.1, results for system execution footprint are presented. In this table, the unit
selection technique has the best performance, by being approximately 10 times faster than the
HMM-based technique. However, unit selection techniques have the best performance under
limited-domain input text and a more inconstant performance under general domain. When using
general domain, the hit rate for small units is much higher and therefore synthesis slower. Under
these conditions, the Dixi unit selection based waveform generator is much slower then the HMM-
based waveform generator. HMM-based techniques, have a constant performance under any
type of domain. In terms of execution time, this technique is generally more efficient than unit
selection based techniques.
Waveform Generator Total Speech Length Execution Time Real-Time SpeedHTS 48 m 4m 55s 10
CLUnits 37 m 1m 54s 19
Table 4.1: System execution footprint using HTS-based and Unit Selection based waveform generators.
In table 4.2, results for system memory footprint are also presented. In this table, HMM-
based technique has the best performance by only requiring half of the memory required by the
unit selection technique. Although the memory peaks seam a little high, in fact given the amount
of input sentences the system exhibits a normal behavior. Test log showed that all 860 sentences
are at the waveform generators input buffer, approximately 20 seconds after the beginning of
execution. In normal conditions, the HTS-based waveform generator usually allocates between
5-10Mb of memory per sentence and the unit selection waveform generator, usually allocates
20-50Mb of memory per sentence.
Waveform Generator Voice Size Pre-Synthesis Memory Memory PeakHTS 3.5Mb 40Mb 382Mb
CLUnits 1.2Gb 82Mb 712Mb
Table 4.2: System memory footprint using HTS-based and Unit Selection based waveform generators.
4.4.2 Waveform Generation
Waveform evaluation is one of the most difficult task in speech synthesis. In recent years
specifications and evaluation procedures for speech synthesis were presented [Bonafonte, Höge,
Kiss, Moreno, Ziegenhain, van den Heuvel, H.-U.Hain, X.S.Wang, and GarciaBonafonte et al.,
Black and TokudaBlack and Tokuda]. Usually audible tests are performed by a selected population
and evaluation performed using MOS tests. In this thesis, waveforms were evaluated by audible
tests and visual assessment of spectrograms and pitch curves.
Recorded audio prompts were selected from a test corpus and compared to synthesized
41
versions. From the audible tests, the first clear difference was the sounding of the synthesized
speech which is Vocoder like. On the other hand, the intonation of the synthesized speech is very
natural. From the visual assessment, results showed that the synthesized intonation presents
itself well behaved compared to the correspondent original speech intonation. Also, comparing
spectrums it was easy to identify the similarities between formants.
In figures 4.4.2 and 4.4.2, spectrograms and pitch curves are presented for one of the
comparison tests. Similarities and differences between both audio realizations can be observed.
Figure 4.4: Original waveform for sentence "O de Aveiro custou-me vinte."
Figure 4.5: Synthesized waveform for sentence "O de Aveiro custou-me vinte."
42
4.5 Conclusions
The Dixi system architecture is based on a pipeline of threaded components, intercon-
nected by means of intermediate buffers. This architecture has advantages in the new multi-core
CPU platforms while processing large amounts of data. The primary goal of this architecture was
to manage available computational resources the best way possible, to meat the fastest possi-
ble speech synthesis. In terms of small-footprint this can be a disadvantage. However, Dixi’s
modularity and dynamic configuration can be tuned to meet lower footprint requirements.
A very important result concerns the need for pitch stylization for intonation prediction. Pre-
vious work on HMM-based synthesis for a tonal language [Chomphan and KobayashiChomphan
and Kobayashi] shows that the inclusion of tonal information presents good intonation results.
However, from the obtained results, where no pitch stylization information is given, the synthe-
sized speech intonation presents itself well behaved compared to its original speech intonation.
Syllable and word levels linguistic features are fundamental in tone prediction. As it was seen in
section 3.3.1, many features from these levels were selected for pitch context-dependent trees.
Therefore, given the results, it is safe to say that context-dependent models are well generalized.
Based on these results, the use of pitch stylization information can be disregarded (at least for
non-tonal languages) without compromising the quality of the synthesized speech. Additionally,
tone prediction algorithms rely on dynamic programming algorithms, which can lower computa-
tional performance during synthesis.
43
44
5Conclusions
45
The goal of TTS systems is to synthesize speech with natural human voice characteristics.
The increasing availability of large speech databases makes it possible to construct TTS systems
by applying statistical learning algorithms. These systems, which can be automatically trained,
can generate natural and good quality synthetic speech.
In this thesis a small footprint speech synthesis system using HMMs was proposed. In such
a system, grammar features and context factors play an important role in the generation of speech
parameter sequences. To overcome language issues, configurable grammar features and context
factors modules for on-line processing were proposed. The goal of these modules was to provide
an easy way of configuring context-dependent clustered models without performing any modifica-
tions to the system core. An important obtained result was the adaptability of the HTS system to
language patterns, specifically to pitch patterns. The high number of models generated for pitch
produced a very natural intonation speech synthesis system. Another important result concerned
the need for pitch stylization for intonation prediction. The high number of models generated for
pitch and natural intonation speech synthesis, proved that the use of pitch stylization information
in context-dependent tree models, can be disregarded without compromising the quality of the
synthesized speech. The Dixi system performance also proved that the proposed architecture is
well suited for small footprint synthesis.
46
Bibliography
Black, A. and K. Tokuda (2005, September). The blizzard challenge – 2005: Evaluating corpus-based speech synthesis on common datasets. In Proc. EUROSPEECH 2005, pp. 77–80.ISCA.
Black, A. W., P. Taylor, and R. Caley (1996-2002). The Festival Speech Synthesis. Manual andsource code available at http://www.cstr.ed.ac.uk/projects/festival.html.
Black, A. W., H. Zen, and K. Tokuda (2007). Statistical parametric speech synthesis. In Proc.ICASSP-2007, pp. 1229–1232.
Bonafonte, A., H. Höge, I. Kiss, A. Moreno, U. Ziegenhain, H. van den Heuvel, H.-U.Hain,X.S.Wang, and M. N. Garcia (2006, May). TC-STAR: specifications of language resourcesand evaluation for speech synthesis. In LREC-2006: Fifth International Conference onLanguage Resources and Evaluation, Genoa, Italy, pp. 311–314.
Chevelu, J., N. Barbot, O. Boeffard, and A. Delhay (2007). Lagrangian relaxation for optimalcorpus design. In Proceedings of the 6th ISCA Tutorial and Research Workshop on SpeechSynthesis (SSW6), pp. 211–216. ISCA.
Chomphan, S. and T. Kobayashi (2006, August). Design of Tree-based Context Clustering foran HMM-based Thai Speech Synthesis System. In Proc. of 6th ISCA Speech SynthesisWorkshop, pp. 160–165.
Chomsky, N. and M. Halle (1968). Sound Pattern of English. New York: Harper and Row.Fukada, T., K. Tokuda, T. Kobayashi, and S. Imai (1992). An adaptive algorithm for mel-cepstral
analysis of speech. In Proc. of ICASSP, Volume 1, pp. 137–140.Hirschberg, J. and P. Prieto (1996). Training intonational phrasing rules automatically for english
and spanish text-to-speech. Speech Communication 18.Huang, X., A. Acero, and H. Hon (2001). Spoken Language Processing: A Guide to Theory,
Algorithm, and System Development. Prentice Hall.Imai, S. (1983). Cepstral analysis synthesis on mel frequency scale. In Proc. of ICASSP, pp.
93–96.Jakobson, R. and M. Halle (1956). Fundamentals of Language. The Hague: Mouton.Johnson, D. S. (1973). Approximation algorithms for combinatorial problems. In STOC ’73:
Proceedings of the fifth annual ACM symposium on Theory of computing, New York, NY,USA, pp. 38–49. ACM.
Keating, P. A. (1997, October). Word-level phonetic variation in large speech corpora. In TheWord as a Phonetic Unit, Berlin. Phonetics Lab, Linguistics Department, UCLA.
Masuko, T. (2002, November). HMM-Based Speech Synthesis and Its Applications. Ph. D. thesis,Tokyo Institute of Technology.
Masuko, T., K. Tokuda, T. Kobayashi, and S. Imai (1996). Speech synthesis using HMMs withdynamic features. In Proc. ICASP-96, pp. 389–392.
Mateus, M. H., A. Andrade, M. C. Viana, and A. Villalva (1990). Fonética, Fonologia e Morfologiado Português (1st ed.). Lisboa: Universidade Aberta.
Mateus, M. H. and E. d’Andrade (2000). The Phonology of Portuguese. Oxford University Press.Nascimento, M. F. B., J. Bettencourt, P. Marrafa, R. Ribeiro, R. Veloso, and L. Wittmann (1997).
LE-PAROLE - Do corpus à modelização da informação lexical num sistema multifunção.Actas do XIII Encontro da Associação Portuguesa de Linguistica. Lisboa, Portugal.
National Project TECNOVOZ number 03/165, P.Neto, J. P. and H. Meinedo (2000). Combination of acoustic models in continuous speech recog-
47
nition hybrid systems. In ICSLP 2000.Oliveira, L. C. (1996). Síntese de Fala a Partir de Texto. Ph. D. thesis, Instituto Superior Técnico,
Universidade Técnica de Lisboa.Oliveira, L. C., S. Paulo, L. Figueira, C. Mendes, R. Cassaca, M. do Céu Viana, and H. Mo-
niz (2008, September). DIXI TTS System. In PROPOR 08 - XIII Encontro para oProcessamento Computacional da Língua Portuguesa Escrita e Falada.
Oliveira, L. C., S. Paulo, L. Figueira, C. Mendes, A. Nunes, and J. Godinho (2008, may). Method-ologies for designing and recording speech databases for corpus based synthesis. InE. L. R. A. (ELRA) (Ed.), Proceedings of the Sixth International Language Resources andEvaluation (LREC 08), Marrakech, Morocco.
Oliver, D. and K. Szklanny (2006, May). Creation and analysis of a polish speech database foruse in unit selection synthesis. In LREC-2006: Fifth International Conference on LanguageResources and Evaluation, Genoa, Italy.
Paulo, S., L. A. Figueira, C. Mendes, and L. C. Oliveira (2008, September). The INESC-ID blizzardentry: Unsupervised voice building and synthesis.
Paulo, S. and L. C. Oliveira (2005, September). Generation of word alternative pronunciationsusing weighted finite state transducers. In Interspeech’2005, pp. 1157–1160. ISCA.
Paulo, S. and L. C. Oliveira (2007). MuLAS: A framework for automatically building multi-tiercorpora. In Interspeech 2007.
Rabiner, L. R. (1989, February). A tutorial on hidden markov models and selected applications inspeech recognition. In Procedings of the IEEE, Volume 77.
Ribeiro, R. D., L. C. Oliveira, and I. Trancoso (2003, June). Using morphossyntactic informationin tts systems: Comparing strategies for european portuguese. In PROPOR’2003 - 6thWorkshop on Computational Processing of the Portuguese Language, Lecture Notes inArtificial Inteligence, pp. 143–150. Springer-Verlag, Heidelberg.
Ribeiro, R. D. F. M. (2003, March). Anotação Morphossintáctica Desambiguada do Poruguês.Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa.
Saratxaga, I., E. N. E., I. Hernaez, and I. Luengo (2006, May). Designing and recording anemotional speech database for corpus based synthesis in basque. In LREC-2006: FifthInternational Conference on Language Resources and Evaluation, Genoa, Italy.
Shinoda, K. and T. Watanabe (1996, May). Speaker Adaptation with Autonomous Model Com-plexity Control by MDL Principle. In Proc. of ICASSP, pp. 717–720.
Sjölander, K. (2003). The Snack Sound Toolkit. http://www.speech.kth.se/snack/index.html.Taylor, P., A. W. Black, and R. Caley (2001, January). Heterogeneous relation graphs as a formal-
ism for representing linguistic information. Speech Communication 33(1-2), 153–174.Tokuda, K., T. Kobayashi, and S. Imai (1995). Speech parameter generation from HMM using
dynamic features. In Proc. ICASSP-95, Volume 1, pp. 660–663.Tokuda, K., T. Masuko, T. Yamada, T. Kobayashi, and S. Imai (1995). An Algorithm for Speech
Parameter Generation from Continuous Mixture HMMs with Dynamic Features. In Proc. ofEUROSPEECH, pp. 757–760.
Tokuda, K., H. Zen, S. Sako, J. Yamagishi, T. Masuko, and Y. Nankaku. SPTK: Speech SignalProcessing Toolkit. http://sp-tk.sourceforge.net/.
Tokuda, K., H. Zen, J. Yamagishi, A. W. Black, T. Masuko, S. Sako, T. Toda, T. Nose, and K. Oura.HTS: HMM-based Speech Synthesis System. http://hts.sp.nitech.ac.jp/.
Viana, C., L. C. Oliveira, and A. I. Mata (2003). Prosodic phrasing: Machine and human evalua-tion. Speech Technology 6.
Vigário, M. and S. Frota (2003). The intonation of standard and northern european portuguese.Journal of Portuguese Linguistics 2-2.
Weiss, C., S. Paulo, L. A. Figueira, and L. C. Oliveira (2007, August). Blizzard entry: Integratedvoice building and synthesis for unit-selection TTS.
Yoshimura, T. (2002, January). Simultaneous Modeling of Phonetic ans Prosodic Parameters, andCharacteristic conversion for HMM-Based Text-To-Speech Systems. Ph. D. thesis, NagoyaInstitute of Technology.
Yoshimura, T., K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura (1998). Duration Modeling inHMM-based Speech Synthesis System. In Procedings of ICSLP, Volume 2, pp. 29–32.
48
Yoshimura, T., K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura (1999). Simultaneous mod-eling of spectrum and pitch and duration in HMM-based speech synthesis. In Eurospeech99, pp. 2347–2350.
Young, S., G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason,D. Povey, V. Valtchev, and P. Woodland (2006). The HTK Book (for HTK Version 3.4).
49
50
APhonetic Alphabets
51
AFI SAM-PA Grapheme Example AFI SAM-PA Grapheme Examplei i i,í,y,e vi [ví] p p p pá [pá]e e e,ê vê [vé] b b b bem [b´e ]� E e,é pé [p�] t t t tu [tú]a a a,á,à pá [pá] d d d dou [dó]� 6 a cama [c�m�] k k c,k casa [káz�]' @ e de [d'] g g g gato [gátu]= O ó,o pó [p=]o o ô,o avô [�vó] f f f fé [f�]u u ú,u tudo [túdu] v v v vê [vé]j j i,e pai [páj] s s s,ç,c sol [s=-]w w u,o pau [páw] z z z,s,x casa [káz�]
M S ch,s,z,x chave [Máv']ı i˜ i,í sim [s´ı ] ` Z j,g,s,z,x já [`á]e e˜ e,ê pente [p´et']� 6˜ ã,a,e branco [br˜�ku] l l l lá [lá]õ o˜ õ,o,ô ponte [põt'] - l˜ l mal [má-]u u˜ u,ú atum [�t´u] \ L lh valha [vá\�] j˜ i,e põe [põ]w w˜ o mão [m˜�w] m m m mão [m˜�w]
n n n não [n˜�w]7 J nh senha [s�7�]
D r r caro [káDu]J R r carro [káJu]
Table A.1: Phonetic Alphabet for Standard European Portuguese Dialect [OliveiraOliveira].
52
AFI Darpa Grapheme Example AFI Darpa Grapheme Examplei iy ee seek [sik] p p p pan [pæn]* ih i sick [s*k] b b b ban [bæn]+ ix i equipment [ikw+pm�nt] t t t tan [tæn]� eh e set [s�t] d d d Dan [dæn]æ ae a sat [sæt] k k c,k can [kæn]� aa o Bob [b�b] g g g gander [gænd�]� ah u but [b�t] e q q (feixo glotal)= ao ou bougth [b=t]V uh oo book [bVk] f f f fan [fæn]u uw u due [du] v v v van [væn]T ux u suit [sTt] S th th thing [S*8]� ax e the [��] � dh th that [�æt]�y ax-h o to go [t�yg=�] s s s seen [sin]� axr er butter [b�t�] z z z zone [zoVn]� er ir bird [b�d] M sh sh sheen [Min]
` zh z azure [æ`�]j y y you [jV] h hh h hope [hoVp]w w w we [wi] $ hv h ahead [�$�d]
e* ey ai bait [be*t] tM ch ch church [tM�tM]a* ay uy buy [ba*] d` jh g gin [d`*n]aV aw ow down [daVn]oV ow ow show [MoV] m m m me [mi]=* oy oy boy [b=*] n n r knee [ni]
ni en n button [b�tni ]8 ng ng weeping [wipi8]
D dx dd ladder [læD�]D nx n banter [bæ Dt�]
l l l long [l=8]li el l bottle [b=tli]G r r rent [G�nt]
Table A.2: Phonetic Alphabet for Standard European English Dialect [KeatingKeating].
53
54