Síntese de fala a partir de texto com reduzidos requisitos ......Síntese de fala a partir de texto com reduzidos requisitos computacionais - 392 Carlos Miguel Duarte Mendes Dissertação

Síntese de fala a partir de texto com reduzidosrequisitos computacionais - 392

Carlos Miguel Duarte Mendes

Dissertação para obtenção do Grau de Mestre emEngenharia Electrotécnica e de Computadores

JúriPresidente: Doutor Carlos Jorge Ferreira SilvestreOrientador: Doutor Luís Miguel Veiga Vaz Caldas de OliveiraVogal: Doutora Isabel Maria Martins Trancoso

13 de Novembro de 2008

Acknowledgments

First, I would like to express my gratitude to Professor Luís Caldas de Oliveira, my adviser, for

is support, encouragement and guidance. I would like to thank Sérgio Paulo and Luís Figueira,

for their substantial help and work in the Tecnovoz corpus without which it would be impossible

to do this thesis; Renato Cassaca and David Matos, for their c/c++ programming suggestions

that helped me solve many problems. Helena Moniz, for helping me with the linguistic issues;

Professor João Paulo Neto and all my colleagues from the Tecnovoz project for their contagious

enthusiasm while achieving the impossible, and everyone else at L2F, for the great work environ-

ment that they builded.

Finally, I would like to give my special thanks to my friends and family for all their support over

the last two years.

Lisbon, September 29, 2008Carlos Miguel Duarte Mendes

Abstract

In recent years, TTS systems have become an important output device in human-machine

interfaces, and they are used in many applications such as car navigation systems, information

retrieval over telephone, voice mail and so on. Although most concatenation based TTS systems

are able to synthesize speech with high quality, their performance decreases when the computa-

tional requirements are very low. Usually due to the large amount of pre-recorded speech stored

in the database.

This thesis main objective was the development of a small footprint text to speech synthesis

system, by using HMM models to generate artificial speech. The developed system was applied

for European Portuguese, but is general enough to be extended to other languages.

Keywords

Text-to-Speech Systems, Small footprint synthesis, HMM-based Synthesis, Context-

dependent clustered models

iii

Resumo

Nos últimos anos, sistemas de texto para fala têm-se tornado importantes dispositivos de

saída em interfaces homem-máquina, pelo que são usados em muitas aplicações como sis-

temas de navegação, obtenção de informações via telefone, voice mail, etc. Apesar da maioria

dos sistemas de síntese de fala, baseados em concatenação de segmentos, serem capazes de

gerar fala sintética com uma grande qualidade, a sua performance decresce quando os requisi-

tos computacionais são muito baixos. Na maioria das vezes deve-se à grande quantidade de fala

pré-gravada armazenada na base de dados.

O objectivo deste trabalho foi o desenvolvimento de um sistema de síntese de fala a partir de

texto com baixos requisitos, tanto de poder de cálculo como de memória, recorrendo a modelos

HMM para a geração do sinal de fala. O sistema foi applicado ao Português Europeu mas tem a

generalidade suficiente para ser alargado a outras línguas.

Palavras Chave

Sistemas de texto para fala, Síntese com baixos requisitos, computacionais, síntese baseada

em HMMs, Grupos de modelos com dependência contextual.

v

Contents

1 Introduction 1

1.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 HMM-based Text-To-Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Context-Dependent Clustered Models 7

2.1 Language Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Tree-Based Context Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Towards Language Independency . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Configurable Grammar Features Module . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Configurable Context Factors Module . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Feature Label Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Grammar Features and Context Factors for European Portuguese . . . . . . . . . 18

2.4.1 Grammar Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Context Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.3 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Voice Building with HTS 25

3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.1 Design of the Recording Prompts . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.2 Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.3 Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.4 Phonetic Segmentation and Multi-Level Utterance Descriptions . . . . . . . 29

3.1.5 Corpus sub-set for HMM Training . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

vii

3.3.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.2 Models for Sandhi Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.3 Segmentation Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Speech Synthesis with Dixi TTS Engine 37

4.1 Dixi Text-To-Speech Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 HMM Based Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 HMM-based Waveform Generation with HTS Engine API . . . . . . . . . . . . . . 39

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4.1 System Footprint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4.2 Waveform Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Conclusions 45

Bibliography 47

A Phonetic Alphabets 51

viii

List of Figures

1.1 HMM TTS System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Context-Dependent Clustered Decision Tree . . . . . . . . . . . . . . . . . . . . . . 5

1.3 HMM-based Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Decision tree-based state tying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Recording Room . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Unit coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Dixi system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 HMM Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Dixi Component for HMM Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Original waveform for sentence "O de Aveiro custou-me vinte." . . . . . . . . . . . 42

4.5 Synthesized waveform for sentence "O de Aveiro custou-me vinte." . . . . . . . . . 42

ix

x

List of Tables

2.1 Chomsky distinctive features for PT-SAMPA vowels . . . . . . . . . . . . . . . . . . 19

2.2 Chomsky distinctive features for PT-SAMPA consonants . . . . . . . . . . . . . . . 20

2.3 POS tag system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Prosodic Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Mel generalized cepstrum coefficients extraction parameters . . . . . . . . . . . . 32

3.2 Models and questions count for duration, pitch and spectral coefficients . . . . . . 33

3.3 Question usage for each linguistic level . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Context at states 1 and 2 for one of the /j/ HMM models . . . . . . . . . . . . . . . 35

4.1 System execution footprint using HTS-based and Unit Selection based waveform

generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 System memory footprint using HTS-based and Unit Selection based waveform

generators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.1 Phonetic Alphabet for Standard European Portuguese Dialect . . . . . . . . . . . . 52

A.2 Phonetic Alphabet for Standard English Dialect . . . . . . . . . . . . . . . . . . . . 53

xi

xii

List of Acronyms

DOM Document Object Model

HMM Hidden Markov Model

HMMs Hidden Markov Models

HRG Heterogeneous Relation Graph

HRGs Heterogeneous Relation Graphs

HTS HMM-based TTS System

IPA International Phonetic Alphabet

MDL Minimum Description Length

MFCC Mel Frequency Cepstrum Coefficients

MGC Mel frequency Generalized Cepstrum coefficients

MLSA Mel-Log Spectra Approximation

MSDs Multi-Space probability Distributions

POS Part-Of-Speech

xiii

xiv

1Introduction

Contents1.1 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 HMM-based Text-To-Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1

Mobile devices like cellphones and PDAs have serious input/output limitations, specially in

situations like driving, where the user can not maintain eye contact with the device. This accessi-

bility problem suggests the use of spoken interfaces.

Today’s most successful speech synthesis methods use speech databases based ap-

proaches, where speech segments with variable durations are selected, and concatenated pro-

duce the desired speech signal. The selection criteria consists of simultaneous optimization of

two costs: adequateness to the objective and concatenation cost. The first evaluates the differ-

ences between the desired synthesized sound and the available sounds in the database, in terms

of segmental and prosodic differences. The second evaluates the concatenation quality of several

possible segments. These two factors joint optimization allows the selection of the best speech

units sequence that better produces the desired acoustic realization. The problem inherent to

this approach is text to speech system design for any type of input text. In this case, speech

database must have a wide variability of speech segments to allow the optimization process to

find an acoustic sequence with acceptable quality. This usually corresponds to several hours of

recorded speech and high computational resources for the selection process. This disadvantage

makes concatenation based synthesis less fit for mobile devices. Recently a parametric synthesis

method re-emerged, where the speech signal is generated from source-filter models that sim-

ulates the human vocal tract, instead of using recorded speech databases. These parameters

are generated from statistical models, trained over a speech database with considerable dimen-

sions. Since parameters are generated from statistical models, there is no advantage in storing

large amounts of speech data to synthesize speech and consequently substantially reducing the

required computational resources.

This thesis main objective was the development of a small footprint text to speech synthesis

system, by using parametric models to generate artificial speech. This system was developed for

European Portuguese but is general enough to be extended to other languages.

2

1.1 State of the Art

Currently there are two most common appoaches for small footprint text-to-speech sys-

tems. The first is diphone concatenation and the second parametric synthesis with Hidden Markov

Models (HMMs).

Diphone concatenation synthesis consists on the production of the desired acoustic

phonetic sequence by concatenating diphone segments available in a pre-collected diphone

database. The number of diphones in a language is equal to the square of the total number

of phonetic segments. However, not all combinations of phone segments exists, meaning that

the resulting database is actually smaller than it would actually be if all combinations were used.

The collected databases are relatively small, since there is a limited amount of phones per lan-

guage, typically between 40 and 60. Although these systems do require a very small amount

of resources, they present a few problems. The first main problem resides in the concatenation

process, where some pitch discontinuities may appear and thus resulting in noticeable acoustical

distortions. The perfect match between the previous diphone end frontier and the next diphone

beginning frontier is a very difficult process, many times leading to some tedious and difficult

work during the database setup. Recent technics emerged as a solution for this problem, where

speech transformations are performed at the concatenation neighborhood to reduce possible sig-

nal distortions. Still there is another problem concerning this method. Although the produced

speech has very good quality, the generated speech intonation is frequently unnatural. In fact,

listeners often find it tedious. One of the procedures to avoid distortions at the concatenations

points consists of recording all diphone units at the same pitch level, meaning as flat as possible.

This technique helps the concatenation process, but as the disadvantage of making the speech

intonation unnatural.

On the other side of small footprint systems there is parametric synthesis with Hidden

Markov Models (HMMs). HMMs have successfully been applied to model the sequence of speech

spectra in speech recognition systems, and the performance of HMM-based speech recognition

systems has improved by techniques that make use of its flexibility: context-dependent modeling,

dynamic feature parameters, mixtures of Gaussian densities, tying mechanism, speaker and envi-

ronment adaptation techniques. HMM-based speech synthesis systems are becoming popular in

the present day. They were first conceived for Japanese by Tokuda, Kobayashi, and Imai (1995)

and further developed by Masuko et al. (1996); Yoshimura et al. (1999). This technique has

also been developed for several other languages such as Korean, English, Brazilian Portuguese,

Slovenian, Chinese and German as indicated in Black, Zen, and Tokuda (2007). It has been

shown that HMM-based speech synthesis can successfully be applied to speech synthesis in a

a wide range of languages. In Yoshimura et al. (1999), the HMM-based TTS system in figure

1.1 was proposed were the training and synthesis parts of the HMM-based TTS system are de-

3

picted. In the training phase, spectral parameters and excitation parameters are extracted from

the speech database. The extracted parameters are modeled by context-dependent HMMs. In

the synthesis phase, a context-dependent label sequence is obtained from input text by linguistic

analysis. An HMM sentence is constructed by concatenating context-dependent HMMs according

to the context-dependent label sequence and using a parameter generation algorithm, spectral

and excitation parameters are generated from the sentence HMM. Finally, by using a synthesis

filter, speech is synthesized from the generated spectral and excitation parameters.

Figure 1.1: HMM TTS System [YoshimuraYoshimura]

1.2 HMM-based Text-To-Speech Synthesis

Phonetic parameter and prosodic parameter are modeled simultaneously with HMMs. In

the system proposed by Yoshimura (2002), mel-cepstrum, fundamental frequency (F0) and state

duration are modeled by continuous density HMMs, multi-space probability distributions HMMs

and multi-dimensional Gaussian distributions, respectively. The distributions for spectrum, F0

and state duration are clustered independently by using a decision-tree based context clustering

technique, as depicted in figure 1.2. These decision trees rely on features that are language

dependent.

Synthetic speech is generated by using an speech parameter generation algorithm from

HMM and a mel-cepstrum based vocoding technique. A more detailed illustration of the HMM-

based text-to-speech synthesis system is shown in figure 1.2. An arbitrarily given text to be

synthesized is converted to a context-based label sequence. Then, according to the label se-

4

Figure 1.2: Context-Dependent Clustered Decision Tree [YoshimuraYoshimura]

quence, a sentence HMM is constructed by concatenating context dependent HMMs. State du-

rations of the sentence HMM are determined so as to maximize the likelihood of the state dura-

tion densities [Yoshimura, Tokuda, Masuko, Kobayashi, and KitamuraYoshimura et al., Yoshimu-

raYoshimura]. According to the obtained state durations, a sequence of mel-cepstral coefficients

and f0 values including voiced/unvoiced decisions are generated from the sentence HMM by

using an speech parameter generation algorithm [Tokuda, Masuko, Yamada, Kobayashi, and

ImaiTokuda et al., YoshimuraYoshimura]. Speech is then synthesized directly from the generated

mel-cepstral coefficients and f0 values using a synthesis filter referred as MLSA filter [Fukada,

Tokuda, Kobayashi, and ImaiFukada et al., ImaiImai,YoshimuraYoshimura].

Figure 1.3: HMM-based Speech Synthesis [YoshimuraYoshimura]

5

1.3 Thesis Outline

This thesis main objective is the development of a small footprint text to speech synthesis

system, by using parametric models to generate artificial speech. The developed system was

aplied for European Portuguese, however is general enough to be extended to other languages.

In chapter 2, a decision-based context dependent clustering technique to cluster HMMs is

described. Language dependency issues inherent to this technique are analyzed and a solution

towards language independency is proposed. Also, work developed for European Portuguese

is presented. In chapter 3, the voice building process is presented, including corpus design and

HMM training, for European Portuguese. In chapter 4, development conducted to the integration of

an HMM-based system in the Dixi TTS system is also presented. Finally in chapter 5, conclusions

are presented.

1.4 Main Contributions

The first main contribution of this thesis is a tool that automatically generates featured

labels according to some language specification. Described in chapter 2, this tool overcomes the

language barrier when developing context-dependent clustered models based synthesis engines.

The second main contribution of this thesis is the integration of an HMM-based waveform

generation module in the Dixi TTS system as a solution for small footprint speech synthesis.

6

2Context-Dependent Clustered

Models

Contents2.1 Language Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Tree-Based Context Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Towards Language Independency . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Grammar Features and Context Factors for European Portuguese . . . . . . . 182.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7

In continuous speech, parameter sequences of a particular speech unit vary according to

linguistic patterns. To model these patterns accurately, context dependent models are clustered

in a structured representation of context features. However, as the number of context factors

increase, their combinations increase exponentially. Moreover, it is impossible to prepare training

data to cover all possible contexts and to compensate for the large variations in the frequency

of each context dependent unit.To alleviate these problems, a decision-based context dependent

clustering technique is used to cluster HMM states and share model parameters, like spectral

coeficients, F0 and duration among states. Since spectrum, F0 and duration models have their

own influential patterns, their distributions are clustered independently as shown in figure 1.2.

2.1 Language Patterns

Speech production is the reproduction of sounds constrained to a language specific set of

rules. Additionally, each speaker adds his/her own particularities giving it the variability that char-

acterizes natural speech. These variations make it difficult to quantify and/or qualify speech in one

unique set of rules. To simplify the idea of natural speech, the concept of speech patterns is used

instead. This statistical perspective of natural speech production allows a better management of

language concepts and thus a better modelation (of the same).

Further analysis of speech patterns shows that these may be presented as the composition

of grammar features and context factors. Grammar features represent the language structure

and are identified by phones, syllables and morphological categories. Context factors on the

other hand are local features that are used to identify a certain linguistic context. Context factors

can be for instance left/central/right phone, left/right part-of-speech and distance(in syllables) to

previous accented syllable. To put it differently, a group of grammar features can be used to

describe a certain acoustic segment, however distinct local grammar features may influence its

characteristics, thus presenting differently under different contexts and consequently representing

different speech patterns.

This linguistic decomposition allows the construction of context dependent models and,

considering the many possible combinations of contextual factors, accurate model parameters

should be expected. However, as the number of contextual factors increase, their combinations

also increase exponentially. Therefore, model parameters with sufficient accuracy cannot be esti-

mated with limited training data. Furthermore, it is impossible to prepare a speech database with

all combinations of contextual factors. To overcome this difficulty, the next section presents the

solution usually adopted in HMM-based synthesis systems.

2.2 Tree-Based Context Clustering

HTS [Tokuda, Zen, Yamagishi, Black, Masuko, Sako, Toda, Nose, and OuraTokuda et al.]

8

tree-based context clusters are translated as binary trees in which a yes/no context question is

attached to each node. Trees are built using a top-down sequential optimization process. Initially

all models are placed in a single cluster at the root of the tree. A question is then found according

to the Minimum Description Length (MDL) criterion, which gives the optimal split of the root node.

The MDL principle is an information criterion which was introduced by Shinoda and Watan-

abe (1996) as an optimal probabilistic model selector. Maximum likelihood based methods, pre-

viously used in tree-based context clustering, have the major disadvantage of producing over-

specialized or under-specialized context trees. This is mostly a consequence of the difficulty in

determining the correct threshold for the stopping rule. MDL main contribution is its ability to pro-

duce optimal context trees without any external given parameters. The splitting process decision

is based on the models description length. If the description length is below zero threshold then

the node is divided, otherwise its not divided.

The splitting process continues for the following nodes until the minimum description length

is above the zero threshold. As a final stage, the minimum description length is calculated for

merging terminal nodes with different parents. Any pair nodes for which the length is above zero

threshold are then merged or tied. For example, figure 2.2 illustrates the case of tying the center

states of all triphones of the phone /aw/. All of the states trickle down the tree and depending on

the answer to the questions, they end up at one of the shaded terminal nodes. For example, in

the illustrated case, the center state of /s/-/aw/+/n/ would join the second leaf node from the right

since its right context is a central consonant, and its right context is nasal, but its left context is not

a central stop.

s-aw+n

t-aw+n

s-aw+t

..etc

Example

Cluster centrestates of phone /aw/

yn

yn yn

yn

R=Central-Consonant?

L=Nasal? R=Nasal?

States in each leaf node are tied

L=Central-Stop?

Figure 2.1: Decision tree-based state tying [Young, Evermann, Gales, Hain, Kershaw, Liu, Moore, Odell,Ollason, Povey, Valtchev, and WoodlandYoung et al.]

An important advantage of tree-based clustering is that it allows models which have no

9

training data to be synthesized. This is done by descending the previously constructed trees for

that phone and answering the questions at each node based on the new unseen context. When

each leaf node is reached, the model representing that cluster is used for the corresponding model

in the unseen context.

Information sharing of training data in the same cluster or leaf node is the essential concept,

therefore the construction of context factors and design of tree structure for decision tree based

context clustering must be done appropriately. Since spectrum, F0 and duration models have

their own influential contextual factors, the distributions for spectral parameters, F0 parameter

and state durations are clustered independently as seen in figure 1.2.

2.3 Towards Language Independency

The main problem in context cluster synthesis is the dependency on the target language.

To adapt a system to a new language, most context feature extractions algorithms need to be

changed to the new language specifications. Also, since most context factors are not statically

available during the synthesis itself, on-the-fly generation is necessary and therefore modifications

in the system core. To overcome language issues, configurable language dependent modules for

on-line context features extraction are proposed.

The next sections describes the proposed solution for language independent context cluster

synthesis, by using configurable language dependent modules. As a support language for these

configurable modules, the XML mark-up language was chosen due to its adaptability to new

features.

2.3.1 Configurable Grammar Features Module

A configurable grammar features module is introduced to solve language dependency is-

sues in context-dependent clustered based synthesis. The goal of this module is to provide an

easy way of configuring grammar features without performing any modifications to the system

core and avoid long and tedious work in its implementation. Next, some of the main language

features are described.

Phone Set

The first major grammar feature is the phone set. A phone set is a symbolic representa-

tion of the phonological basis of a spoken language. In language independent systems, general

phonetic symbolic representations, like the International Phonetic Alphabet (IPA) system, would

be preferred. The IPA goal is to find symbolic representations of every human language phonetic

forms. However, IPA presents a problem: its symbolic representation is not easily computable.

Also, there is a certain difficulty establishing the correct symbolic representation when particu-

10

lar forms of a certain phone are involved. Hence, each language dialect has its own phonetical

representation, for example in American English it is common to use the Darpa phone set, or its

subset, the Radio Phones phone set. The PT-SAMPA phone set is commonly accepted as the

best representation for European Portuguese. See appendix A for the cross-reference between

these systems and the IPA system.

In DTD 1 the Document Type Definition for the XML implementation is presented. The root

entity of the phoneset is the PhoneSet entity. This entity may have more than one element of type

Phone, and has only one attribute named name, which specifies the name and type of the phone

set being used. The Phone entity may have one or more elements of type PhoneFeature. This

entity has two mandatory attributes and two other optional. The first two are name and maintype.

The name attribute sets the name of the phone in question, and the maintype attribute speci-

fies the general phone type, for example, main type "Vowels" can be a vowel, a semi-vowel, a

diphthong or a triphthong. The other two attributes of the Phone entity, are translation and nucle-

arphone. The translation attribute exists for compatability issues. Graphic symbols, like /@/ and

/˜/ used in the PT-SAMPA phone set for instance, may present some parsing problems on some

systems. As for the nuclearphone attribute, its purpose is to help some context factors to deter-

mine the main vowel in a specific syllable or diphthong. With the Darpa phone set this problem is

not really important since most diphthongs are static and any feature processor would not have

a problem identifying the main vowel in a diphthong or syllable. However, when dealing with a

phoneset like PT-SAMPA, that allows multiple characters for a vowel, extracting this information

becomes a problem. Additional information is added using the nuclear phone attribute.

The last entity is the PhoneFeature entity and only as one attribute named name. Phone

characteristics are set in this entity to allow phone discrimination using for example Chomsky

distinctive features, see section 2.4.1.

In XML 1 a very simple usage example of the phoneset is presented. In this example,

two phones from the PT-SAMPA phoneset are used. The first, the vowel /6/ and the second the

diphthong /6˜w˜/. Notice the use of the translation attribute to translate /6/ into /A/ and /˜/ into /y/.

This translation assists, for instance, the HTK Toolkit [Young, Evermann, Gales, Hain, Kershaw,

Liu, Moore, Odell, Ollason, Povey, Valtchev, and WoodlandYoung et al.] parser to interpret special

phone symbols. HTK is known to have parsing problem with these special characters.

Part-Of-Speech

The next grammar feature is Part-Of-Speech (POS). POS is a word level feature type and is

an important linguistic resource in context cluster synthesis. Information obtained by a morphoss-

intactic tagging system can be relevant in several areas of natural language processing [Ribeiro,

Oliveira, and TrancosoRibeiro et al.]. For example, knowing the POS of a given word allows one

to predict which words or word types can occur in its neighborhood. Morphossintactic information

11

DTD 1 Phone set "Document Type Definition".<!ENTITY % PhoneSetElements "Phone" ><!ENTITY % PhoneElements "PhoneFeature" >

<!ELEMENT PhoneSet (%PhoneSetElements;)* ><!ATTLIST PhoneSet name CDATA #REQUIRED>

<!ELEMENT Phone (%PhoneElements;)* ><!ATTLIST Phone name CDATA #REQUIRED maintype CDATA #REQUIRED

nuclearphone CDATA #IMPLIED translation CDATA #IMPLIED>

<!ELEMENT PhoneFeature (#PCDATA)* ><!ATTLIST PhoneFeature name CDATA #REQUIRED>

XML Code 1 A very symple Phoneset xml usage example.<PhoneSet name="PT-SAMPA">

<Phone name="6" maintype="Vowels" translation="A"><PhoneFeature name="Syllabic">Syllabic</PhoneFeature><PhoneFeature name="High">NonHigh</PhoneFeature><PhoneFeature name="Low">NonLow</PhoneFeature><PhoneFeature name="Back">Back</PhoneFeature><PhoneFeature name="Labial">NonLabial</PhoneFeature><PhoneFeature name="Round">NonRound</PhoneFeature><PhoneFeature name="Nasal">NonNasal</PhoneFeature><PhoneFeature name="Dorsal">Dorsal</PhoneFeature>

</Phone><Phone name="6~w~" maintype="Vowels" translation="Aywy" nuclearphone="6~">

<PhoneFeature name="Syllabic">Syllabic</PhoneFeature><PhoneFeature name="High">NonHigh</PhoneFeature><PhoneFeature name="Low">NonLow</PhoneFeature><PhoneFeature name="Back">Back</PhoneFeature><PhoneFeature name="Labial">NonLabial</PhoneFeature><PhoneFeature name="Round">NonRound</PhoneFeature><PhoneFeature name="Nasal">Nasal</PhoneFeature><PhoneFeature name="Dorsal">Dorsal</PhoneFeature>

</Phone></PhoneSet>

can also be used to select special words (or word types) or to know which affixes a given word can

take. In the same way, a morphossintactic tagger can help context-dependent clustered models

to improve the quality of the produced speech. POS plays an important role in the prediction of

prosodic phrasing and accentuation. Certain POS categories, such as content and functional cat-

egories, have a strong influence in word accentuation. Content words belong to major open-class

lexical categories such as noun, verb, adjective, adverb and closed-class words such as nega-

tives and some quantifiers. Decision methods based on content words have been widely used in

word accentuation and have proven to be very effective [Ribeiro, Oliveira, and TrancosoRibeiro

et al.,Huang, Acero, and HonHuang et al.].

In DTD 2 the Document Type Definition for the XML implementation is presented. The

root entity of the POS is the POS entity. This entity may have two types of elements. The first

is the LexicalCategory, where morphossintactic information is defined and the second element

12

LexicalCategoryGroups, where POS categories, like content and functional categories, are also

defined. The LexicalCategory entity may have one or more elements of type Category. This last

entity has two mandatory attributes, name and tag. The name attribute is used to specify the

morphossintactic classification, while the tag attribute is used to set its respective POS tag. The

LexicalCategoryGroups entity may have one or more elements of type Groups. The Groups entity

only has one mandatory attribute named name. This entity is used to define lexical categories

groups, by means of the LexicalCategory entity.

DTD 2 Part-Of-Speech "Document Type Definition".<!ENTITY % POSElements "LexicalCategorys|LexicalCategoryGroups" ><!ENTITY % LexicalCategorysElements "Category" ><!ENTITY % LexicalCategoryGroupsElements "Group" ><!ENTITY % GroupElements "LexicalCategory" >

<!ELEMENT LexicalCategory (#PCDATA)* >

<!ELEMENT Group (%GroupElements;)* ><!ATTLIST Group name CDATA #REQUIRED>

<!ELEMENT Category EMPTY ><!ATTLIST Category name CDATA #REQUIRED tag CDATA #REQUIRED>

<!ELEMENT LexicalCategoryGroups (%LexicalCategoryGroupsElements;)* >

<!ELEMENT LexicalCategorys (%LexicalCategorysElements;)* >

<!ELEMENT POS (%POSElements;)* >

In XML 2 a very usage example of the POS grammar feature is presented. In this example,

the definition of three morphossintactic categories (Noun, Verb and Preposition) and their respec-

tive POS tags can be observed. Also note the grouping mechanism to define two POS categories

(content and functional), where content is defined with the Noun and Verb lexical categories and

functional with its only member Preposition.

Prosodic Markers

Speech events like tone characterization, are important linguistic resources. Good tonal

description allows the resulting synthesized speech to have a more natural behavior. The purpose

of this grammar feature is to define linguistic events like this.

In DTD 3 the Document Type Definition for the XML implementation is presented. The root

entity for prosodic markers is the ProsodicMarkers entity. This entity defines a set of prosodic

marker types and may have elements of type MarkersType. The MarkersType entity is used to

define linguistic markers that reproduce a known prosodic pattern, like for instance punctuation or

word breaks. This entity only has one mandatory attribute named name. The MarkersType entity

has elements of type ProsodicMarker. The ProsodicMarker entity has two mandatory attributes.

13

XML Code 2 A very symple Part-Of-Speech xml usage example.<POS>

<LexicalCategorys><Category name="Noun" tag="N"/><Category name="Verb" tag="V"/><Category name="Preposition" tag="P"/>

</LexicalCategorys><LexicalCategoryGroups>

<Group name="content"><LexicalCategory>Noun</LexicalCategory><LexicalCategory>Verb</LexicalCategory>

</Group><Group name="functional">

<LexicalCategory>Preposition</LexicalCategory></Group>

</LexicalCategoryGroups></POS>

The first, the name attribute, is used to identify the linguistic marker and the second, the tag

attribute, is used as the contextual factor identifier.

DTD 3 Prosodic Markers "Document Type Definition".<!ENTITY % ProsodicMarkersElements "MarkerType" ><!ENTITY % MarkerTypeElements "ProsodicMarker" >

<!ELEMENT ProsodicMarker EMPTY ><!ATTLIST ProsodicMarker name CDATA #REQUIRED tag CDATA #REQUIRED>

<!ELEMENT MarkerType (%MarkerTypeElements;)* ><!ATTLIST MarkerType name CDATA #REQUIRED>

<!ELEMENT ProsodicMarkers (%ProsodicMarkersElements;)* >

In XML 3, a simple usage example is shown, where prosodic markers WordBreak and

Punctuation definition can be verified.

2.3.2 Configurable Context Factors Module

A configurable Context Factors Module is introduced here to provide an easy way of config-

uring specific context factors for a specific language, without performing any modifications to the

system core. The idea consists of retrieving context factors information from a linguistic storage

facilitator. In this work, Heterogeneous Relation Graphs (HRGs) [Taylor, Black, and CaleyTaylor

et al.] were used to describe the linguistic structures. The HRG formalism was developed for use

in the Festival speech synthesis system [Black, Taylor, and CaleyBlack et al.]. In this formalism,

linguistic objects such as words syllables and phonemes are represented by objects termed lin-

guistic items. These items exist in relation structures, which specify the relationship between the

items. A relation exists for each required linguistic type. A HRG contains all the relations and

items for an utterance.

14

XML Code 3 Example of Prosodic Markers xml usage.<ProsodicMarkers>

<MarkerType name="WordBreak"><ProsodicMarker name="NB" tag="0"/><ProsodicMarker name="B" tag="3"/><ProsodicMarker name="BB" tag="4"/>

</MarkerType><MarkerType name="PunctuationType">

<ProsodicMarker name="." tag="A"/><ProsodicMarker name="?" tag="I"/><ProsodicMarker name="!" tag="E"/><ProsodicMarker name="," tag="C"/><ProsodicMarker name="" tag="O"/>

</MarkerType></ProsodicMarkers>

Next, the basic HRG feature access mechanism for this module is analyzed. In DTD 4 the

Document Type Definition for the XML implementation is presented. The root entity of this module

is the Label entity. This entity may only have elements of type BaseRelation, which defines sets of

contextual factors for different targets. The idea behind this is to allow the possibility to define sets

of contextual factors under different conditions. A set of context factors can be set for the training

process of tree-based context clustering, but the same context factors can be seen differently

during the synthesis stage. For example, post-lexical rules on the synthesis stage can only be

observed in the real phone sequence during the training stage, meaning a different disposition of

the linguistic data on different targets.

The BaseRelation entity has two attributes. The first, the name attribute that establishes the

Relation on the HRG structure for which all context factors are generated from and the second,

the target attribute, which is used to identify the set of context factors that will be returned. The

BaseRelation entity may only have elements of type Level. These define the context factors

linguistic levels such as Word level, Syllable level or Phone level.

The Level entity has two attributes, the name and switch attributes. The first attribute gives

information about the linguistic level. As for the second attribute, it works much like the switch

flux control instruction from c/c++ languages. Some context factors may change their linguistic

references under certain conditions. For example, consider the contextual factor for central phone.

When its value becomes a silence, certain context factor have little linguistic information and

sometimes no information at all, meaning that they are missing from the HRG structure. To handle

this the switch attribute determines the HRG base relation linguistic feature to be observed, whose

value will cause a modification to the underlying context factors specifications.

The Level entity only has Features elements. These elements have only one attribute, the

case attribute. Its value will cause the change on the context factors previously defined by the

switch attribute on the Level entity. When the linguistic feature defined by the switch attribute

reaches the value defined by the case attribute, a change on the specification of the context

15

DTD 4 Feature Extraction Configuration "Document Type Definition".<!ENTITY % LabelElements "BaseRelation" ><!ENTITY % BaseRelationElements "Level" ><!ENTITY % LevelElements "Features" ><!ENTITY % FeaturesElements "BaseItem" ><!ENTITY % BaseItemElements "Feature" >

<!ELEMENT Feature EMPTY ><!ATTLIST Feature position CDATA #REQUIRED pre CDATA #REQUIRED

post CDATA #REQUIRED name CDATA #REQUIREDarg CDATA #REQUIRED null CDATA #REQUIREDtag CDATA #REQUIRED question CDATA #IMPLIEDlower CDATA #IMPLIED upper CDATA #IMPLIEDstatus CDATA #IMPLIED required CDATA #IMPLIED>

<!ELEMENT BaseItem (%BaseItemElements;)* ><!ATTLIST BaseItem name CDATA #REQUIRED >

<!ELEMENT Features (%FeaturesElements;)* ><!ATTLIST Features case CDATA #REQUIRED >

<!ELEMENT Level (%LevelElements;)* ><!ATTLIST Level name CDATA #REQUIRED switch CDATA #REQUIRED >

<!ELEMENT BaseRelation (%BaseRelationElements;)* ><!ATTLIST BaseRelation name CDATA #REQUIRED target CDATA #REQUIRED>

<!ELEMENT Label (%LabelElements;)* >

factors will occur, under the current linguistic level.

The Features entity only has BaseItem type elements, which has only one attribute named

name. This attribute specifies the base path, on the HRG, from the base relation linguistic item

to the underlying linguistic features. The BaseItem entity exists for optimization purposes. The

definition of the HRG base path to the underlying features minimizes the search algorithms usage,

and as maximizing performance on feature retrieval.

The BaseItem entity may only have elements of type Feature. These are responsible for

the context factors definition. The Feature entity has twelve attributes, seven mandatory and five

optional. To facilitate description, these twelve attributes are divided in the following four sets: out-

put; context feature; question generation and flux control. In the first set position, pre and post are

responsible for the context label format. The position attribute specifies the context factors order

in the context label and the pre and post attributes the context factor separators. In the second

set, attribute name is responsible for fetching linguistic features from the HRG structure. But since

not all linguistic features are statically available in the HRG structure, attribute arg may be used to

dynamically retrieve linguistic features, by using pre-defined processing functions specified in the

name attribute. The arg attribute specifies the processing function input parameters and the null

attribute the default return value in case of missing features. In the third set, the question gen-

eration set, attributes tag, question, lower and upper are used to automatically generate context

16

tree questions. The tag attribute identifies the set of context features questions that are generated

by the question attribute. The question identifies a pre-defined processing function for automatic

question generation and the lower and upper attributes the function constrains. For example, in

XML 4, at the syllable level, the feature with name stress applies automatic question generation

function eq, to generate stress questions equal to 0 and 1. The last set, the flux control set, is

used to control the context features flow. The status attribute is used to enable or disable a certain

context feature. Often while assembling context factors its useful to enable or disable certain con-

text factors for testing. Finally the required attribute forces the current context factor to disregard

the current context label, if the feature value is null or inexistent. In XML 4, a simple XML usage

example is shown in context factors specification.

XML Code 4 XML example for the Context Factors Module.<Label>

<BaseRelation name="Observed" target="Simple-Training-Full"><Level name="RealPhone" switch="name">

<Features case="Default"><BaseItem name="">

<Feature position="0" pre="^" post="-" name="p.trans_name" arg="" null="x"tag="L.Phone" question="FeaturedPhones" lower="" upper=""/>

<Feature position="1" pre="-" post="+" name="trans_name" arg="" null="x"tag="C.Phone" question="FeaturedPhones" lower="" upper=""/>

<Feature position="2" pre="+" post="=" name="n.trans_name" arg="" null="x"tag="R.Phone" question="FeaturedPhones" lower="" upper=""/>

</BaseItem></Features>

</Level><Level name="Syllable" switch="name">

<Features case="Default"><BaseItem name="R:SylStructure.parent.R:Syllable">

<Feature position="4" pre="-" post="!" name="stress" arg="" null="x"tag="C.Syllable.Stressed" question="eq" lower="0" upper="1"/>

</BaseItem></Features><Features case="#">

<BaseItem name="R:SylStructure.parent.R:Syllable"><Feature position="4" pre="-" post="!" name="null" arg="" null="x"

tag="C.Syllable.Stressed"/></BaseItem>

</Features></Level>

</BaseRelation></Label>

Although there could be a number of different ways to design the structure of this module,

this implementation can be easily modified and has a strong adaptability to the target context

factors.

17

2.3.3 Feature Label Generator

In the previous sections, configurable language dependent modules for on-line context fea-

tures extraction were proposed. As a support language for these configurable modules, the XML

mark-up language was chosen. In this section, a c++ interface for these modules is described to

perform automatic context-dependent label extraction and automatic question generation. The

interface consists of an XML mark-up language parser, a question generator and a context-

dependent label generator.

The XML mark-up language parser was implemented using the Xerces-C library. Xerces-

C is a validating XML parser written in a portable subset of c++, and is faithful to the XML 1.0

recommendation and associated standards. The parser uses the Xerces-C DOM to load grammar

features and context factors into an object representation similar to the entities described in the

previous sections. After the XML (containing grammar features and context factors) is loaded,

questions can be automatically generated as HTK questions. Questions are generated for all

context-factors, by using the question, lower and upper attributes from the Features entity. The

context-dependent label generator uses the loaded XML to retrieve all context dependent linguistic

information. Given an input utterance, all items in the relation specified by the name attribute at the

BaseRelation entity, are sequentially processed and the underlying linguistic information fetched

as described in section 2.3.2. After the linguistic information is retrieved from the utterance, the

gathered data is then dumped as an HTK context dependent label.

2.4 Grammar Features and Context Factors for European Por-tuguese

In this section, the work done in the context of this thesis for European Portuguese is

presented. Detailed descriptions of grammar features and context factors are presented.

2.4.1 Grammar Features

Phoneset

The phonetical representation used, for the European Portuguese dialect, was the PT-

SAMPA phoneset. In appendix A a class-reference to the IPA phonetic alphabet can be found.

The necessity of language generalization and phonological rules expression in a clear way

led many linguistics to the creation of many distinctive features systems. The most notable one

was performed by Chomsky and Halle (1968) in the sequence of the pioneer work from Jakobson

and Halle (1956) in distinctive features theory. In Chomsky and Halle (1968) system, features

are binary, where /+/ indicate the presence of such propriety and /−/ its absence. Each feature

represents an independent controllable articulatory aspect.

The sets of distinctive features used in this thesis is based on work from Mateus and

18

d’Andrade (2000); Mateus, Andrade, Viana, and Villalva (1990); Oliveira (1996). Their work is

an application of Chomsky distinctive features system to European Portuguese. In table 2.1 the

distinctive features for PT-SAMPA vowels are presented and in table 2.2 the distinctive features

for PT-SAMPA consonants. These distinctive features were used in this thesis as phone features,

to build the phonetical questions set for context clustered models.

Name Syllabic High Low Back Labial Round Nasal Dorsali + + − − − − − −e + − − − − − − −E + − + − − − − −6 + − − + − − − +a + − + + − − − +O + − + + + + − −o + − − + + + − −u + + − + + + − −@ + + − + − − − +i˜ + + − + − − + −e˜ + − − + − − + −6˜ + − − + − − + +o˜ + − − + + + + −u˜ + + − + + + + −j − + − − − − − −w − + − + − + − +j˜ − + − − − − + −w˜ − + − + − + + +

Table 2.1: Chomsky distinctive features for PT-SAMPA vowels

POS

As it was seen in section 2.3.1, POS is a word level feature type and is an important

linguistic resource. In context-dependent clustered models based synthesis, POS assists the

prediction of prosodic phrasing and accentuation. The used POS tagging system was the one

defined in Ribeiro (2003). The used tag set is a subset of the POS tag set used in the PAROLE

corpus [Nascimento, Bettencourt, Marrafa, Ribeiro, Veloso, and WittmannNascimento et al.]. This

subset retains information relative to the lexical category and sub-category, discarding any other

information. The resulting set has a total of 28 tags and they can be observed in table 2.3.

Content words belong to major open-class lexical categories such as noun, verb, adjec-

tive, adverb and certain closed-class words such as negatives and some quantifiers. Likewise,

functional words belong to the closed-class lexical categories such as articles, conjunctions, pro-

nouns, prepositions and numerals. The grouping of lexical categories have a strong influence in

the prediction of accents and some prosodic events that can contribute to the better modulation

of HMM model trees. In table 2.3, one can observe the lexical categories grouping carried for

European Portuguese.

19

Name Continuant Sonorant Prior Coronal Back Distributed Nasalp − − + − − − −b − − + − − − −t − − + + − − −d − − + + − − −k − − − − + − −g − − − − + − −f + − + − − − −v + − + − − − −s + − + + − + −z + − + + − + −S + − − + − + −Z + − − + − + −l + + + + − − −l˜ + + − − + − −L − + − − − + −m − + + − − − +n − + + + − − +J − + − + − + +r + + + + − − −R + + − − + − −

Name High Strident Voiced Lateral Laryngeal Labial Dorsalp − − − − + + −b − − + − + + −t − − − − + − −d − − + − + − −k + − − − + − +g + − + − + − +f − + − − + + −v − + + − + + −s − + − − + − −z − + + − + − −S + + − − + − −Z + + + − + − −l − − + + − − −l˜ − − + + − − +L + − + + − − −m − − + − − + −n − − + − − − −J + − + − − − −r − − + − − − −R + − − − − − +

Table 2.2: Chomsky distinctive features for PT-SAMPA consonants

Prosodic Markers

Tone characterization is an important linguistic feature at the word level. Its main influence

in context clustered models is on the pitch parameter. Good description of intonation allows the

resulting synthesized speech to have a more natural behavior.

In recent years, research on Intonation for the European Portuguese dialect has been car-

20

POS Tag Lexical Category Lexical Category GroupAA None noneNc Noun.Common contentNp Noun.Proper contentV= Verb contentA= Adjective contentR= Adverb contentTd Article.Definite functionalTi Article.Indefinite functionalCc Conjunction.Coordenate functionalCs Conjunction.Subordinate functionalMc Numeral.Cardinal functionalMo Numeral.Ordinal functionalPp Pronoun.Personal functionalPd Pronoun.Demostrative functionalPi Pronoun.Indefinite functionalPo Pronoun.Possessive functionalPt Pronoun.Interrogative functionalPr Pronoun.Relative functionalPe Pronoun.Exclamative functionalPf Pronoun.Reflexive functionalS= Preposition functionalI= Interjection emotionalXf Residual.LoanWords otherXa Residual.Abbreviation otherXy Residual.Acronym otherXs Residual.Symbol otherU= PassiveMarker otherO= Punctuation other

Table 2.3: POS tag system

ried out [Hirschberg and PrietoHirschberg and Prieto,Viana, Oliveira, and MataViana et al.,Vigário

and FrotaVigário and Frota]. However tools for automatic tone stylization are still under develop-

ment, and thus a workaround solution was needed. To enhance the models intonation, syntactic

information like word break and punctuation were used instead. In table 2.4 word break and

punctuation based prosodic markers are presented.

name tagNB 0B 3

BB 4

(a) Wordbreak

name tag. A? I! E, C

/other/ O

(b) Punctuation

Table 2.4: Prosodic Markers

21

2.4.2 Context Factors

In this section, the sets of context factors for European Portuguese are presented. The

following 24 context factors sets, distributed by 5 levels of speech categories, were design for Eu-

ropean Portuguese to model context dependent models. Note that all context factors are relative

to the current phone.

Phone level

At the phone level, penta-phones were used. Penta-phone models have been used by

many HMM-based speech synthesis systems, and have proved its importance in HMM modeling

for synthesis purposes [Black, Zen, and TokudaBlack et al.].

In many spoken languages, such as European Portuguese, Sandhi phenomena are re-

sponsible for language naturalism. It consists on acoustic phone modifications produced between

two consecutive words inside a sentence. To reproduce these phenomenons in speech synthesis,

post-lexical rules are applied to canonical phone sequences. To reiterate the importance of this

phenomena in the models, observed phone sequences should be used. However, the use of ob-

served phone sequences to train context-dependent clustered HMM models may lead to lack of

accuracy on the same. To put it differently, the cluster of HMMs may disregard the neighborhood

influences from important phone transitions when using observed phone sequences. To force

training to take into account possible post-lexical influences a mixed solution is proposed. Both

canonical and observed phone sequences were included as context factor sets, where it was ex-

pected that decision trees would reflect post-lexical rules, by building different contexts for specific

phone transitions.

Phone level context factors:

1. {previous previous, previous, current, next, next next} observed phone;

2. {previous previous, previous, current, next, next next} canonical phone;

3. {backward, forward} position in syllable;

Syllable level

In this level, syllable information is retrieved. Stress, accent and proximity of major phrase

breaks are important to prosody. Consequently, related information was included.

Syllable level context factors:

1. {previous, current, next} syllable stress;

2. {previous, current, next} syllable accent;

3. {previous, current, next} number of phones in syllable;

22

4. {backward, forward} position of current syllable in word;

5. number of syllables to {previous, next} phrase break

6. number of stressed syllables to {previous, next} phrase break

7. number of accented syllables to {previous, next} phrase break

8. distance to {previous, next} stressed syllable;

9. distance to {previous, next} accented syllable;

10. syllable nuclear phone;

Word level

At the word level, intonation and prosodic ruptures are fundamental. Accordingly, word

breaks, punctuation, POS and distance measures to content words were used.

Word level context factors:

1. {previous, current, next} part-of-speech;

2. {previous, current, next} word breaks;

3. {previous, current, next} punctuation type;

4. {previous, current, next} word number of syllables;

5. {backward, forward} position of word in phrase;

6. {backward, forward} number of content words in phrase;

7. distance to {previous,next} content word;

Phrase level

During recordings, speakers tend to reproduce strong reading patterns, such as intonation

patterns. These patterns are usually a function of the phrase length and the distance to the last

speech rupture. The phrase level attempts to catch these reading patterns.

Phrase level context factors:

1. {previous, current, next} phrase number of syllables;

2. {previous, current, next} phrase number of words;

3. number of non-major phrase breaks to {previous, next} major phrase break;

23

Utterance level

The utterance level is a complementary level of the phrase level. It attempts to catch other

reading patterns like intonation transition patterns between consecutive phrases.

Utterance level context factors:

1. total number of {syllables, words, phrases} in utterance;

2.4.3 Questions

As it was seen before, tree-based context clusters are translated as binary trees in which

a yes/no context question is attached to each node. For each context factor a question is au-

tomatically generated by the feature label generation module. The context sets, described in

the previous section, were automatically transformed into questions, resulting in a total of 3231

questions.

2.5 Conclusions

In this chapter, context-dependent clustered models for speech synthesis were over-

viewed. As stated before, the main problem in context cluster synthesis is the dependency on

the target language. To overcome language issues, configurable grammar features and context

factors modules for on-line processing were proposed. The goal of these module is to provide an

easy way of configuring context-dependent clustered models without performing any modifications

to the system core and avoid long and tedious work in its implementation.

The work developed for European Portuguese grammar features and context factors was

presented, and although it was only applied for European Portuguese, this work is general enough

to be applied to other languages.

24

3Voice Building with HTS

Contents3.1 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 HMM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

25

Voice building is the process of setting up the recordings of a voice a voice to be used

by a speech synthesis system. This process involves a set of procedures like data preparation,

parameter extraction and model construction. HMM synthesis was the chosen technique for small

foot print synthesis, concretely for an HTS based system. In this chapter all HTS voice building

procedures are described.

3.1 Corpora

The corpora used in this thesis was constructed in the scope of the Tecnovoz project [Na-

tional Project TECNOVOZ number 03/165National Project TECNOVOZ number 03/165], by the

L2F laboratory at INESC-ID and its Tecnovoz partner INOV. The Tecnovoz project was a join

effort to disseminate the use of spoken language technologies in the a wide range of different do-

mains. The project consortium included 4 research centers and 9 companies specialized in areas

such as banking, health systems, fleet management security, media, alternative and augmentative

communication, computer desktop applications, etc.

The Tecnovoz road-map for speech databases was to build an inventory of a considerable

amount of speech recordings (from 3 to 10 hours or more) with carefully selected contents. This

database original target was to feed a unit selection based TTS system. In this section, the

methodologies used in designing and recording the Tecnovoz speech database are described as

well as the procedures taken while building a corpus for HMM-based synthesis. More information

on the speech database design methodologies can be seen in [Oliveira, Paulo, Figueira, Mendes,

Nunes, and GodinhoOliveira et al.].

3.1.1 Design of the Recording Prompts

Since the original purpose of the inventory was to feed a unit selection based TTS system,

the text prompts to be recorded were selected to cover the acoustic patterns observed in the

general use of the language. This coverage cannot be achieved at the word level, as the number

of words in a language is virtually infinite. Therefore, the candidate prompts must be represented

by smaller sized acoustic units. Three levels of representation were used: syllables, triphones

and diphones. These levels make up finite sets and can carry information that spans from the

phonetic level up to the prosodic level.

The text corpora mainly consists of newspaper texts and books that do not always cover

the specific requirements of certain applications such as speech-to-speech translation, medical

systems, customer support, etc. The gathered text corpora contained a total of 70 million words

and 420 thousand distinct words.

In order to have a proper sub-word selection scheme, a very high confidence in the esti-

mated phone sequence for every sentence is needed. For this reason, all sentences that con-

26

tained words not included in a manually corrected pronunciation lexicon were discarded. The text

corpus was this way reduced to approximately 400 thousand sentences. A greedy selection algo-

rithm was then used to select a representative sub-set of the sentences in the text corpora [John-

sonJohnson, Chevelu, Barbot, Boeffard, and DelhayChevelu et al.], aiming tokens coverage at

three selected levels: syllables, triphones and diphones. The greedy algorithm stopped after a

predefined coverage threshold, resulting in a total of 8260 selected text prompts.

3.1.2 Speakers

The original Tecnovoz speech database included four speakers, from which two were male.

The speakers selection was based on results from recording test sessions of several candidates.

These were pre-selected by personal contacts and through a voice talent recording studio. All

candidates were Portuguese native speakers from the Lisbon area.

The test consisted of recording a session of 600 sentences. The sentences were selected

to have a good diphone coverage. Using these recordings a synthesizer was build with each

voice allowing the possibility to evaluate not only the quality of the voice itself but also its use

for synthesis purposes. The decision was taken by listening to several phonetically rich prompts

synthesized with a variable size unit selection voice using the recordings of each speaker. The

decision criteria were:

• the recording naturalness;

• number of repetitions per prompt per session;

• voice quality consistency;

• pleasantness of the synthesized voice;

• voice ability to mask concatenation errors.

In this thesis, only one of the male speakers was used. The choice of this speaker was

influenced by the following characteristics:

• professional voice talent;

• consistent reading naturalness;

• high voice quality;

• low pitch frequency.

3.1.3 Recordings

The recording of the inventory required a large number of recording sessions and a strict

recording procedure to ensure the uniformity of the database [Bonafonte, Höge, Kiss, Moreno,

27

Ziegenhain, van den Heuvel, H.-U.Hain, X.S.Wang, and GarciaBonafonte et al., Oliver and Szk-

lannyOliver and Szklanny, Saratxaga, E., Hernaez, and LuengoSaratxaga et al.]. The recordings

were conducted in the L2F recording studio that includes a sound-proof room and a control sta-

tion (see figure 3.1.3), where the supervision of the recording process took place. The equipment

in the sound-proof room includes:

• a Studio Projects T3 Dual Triode microphone;

• an anti-pop filter;

• a Brüel & Kjær Type 2230 microphone probe;

• an LCD monitor;

• a set of headphones;

• a web camera;

• and a small mirror on the wall.

The supervisor could control the speaker position in the beginning of each session by

comparing a web camera image with pictures taken in the previous sessions. The small mirror on

the wall helped the speakers maintaining a fixed distance to the microphones during the sessions:

they were asked to check the position of their face in the mirror periodically. Also, to maintain

consistency of the speaker voice level and quality thru sessions, blocks of recording prompts from

previous sessions were played and compared with newly recorded prompts at the beginning of

each session.

(a) Recording booth (b) Control Room

Figure 3.1: Recording Room [Oliveira, Paulo, Figueira, Mendes, Nunes, and GodinhoOliveira et al.]

In the control station, the signal from both microphones were digitalized using a RME

Fireface 800 digital mixing desk. A sampling frequency of 44.1kHz and 24bit quantization were

used. The audio feedback and the supervisor instructions were also routed through the mixing

desk to the speakers headphones.

28

The control station had two display monitors, one of them being mirrored inside the sound-

proof room. These monitors were used to display the recording prompts under the control of the

recording supervisor. Since speaker throat relaxation and list effects have an important effect

in the recorded speech, recordings were done in sessions of two hours with 10 minutes interval

every half hour. Each recording session produced, on average, 40 minutes of recorded speech.

At the end of the recordings, 20 sessions per speaker had been made, resulting in 13 hours of

speech per speaker.

3.1.4 Phonetic Segmentation and Multi-Level Utterance Descriptions

The phonetic segmentation of the databases was performed in three different stages

[Weiss, Paulo, Figueira, and OliveiraWeiss et al., Paulo, Figueira, Mendes, and OliveiraPaulo

et al.]. In the first stage, speech files were segmented by Audimus [Neto and MeinedoNeto

and Meinedo] working in forced alignment mode. Next, such segmentations are used by the

HTK programs [Young, Evermann, Gales, Hain, Kershaw, Liu, Moore, Odell, Ollason, Povey,

Valtchev, and WoodlandYoung et al.] for training context-independent speaker-specific phone

models. The speaker-adapted models are subsequently provided to a phonetic segmentation tool

based on weighted finite state transducers, allowing many alternative word pronunciations [Paulo

and OliveiraPaulo and Oliveira].

The utterances orthographic transcriptions are then combined with the respective phonetic

segmentations using a procedure described in [Paulo and OliveiraPaulo and Oliveira], in order

to obtain a realistic and multi-leveled description of the spoken utterances. Moreover, those de-

scriptions are enhanced by additional descriptions, such as F0 values of the speech signal and

prosodic annotations. The F0 values are assigned to the respective phonetic segments base on

the temporal inclusion criterion.

3.1.5 Corpus sub-set for HMM Training

The selected male speaker from the Tecnovoz speech database has a total of 8260 prompts

and 13 hours of recorded speech. The huge size of this database poses a problem with tree-

based context HMM clusters. As it was seen in section 2.2 tree-based context HMM clusters are

translated as binary trees in which a yes/no context question is attached to each node. These

trees are built using a top-down sequential optimization process, where initially all models are

placed in a single cluster at the root of the tree. At this point all HMM models must be avail-

able for process, meaning a lot of computational resources are required to build the context tree.

Consequently, a smaller subset of the database should be used in the development to guarantee

enough computational resources in the building process.

The original corpus was design to cover many important linguistic features, such as phones,

diphones, triphones, syllables and words. This coverage technique allows the design of very rich

29

databases in terms of intonation and special acoustic units like diphthongs and triphthongs. Diph-

thongs and triphthongs have very specific characteristics like accent co-articulation. In section

2.4.2, the use of penta-phones as context factors were described. The importance of these re-

flect in the building process of specific models for diphthongs and triphthongs. Total penta-phones

coverage is very hard to achieve, even in a large database like the Tecnovoz database. The num-

ber of combinations for penta-phones in the order of 70 millions.

To guarantee that the subset has at least some of the most common and important linguistic

features, a statistical approach to subset corpus design was chosen. Statistically speaking, if the

large corpus is uniformly random sampled to meet a certain number of prompts, the probability

of a certain unit being in the subset is asymptotically the same as being in the large corpus. This

means that the units in the sub-set will be in asymptotical relative proportion to the same in the

large corpus. In figure 3.1.5, an example of this procedure for a 1k prompts is illustrated. Following

this procedure a sub-set for HMM training was builded, resuming to a total of 1500 prompts and

2.3 hours of recorded speech.

(a) Diphone coverage: in green the diphonecoverage of the Tecnovoz database and inred the coverage of the same diphones in

the subset

(b) Triphone coverage: in green the triphonecoverage of the Tecnovoz database and inred the coverage of the same triphones in

the subset

Figure 3.2: Unit coverage

3.2 HMM Training

In the training phase, spectral parameters (vocal tract parameters) and excitation param-

eters (F0 parameters) are extracted from a speech database and then modeled by context-

dependent HMMs. In figure 4.2, HMM training for context-dependent clustered models based

synthesis is illustrated [YoshimuraYoshimura].

Continuous density HMM are usually adapted for vocal tract modeling in the same way as

speech recognition systems. The continuous density Markov models is a finite state machine

30

Figure 3.3: HMM Training [YoshimuraYoshimura]

which makes one state transition at each time unit. First, a decision is made to which state to

succeed. Then an output vector is generated according to the probability density function for

the current state. An HMM is a doubly stochastic random process, modeling state transition

probabilities between states and the output probabilities at each state [RabinerRabiner].

The F0 pattern is composed by continuous values in voiced regions and a discrete symbol

in unvoiced regions. This double feature makes F0 modeling difficult to achieve either with discrete

or continuous HMMs. For this reason, F0 pattern modeling state output probabilities are defined

by Multi-Space probability Distributions (MSDs) [YoshimuraYoshimura,MasukoMasuko].

In this thesis, HMM training was performed with the HTS 2.1 toolkit [Tokuda, Zen, Yamag-

ishi, Black, Masuko, Sako, Toda, Nose, and OuraTokuda et al.] and a training script from one of

the speaker dependent training demos available on-line [Tokuda, Zen, Yamagishi, Black, Masuko,

Sako, Toda, Nose, and OuraTokuda et al.]. HTS 2.1 is an integrated tool in the HTK 3.4 [Young,

Evermann, Gales, Hain, Kershaw, Liu, Moore, Odell, Ollason, Povey, Valtchev, and WoodlandY-

oung et al.] toolkit.

3.2.1 Setup

The audio from the previously selected corpus sub-set was recorded with 44.1 kHz sam-

pling frequency and 24 bit quantization, as stated in section 3.1.3. The first step taken in the data

setup procedure was to perform, to the audio database, a lowpass filtering with cut-off frequency

at 8 kHz and downsample to 16 kHz to produce a 16 kHz sampling frequency and 16 bit quantiza-

tion speech database. The use of a smaller sampling frequency and lower bit quantization makes

the training procedure much lighter, without compromising either the quality of the corpus or the

quality of the synthesis.

The following procedures were spectral and F0 parameters extraction. The used method

31

for spectral extraction was Mel frequency Generalized Cepstrum coefficients (MGC). MGC is a

method available in the SPTK toolkit [Tokuda, Zen, Sako, Yamagishi, Masuko, and Nankaku-

Tokuda et al.] and is a variation of the well known Mel Frequency Cepstrum Coefficients (MFCC).

This method is used, as default, by the training scripts from the training demos of HTS. The input

configuration parameters for MGC extraction, used in this thesis, are depicted in table 3.1.

Analysis order: 32Window type: HammingFrame length: 400 pointsFrame shift: 80 pointsFFT length: 2048 pointsFrequency warping factor: 0.42

Table 3.1: Mel generalized cepstrum coefficients extraction parameters

As for F0 parameter extraction, the ESPS method from the SNACK [SjölanderSjölander]

toolkit was used. The chosen input F0 boundary’s parameters were 55Hz and 400Hz for lower

and upper limits respectively. These values were chosen after careful analysis of the selected

speaker pitch variations. The final step in data setup was HMM specifications. In this thesis,

default values from the demo scripts were used. The default values are 5 states per HMM, 3 delta

windows and 1 mixture for both MGC coefficients and F0 values.

3.2.2 Training

As it was stated before, HMM training was performed with the HTS 2.1 toolkit [Tokuda,

Zen, Yamagishi, Black, Masuko, Sako, Toda, Nose, and OuraTokuda et al.] and a training script

from one of the speaker dependent training demos available on-line [Tokuda, Zen, Yamagishi,

Black, Masuko, Sako, Toda, Nose, and OuraTokuda et al.]. However, this demo relies on festi-

val [Black, Taylor, and CaleyBlack et al.] scripts to generate context-dependent label sequences

for context-dependent cluster training. Also, these scripts are language dependent, more con-

cretely American English specific. Therefore, adaptations were made to use the multi-language

platform described in chapter 2. In essence, the performed changes included the feature label

generator, described in section 2.3.3, to generate context-dependent label sequences for Euro-

pean Portuguese context-dependent cluster HMM training. The training script consists of a series

of steps to produce context clustered HMMs. Some of the main steps performed by the HTS

demo scripts are:

1. Global variance computation;

2. Initialization and re-estimation;

3. Embedded re-estimation for mono-phones;

4. Embedded re-estimation for full-context;

32

5. Tree-based context clustering for mel generalized cepstral coefficients and log(F0);

6. Clustered embedded re-estimation;

7. Untied parameter sharing structure;

8. Untied embedded re-estimation;

9. Tree-based context clustering for mel generalized cepstral coefficients and log(F0);

10. Re-clustered embedded re-estimation;

11. Tree-based context clustering for duration.

3.3 Results

The training process was preformed in an Intel R© CoreTM 2 CPU platform at 2.40GHz and

4GB of physical memory, and toke approximately 24h, including parameter generation. Memory

peak occurred while performing tree-based context dependent clustering, where approximately

1GB of memory was allocated. The final voice (context-dependent trees, models and configura-

tion files) occupied 3.5 MB.

3.3.1 Data Analysis

Once the training process was concluded, trees and models were analyzed. In table 3.2,

models and questions count for duration, pitch and spectral coefficients available in the context-

dependent trees are presented. The first noticeable result is the number of models used in pitch

modeling. Over 7000 models were constructed and 50% of the total number of questions were

used. This information reveals an important result. Despite the fact that no pitch stylization

information was used to train the context-dependent models, the system used a lot of context in-

formation to generalize the corpus. Moreover, since the MDL principle was used to train context-

dependent trees, over-specialized context-dependent trees for pitch are unlikely. Another interest-

ing result is the low number of models and questions used to train MGC context-dependent trees.

Spectral coefficients discrimination are very important in HMM-based speech synthesis and given

the complexity of spectrum, more models were expected.

Stream # Models # Questions Questions Selecteddur 677 440 13.6 %

f0 7051 1615 50.0 %mgc 2006 553 17.1 %

Table 3.2: Models and questions count for duration, pitch and spectral coefficients

To better understand the previous results, a second analysis was performed. In table 3.3,

the question usage for each linguistic level is presented. From this table, the low number of

33

models for MGC context-dependent trees become clear. Most context-dependent models are

concentrated at the phone level, meaning that the only relevant information for spectral training

is at the phone level and eventually at the syllable level. This explains the low number of used

questions and therefor the low number of generated models. As for the pitch context-dependent

trees, a lot of questions from higher levels were used. Curiously, most context features usually

used in pitch stylization, are at the syllable, word and phrase levels. This result has an important

consequence. If usual pitch stylization context features were selected by HTS to model pitch,

then the use of pitch stylization based information for training is pointless. In section 4.4.2, results

from synthesis confirm that in fact the use of pitch stylization in HMM-based synthesis can be

disregarded.

Stream Phone Syllable Word Phrase Uttdur 65.0 % 16.6 % 8.6 % 4.1 % 5.7 %

f0 47.1 % 18.5 % 10.1 % 14.4 % 9.9 %mgc 77.6 % 12.8 % 3.4 % 2.0 % 4.2 %

Table 3.3: Question usage for each linguistic level

3.3.2 Models for Sandhi Phenomena

Sandhi phenomena are acoustic phone modifications occurring at the boundaries of words

inside a sentence. To reproduce these phenomena in speech synthesis, post-lexical rules are

applied to canonical phone sequences. As stated in section 2.4.2, observed and canonical penta-

phones were used to train context-dependent models in the expectation that models would reflect

influences from post-lexical rules and thus enhancing language naturalism. To test this, the result-

ing context-dependent trees were analyzed using some of the most common Sandhi phenomena.

The idea consisted of finding explicit post-lexical context questions that would lead to different

models for the same phone.

The first analyzed set of Sandhi phenomena were consonant modifications. There are

three situations where consonant modifications occur. The first two consonant modifications are

the /S/ at the final of a word, that is realized as /z/, when the next word begins with a vowel (e.g.

dias antes), or as a /Z/ when the next word begins with a voiced consonant (e.g. bons dias). The

other consonant modification occurs when the /l / at final word position stops being velar /l/ when

followed by a initial word vowel (e.g. mal entendido).

The second analyzed set of Sandhi phenomenons are vowel related. The majority of phone

modifications happen in vowels, in initial and final word positions. However only one of the most

common vowel modification was analyzed due to its context. The analyzed vowel modification

corresponds to the unstressed vowel [a] that should be normally by realized as a /6/, but in the

case of identical vowel sequence is realized as /a/ (e.g. visse a Antónia). Given the strong context

34

of this phenomenon, if context-dependent models do not reflect any influence for this case, then

most likely neither the other cases.

After careful examination of the context-dependent trees, the test results were discouraging.

They revealed no influence whatsoever from the use of canonical penta-phones. In spite of the

fact that there were explicit use of canonical questions, they however do not produce a node split

that explicitly reveals a post-lexical influence or that would lead to different models. The analysis

was not exhaustive, however since the most common cases revealed no influences, then there

is no point of analyzing all the other particular cases. Even if there is an actual influence in a

particular model, the probability of this model being used during synthesis is very small and thus

making it irrelevant to speech synthesis.

3.3.3 Segmentation Errors

From the analysis of the context trees, an interesting phenomenon was observed. While

walking thru the question nodes, the MGC context dependent model in table 3.4 was found. In

this table, context at states 1 and 2 for one of the /j/ HMM models is presented. A simple obser-

vation reveals that any context combination produces impossible phone sequences. In European

Portuguese, the semi-vowel /j/ can’t be isolated by consonants.

Phone Type Left Context Center Context Right ContextCanonical /f/, /v/, /s/, /z/, /S/, /Z/ Vowel AnyObserved /d/, /g/, /z/, /Z/ /j/ Consonant

Table 3.4: Context at states 1 and 2 for one of the /j/ HMM models

The reason for the occurrence of this model comes from corpus segmentation errors, were

some realizations of the vowel /i/ were understood as the semi-vowel /j/. Moreover, since HMM-

based techniques are usually insensible to small errors in the corpus, one can conclude that these

errors happen at a significant rate in the training corpus.

3.4 Conclusions

In this chapter, the process of constructing and HMM-based voiced was described. Al-

though the training process is time consuming, the benefits of this technique are clear. One of

the most important result was the adaptability of the system to language patterns, specifically to

pitch patterns. The high number of models generated for pitch suggests a very natural intonation

speech synthesis system. Another important result is the high compressibility of an HMM-based

voice. The ability to produce voices four hundred times smaller than the corpus is a clear advan-

tage, specially in small-footprint synthesis.

35

36

4Speech Synthesis with Dixi TTS

Engine

Contents4.1 Dixi Text-To-Speech Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 HMM Based Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 HMM-based Waveform Generation with HTS Engine API . . . . . . . . . . . . 394.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

37

HMM synthesis was the chosen technique for small foot print synthesis. In this thesis HTS

waveform generation module was integrated in Dixi TTS engine. In this chapter, the resulting work

performed for this integration is described. Results concerning system performance and footprint

are also reported.

4.1 Dixi Text-To-Speech Engine

Dixi [Oliveira, Paulo, Figueira, Mendes, Cassaca, do Céu Viana, and MonizOliveira et al.]

is a generic text-to-speech synthesis system, developed by L2F Laboratory at INESC-ID in the

scope of the Tecnovoz project. Although it was primarily targeted at speech synthesis for Eu-

ropean Portuguese, its modular architecture and flexible components allow its use for different

languages. Moreover, the same synthesis framework can be used for either concatenation based

or HMM-based speech synthesis applications. The HMM-based components were integrated by

the author in the scope of this thesis. Dixi currently runs on Windows and Linux, and can be

accessed, in both operating systems, by means of an API provided by a set of Dynamic Linked

Libraries and Shared Objects, respectively.

The system architecture is based on a pipeline of components, interconnected by means

of intermediate buffers, as depicted in figure 4.1. Every component runs independently from all

others, loads the to-be-processed utterances from its input buffer and, subsequently, dumps them

into its output buffer. Buffers, as the name suggests, are used to store the utterances already

processed by the previous component while the following one is still processing earlier submitted

data. Dixi internal utterance representation follows the HRG [Taylor, Black, and CaleyTaylor et al.]

formalism. The HRG formalism was first developed for use in the Festival speech synthesis

system [Black, Taylor, and CaleyBlack et al.]. In this formalism, linguistic objects such as words

syllables and phonemes are represented by objects termed linguistic items. These items exist in

relation structures, which specify the relationship between the items. A relation exists for each

required linguistic type. An HRG contains all the relations and items for an utterance.

Figure 4.1: Dixi system architecture [Oliveira, Paulo, Figueira, Mendes, Cassaca, do Céu Viana, and Moni-zOliveira et al.]

Dixi comprises five main components: text pre-processing, part-of-speech tagging,

grapheme-to-phone conversion, phonological analysis and waveform generation, as depicted in

figure 4.1.

38

4.2 HMM Based Synthesis

In HMM-based waveform generation, a context-dependent label sequence is obtained from

the input text by text analysis. An HMM sentence is constructed by concatenating context depen-

dent HMMs according to the context-dependent label sequence. State durations are determined

to maximize the likelihood of the state duration densities [Yoshimura, Tokuda, Masuko, Kobayashi,

and KitamuraYoshimura et al.]. According to the obtained state durations, a sequence of mel-

cepstral coefficients and F0 values, including voiced/unvoiced decisions, are generated from the

sentence HMM by using a maximum likelihood speech parameter generation algorithm [Tokuda,

Masuko, Yamada, Kobayashi, and ImaiTokuda et al.]. Finally, speech is synthesized directly

from the generated mel-cepstral coefficients and F0 values by the MLSA filter [Fukada, Tokuda,

Kobayashi, and ImaiFukada et al., ImaiImai]. In figure 4.2 the HMM-based waveform generation

is ilustrated [YoshimuraYoshimura].

Figure 4.2: HMM Synthesis [YoshimuraYoshimura]

4.3 HMM-based Waveform Generation with HTS Engine API

As stated in section 4.1, the Dixi system architecture is based on a pipeline of components,

interconnected by means of intermediate buffers, as depicted in figure 4.1. At this point, the

goal of this thesis was to design an HMM-based waveform generation component for the Dixi

TTS system. The HTS engine API version 1.0 [Tokuda, Zen, Yamagishi, Black, Masuko, Sako,

Toda, Nose, and OuraTokuda et al.] was used to design this component. One of the drawbacks

of the HTS engine is its inability to generate on-line context-dependent labels. To overcome this

problem, the feature label generator described in section 2.3.3 was integrated in this component

as a context-dependent label generator.

In figure 4.3, the HMM-based waveform generation component for the Dixi engine is pro-

posed. The to-be-processed utterances are loaded from the input buffer, where linguistic fea-

tures are available. The input utterance goes through the feature label generator where context-

dependent labels are automatically generated. These labels are then used by the HTS engine to

39

build HMM sentences from the HMM trees and thus generating the corresponding waveform as

described in section 4.2. After processing, the generated speech is then placed in the utterance

and the corresponding utterance in the component output buffer.

Figure 4.3: Dixi Component for HMM Synthesis

The major difficulty in the implementation was the process of loading the resources (trees

and models) that feeds the HTS Engine system. In the Dixi system, voices (trees and models) are

separated from the components pipeline. This separation exists as a memory optimization mea-

sure. By separating voices from the pipeline, the same voice can feed several parallel pipelines

and thus optimizing resources usage. In the HTS engine API, models and trees are more or less

integrated in the processing part and therefore making integration difficult. The solution for this

problem was to port some of the HTS engine API code to the Dixi system and thus separating

models and trees from the components pipeline.

4.4 Results

Results for synthesis were obtained in two sets of tests. The first set was designed to test

the system footprint and the second to evaluate the HMM-based waveform generation technique.

Tests were conducted in an Intel R© CoreTM 2 CPU platform at 2.40GHz and 4GB of physical

memory.

4.4.1 System Footprint

The Dixi system architecture is based on a pipeline of threaded components, intercon-

nected by means of intermediate buffers, as depicted in figure 4.1. Testing multi-threaded envi-

ronments can be difficult. The Dixi components run independently from each other and conse-

quently real time execution measures are inconsistent. To have more accurate measures, a load

test was conducted. Approximately 860 sentences from a specific text domain were used. The

selected sentences are part of a human-machine interface were the main topic is a religious art

40

object. The test was conducted using HMM-based waveform generation and unit selection based

waveform generation for comparison. In both techniques the same speaker was used.

In table 4.1, results for system execution footprint are presented. In this table, the unit

selection technique has the best performance, by being approximately 10 times faster than the

HMM-based technique. However, unit selection techniques have the best performance under

limited-domain input text and a more inconstant performance under general domain. When using

general domain, the hit rate for small units is much higher and therefore synthesis slower. Under

these conditions, the Dixi unit selection based waveform generator is much slower then the HMM-

based waveform generator. HMM-based techniques, have a constant performance under any

type of domain. In terms of execution time, this technique is generally more efficient than unit

selection based techniques.

Waveform Generator Total Speech Length Execution Time Real-Time SpeedHTS 48 m 4m 55s 10

CLUnits 37 m 1m 54s 19

Table 4.1: System execution footprint using HTS-based and Unit Selection based waveform generators.

In table 4.2, results for system memory footprint are also presented. In this table, HMM-

based technique has the best performance by only requiring half of the memory required by the

unit selection technique. Although the memory peaks seam a little high, in fact given the amount

of input sentences the system exhibits a normal behavior. Test log showed that all 860 sentences

are at the waveform generators input buffer, approximately 20 seconds after the beginning of

execution. In normal conditions, the HTS-based waveform generator usually allocates between

5-10Mb of memory per sentence and the unit selection waveform generator, usually allocates

20-50Mb of memory per sentence.

Waveform Generator Voice Size Pre-Synthesis Memory Memory PeakHTS 3.5Mb 40Mb 382Mb

CLUnits 1.2Gb 82Mb 712Mb

Table 4.2: System memory footprint using HTS-based and Unit Selection based waveform generators.

4.4.2 Waveform Generation

Waveform evaluation is one of the most difficult task in speech synthesis. In recent years

specifications and evaluation procedures for speech synthesis were presented [Bonafonte, Höge,

Kiss, Moreno, Ziegenhain, van den Heuvel, H.-U.Hain, X.S.Wang, and GarciaBonafonte et al.,

Black and TokudaBlack and Tokuda]. Usually audible tests are performed by a selected population

and evaluation performed using MOS tests. In this thesis, waveforms were evaluated by audible

tests and visual assessment of spectrograms and pitch curves.

Recorded audio prompts were selected from a test corpus and compared to synthesized

41

versions. From the audible tests, the first clear difference was the sounding of the synthesized

speech which is Vocoder like. On the other hand, the intonation of the synthesized speech is very

natural. From the visual assessment, results showed that the synthesized intonation presents

itself well behaved compared to the correspondent original speech intonation. Also, comparing

spectrums it was easy to identify the similarities between formants.

In figures 4.4.2 and 4.4.2, spectrograms and pitch curves are presented for one of the

comparison tests. Similarities and differences between both audio realizations can be observed.

Figure 4.4: Original waveform for sentence "O de Aveiro custou-me vinte."

Figure 4.5: Synthesized waveform for sentence "O de Aveiro custou-me vinte."

42

4.5 Conclusions

The Dixi system architecture is based on a pipeline of threaded components, intercon-

nected by means of intermediate buffers. This architecture has advantages in the new multi-core

CPU platforms while processing large amounts of data. The primary goal of this architecture was

to manage available computational resources the best way possible, to meat the fastest possi-

ble speech synthesis. In terms of small-footprint this can be a disadvantage. However, Dixi’s

modularity and dynamic configuration can be tuned to meet lower footprint requirements.

A very important result concerns the need for pitch stylization for intonation prediction. Pre-

vious work on HMM-based synthesis for a tonal language [Chomphan and KobayashiChomphan

and Kobayashi] shows that the inclusion of tonal information presents good intonation results.

However, from the obtained results, where no pitch stylization information is given, the synthe-

sized speech intonation presents itself well behaved compared to its original speech intonation.

Syllable and word levels linguistic features are fundamental in tone prediction. As it was seen in

section 3.3.1, many features from these levels were selected for pitch context-dependent trees.

Therefore, given the results, it is safe to say that context-dependent models are well generalized.

Based on these results, the use of pitch stylization information can be disregarded (at least for

non-tonal languages) without compromising the quality of the synthesized speech. Additionally,

tone prediction algorithms rely on dynamic programming algorithms, which can lower computa-

tional performance during synthesis.

43

44

5Conclusions

45

The goal of TTS systems is to synthesize speech with natural human voice characteristics.

The increasing availability of large speech databases makes it possible to construct TTS systems

by applying statistical learning algorithms. These systems, which can be automatically trained,

can generate natural and good quality synthetic speech.

In this thesis a small footprint speech synthesis system using HMMs was proposed. In such

a system, grammar features and context factors play an important role in the generation of speech

parameter sequences. To overcome language issues, configurable grammar features and context

factors modules for on-line processing were proposed. The goal of these modules was to provide

an easy way of configuring context-dependent clustered models without performing any modifica-

tions to the system core. An important obtained result was the adaptability of the HTS system to

language patterns, specifically to pitch patterns. The high number of models generated for pitch

produced a very natural intonation speech synthesis system. Another important result concerned

the need for pitch stylization for intonation prediction. The high number of models generated for

pitch and natural intonation speech synthesis, proved that the use of pitch stylization information

in context-dependent tree models, can be disregarded without compromising the quality of the

synthesized speech. The Dixi system performance also proved that the proposed architecture is

well suited for small footprint synthesis.

46

Bibliography

Black, A. and K. Tokuda (2005, September). The blizzard challenge – 2005: Evaluating corpus-based speech synthesis on common datasets. In Proc. EUROSPEECH 2005, pp. 77–80.ISCA.

Black, A. W., P. Taylor, and R. Caley (1996-2002). The Festival Speech Synthesis. Manual andsource code available at http://www.cstr.ed.ac.uk/projects/festival.html.

Black, A. W., H. Zen, and K. Tokuda (2007). Statistical parametric speech synthesis. In Proc.ICASSP-2007, pp. 1229–1232.

Bonafonte, A., H. Höge, I. Kiss, A. Moreno, U. Ziegenhain, H. van den Heuvel, H.-U.Hain,X.S.Wang, and M. N. Garcia (2006, May). TC-STAR: specifications of language resourcesand evaluation for speech synthesis. In LREC-2006: Fifth International Conference onLanguage Resources and Evaluation, Genoa, Italy, pp. 311–314.

Chevelu, J., N. Barbot, O. Boeffard, and A. Delhay (2007). Lagrangian relaxation for optimalcorpus design. In Proceedings of the 6th ISCA Tutorial and Research Workshop on SpeechSynthesis (SSW6), pp. 211–216. ISCA.

Chomphan, S. and T. Kobayashi (2006, August). Design of Tree-based Context Clustering foran HMM-based Thai Speech Synthesis System. In Proc. of 6th ISCA Speech SynthesisWorkshop, pp. 160–165.

Chomsky, N. and M. Halle (1968). Sound Pattern of English. New York: Harper and Row.Fukada, T., K. Tokuda, T. Kobayashi, and S. Imai (1992). An adaptive algorithm for mel-cepstral

analysis of speech. In Proc. of ICASSP, Volume 1, pp. 137–140.Hirschberg, J. and P. Prieto (1996). Training intonational phrasing rules automatically for english

and spanish text-to-speech. Speech Communication 18.Huang, X., A. Acero, and H. Hon (2001). Spoken Language Processing: A Guide to Theory,

Algorithm, and System Development. Prentice Hall.Imai, S. (1983). Cepstral analysis synthesis on mel frequency scale. In Proc. of ICASSP, pp.

93–96.Jakobson, R. and M. Halle (1956). Fundamentals of Language. The Hague: Mouton.Johnson, D. S. (1973). Approximation algorithms for combinatorial problems. In STOC ’73:

Proceedings of the fifth annual ACM symposium on Theory of computing, New York, NY,USA, pp. 38–49. ACM.

Keating, P. A. (1997, October). Word-level phonetic variation in large speech corpora. In TheWord as a Phonetic Unit, Berlin. Phonetics Lab, Linguistics Department, UCLA.

Masuko, T. (2002, November). HMM-Based Speech Synthesis and Its Applications. Ph. D. thesis,Tokyo Institute of Technology.

Masuko, T., K. Tokuda, T. Kobayashi, and S. Imai (1996). Speech synthesis using HMMs withdynamic features. In Proc. ICASP-96, pp. 389–392.

Mateus, M. H., A. Andrade, M. C. Viana, and A. Villalva (1990). Fonética, Fonologia e Morfologiado Português (1st ed.). Lisboa: Universidade Aberta.

Mateus, M. H. and E. d’Andrade (2000). The Phonology of Portuguese. Oxford University Press.Nascimento, M. F. B., J. Bettencourt, P. Marrafa, R. Ribeiro, R. Veloso, and L. Wittmann (1997).

LE-PAROLE - Do corpus à modelização da informação lexical num sistema multifunção.Actas do XIII Encontro da Associação Portuguesa de Linguistica. Lisboa, Portugal.

National Project TECNOVOZ number 03/165, P.Neto, J. P. and H. Meinedo (2000). Combination of acoustic models in continuous speech recog-

47

nition hybrid systems. In ICSLP 2000.Oliveira, L. C. (1996). Síntese de Fala a Partir de Texto. Ph. D. thesis, Instituto Superior Técnico,

Universidade Técnica de Lisboa.Oliveira, L. C., S. Paulo, L. Figueira, C. Mendes, R. Cassaca, M. do Céu Viana, and H. Mo-

niz (2008, September). DIXI TTS System. In PROPOR 08 - XIII Encontro para oProcessamento Computacional da Língua Portuguesa Escrita e Falada.

Oliveira, L. C., S. Paulo, L. Figueira, C. Mendes, A. Nunes, and J. Godinho (2008, may). Method-ologies for designing and recording speech databases for corpus based synthesis. InE. L. R. A. (ELRA) (Ed.), Proceedings of the Sixth International Language Resources andEvaluation (LREC 08), Marrakech, Morocco.

Oliver, D. and K. Szklanny (2006, May). Creation and analysis of a polish speech database foruse in unit selection synthesis. In LREC-2006: Fifth International Conference on LanguageResources and Evaluation, Genoa, Italy.

Paulo, S., L. A. Figueira, C. Mendes, and L. C. Oliveira (2008, September). The INESC-ID blizzardentry: Unsupervised voice building and synthesis.

Paulo, S. and L. C. Oliveira (2005, September). Generation of word alternative pronunciationsusing weighted finite state transducers. In Interspeech’2005, pp. 1157–1160. ISCA.

Paulo, S. and L. C. Oliveira (2007). MuLAS: A framework for automatically building multi-tiercorpora. In Interspeech 2007.

Rabiner, L. R. (1989, February). A tutorial on hidden markov models and selected applications inspeech recognition. In Procedings of the IEEE, Volume 77.

Ribeiro, R. D., L. C. Oliveira, and I. Trancoso (2003, June). Using morphossyntactic informationin tts systems: Comparing strategies for european portuguese. In PROPOR’2003 - 6thWorkshop on Computational Processing of the Portuguese Language, Lecture Notes inArtificial Inteligence, pp. 143–150. Springer-Verlag, Heidelberg.

Ribeiro, R. D. F. M. (2003, March). Anotação Morphossintáctica Desambiguada do Poruguês.Master’s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa.

Saratxaga, I., E. N. E., I. Hernaez, and I. Luengo (2006, May). Designing and recording anemotional speech database for corpus based synthesis in basque. In LREC-2006: FifthInternational Conference on Language Resources and Evaluation, Genoa, Italy.

Shinoda, K. and T. Watanabe (1996, May). Speaker Adaptation with Autonomous Model Com-plexity Control by MDL Principle. In Proc. of ICASSP, pp. 717–720.

Sjölander, K. (2003). The Snack Sound Toolkit. http://www.speech.kth.se/snack/index.html.Taylor, P., A. W. Black, and R. Caley (2001, January). Heterogeneous relation graphs as a formal-

ism for representing linguistic information. Speech Communication 33(1-2), 153–174.Tokuda, K., T. Kobayashi, and S. Imai (1995). Speech parameter generation from HMM using

dynamic features. In Proc. ICASSP-95, Volume 1, pp. 660–663.Tokuda, K., T. Masuko, T. Yamada, T. Kobayashi, and S. Imai (1995). An Algorithm for Speech

Parameter Generation from Continuous Mixture HMMs with Dynamic Features. In Proc. ofEUROSPEECH, pp. 757–760.

Tokuda, K., H. Zen, S. Sako, J. Yamagishi, T. Masuko, and Y. Nankaku. SPTK: Speech SignalProcessing Toolkit. http://sp-tk.sourceforge.net/.

Tokuda, K., H. Zen, J. Yamagishi, A. W. Black, T. Masuko, S. Sako, T. Toda, T. Nose, and K. Oura.HTS: HMM-based Speech Synthesis System. http://hts.sp.nitech.ac.jp/.

Viana, C., L. C. Oliveira, and A. I. Mata (2003). Prosodic phrasing: Machine and human evalua-tion. Speech Technology 6.

Vigário, M. and S. Frota (2003). The intonation of standard and northern european portuguese.Journal of Portuguese Linguistics 2-2.

Weiss, C., S. Paulo, L. A. Figueira, and L. C. Oliveira (2007, August). Blizzard entry: Integratedvoice building and synthesis for unit-selection TTS.

Yoshimura, T. (2002, January). Simultaneous Modeling of Phonetic ans Prosodic Parameters, andCharacteristic conversion for HMM-Based Text-To-Speech Systems. Ph. D. thesis, NagoyaInstitute of Technology.

Yoshimura, T., K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura (1998). Duration Modeling inHMM-based Speech Synthesis System. In Procedings of ICSLP, Volume 2, pp. 29–32.

48

Yoshimura, T., K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura (1999). Simultaneous mod-eling of spectrum and pitch and duration in HMM-based speech synthesis. In Eurospeech99, pp. 2347–2350.

Young, S., G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G. Moore, J. Odell, D. Ollason,D. Povey, V. Valtchev, and P. Woodland (2006). The HTK Book (for HTK Version 3.4).

49

50

APhonetic Alphabets

51

AFI SAM-PA Grapheme Example AFI SAM-PA Grapheme Examplei i i,í,y,e vi [ví] p p p pá [pá]e e e,ê vê [vé] b b b bem [bé ]� E e,é pé [p�] t t t tu [tú]a a a,á,à pá [pá] d d d dou [dó]� 6 a cama [c�m�] k k c,k casa [káz�]' @ e de [d'] g g g gato [gátu]= O ó,o pó [p=]o o ô,o avô [�vó] f f f fé [f�]u u ú,u tudo [túdu] v v v vê [vé]j j i,e pai [páj] s s s,ç,c sol [s=-]w w u,o pau [páw] z z z,s,x casa [káz�]

M S ch,s,z,x chave [Máv']ı i˜ i,í sim [s´ı ] ` Z j,g,s,z,x já [`á]e e˜ e,ê pente [pét']� 6˜ ã,a,e branco [br˜�ku] l l l lá [lá]õ o˜ õ,o,ô ponte [põt'] - l˜ l mal [má-]u u˜ u,ú atum [�tú] \ L lh valha [vá\�] j˜ i,e põe [põ]w w˜ o mão [m˜�w] m m m mão [m˜�w]

n n n não [n˜�w]7 J nh senha [s�7�]

D r r caro [káDu]J R r carro [káJu]

Table A.1: Phonetic Alphabet for Standard European Portuguese Dialect [OliveiraOliveira].

52

AFI Darpa Grapheme Example AFI Darpa Grapheme Examplei iy ee seek [sik] p p p pan [pæn]* ih i sick [s*k] b b b ban [bæn]+ ix i equipment [ikw+pm�nt] t t t tan [tæn]� eh e set [s�t] d d d Dan [dæn]æ ae a sat [sæt] k k c,k can [kæn]� aa o Bob [b�b] g g g gander [gænd�]� ah u but [b�t] e q q (feixo glotal)= ao ou bougth [b=t]V uh oo book [bVk] f f f fan [fæn]u uw u due [du] v v v van [væn]T ux u suit [sTt] S th th thing [S*8]� ax e the [��] � dh th that [�æt]�y ax-h o to go [t�yg=�] s s s seen [sin]� axr er butter [b�t�] z z z zone [zoVn]� er ir bird [b�d] M sh sh sheen [Min]

` zh z azure [æ`�]j y y you [jV] h hh h hope [hoVp]w w w we [wi] $ hv h ahead [�$�d]

e* ey ai bait [be*t] tM ch ch church [tM�tM]a* ay uy buy [ba*] d` jh g gin [d`*n]aV aw ow down [daVn]oV ow ow show [MoV] m m m me [mi]=* oy oy boy [b=*] n n r knee [ni]

ni en n button [b�tni ]8 ng ng weeping [wipi8]

D dx dd ladder [læD�]D nx n banter [bæ Dt�]

l l l long [l=8]li el l bottle [b=tli]G r r rent [G�nt]

Table A.2: Phonetic Alphabet for Standard European English Dialect [KeatingKeating].

53

54

Documents

Síntese de fala a partir de texto com reduzidos requisitos ......Síntese de fala a partir de texto com reduzidos requisitos computacionais - 392 Carlos Miguel Duarte Mendes Dissertação