277

Views

CrossRef citations to date

Altmetric

Listen

Original Articles

A UNIFIED APPROACH TO GRAPHEME-TO-PHONEME CONVERSION FOR THE PLATTOS SLOVENIAN TEXT-TO-SPEECH SYSTEM

Matej Rojc University of Maribor, Maribor, SloveniaCorrespondence[email protected]

Zdravko Kačič University of Maribor, Maribor, Slovenia

Abstract

This article presents a new unified approach to modeling grapheme-to-phoneme conversion for the PLATTOS Slovenian text-to-speech system. A cascaded structure consisting of several successive processing steps is proposed for the aim of grapheme-to-phoneme conversion. Processing foreign words and rules for the post-processing of phonetic transcriptions are also incorporated in the engine. The grapheme-to-phoneme conversion engine is flexible, efficient, and appropriate for multilingual text-to-speech systems. The grapheme-to-phoneme conversion process is described via finite-state machine formalism. The engine developed for Slovenian language can be integrated into various applications but can be even more efficiently integrated into architectures based on finite-state machine formalisms. Provided the necessary language resources are available, the presented approach can also be used for other languages.

INTRODUCTION

State-of-the-art telecommunication services in multilingual environments demand the development of multilingual text-to-speech (TTS) systems that use language-specific resources. In such systems an efficient and flexible mechanism must be developed for using these resources. The other problem is that input texts in native language for TTS systems can contain a lot of words, or even phrases, in other languages. These words have a significant impact on the final intelligibility of the synthesized speech when not processed correctly in the TTS system. Both problems are also closely connected with grapheme-to-phoneme conversion (G2P), the problem mainly discussed in this article. It is therefore necessary to develop language-independent G2P engines capable of acting on various phonetic lexicons. Switching to new language must be flexible, fast, and easy, without the need of explicitly knowing the language's specific phonetic regularities.

Grapheme-to-phoneme conversion is one of many steps needed in a TTS system. The traditional options for G2P conversion are the use of explicit rules or lexicon look-up, or a combination of the two. In this article, the latter option is followed. The phonetic transcriptions are obtained by lexicon look-up, if a phonetic lexicon is available. However, for some languages huge lexica are needed to account for linguistic exceptions. Efficient representation of such resources is also needed to meet real-time TTS constraints. The problem remains for the grapheme-to-phoneme conversion of unknown words, especially the nonnative ones found in many texts.

Obtaining the correct phoneme string from any written text of a given language in an efficient and flexible way is not a trivial task for many languages. Many studies have been performed on how to automatically extract the grapheme-to-phoneme conversion rules by using various machine learning techniques, based on HMMs, NN, or classification and regression trees (CART, Hain, Citation2000; Suontausta and Hakkinenen, Citation2000; Galescu and Allen, Citation2002; Gelfand et al., Citation1991; Pagel, Lenzo, and Black, Citation1998). CART trees help to obtain the phonetic knowledge existing in a phonetic lexicon automatically; that is, without the need for linguistic expertise. Classification trees have already proven themselves to be appropriate prediction models, where various features can also be used.

This article presents a new unified approach for grapheme-to-phoneme conversion. The G2P engine is based on the finite-state machine (FSM) formalism. Its cascaded structure consists of several processing steps. The modules inside a G2P engine can be used in different orders. The addition of new modules to any position inside a cascade is very flexible. Modules are proposed for processing foreign words and phonological post-processing rules. Separation of language-dependent resources from the language-independent G2P engine is also performed. CART models are used in the engine. CART models are also proposed for homograph disambiguation and syllabification, represented as weighted finite-state transducers (WFST). Language-specific processing steps can be efficiently considered during the construction of the G2P engine (e.g., stress prediction must be performed for the Slovenian language before grapheme-to-phoneme mapping). There is some work required to transform the training data into a format appropriate for training CART models. In this article we propose using heterogeneous relation graphs that can be used for the flexible construction of complex linguistic features.

The article is organized as follows. The second section presents the general TTS system architecture, where the language-independent core is separated from the language-dependent resources. Then a brief description is given of the FSM machines used in the unified grapheme-to-phoneme conversion process and the language independent approach of the proposed unified G2P conversion is presented in more detail. Then the implementation of the unified G2P is described for the Slovenian language. The article concludes with an analysis of the proposed unified G2P conversion approach's main benefits.

THE TTS SYSTEM ARCHITECTURE

Figure presents the general architecture of the PLATTOS Slovenian TTS system, with separated language-dependent and language-independent parts of the TTS system. The language-dependent part represents external language resources such as lexicons, corpora, rules, and acoustic database. The language-independent engine consists of the following modules: tokenizer, morphology analyzer, POS tagger, grapheme-to-phoneme conversion module, symbolic and acoustic prosody modules, unit selection module, and acoustic module. All modules in the PLATTOS TTS system's architecture are based on finite-state machine formalism. Use of finite-state machines and classification and regression trees (CART) enables separation of language-dependent resources and trained models from the language-independent TTS engine (Rojc, Citation2003).

FIGURE 1 The architecture of the PLATTOS TTS system, with separated TTS engine and language-dependent resources.

FINITE-STATE MACHINES (FSM) FOR GRAPHEME-TO-PHONEME CONVERSION

The basic idea of handling linguistic analysis in TTS systems using WFSTs and Yarowsky's decision lists for homograph disambiguation implemented as WFSTs has been already discussed by Sproat and colleagues at Bell Labs (Sproat, Citation1998). They showed that FSMs are very appropriate formalisms for the representation of grapheme-to-phoneme conversion engines that can be efficient and flexible. Such engines can be even more efficiently integrated into FSM-based architectures (e.g., TTS systems and speech recognition systems).

Finite-state machines (FSM) represent an attractive solution as a unified framework for grapheme-to-phoneme conversion, since they have the following very interesting features (Emmanuel and Schabes, Citation1997):

	Optimal speed: matching a string with a deterministic finite-state machine takes a time linearly proportional to the input size and is independent of the finite-state machine size;
	Compactness;
	Large-scale optimization: a lot of efficient algorithms exist; and
	Modularity.

A finite-state automaton (FSA; Kuich and Salomaa, Citation1986; Hopcroft and Ullman, Citation1979; Daciuk, Citation1998; Mohri, Citation1995) can be seen simply as a directed graph with labels on each arc. A finite-state automaton A is a 5-tuple (Σ, Q, i, F, E) where Σ is a finite set called the alphabet, Q is a finite set of states, i ∊ Q is the initial state, F ⊆ Q is the set of final states, and E ⊆ Q × (Σ ∪ {ε}) × Q is the set of edges. FSAs are shown to be closed under union, Kleene star, concatenation, intersection, and complementation, thus allowing for natural and flexible descriptions. They are also very flexible due to their closure properties (Emmanuel and Schabes, Citation1997).

Finite-state transducers (FST; Mohri and Sproat, Citation1996; Mohri, Citation1994 Citation1997) can be interpreted as defining a class of graphs, a class of relations on strings, or a class of transductions on strings. In the first interpretation, an FST can be seen as an FSA, in which each arc is labeled by a pair of symbols rather than by a single symbol. A finite-state transducer T is a 6-tuple (Σ₁, Σ₂, Q, i, F, E), where Σ₁ is a finite alphabet, namely the input alphabet; Σ₂ is a finite alphabet, namely the output alphabet; Q is a finite set of states; i ∊ Q is the initial state; F ⊆ Q is the set of final states; and E ⊆ Q × Σ₁∗ × Σ₂∗ × Q is the set of edges. As with FSAs, FSTs are also powerful because of the various closure and algorithmic properties (Emmanuel and Schabes, Citation1997; Mohri, Citation1997). In this article, the following conventions when describing an FSM are used: final states are depicted by a bold circle; ε represents the empty string; and the initial state (labeled 0) is the leftmost state appearing in the figure.

UNIFIED APPROACH OF THE SLOVENIAN GRAPHEME-TO-PHONEME CONVERSION

FSM properties are the main reason for the development of a unified Slovenian grapheme-to-phoneme conversion system based on FSM machines. Firstly, the proposed unified approach enables efficient integration of the Slovenian grapheme-to-phoneme conversion process into the PLATTOS TTS system and, secondly, it also provides a clean general purpose mechanism for storing and accessing linguistic knowledge. The processing steps that are included in the proposed approach to Slovenian grapheme-to-phoneme conversion process are depicted in Figure . Implementation of such a solution using FSMs represents a time and space efficient solution for grapheme-to-phoneme conversion, able to be used in multilingual and polyglot TTS systems. The speed of the grapheme-to-phoneme conversion process is independent of the number of entries in the lexicons and also of the rule context length (when using automatically obtained rules).

FIGURE 2 Processing steps proposed for Slovenian grapheme-to-phoneme conversion in the PLATTOS TTS system.

The proposed FSM based cascade module for the Slovenian grapheme-to-phoneme conversion approach in Figure consists of more processing steps. Phonetic lexicons, presented by finite-state machines, are used firstly for simple lexicon look-up. When the word is a homograph, its correct phonetic transcription must be determined using the context information already available and obtained in previous TTS processing steps during morphological analysis and part-of-speech (POS) tagging. Trained CART models containing automatically obtained weighted linguistic rules are used in the case of unknown words. They are represented as WFST machines due to the overall flexibility and efficiency of the G2P process. In many texts (e.g., newspaper, e-mails) there are also words in one or even more foreign languages. Such words, or even phrases, must be detected to be able to perform correct overall grapheme-to-phoneme conversion. In the final step, some linguistic postprocessing pronunciation rules are applied, to adapt the obtained phonetic transcriptions to the context of neighboring words and the pronunciation rules for the Slovenian language. The described grapheme-to-phoneme conversion engine is appropriate for integration into the architecture of the PLATTOS TTS system. The engine is flexible, efficient, and open to new languages. Some ideas about handling grapheme-to-phoneme conversion in TTS systems using WFSTs and Yarowsky's decision lists for homograph disambiguation implemented as WFSTs have already been discussed by Sproat and colleagues at Bell Labs (Sproat, Citation1998). Nevertheless, in this article, additional ideas are proposed for the grapheme-to-phoneme conversion of foreign words and proper names (also frequently found in many Slovenian texts), homograph disambiguation, syllabification, and FSM-based implementation of phonetic postprocessing rules. All these processing steps have an important influence on the final naturalness of synthesized speech. In the presented approach, time- and space-efficient incremental algorithms were used for the construction of finite-state transducers representing language specific resources (phonetic lexicons; Daciuk, Citation1998; Rojc, Citation2000). When usingFSM formalism, the language-specific order of the processing steps inside the unified G2P engine can be implemented in a flexible and efficient way (e.g., stress prediction must be done before the grapheme-to-phoneme conversion of unknown words for the Slovenian language). The FSM architecture of the G2P engine allows for arbitrary linking of G2P modules and changing their positions in the G2P engine; therefore, it is also open for other languages. In the following subsections all processing steps included in the unified G2P conversion will be described in more detail.

FIGURE 3 Proposed FSM based cascade module for grapheme-to-phoneme conversion.

Phonetic Lexicons

Common Words

In any TTS system, external language resources (e.g., phonetic lexicons) represent a problem regarding memory usage and, especially, needed look-up time. The representation of phonetic lexicons using finite-state machines (FSM) is time and space efficient. The effect of using FSM for representation of external natural language resources is a considerable reduction in memory usage and optimal access time (look-up time), the latter is even independent from the sizes of the lexicons. In Figure the achieved compression of 100 lexicon entries, when using finite-state transducers, is illustrated in visual form. From the shape of the FST, it can be seen that prefixes and suffixes of the entries are especially well compressed (less states and transitions), whereas in the middle part of the FST the compression is not as efficient (more states and transitions remains after minimization).

FIGURE 4 Phonetic lexicon SIflex, represented as FST (100 entries).

Proper Names

Proper names are far more numerous than ordinary words according to the statistical results given in Liberman and Church (Citation1992). It is also well known that the pronunciation of proper names cannot typically be accounted for by using the same letter-to-sound (LTS) rules as for ordinary words. Phonetic lexicons of European proper names were developed during the Onomastica European project (Onomastica, Citation1995). The results of this project could lay the basis for a truly multilingual, lexicon-based proper names pronunciation system. Since these lexicons are huge, their representation must be time and space efficient to maintain the efficiency of the unified grapheme-to-phoneme conversion engine. Normally only Onomastica for a native language would be used (e.g., for the Slovenian language). Since there are a lot of foreign proper names found in native texts, other Onomastica lexica can also be used in the G2P engine as part of the external resources (in order to support the polyglot nature of the G2P engine). In this case their representation must be time and space efficient. Some long string entries are characteristic of Onomastica lexicons. Additional compression is, therefore, achievable, when some long entries are broken into constituent words before the lexicon compilation process is performed.

Disambiguation of Homographs

The written text can contain the so-called natural ambiguities and artifical ambiguities. Natural ambiguities consist of words that are polysemous. In the Slovenian language there are a lot of homographs that can be used in various contexts and cases. A nice example in the Slovenian language is the use of the homograph “gori” (top/mountain/burn):

Gori na gori gori (g∗O − rinag∗O − rigo − r/i :)(It burns on the top of the mountain).

The correct pronunciations for homographs can be mainly resolved by the context. Use of POS-tags can help to disambiguate a great number of them, but in some cases even the use of POS-tags is insufficient. Semantic information is also needed in such cases. POS-tags and word-context information in a homographic disambiguation process are used in the proposed G2P solution. In the proposed G2P engine we are able to use the Yarowsky's decision lists or CART homograph disambiguation model, which is proposed as an alternative approach in this article. Both models have to be trained on corpus-containing homographs.

Figure presents the proposed approach for the homograph data preparation step needed in the process of homographic disambiguation model training. Firstly, the tokenizer defines the initial set of sentences to be included in the homograph corpus construction. The homograph detector then automatically tags each homograph in the input sentences and keeps only sentences containing homographs. The list of homographs can also be defined automatically, if such information exists in phonetic lexicons (e.g., as in the SIflex lexicon for the Slovenian language), or otherwise manually by a linguistic expert. Next follows the phase of automatic POS-tagging and the grapheme-to-phoneme conversion of the homographs in the corpus. In this step, homographs can have more than just one assigned phonetic transcription. In the next step, the linguistic expert has to manually choose the appropriate transcription by considering the context of the homograph. The processed sentences are then coded into heterogeneous relation graph (HRG) structures, flexible for an efficient feature vector construction process (Taylor et al., Citation2001). These features are used for the training of, e.g., CART model or a decision-list model, as will be shown in the implementation part of the article. Sproat stated that in the case of CART models, some work is necessary on the part of the user to transform the data into a format that can be used to train the models (Sproat, Citation2001). This problem is solved if the linguistic phonological knowledge is represented by HRG structure. These structures enable the definition of any type of linguistic features easily and flexibly.

FIGURE 5 The proposed approach for the construction of homograph training corpus for homograph disambiguation model.

Unknown Words

In many languages orthographic transcription has some relation to phonetic transcription. This can sometimes be fairly simple, complex or also quite complex. It is evident that people are able to pronounce unknown words fairly correctly. In TTS systems the main motivation for a grapheme-to-phoneme conversion module is to be able to automatically find this relation and determine phonetic transcriptions of unknown words (OOV; out of vocabulary words) correctly, if possible.

Phonetic Transcription Alignment

It is necessary to perform an alignment of orthographic and corresponding phonetic transcriptions between graphemes and phonemes before starting construction of the grapheme-to-phoneme conversion model for OOV words. The number of letters in a word and the number of phonemes in the corresponding transcription do not necessarily equate. In general, graphemes can be mapped to zero, one, two or even more phonemes. The same is true for phonemes. The alignment is not necessarily trivial even if the orthographic and phonetic transcriptions have the same number of letters and phonemes. When not all letters in the given context have the corresponding mapping into phonemes, the empty symbol ε is used as a filler symbol. Heterogeneous relation graphs (HRG), which are suitable structures for storing and representing linguistic knowledge about aligned entries in the phonetic lexicons, are used after the alignment process. Elements of the HRG structures are lists with attribute values, containing available linguistic information (grapheme-to-phoneme mapping, syllable positions, stress position, stress type, etc.). The HRG structure is very flexible when linking various phonological features for the construction of complex feature vectors to be used for training CART models. Feature vectors are defined simply by the list, with selected feature names (stress type, position in word, voiced consonant, vowel length, vowel height, vowel frontness, consonant type, consonant voicing, etc.). When using CART models, each feature vector can be composed from a number of different features (numerical or categorical) and, of course, the target value (e.g., stress type, syllable break, or phoneme). In Figure HRG representation of one aligned lexicon entry is presented for the SIflex Slovenian phonetic lexicon.

FIGURE 6 Heterogeneous relation graph (HRG) for one aligned lexicon entry, containing linguistic knowledge (e.g., from SIflex lexicon).

CART and WFST Models for Grapheme-to-Phoneme Conversion

CART trees can be used to obtain phonetic knowledge existing in the phonetic lexicon fully automatically, without the need for a linguistic expert. The advantage of CART trees over some other automatic training methods, such as NN and linear regression, is that their output is more readable and often more understandable to humans. This enables their manual modification if necessary. The constructed trees used in this article contain yes/no questions about features, and provide probability distribution, since they predict categorical values (CART tree). Feature vectors can be constructed by using phonetic lexicon. The vectors consist of a number of feature values, each containing the same number of fields. The first field always contains the value that should be predicted (target value). Additional description must also be defined, containing sets of all possible values for each feature name. Constructed feature vectors are usually separated into the training and testing feature sets. The ratio 9:1 is frequently used. Well-defined training techniques were developed to construct an optimal tree from a set of training data (Breiman et al., Citation1984).

Although CART trees are a size-efficient solution, it is hard to integrate them into a system whose architecture is based on FSMs. Fortunately, CART models can be compiled into the corresponding WFST models (Sproat and Riley, Citation1996) giving us time- and space-efficient representation of the CART models.

CART models can be fully described by a set of weighted rewrite rules. Kaplan and Kay (Citation1994) showed that the set of rewrite rules such as:

where φ, ψ, λ, and ρ are regular expressions that can be represented as FST. Here φ represents the input of the rule, ψ the output of the rule, and λ and ρ are the left and right contexts of the rewrite rule. Kaplan and Kay (Citation1994) also proposed an algorithm for compiling the rewrite rules into corresponding FSTs. This algorithm can also be extended for the probabilistic or weighted rules. Weighted rewrite rules can be compiled into WFST by an algorithm proposed by Mohri and Sproat (Mohri and Sproat, Citation1996; Sproat and Riley, Citation1996). Each rewrite rule replaces φ with ψ, considering the calculated rule weight and the specified context

. When compiling CART trees into a corresponding WFST, it is necessary to be aware of two facts:

	Final lists define how to rewrite special symbols in the input string,
	Decisions in each node should be representable by regular expressions.

The whole tree is compiled in such a way that intersection of all obtained rules is computed. Each of the meta labels used in λ and ρ represents pairs of symbols (sets). Therefore, each tree represents a system of weighted rewrite rules. The rule for leaf L in the tree is defined by:

and the whole tree is obtained after performing an intersection of all WFST rules in the CART tree model:

φ_L represents the input symbol for the tree model, ψ_L is the output expression defined in the leaf L, P _L represents the path from the root to the leaf L, p is an individual branch on this path, and λ_p and ρ_p are left and right contexts for branch p for a corresponding question. Figure shows the procedure for compiling the trained decision tree models used in the unified grapheme-to-phoneme conversion into corresponding WFST models. In the first step all questions in the trees are described by regular expressions (using automatic procedure). Then the construction of rewrite rules for all tree leaves follows, including automatic definition of left and right contexts along the paths of the leaves. Rewrite rules are compiled into WFST machines and combined into the final WFST model using intersection operation.

FIGURE 7 Procedure for compiling decision tree models into corresponding WFST models.

Word Stress Prediction

The importance of information such as word stress is different among languages. In some languages (e.g., the Slovenian language) the correct prediction of stress is vital for correct mapping prediction in the subsequent grapheme-to-phoneme conversion procedure. In the English language the word stress can be interpreted differently, usually depending on the syntactic stress and, in some morphological derivations, it can even change position. Only vowels can be stressed in most languages. Prediction of word stress for each vowel cannot be accurate enough when only using the letter context. Often it is better to construct two separate models, one for predicting word stress and the other for grapheme-to-phoneme conversion. HRG structures can be used for storing and representing linguistic knowledge about phonetic lexicons' entries including stress type and position information. The construction of complex feature vectors is very flexible when linking various phonological features. Feature vectors can then be used for training CART word stress prediction models that can be compiled into WFST machine, as an optimised representation of the model.

Grapheme-to-Phoneme Conversion of Out-of-Vocabulary (OOV) Words

In an on-line unified grapheme-to-phoneme conversion model, in the G2P conversion step, the stress position and type should already be known or predicted. Therefore, stress information can be included as additional linguistic information into the HRG structures, containing linguistic knowledge about phonetic lexicons' entries. In the next step, HRG structures and the list with selected feature names are used for the construction of linguistic feature vectors. The compilation of the model into time and space-efficient representation as a WFST machine can follow after training the CART grapheme-to-phoneme conversion model.

Syllabification

Information about syllable break positions is needed at higher levels of TTS processing. In many languages the pronunciation of phonemes is a function of corresponding syllable break positions. The phoneme's position in the syllable also has an important impact on phonemes' durations and represents important information for subsequent modules in the TTS system, predicting the speech segment duration. Syllabification can be performed using, e.g., declarative grammar for describing possible syllable break positions in polysyllabic words. A very simple example of such description can be that each word inside a sentence is composed of one or more syllables of structure C∗VC∗. Each syllable described in such a way can be composed from an obligatory nucleus of the syllable, marked with V. Any number of consonants, marked with C, can follow before and after the nucleus or both. Using rules, it is possible to define positions where the syllable break must be inserted in the input string. According to the defined syllable structure, there can exist more possible positions for each syllable break, but usually the first possible one is chosen. In finite-state machine formalism, this can be achieved by assigning higher weight to the last consonant of each syllable. Such an approach to syllabification is also called “the principle of maximal onset” (Sproat, Citation1998). Grammar can, for example, end each non-final syllable in a word with the corresponding symbol for marking syllable break positions. Later sections present two approaches for automatic syllabification. The first approach is based on a statistical analysis performed on some phonetic lexicon, containing syllable markers, and was proposed by Kiraz and Mobius (Citation1998). The other one, based on a trained CART model, is proposed in this article and is described in more detail later. In both approaches the syllabification models can be represented as WFST. Therefore, they are both appropriate for integration into the unified grapheme-to-phoneme conversion model and represent time- and space-efficient solutions.

Foreign Words

People who master more languages are very rare. They use their knowledge about language whenever it is necessary and are able to decide when to switch to another language. They are also capable of deciding when to perform total assimilation into the primary language or just perform assimilation of the prosody and use the corresponding foreign pronunciation.

Therefore, one of the most important features for the TTS system would be the capability of synthesizing speech from sentences containing words and phrases in different languages. One possibility is to use different speech databases for the speech synthesis of such sentences. Such an approach is, however, unacceptable in many respects. The real polyglot TTS system is also expected to be capable of synthesizing speech using the same voice in sentences where words or word phrases are present in one, or even more, foreign languages. Newspaper texts are especially problematic, as there are a lot of phrases and proper names in non-primary languages. The main problem when discussing the polyglot approach is the common treatment of foreign language phrases inside one environment (use of different dictionaries), text analysis, language detection in the input text, and use of extensive polyglot pronunciation lexicons. The method proposed in this article consists basically of two main steps. In the first step, the detection of foreign words or phrases has to be performed. In this step, the detector can be built from a foreign words list. The list can be constructed from large text corpora available for a given language. Its representation as finite-state automaton can then represent a time and space efficient detector of foreign words in input sentences. After detection, the next step is a grapheme-to-phoneme conversion process as defined for the given language. Finally, in the last step, the mapping of foreign phonemes into the primary language phonemes has to be performed, as seen in Figure .

FIGURE 8 Proposed approach to foreign words' grapheme-to-phoneme conversion.

Grapheme-to-Phoneme Post-Processing

In continuous speech, the phonemes at word endings are often changed because they are influenced by the coarticulation effects that also spread to the next word. Lexicon items usually contain phonetic transcriptions as spoken in isolation (canonical form). Postprocessing rules are, therefore, important for any TTS system in order to achieve naturally sounding speech. Phonetic transcriptions must be adapted following the coarticulation and phonological rules of the language. Postprocessing rules represent, of course, a language-dependent step. They can be defined manually by linguistic experts or automatically, if a training corpus is available. In this article the grapheme-to-phoneme postprocessing method is described in the implementation part of the article, where postprocessing rules are defined by the expert and then represented as WFST machines. Such representation is efficient and suitable for integration into the architecture of any TTS.

In the previous sections, problems involving the grapheme-to-phoneme conversion process were exposed and the processing steps were described, which are used in the proposed grapheme-to-phoneme conversion engine. The G2P engine is composed of several FSM machines, each performing a specific task in the cascaded architecture. These tasks can be performed using different orders (the order of FSMs can be language dependent). An additional two FSMs are integrated into the engine. The first FSM for processing foreign words and the second for adaptation of canonical phonetic transcriptions found in specific word contexts (here, postprocessing phonological rules for a given language are used). In the engine, efficient incremental algorithms are used for lexica FSM representations. The use of CART models is proposed for homographic disambiguation problems and syllabification, represented as WFST machines. Implementation examples of the ideas presented in the first part of the article for the Slovenian language will be presented in the next sections. However, the engine can be efficiently and flexibly adapted for other languages and TTS systems, due to the FSM architecture, its features, and corresponding mathematical operations.

IMPLEMENTATION OF A UNIFIED GRAPHEME-TO-PHONEME CONVERSION ENGINE FOR THE SLOVENIAN LANGUAGE

This section presents the implementation of the proposed unified grapheme-to-phoneme conversion engine for the Slovenian language.

Efficient Representation of Phonetic Lexicons

Phonetic and morphology lexicons are the necessary prerequisites for any TTS system. Their efficient utilization is crucial, particularly in real-time applications. The use of FSM formalism is, therefore, especially attractive for efficient representation of lexicons. The following presents the lexicon compilation results of phonetic lexicons for the Slovenian language, when implemented with an FST machine. The SIflex phonetic lexicon is one of the SIlex lexicons and was developed in parallel with the SImlex morphology lexicon (Rojc and Kacic, Citation2000). In the FST construction 68,817 entries were used. The phonetic symbols used are SAMPA symbols (Speech Assessment Methods Phonetic Alphabet) for the Slovenian language. The compilation results using a proprietary FSMHal lexicon compiler (Rojc, Citation2003) are presented in . The final size of the FST is only 197 kB. Optimal access time (look-up time) is obtained using a determinization operation. In addition, a SIplex phonetic lexicon of proper names for the Slovenian language (237,657 entries) was presented as FST. Table presents the compilation results when using FST representation of the lexicon.

	First rule:When the current word is before a word that begins with one of the vowels (a, /a :, \a…) and ends with a voiced consonant, this voiced consonant must be replaced with the corresponding unvoiced pair (Figure a).
	Second rule:When the preposition (Sps∗ - POS tag according to Multext specifications) ends with a voiced consonant, this must not be changed when the next word starts with a vowel (Figure b).

A UNIFIED APPROACH TO GRAPHEME-TO-PHONEME CONVERSION FOR THE PLATTOS SLOVENIAN TEXT-TO-SPEECH SYSTEM

Abstract

INTRODUCTION

THE TTS SYSTEM ARCHITECTURE

FINITE-STATE MACHINES (FSM) FOR GRAPHEME-TO-PHONEME CONVERSION

UNIFIED APPROACH OF THE SLOVENIAN GRAPHEME-TO-PHONEME CONVERSION

Phonetic Lexicons

Common Words

Proper Names

Disambiguation of Homographs

Unknown Words

Phonetic Transcription Alignment

CART and WFST Models for Grapheme-to-Phoneme Conversion

Word Stress Prediction

Grapheme-to-Phoneme Conversion of Out-of-Vocabulary (OOV) Words

Syllabification

Foreign Words

Grapheme-to-Phoneme Post-Processing

IMPLEMENTATION OF A UNIFIED GRAPHEME-TO-PHONEME CONVERSION ENGINE FOR THE SLOVENIAN LANGUAGE

Efficient Representation of Phonetic Lexicons

TABLE 1 SIflex Phonetic Lexicon Implemented as FST (68,817 entries)

TABLE 2 SIplex Phonetic Lexicon for the Slovenian Language, Presented as FST (23,657 entries)

Homograph Disambiguation

Homograph Decision List Disambiguator

TABLE 3 Context Examples for the Slovenian Homograph “dvigali” (elevators/lift up)

TABLE 4 Part of Constructed Decision List for Slovenian Homograph “dvigali” (elevators/lift up)

TABLE 5 Possible Lexical Analysis for Homograph “divigali” (elevators/lift up), Described with Transducer L

TABLE 6 Mappings Performed by the Context Classifier C 2 for Homograph “dvigali” (Elevators/Lift up)

TABLE 7 Rules for the Dissambiguator D

Homograph CART Model Disambiguator

TABLE 8 Questions Presented by Regular Expressions

Unknown Words

Phonetic Transcription Alignment

Word Stress

TABLE 9 Confusion Matrix for Word Stress Module. “0” Marks no Stress, “/” Marks Long and Narrow Stressed vowels, “/” Marks Short and Wide Stressed Vowels, and “∗”Marks Long and Wide Stressed Vowels

TABLE 10 Prediction of Word Stress for Slovenian Languages

TABLE 11 Questions Represented by Regular Expressions (Generated Automatically)

Grapheme-to-Phoneme Conversion of Out-of-Vocabulary (OOV) Words

TABLE 12 Grapheme-to-Phoneme Conversion (Letter-to-Phoneme) Results for the Solvenian Language

TABLE 13 Vowel Confusion Matrix Results, When No Stress Type and Position Information Is Used

TABLE 14 Grapheme-to-Phoneme Conversion (Letter-to-Phoneme) Results for the Slovenian Language, When No Stress Type and Position Information Is Used

Syllabification Based on Lexicon Statistics

TABLE 15 Classes of Slovenian Phonemes

TABLE 16 Examples of Consonant Clusters for the Slovenian Language

WFST Syllabification Transducer

TABLE 17 Syllable Structure for the Slovenian Word “Pacientov/Patient's

TABLE 18 A Set of Slovenian Nuclei Found in the SIflex Lexicon and the Corresponding Frequencies (f)

Syllabification Using CART Model

TABLE 19 Confusion Matrix for a Syllabification Model. “0” Marks No Syllable Present and “1” Marks Syllable Is Present. (a) WFST Model and (b) CART Model

TABLE 20 Syllable Prediction for the Slovenian Language. (a) WFST Model and (b) Cart Model

Grapheme-to-Phoneme Approach to Foreign Words

Grapheme-to-Phoneme Postprocessing Rules

CONCLUSION

Notes

REFERENCES

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

**TABLE 1 SIflex Phonetic Lexicon Implemented as FST (68,817 entries)**

**TABLE 2 SIplex Phonetic Lexicon for the Slovenian Language, Presented as FST (23,657 entries)**

**TABLE 3 Context Examples for the Slovenian Homograph “dvigali” (elevators/lift up)**

**TABLE 4 Part of Constructed Decision List for Slovenian Homograph “dvigali” (elevators/lift up)**

**TABLE 5 Possible Lexical Analysis for Homograph “divigali” (elevators/lift up), Described with Transducer L**

**TABLE 6 Mappings Performed by the Context Classifier C ₂ for Homograph “dvigali” (Elevators/Lift up)**