200
Views
0
CrossRef citations to date
0
Altmetric
Original Articles

PHRASE AND NGRAM-BASED STATISTICAL MACHINE TRANSLATION SYSTEM COMBINATION

&
Pages 694-711 | Published online: 01 Oct 2009

Abstract

Multiples translations can be computed by one machine translation (MT) system or by different MT systems. We may assume that different MT systems make different errors due to using different models, generation strategies, or tweaks. An investigated technique, inherited from automatic speech recognition (ASR), is the so-called system combination that is based on combining the outputs of multiples MT systems. We combine the outputs of a phrase- and Ngram-based Statistical MT (SMT) systems using statistical criteria and additional rescoring features.

INTRODUCTION

Combining outputs from different systems was shown to be quite successful in automatic speech recognition (ASR). Voting schemes like the ROVER approach of Fiscus (Citation1997) use edit distance alignment and time information to create confusion networks from the output of several ASR systems.

In MT, some approaches combine lattices or n-best lists from several different machine translation (MT) systems (Frederking and Nirenburg, Citation1994). To be successful, such approaches require compatible lattices and comparable scores of the (word) hypotheses in the lattices.

Bangalore, Bordel, and Riccardi (Citation2001) used the edit distance alignment extended to multiple sequences to construct a confusion network from several translation hypotheses. This algorithm produces monotone alignments only, i.e., allows insertion, deletion, and substitution of words. Jayaraman and Lavie (Citation2005) try to deal with translation hypotheses with significantly different word order. They introduce a method that allows nonmonotone alignments of words in different translation hypotheses for the same sentence.

Experiments combining several kinds of MT systems have been presented in Matusov, Ueffing, and Ney (Citation2006), but they are only based on the single best output of each system. They propose an alignment procedure that explicitly models reordering of words in the hypotheses. In contrast to existing approaches, the context of the whole document, rather than a single sentence, is considered in this iterative, unsupervised procedure, yielding a more reliable alignment.

More recently, confusion networks have been generated by choosing one hypothesis as the skeleton and other hypotheses are aligned against it. The skeleton defines the word order of the combination output. Minimum Bayes risk (MBR) was used to choose the skeleton in Sim et al. (Citation2007). The average translation edit rate (TER) score (Snover, Dorr, Schwartz, Micciulla, and Makhoul, Citation2006) was computed between each system's 1-best hypotheses in terms of TER. This work was extended by Rosti et al. (Citation2007) by introducing system weights for word confidences.

Finally, the most straightforward approach simply selects for each sentence, one of the provided hypotheses. The selection is made based on the scores of translation, language, and other models (Nomoto, Citation2004; Doi et al., Citation2005).

The next section briefly describes the phrase- and Ngram-based MT systems. In the Straight System Combination section, we combine the output of two systems: phrase- and Ngram-based. Both systems are statistical and share similar features. That is why they tend to produce outputs which do not vary much. We propose a straightforward approach which simply selects, for each sentence, one of the hypotheses. In these systems, the phrase- or Ngram-based models are usually the main features in a log-linear framework, reminiscent of the maximum entropy modeling approach. Two basic issues differentiate both systems. In the Ngram-based model the training data is sequentially segmented into bilingual units, and the probability of these units is estimated as a bilingual Ngram language model. However, in the phrase-based model, no monotonicity restriction is imposed on the segmentation, and the probabilities are normally estimated simply by relative frequencies.

In the subsequent section, we propose to rescore the outputs of both systems with the probability given by both systems. The central point here is that the cost of phrase-based output is computed with the Ngram-based system. This cost is combined together with the phrase-based cost of this output. For the Ngram-based outputs, the cost is computed with the phrase-based system and again this cost is combined together with the Ngram-based cost. Notice that we have to deal with the particular cases where one translation generated by a system cannot be computed by the other system.

PHRASE- AND NGRAM-BASED SYSTEMS

During the last few years, the use of context in SMT systems has provided great improvements in translation. Statistical MT has evolved from the original word-based approach to phrase-based translation systems (Koehn et al., Citation2003). In parallel to the phrase-based approach, the use of bilingual n-grams gives comparable results as shown by Crego, Costa-Jussà, Mariño, and Fonollosa (Citation2005). Two basic issues differentiate the n-gram-based system from the phrase-based: training data are monotonically segmented into bilingual units and the model considers n-gram probabilities rather than relative frequencies. This translation approach is described in detail by Mariño et al. (Citation2006).

Both systems follow a maximum entropy approach, in which a log-linear combination of multiple models is implemented, as an alternative to the source-channel approach. This simplifies the introduction of several additional models explaining the translation process, as the search becomes:

where f and e are sentences in the source and target language, respectively. The feature functions h i are the system models and the λ i weights are typically optimized to maximize a scoring function on a development set. Both the n-gram-based and the phrase-based system use a language model on the target language as feature function, i.e., P(e), but they differ in the translation model. In both cases, it is based on bilingual units. A bilingual unit consists of two monolingual fragments, where each one is supposed to be the translation of its counterpart. During training, each system learns its dictionary of bilingual fragments.

Both SMT approaches were evaluated in IWSLT'06 evaluation and they are described in detail in Costa-Jussà et al. (Citation2006), Crego et al. (Citation2006). Therefore, we only give a short summary in the following two sections.

Phrase-Based Translation Model

Given a sentence pair and a corresponding word alignment, a phrase (or bilingual phrase) is any pair of m source words and n target words that satisfies two basic constraints:

  1. Words are consecutive along both sides of the bilingual phrase.

  2. No word on either side of the phrase is aligned to a word out of the phrase.

Given the collected phrase pairs, we estimate the phrase translation probability distribution by relative frequency:

where N(f, e) means the number of times the phrase f is translated by e; N(e), the number of times the phrase e appears; and N(f), the number of times the phrase f appears. Notice that the phrase-based system has two feature functions, (P(f|e) and P(e|f)), which are considered translation models.

N-Gram-Based Translation Model

The translation model can be thought of a language model of bilingual units (here called tuples). These tuples define a monotonic segmentation of the training sentence pairs (f, e), into K units (t 1,…, t K ).

The translation model is implemented using an n-gram language model, (for N = 4):

Bilingual units (tuples) are extracted from any word-to-word alignment according to the following constraints:

a monotonic segmentation of each bilingual sentence pairs is produced,

no word inside the tuple is aligned to words outside the tuple, and

no smaller tuples can be extracted without violating the previous constraints.

As a consequence of these constraints, only one segmentation is possible for a given sentence pair.

Two important issues regarding this translation model must be considered. First, it often occurs that a large number of single-word translation probabilities are left out of the model. This happens for all words that are always embedded in tuples containing two or more words; then no translation probability for an independent occurrence of these embedded words will exist. To overcome this problem, the tuple 4-gram model is enhanced by incorporating 1-gram translation probabilities for all the embedded words detected during the tuple extraction step. These 1-gram translation probabilities are computed from the intersection of both the source-to-target and the target-to-source alignments.

The second issue has to do with the fact that some words linked to NULL end up producing tuples with NULL source sides. Since no NULL is actually expected to occur in translation inputs, this type of tuple is not allowed. Any target word that is linked to NULL is attached either to the word that precedes or the word that follows it. To determine this, we use the POS entropy approach de Gispert (Citation2006).

Additional Features

Both systems share the additional features which follow:

A target language model. In the baseline system, this feature consists of a 4-gram model of words, which is trained from the target side of the bilingual corpus.

A source-to-target lexicon model. This feature, which is based on the lexical parameters of the IBM model 1, provides a complementary probability for each tuple in the translation table. These lexicon parameters are obtained from the source-to-target alignments.

A target-to-source lexicon model. Similar to the previous feature, this feature is based on the lexical parameters of the IBM model 1, but in this case, these parameters are obtained from target-to-source alignments.

A word bonus function. This feature introduces a bonus based on the number of target words contained in the partial-translation hypothesis. It is used to compensate for the system's preference for short output sentences.

A phrase bonus function. This feature is used only in the phrase-based system and it introduces a bonus based on the number of target phrases contained in the partial-translation hypothesis.

All these models are combined in the decoder. Additionally, the decoder allows for a nonmonotonic search with the following distortion model:

A word distance-based distortion model:

where d k is the distance between the first word of the kth unit, and the last word +1 of the (k − 1)th unit. Distance is measured in words referring to the units source side.

To reduce the computational cost, we place limits on the search using two parameters: the distortion limit (the maximum distance measured in words that a tuple is allowed to be reordered, m) and the reordering limit (the maximum number of reordering jumps in a sentence, j). This feature is independent of the reordering approach presented in this article, so they can be used simultaneously.

In order to combine the models in the decoder suitably, an optimization tool based on the Simplex algorithm is used to compute log-linear weights for each model.

STRAIGHT SYSTEM COMBINATION

Integration of phrase and Ngram-based translation models in the search procedure would be a complex task. First, translation units of both models are quite different. Then, the fact that the Ngram-based translation model uses context and the phrase-based translation model does not use it poses severe implementation difficulties.

Some features that are useful for SMT are too complex to include directly in the search process. A clear example are the features that require the entire target sentence to be evaluated, as this is not compatible with the pruning and recombination procedures that are necessary for keeping the target sentence generation process manageable. A possible solution for this problem is to apply sentence level reranking by using the outputs of the systems, i.e., to carry out a two-step translation process.

The aim of this preliminary system combination is to select the best translation given the 1-best output of each system (phrase- and Ngram-based) using the following feature functions:

IBM-1 lexical parameters from the source-to-target direction and from the target-to-source direction. The IBM model 1 is a word alignment model that is widely used in working with parallel corpora. Similarly to the feature function used in decoding, the IBM-1 lexical parameters are used to estimate the translation probabilities of each hypothesis in the n-best list.

Target language models. Given that a 4-gram language model was used in decoding, we add a 2-gram, 3-gram, and 5-gram. These models should be more useful when trained on additional monolingual data which is easier to acquire than bilingual data. Unfortunately, we were not able to add more data.

Word bonus. Given that the above models tend to shorten the translation, translations receive a bonus for each produced word.

Task and System Description

Translation of four different languages is considered: Mandarin to English, Japanese to English, Arabic to English, and Italian to English using the 2006 IWSLT data (see Table ) and the corresponding official evaluation test set. Both phrase and Ngram-based systems are described above.

TABLE 1 BTEC Chinese-English Corpus. Basic Statistics for the Training, Development and Test Datasets

The optimization tool used for computing each model weight both in decoding and rescoring was based on the simplex method (Nelder and Mead, Citation1965). Following the consensus strategy proposed in Chen et al. (Citation2005), the objective function was set to 100·BLEU + 4·NIST. Parameters of the baseline systems are summarized in Table .

TABLE 4 Phrase (Left) and Ngram (Right) Default Parameters

Result Analysis

Table shows that improvements due to system combination are consistent in the internal test set. Moreover, note that in the combined approach, a general improvement of the BLEU score is observed, whereas the NIST score seems to decrease. This behavior can be seen in almost all tasks and all test sets. Both the IBM-1 lexical parameters and the language model tend to benefit shorter outputs. Although a word bonus was used, we have seen that the outputs produced by the TALPcom system are shorter than the outputs produced by the TALPphr (phrase-based) or the TALPtup (Ngram-based) systems, which is why NIST did not improve.

TABLE 2 Results Obtained Using TALPphr (Phrase-Based), TALPtup (Ngram-Based System) and Their Combination

When observing results in the test set, system combination performs well in all tasks improving from 2 to 3 point BLEU, except in the It2En task where BLEU stays the same.

When observing the results in the evaluation set, the analysis changes. We should take into account that the evaluation set from 2006 was not extracted from the BTEC Corpus, which may explain the difference in behavior. The biggest improvement is in Jp2En which leads to an improvement of 1 point BLEU. It2En follows, and is succeeded by Ar2En. Finally, Zh2En is the only case where the system combination decreases the translation quality in BLEU.

PHRASE AND NGRAM-BASED COMBINATION SCORE

We extend the method from the previous section by combining the two systems at the level of n-best lists doing a two-step translation. In this case, the first step generates an n-best list using the models that can be computed at search time. In our approach, the translation candidate the previous lists are produced independently by both systems and are then combined by concatenation.Footnote 1 During the second step, these multiple translation candidates are reranked using as additional feature function the probability given by the opposite system. Given the phrase-based (or Ngram-based) n-best list, we compute the cost of each target sentence of this n-best list given by the Ngram-based (or phrase-based) system. As a~consequence, for each sentence we have the cost given by the phrase- and Ngram-based systems.

However, this computation may not be possible in all cases. An example is given in Figure . Let us suppose that the three top sentences in Figure are our bilingual training sentences with their corresponding word alignment. Then the phrase and tuple extraction are computed. For the sake of simplicity, the phrase length was limited to 3. It is shown that several units are common in both systems and others are only in one system, which leads to different translation dictionaries. Finally, Figure shows a test sentence and the corresponding translations of each system. Both sentences cannot be reproduced by the opposite system because there are target words that are not contained in the respective dictionaries given the actual source words. (the tuple dictionary does not contain the unit translations # traducciones and the phrase dictionary does not contain They # NULL).

FIGURE 1 Analysis of phrase and Ngram-based systems. From top to bottom, training sentences with the corresponding alignment (source word position hyphen target word position), unit extraction, test sentence, and translation were computed with each system using the extracted units. Word marked with ∗ is unknown for the system.

FIGURE 1 Analysis of phrase and Ngram-based systems. From top to bottom, training sentences with the corresponding alignment (source word position hyphen target word position), unit extraction, test sentence, and translation were computed with each system using the extracted units. Word marked with ∗ is unknown for the system.

Given the unique and monotonic segmentation of the Ngram-based system, the number of phrase-based translations that can be reproduced by the Ngram-based system may be smaller than the number of Ngram-based translations reproduced by the phrase-based system. Whenever a sentence cannot be reproduced by a given system, the cost of the worst sentence in the n-best list is assigned to it. This is an experimental decision for which we penalize sentences that cannot be produced by both systems.

Task and System Description

The translation models used in the following experiments were used by UPC and RWTH in the second evaluation campaign of the TC-STAR project see Table (Spanish-English).

TABLE 3 TC-STAR Corpus (Spanish-English Task). Basic Statistics for the Training, Development, and Test Datasets

Preprocessing

Standard tools were used for tokenizing and filtering.

Word Alignment

After preprocessing the training corpora, word-to-word alignments were performed in both alignment directions using Giza + + (Och and Ney, Citation2003) and the union set of both alignment directions was computed.

Tuple Modeling

Tuple sets for each translation direction were extracted from the union set of alignments. The resulting tuple vocabularies were pruned out considering the N best translations for each tuple source-side (N = 20 for the Es2En direction and N = 30 for the En2Es direction) in terms of occurrences. The SRI language modeling toolkit (Stolcke, Citation2002) was used to compute the bilingual 4-gram language model. Kneser-Ney smoothing (Kneser and Ney, Citation1995) and interpolation of higher and lower n-grams were used to estimate the 4-gram translation language models.

Phrase Modeling

Phrase sets for each translation direction were extracted from the union set of alignments. The resulting phrase vocabularies were pruned out considering the N best translations for each tuple source-side (N = 20 for the Es2En direction and N = 30 for the En2Es direction) in terms of occurrences. Phrases up to length 10 were considered.

Feature Functions

Lexicon models were used in the source-to-target and target-to-source directions. A word and a phrase (the latter only for the phrase-based system) bonus and a target language model were added in decoding. Again, Kneser-Ney smoothing (Kneser and Ney, Citation1995) and interpolation of higher and lower n-grams were used to estimate the 4-gram target language models.

Optimization

Once models were computed, optimal log-linear coefficients were estimated for each translation direction and system configuration using an in-house implementation of the widely used downhill Simplex method (Nelder and Mead, Citation1965). The BLEU score was used as the objective function.

Decoding

The decoder was set to perform histogram pruning, keeping the best b = 50 hypotheses (during the optimization work, histogram pruning is set to keep the best b = 10 hypotheses).

Parameters of the baseline systems are defined in Table .

Result Analysis

Table shows results of the rescoring and system combination experiments on the development set. λ pb is the weight given to the cost computed with the phrase-based system and λ nb is the weight given to the cost computed with the Ngram-based system.

TABLE 5 Results of Rescoring and System Combination on the Development Set. PB-Phrase-Based, NB-Ngram-Based

For each translation direction, the first six rows in Table include results of systems nonrescored and PB (NB) rescored by NB (PB). The last three rows correspond to the system combination, where the PB (NB) outputs are concatenated with the NB (PB) outputs and ranked by their rescored score. The weight given to the Ngram score tends to be higher than the weight given to the phrase score (see λ pb and λ nb in Table ). Moreover, the gain in rescoring is always higher for the phrase-based system. This improvement is enhanced by the number of n-best. Here, the best results are attained by the system combination.

First, we extracted the n-best list with the phrase-based system. Then we found the corresponding costs of this n-best list given by the Ngram-based system. Second, we extracted the n-best list with the Ngram-based system.

Thirdly, we concatenated the first n-best lists with the second.

Given that the phrase- and Ngram-based system have different dictionaries, a phrase (Ngram) translation may not be reproduced by the opposite system. In that case, the cost of the worst sentence in the n-best list is assigned to it.

Rescoring the phrase-based system reaches an improvement of 0.8 point BLEU in the En2Es development set and a little less in the opposite direction. The Ngram-based baseline system is already better than the phrase-based baseline system and rescoring reaches an smaller improvement: 0.3 point BLEU in the development set in both directions. The improvement is much lower than in the case of rescoring a phrase-based system. Although the n-best lists were more varied in the Ngram-based system, the quality of n-best translations may be higher in the phrase-based system. The system combination has almost reached 0.5 point BLEU of improvement when comparing it to the best of the two baseline systems and 1 point BLEU when comparing it to the worst of the two baseline systems.

Table shows the results of the rescoring and system combination experiments on the test set. Again, the first two rows include results of systems nonrescored and PB (NB) rescored by NB (PB). The third row corresponds to the system combination. Here, PB (NB) rescored by NB (PB) are simply merged and reranked.

TABLE 6 Rescoring and System Combination Results

Rescoring the phrase-based system reaches an improvement of 0.65 point BLEU in the test set. Rescoring the Ngram-based system reaches an improvement of 0.6 point BLEU, which is a better performance than in the development set. Finally, the Es2En system combination BLEU reaches 0.4 point BLEU improvement when comparing it to the best of the two baseline systems and 0.8 when comparing it to the worst system. In the opposite direction, the gain is about 0.8 point BLEU when comparing it to the two baseline systems. In En2Es, the improvement is consistent in all measures. In Es2En, the improvement in rescoring is coherent, but the improvement using the system combination is not. Here, the results in the development set do not generalize to the test set.

Structural Comparison

The following experiments were carried out to give a comparison of the translation units used in the phrase- and Ngram-based systems that were previously combined.

Both approaches aim at improving accuracy by including word context in the model. However, the implementation of the models is quite different and may produce variations in several aspects. Table shows how decoding time varies with the beam size. Additionally, the number of available translation units is shown, corresponding to the number of available phrases for the phrase-based system and 1 gram, 2 gram and 3 gram entries for the Ngram-based system. Results are computed on the development set.

TABLE 7 Impact on Efficiency of the Beam Size in PB (Top) and NB System (Bottom)

The number of translation units is similar in both tasks for both systems (537 k ∼537 k for Spanish-to-English and 594 k ∼651 k for English-to-Spanish), while the time consumed during decoding is clearly higher for the phrase-based system. This can be explained by the fact that in the phrase-based approach, the same translation can be hypothesized following several segmentations of the input sentence, as phrases appear (and are collected) from multiple segmentations of the training sentence pairs. In other words, the search graph seems to be overpopulated in the phrase-based approach.

Table shows the effect on translation accuracy regarding the size of the beam in the search. Results are computed on the test set for the phrase-based and Ngram-based systems.

TABLE 8 Impact on Accuracy of the Beam Size in PB (Top) and NB System (Bottom)

Results of the Ngram-based system show that decreasing the beam size produces a clear reduction in the accuracy results. The phrase-based system shows that accuracy results remain very similar under the different settings. This is because of the way translation models are used in the search. In the phrase-based approach, every partial hypothesis is scored uncontextualized; hence, a single score is used for a given partial hypothesis (phrase). In the Ngram-based approach, the model is intrinsically contextualized, which means that each partial hypothesis (tuple) depends on the preceding sequence of tuples. Thus, if a bad sequence of tuples (poor scored) is composed of a good initial sequence (good scored), it is placed on top of the first stacks (beam) and may cause the pruning of the rest of hypotheses.

Error Analysis

This section aims at identifying the main problems in a phrase- and Ngram-based systems by performing a human error analysis. The main objective of this analysis is to focus on the differences between both systems, which justify the system combination. The guidelines for this error analysis can be found in Vilar, Xu, D'Haro, and Ney (Citation2006). We randomly selected 100 sentences to be evaluated by bilingual judges.

This analysis reveals that the two systems produce some different errors (most are the same). For the English-to-Spanish direction, the greatest problem is the correct generation of the right tense for verbs, with around 20% of all translation errors being of this kind. Reordering also poses an important problem for both phrase and Ngram-based systems, with 18% or 15% (respectively) of the errors falling into this category. Missing words are also an important problem. However, most of them (approximately two-thirds for both systems) are filler words (i.e., words do not convey meaning). In general, the meaning of the sentence is preserved. The most remarkable difference when comparing both systems is that the Ngram-based system produces a relatively large amount of extra words (approximately 10%), while for the phrase-based system, this is only a minor problem (2% of the errors). In contrast, the phrase-based system has more problems with incorrect translations, that is, words for which a human can find a correspondence in the source text but the translation is incorrect.

Similar conclusions can be drawn for the inverse direction. The verb generation problem is not as acute in this translation direction due to the much-simplified morphology of English. An important problem is the generation of the correct preposition.

The Ngram-based system seems to be able to produce more accurate translations (reflected by a lower percentage of translation errors). However, it generates too many additional (and incorrect words) in the process. The phrase-based system, in contrast, counteracts this effect by producing a more direct correspondence with the words present in the source sentence, at the cost of sometimes not being able to find the exact translation.

CONCLUSIONS

We have presented a straightforward system combination method using several well-known feature functions for rescoring the 1-best output of the phrase- and Ngram-based SMT systems. The TALPcom is the combination of the TALPphr and the TALPtup, using several n-gram language models, a word bonus and the IBM Model 1 for the whole sentence. The combination seems to obtain clear improvements in BLEU score, but not in NIST, since the features that operate in the combination generally benefit shorter outputs.

We have reported a structural comparison between the phrase- and Ngram-based system. On the one hand, the Ngram-based system outperforms the phrase-based in terms of search time efficiency by avoiding the overpopulation problem presented in the phrase-based approach. On the other hand, the phrase-based system shows a better performance when decoding under a highly constrained search.

We have carried out a detailed error analysis in order to better determine the differences in performance of both systems. The Ngram-based system produced more accurate translations, but also a larger amount of extra (incorrect) words when compared to the phrase-based translation system.

Finally, we have presented another system combination method which consists of concatenating a list of the respective system outputs and rescoring them using the opposite system as a feature function, i.e., the Ngram-based system is used for the phrase-based system and vice versa. For both systems, including the probability given by the opposite system as a rescoring feature function leads to an improvement of BLEU score.

The authors would like to thank Josep M. Crego and David Vilar for their contributions to the sections of structural comparison and error analysis, respectively.

Notes

With removal of duplicates.

REFERENCES

  • Bangalore , S. , G. Bordel , and G. Riccardi . 2001 . Computing consensus translation from multiple machine translation systems . In: IEEE Workshop on Automatic Speech Recognition and Understanding , pp. 351 – 354 , Italy : Madonna di Campiglio .
  • Chen , B. , R. Cattoni , N. Bertoldi , M. Cettolo , and M. Federico . 2005 . The ITC-irst statistical machine translation system for IWSLT-2005 . In: Proc. of the Int. Workshop on Spoken Language Translation, IWSLT'05 , pp. 98 – 104 , Pittsburgh , PA .
  • Costa-Jussà , M. R. , J. M. Crego , A. de Gispert , P. Lambert , M. Khalilov , J. A. R. Fonollosa , J. B. Mariño , and R. Banchs . 2006 . TALP phrase-based statistical machine translation and TALP system combination the IWSLT 2006 . In: Proc. of the Int. Workshop on Spoken Language Translation, IWSLT'06 , Kyoto .
  • Crego , J. M. , M. R. Costa-Jussà , J. Mariño , and J. A. Fonollosa . 2005 . Ngram-based versus phrase-based statistical machine translation . In: Proc. of the Int. Workshop on Spoken Language Translation, IWSLT'05 , pp. 177 – 184 .
  • Crego , J. M. , A. de Gispert , P. Lambert , M. Khalilov , M. R. Costa-Jussà , J. Mariño , R. Banchs , and J. A. Fonollosa . 2006 . The TALP Ngram-based system for the IWSLT2006 . In: Proc. of the Int. Workshop on Spoken Language Translation, IWSLT'06 , Kyoto .
  • de Gispert , A. 2006 . Introducing linguistic knowledge into statistical machine translation. PhD Thesis, Dep. de Teoria del Senyal i Comunicacions, Universitat Politècnica de Catalunya (UPC) .
  • Doi , T. , Y. Hwang , K. Imamura , H. Okuma , and E. Sumita . 2005 . Nobody is perfect: Atr s hybrid approach to spoken language translation . In: Proc. of the Int. Workshop on Spoken Language Translation, IWSLT'04 , pp. 55 – 62 , Pittsburgh , PA .
  • Fiscus , G. 1997 . A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER) . In: IEEE Workshop on Automatic Speech Recognition and Understanding , Santa Barbara . CA .
  • Frederking , R. and S. Nirenburg . 1994 . Three heads are better than one . In: 4th Conference on Applied Natural Language Processing , Stuttgart , German .
  • Jayaraman , S. and A. Lavie . 2005 . Multi-engline machine translation guided by explicit word matching . In: 10th Conference of the European Association for Machine Translation , pp. 143 – 152 . Budapest , Hungary .
  • Kneser , R. and H. Ney . 1995 . Improved backing-off for m-gram language modeling . In: Proc. of the ICASSP Conference , vol. 1 , pp. 181 – 184 . Detroit , MI .
  • Koehn , P. , F. J. Och , and D. Marcu . 2003 . Statistical phrase-based translation . In: Proc. of the Human Language Technology Conf., HLT-NAACL'03 , pp. 48 – 54 . Edmonton , Canada .
  • Mariño , J. B. , R. E. Banchs , J. M. Crego , A. de Gispert , P. Lambert , J. A. R. Fonollosa , and M. R. Costa-Jussà . 2006 . N-gram based machine translation . Computational Linguistics 32 ( 4 ): 527 – 549 .
  • Matusov , E. , N. Ueffing , and H. Ney . 2006 . Computing consensus translation from multiple machine translation systems using enhanced hypotheses alignment . In: Proc. of the 11th Conf. of the European Chapter of the Association for Computational Linguistics , pp. 33 – 40 . Trento .
  • Nelder , J. A. and R. Mead . 1965. A simplex method for function minimization. The Computer Journal 7:308–313.
  • Nomoto , T. 2004 . Multi-engine machine translation with voted language model . In: Proc. of the 42th Annual Meeting of the Association for Computational Linguistics , pp. 494 – 501 .
  • Och , F. J. and H. Ney . 2003 . A systematic comparison of various statistical alignment models . Computational Linguistics 29 ( 1 ): 19 – 51 .
  • Rosti , A. V. I. , N. F. Ayan , S. B. Xiang , R. Schwartz Matsoukas , and B. J. Dorr . 2007 . Combining outputs from multiple machine translation systems . In: Proc. of the Human Language Technology Conf., HLT-NAACL'07 , pp. 228 – 235 . Rochester .
  • Sim , K. C. , W. J. Byrne , M. J. F. Gales , H. Sahbi , and P. C. Woodland . 2007 . Consensus network decoding for statistical machine translation system combination . In: Proc. of the ICASSP , vol. 4 , pp. 105 – 108 . Rochester , NY .
  • Snover , M. , B. Dorr , R. Schwartz , L. Micciulla , and J. Makhoul . 2006 . A study of translation edit rate with targeted human annotation . In: Proc. Assoc. for Machine Trans. in the Americas , Cambridge , MA .
  • Stolcke , A. 2002 . SRILM - An extensible language modeling toolkit . In Proc. of the 7th Int. Conf. on Spoken Language Processing, ICSLP'02 , pp. 901 – 904 . Denver .
  • Vilar , D. , J. Xu , L. F. D'Haro , and H. Ney . 2006 . Error analysis of machine translation output . In 5th Int. Conf. on Language Resources and Evaluation, LREC'06 , pp. 697 – 702 . Genoa , Italy .

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.