568
Views
0
CrossRef citations to date
0
Altmetric
Articles

Improved Language Models for Word Prediction and Completion with Application to Hebrew

, &
Pages 232-250 | Published online: 12 May 2017

ABSTRACT

Language models (LMs) are important components of many applications that work with natural language, such as word prediction and completion programs, automatic speech recognition, and machine translation. In this paper, we introduce various types of improvements for LMs dealing with word prediction and completion in Hebrew. Whereas previous systems for the Hebrew language apply known variants of existing LMs without any alteration, this study presents two types of improvements concerning the LMs: one is general and the other is special for the Hebrew language. These improvements enable all tested LMs to improve their keystroke saving abilities.

Introduction

Word Prediction is the process in which words that are likely to follow in a given text are suggested by displaying a list of most relevant words. Word Prediction enables a user to select a word from a given list to reduce the number of keystrokes necessary for typing, saving physical effort, time, and spelling errors. The process of Word Prediction is as follows: As the writer types a word(s), the software produces one word or a list of word choices. Each time a word is added, the list is updated. When the target word appears in the list, it can be selected and inserted into the ongoing text with a single keystroke. Typically, word lists are numbered, and words can be selected by typing the corresponding number. If the word the user seeks is not predicted, the writer must enter the word.

Word Completion is the process in which words that include the entered letter sequence are suggested by displaying a list of the most relevant words. The process of Word Completion is as follows: As the writer types the first letter or letters of a word, the software produces one word or a list of words beginning with the entered letter sequence. Each time a letter is added, the list is updated. When the target word appears in the list, it can be selected and inserted into the ongoing text with a single keystroke. Typically, word lists are numbered and words can be selected by typing the corresponding number. If the word that the user seeks is not predicted, the writer must enter the next letter of the word.

Word prediction and completion features are provided by various software, e.g., database query tools, e-mail programs, search engine interfaces, text editors, web browsers, word processors, and fast texting in small form factor devices. These features speed up human-computer interactions, saving physical effort, time, and spelling errors.

The original purpose of word prediction and completion was to assist people with physical disabilities to increase their typing speed (Tam and Wells Citation2009) and to decrease the number of keystrokes required to either write or complete a word (Anson et al. Citation2006). The main aims of word prediction are to increase typing speed and to reduce writing errors (particularly for dyslexic people). However, word prediction and completion have also been shown to be helpful for anyone who types text, especially those who use text editors, mobile phones, search engines, and short message services. In addition, Word Prediction software often allows users to enter their own words into the word prediction dictionaries either manually or by “learning” inputted words (Beukelman and Mirenda Citation2008; Darragh, Witten, and James Citation1990).

Augmentative and alternative communication (AAC) is used by individuals to compensate for their impairments in the expression or comprehension of written or spoken language (Calculator et al. Citation2004; Fossett and Mirenda Citation2009). AAC is utilized by individuals with a variety of congenital conditions, such as autism, cerebral palsy, and intellectual disabilities and acquired conditions, such as amyotrophic lateral sclerosis, aphasia, and traumatic brain injury (Beukelman and Mirenda Citation2005).

AAC’s users are able to input (letter by letter) over 10–15 words per minute (WPM), despite many users suffering from motor impairment resulting in a typing speed of one WPM or less (Trnka et al. Citation2009).

Word prediction and completion programs commonly utilize language models that attempt to capture the properties of a language. These LMs provide the user with one or more suggestions for completing the current or, occasionally, the following word (Darragh and Witten Citation1991; Trnka et al. Citation2009; Wandmacher and Antoine Citation2007).

The main evaluation measure for word prediction is keystroke savings (KSs) (Carlberger et al. Citation1997; Newell, Langer, and Hickey Citation1998; Li and Hirst Citation2005; Trnka and McCoy Citation2008), which measure the saving percentage in keys pressed compared to letter-by-letter text entry. KS is computed using the following formula: (chars—keystrokes) /chars × 100, where chars represents the number of characters in the text, including spaces and newlines, and keystrokes is the minimum number of key presses required to enter the text using word prediction, including the keystroke to select a prediction from the list and a key press at the end of each utterance.

Whereas previous systems for the Hebrew language present applications of known variants of existing LMs without any alteration, this study presents two improvements concerning the LMs: one is general and the other is special for the Hebrew language. These improvements contribute to improve the KS rate.

This article is organized as follows: The section ‘Previous Word Completion and Prediction Systems’ presents a number of previous prediction systems as well as several previous prediction systems for Hebrew. The section ‘Language Models and Their Combinations’ describes LMs in general and LM combinations. The section ‘Improved LMs for Hebrew Word Completion and Prediction’ presents the proposed Improved LMs for word prediction and completion in Hebrew. The section ‘Experimental Setup and Results’ describes the examined corpora and the experimental results and analyzes them. The final section summarizes, concludes, and proposes future directions.

Previous word completion and prediction systems

PAL (Swiffin et al. Citation1985), one of the first word prediction systems, suggests most frequent words that match a partially typed word, ignoring the context of the typed word. WordQ (for English) (Shein et al. Citation2001) and Profet (for Swedish) (Carlberger Citation1997; Carlberger et al. Citation1997) use not only word unigrams but also bigrams for word prediction.

Concentrating solely on n-grams’ distribution statistics for prediction occasionally results in the suggestion of syntactically inappropriate words, whereas utilizing syntactic knowledge, such as POS tags of a language, can filter such suggestions. Systems such as Syntax PAL (Morris et al. Citation1992) (for English) and Prophet (Carlberger Citation1997) (for Swedish) base their predictions on syntactic knowledge of a language. Syntax PAL and Prophet are improvements of their previous systems PAL and Profet, respectively. Syntax PAL has decreased the errors caused by PAL and has made it possible for users to write longer and more complicated sentences. Prophet required 33% fewer keystrokes than Profet, its predecessor. A survey on text prediction systems was conducted by Garay-Vitoria and Abascal (Citation2006).

Previous Hebrew word prediction systems

Netzer, Adler, and Elhadad (Citation2008) were the first to present the results of a Hebrew word prediction system. Their system applied three methods: (1) Statistical knowledge—Markov LMs for unigrams, bigrams, and trigrams, (2) Syntactic knowledge—POS tags and phrase structures, and (3) Semantic knowledge—Associating words into categories and finding a group of rules that constrain the prospective following word.

They tested their system on three corpuses of 1, 10, and 27 M words. The best results were obtained after training a hidden Markov LM on the largest corpus. In contrast to expectation, the use of morpho-syntactic information, such as POS tags, lowered the prediction results. The best results were achieved using statistical data, resulting in KS of 29% with nine word proposals, 34% for seven proposals, and 54% for a single proposal.

HaCohen-Kerner and Greenfield (Citation2012) presented another Hebrew word prediction system. Their system was composed of the following components: (1) Sorted lists of words, frequent nouns, and frequent verbs, (2) Six corpuses containing approximately 177 M words, (3) Three LMs (trigram, bigram, and unigram) that were automatically generated using the Microsoft Research Scalable Language-Model-Building ToolFootnote1 and the aforementioned corpora, (4) Results of queries sent to the Google Search Engine,Footnote2 (5) A morphological analyzerFootnote3 generated by MILA,Footnote4 and (6) A cache containing the 20 most recently typed words.

The KSs rates (between 54% and 72%) reported in HaCohen-Kerner and Greenfield (Citation2012) were higher than those reported in Netzer, Adler, and Elhadad (Citation2008). However, these two systems were trained and tested on different corpora. In any event, it seems that a larger corpus enables improvement in the prediction results. Both studies discovered that when an LM is based on text documents from different domains, better results are achieved without a morphological analyzer.

Whereas Netzer, Adler, and Elhadad (Citation2008) and HaCohen-Kerner and Greenfield (Citation2012) did not define and apply any combinations of LMs (but rather simple n-gram LMs), HaCohen-Kerner, Applebaum, and Bitterman (Citation2014) defined and applied 16 specific variants belonging to five general types of LMs: (1) Basic LMs (unigrams, bigrams, trigrams, and quadgrams), (2) Backoff LMs, (3) LMs Integrated with tagged LMs, (4) Interpolated LMs, and (5) Interpolated LMs Integrated with tagged LMs. These 16 variants were compared in terms of KS for Hebrew word prediction and completion using three types of Israeli web newspaper corpora. The tested corpora contained only 2.1 M tokens to save runtime for the 16 variants of LMs.

The foremost KS results (18–42% for word prediction and 44–63% for word completion after having the first seven or more letters of the tested words) were achieved with the LMs of the most complex variety, the interpolated LMs integrated with tagged LMs. The improvements of the KS rates for completion of a word after having at least five letters were less than 1% for all corpora. That is to say, the contribution of the various LMs and the combinations of LMs are primarily expressed for either prediction or completion of words for less than five letters.

However, these three studies only presented applications of known variants of existing LMs without any alteration. In this research, we present two improvements concerning the LMs: one is general and the other is specific to the Hebrew language.

Language models and their combinations

The most commonly applied LMs are statistical n-gram LMs. An n-gram is a contiguous sequence of n items (e.g., characters, words, phonemes, or syllables) from a given sequence of text or speech. These LMs try to capture the syntactic and semantic properties of a language by estimating the probability of an item in a sentence given the preceding n−1 items. An n-gram (N = 1, 2, 3, 4, …) word LM for a specific corpus gives a probability to each sequence of N consecutive words in the corpus according to its distribution in the discussed corpus. LMs with N = 1, 2, 3, and 4 are referred to as unigrams, bigrams, trigrams, and quadgrams (also known as fourgrams), respectively.

The simplest version of an n-gram LM is a list of specific n-grams that occur in a given corpus (corpora) and their frequency of occurrence. Unigram LMs only calculate the probabilities of single words without considering their context, i.e., the words before or after the target word. In the majority of IR’s tasks, unigram LMs (as opposed to complex LMs) are most commonly used because they are sufficient to determine the topic from a piece of text (Wang et al. Citation2011). However, for word prediction and completion tasks, unigram LMs are entirely insufficient. N-gram LMs (N > 1) are necessary to capture the context of the word at hand and to enable a better word prediction or completion.

Wandmacher and Antoine (Citation2006) and Trnka and McCoy (Citation2007) showed that n-gram LMs for word prediction are domain-sensitive. Namely, an n-gram LM trained and tested on a corpus with a different topic and/or style may not perform well when compared to an LM trained and tested on a relevant corpus.

Mikolov (Citation2012) in his doctoral dissertation presented various limitations of n-gram LMs: (1) N-gram LMs assume exact match of the tested n-gram, which is not always true. (2) Regular n-gram LMs do not represent patterns that contain more than three or four words. (3) With increasing order of the n-gram LM, the number of possible parameters increases exponentially. (4) There will be never enough training data to estimate parameters of high-order n-gram LMs.

Part-of-speech taggers enable the generation of more general LMs (Brants Citation2000). Such LMs include sequences of n consecutive tags and their frequencies of occurrence. Ghayoomi and Daroodi (Citation2008) showed that the integration of POS tags in LMs for the Persian language improves the KS rate. Aliprandi et al. (Citation2007) improved a previous POS trigram model for the Italian-inflected language by two types of unification of two trigrams into a single one. These improvements led to an additional 8% KS and a lower number of POS trigrams.

Various additional combinations of LMs have been performed by several systems. For instance, Kimelfeld et al. (Citation2007) used combinations of LMs and HITS algorithms for XML retrieval. They show that combined LMs generally yield better results in identifying large collections of relevant elements. Kirchhoff et al. (Citation2006) applied various combinations of LMs for large-vocabulary conversational Arabic speech recognition, including an LM that uses a backoff procedure in which both words and/or additional morphological features can be used in tandem. They report that combinations of more than one LM usually have a more significant effect; the greatest reduction (0.5% absolute) is obtained by combining stream models with a class-based model involving all three morphological components (roots, stems, and morph classes). McMahon (Citation1994) in the 2nd chapter of his PhD dissertation supplied an overview of word-based LMs in general and combinations of LMs in particular. McMahon describes a few combination variations using several LMs, such as weighted combinations, to produce an interpolated LM and a backoff use of LMs that either uses one LM or utilizes an additional LM, contingent on a strict set of conditions. Wu et al. (Citation2010) applied combinations of LMs for sentence correction and compared them. Stymne (Citation2011)applied combinations of LMs for spell-checking techniques to allow the replacement of unknown words and data cleaning for translation of Haitian Creole SMS. Beyerlein (Citation1998) found that integrating bigram, trigram, and quadgram LMs (with two additional acoustic models) into one LM leads to lower error rates than searching for the “best” combination of a specific LM and a particular acoustic model. Shaik et al. (Citation2012) proposed hierarchical hybrid LMs for open vocabulary continuous speech recognition hybrid vocabularies. They integrated word LMs and character LMs. ‏

Improved LMs for Hebrew word completion and prediction

Basic LMs

In this subsection, we briefly describe the 16 specific variants belonging to five types of LMs implemented in HaCohen-Kerner, Applebaum, and Bitterman (Citation2014). The tagged LMs used the tagger built by Adler (Citation2007, Citation2008). This tagger achieved 93% accuracy for word segmentation and POS tagging when tested on a corpus of 90 K tokens.

1) The Basic LM type includes four variants of the four most elementary word-based LMs: unigrams, bigrams, trigrams, and quadgrams and solely utilizes a single LM.

2) The Backoff LM type includes three variants: the backoff quadgram LM, the backoff trigram LM, and the backoff quadgram LM. The implemented backoff LM is based on the exclusive use of the highest n-gram basic LM. If this fails to yield results, we then attempt to use the (n−1)-gram LM.

3) The Backoff integrated LM and Tagged LM types include two variants. The first variant (called the conservative) uses the backoff quadgram LM. Only in the case that there are at least two proposals with the same highest result proposed by the n-gram LM do we attempt to choose between them based on the compatible tagged n-gram LM. If no selection is made, we attempt the (n−1)-gram LM and so on. The second variant is Backoff Integrated Tagged LMs with Basic LMs. This variant first activates the tagged quadgram LM. According to the likely POS-tag, it retrieves in context the word that is most likely to fit this POS-tag using the compatible n-gram LM.

4) The Interpolated LM type includes four variants that synthesize all four basic types of n-gram LMs (unigram, bigram, trigram, and quadgram). The implemented variants are: (1) Fixed Equal Weights (0.25) for each n-gram LM; (2) Fixed Unequal Weights that distributes the weights as follows: 0.4, 0.3, 0.2, and 0.1 for the quadgram, trigram, bigram, and unigram LMs, respectively; (3) A different weight is given to each n-gram LM according to its relative rate of successful predictions and completions of words. We call this model interpolated LM with relative weights; and (4) Similar to the previous variant with a special statistical treatment of the first word in each sentence, this treatment is based on the distribution of the first word in all sentences.

5) The synthesized LM type (interpolated and integrated LM and tagged LM) includes three variants as follows: (1) Fixed unequal weights of 0.4, 0.3, 0.2, and 0.1 for the quadgram, trigram, bigram, and unigram LMs, respectively; (2) Weights are given to each n-gram LM according to its relative rate of successful word predictions and completions; and (3) Weights are given to each n-gram LM according to its relative rate of successful word predictions and completions with a special statistical treatment for the first word in each sentence based on the distribution of the first word in all sentences. presents the names of the models and their abbreviations, which are used in the rest of the paper.

Table 1. The names of the models and their abbreviations.

Improved LMs

In contrast to the three previous systems for the Hebrew language (mentioned in Section 2) that presented applications of known variants of existing LMs without any alteration, this study presents two types of improvements for the LMs: one is general and the other is specific to the Hebrew language. These improvements contribute to improving the KS rate. The LMs presented in this study extend and improve different aspects of the LMs presented in HaCohen-Kerner, Applebaum, and Bitterman (Citation2014), which are briefly presented in Section 4.1.

In the rest of this section, we present two methods related to the test of the LMs on the experimented corpora and two types of improvements concerning the LMs. Some of them were tested on all LMs, and some were tested on the most advanced variant of LM. One improvement is general and technical, and the other is specific to the Hebrew language. These improvements contribute to improving the KS.

The two methods related to the test of the LMs on the experimented corpora are as follows:

  1. Each corpus was split into a training set (90%) and a test set (10%) as opposed to HaCohen-Kerner, Applebaum, and Bitterman (Citation2014), where they trained and tested on the same corpus in each experiment. This is probably the reason why the results achieved in the current research are lower than those reported in HaCohen-Kerner, Applebaum, and Bitterman (Citation2014). It is not correct to test on the same database from which learning was conducted. The real ability of a model should be tested on a corpus different than the one from which it learned.

  2. In addition to the calculation of the KS measure that was performed in the three previous systems for Hebrew, we also calculated the perplexity measure in a limited way. A practical informal definition of perplexity is the number of different tokens that an n-gram LM refers as potential candidate tokens following a given sequence that contains N−1 tokens. The perplexity measure was computed for the quadgram backoff model, and the values are presented in the .

Table 2. The perplexity’s values for the quadgram backoff model.

The very high values of the perplexity measure for the quadgram backoff model on the three corpora (between 920 and 3514 per word) show that this measure is less relevant than the KS to our research. These values are much higher compared to the lowest perplexity value (approximately 247 per word) that has been published on the Brown Corpus (Brown et al. Citation1992) using a trigram model.

In addition, the perplexity measure in word n-grams LMs usually relates to word prediction, whereas in our research, we also address word completion. Moreover, the time needed for the computation of the perplexity measure for all LMs was extremely high compared to the time needed for the computation of the KS measures for the same LMs. Therefore, we decided to work only with the KS measure.

The two improvements concerning the LMs are as follows: (1) A general technical improvement, which is not specific to Hebrew. This is the non-repetition improvement: the LM model does not suggest the same word one by one; and (2) A specific improvement related to the Hebrew language—the prefix improvement: if the model offers a word in Hebrew that is a prefix of another word in Hebrew whose length contains at least two letters more than the length of the word at hand, the user (or the program itself) will choose it (an offer of a word with a length of only one additional letter will not be chosen because its selection needs one keystroke). For example, let us assume that the word that the user wants to write is “followed.” He types “foll” and the LM offers the word “follow.” It makes sense that he would choose this word because this word is a prefix of the desired word “followed.”

Experimental setup and results

Examined corpora

In this study, we tested our LMs on three corpora, two of which were tested in HaCohen-Kerner, Applebaum, and Bitterman (Citation2014). The examined corpora are three Israeli newspaper corpora (Arutz Sheva (A7),Footnote5 Haaretz,Footnote6 and Themarker (TM)Footnote7). The three newspaper corpora contain 2,477,947 tokens, slightly more than the 2,130,477 tokens included in the three newspaper corpora used in HaCohen-Kerner, Applebaum, and Bitterman (Citation2014). The only difference is that the Haaretz corpus replaced the NRGFootnote8 corpus. The NRG corpus was removed from the experiments in this study because it contained two huge files, and each of them was approximately 150 times larger than all other 2500 files from the same corpus.

In this research, we work with a simplified definition of a token in Hebrew as follows: a token is a string of contiguous characters between two spaces, or between a space and any punctuation mark. A token can also be an integer, real number, or a number with a colon (time, for example: 2:00). All other symbols are tokens themselves, except apostrophes and quotation marks in a word (with no space), which in many cases symbolize acronyms or citations. A token can present a single Hebrew word or a group of words, such as the following token “וכשתבואי” (VeKsheTavoee), which includes five words “and when you will come.”

These corpora represent three different types of Israeli web newspapers: Israel’s political left-center, economic, and national religious. The diversity of the three corpora allows for testing of the performance of all LMs on various news genres. presents general information regarding the examined corpora.

Table 3. General information regarding the examined corpora.

LM experiments on newspaper corpora

HaCohen-Kerner, Applebaum, and Bitterman (Citation2014) tested 16 specific LMs belonging to five types of LMs. Each LM was tested on the three corpora (NRG, Themarker, and A7). In this research, we investigated 32 specific LMs belonging to the same five types of LMs. Each LM was tested on the three corpora (Haaretz, Themarker, and A7 mentioned in Section 5.1).

In this study, we simulated the process of user interaction in the following manner: we went over each word in each sentence in the test corpus. We attempted to predict the next word in its entirety. If we failed to do so, we tried to predict a single character at a time until a space or a dot (or any other punctuation mark) was reached or until the next word was correctly proposed.

The KS (see definition toward the end of Section 1) values are reported when only one suggestion (with the highest result) is proposed. The KS results of all LMs are shown for each corpus in a separate table (). The rows of each table show two results for each LM’s variant: one for the basic variant, which is presented in HaCohen-Kerner, Applebaum, and Bitterman (Citation2014), and one (with a “+” to the right of the LM’s abbreviation) that represents the variant with the prefix and non-repetition additions (see Section 4.2). The leftmost column presents the tested LM (i.e., its abbreviation, the full name can be found in ). The second-to-left column shows the KS result for prediction of a word in its entirety. The third-to-left column shows the KS result for the completion of a word after having the first letter of the discussed word. The fourth-eighth-to-left columns present the KS result for the completion of a word after having at least the first 2–6 letters of the word, respectively. The ninth column shows the KS result for the completion of a word after having at least the first seven letters (7+) of the discussed word. The rightmost column shows the relative improvement in % between the “7+” result of the improved variant of the LM (with “+” near the LM’s name) and the “7+” result of the basic variant of the same L (without “+” near the LM’s name).

Table 4. The KS results of LMs for the A7 corpus.

Table 5. The KS results of LMs for the Haaretz corpus.

Table 6. The KS results of LMs for the Themarker corpus.

Conclusions drawn from :

  • The prefix and non-repetition additions are beneficial. The “+” variants (representing the variants with the prefix and non-repetition additions) are always better than the basic LM’s variants for word completion after having the first two or more letters. The improvement rate between each LM’s variant and its corresponding “+” variant varies between 3% and 11% and between 7% and 8% for the four best variants (colored with red).

  • Regarding the KS results for word prediction, the synthesized LM’s variants (the three most complex and advanced LMs (# 14–16 in ) are the best LMs with KS results approximately 16.8%. Much better results are achieved for word completion after having a few letters.

  • Regarding the KS results for word completion after having at least seven letters, it is unclear which LM’s variant is the best LM. The results of four different advanced “+” variants are ≥ 46% and are rather similar. The best LM’s variant is the most complex one, SSD+ (the “synthesized LM with super dynamic weights” variant) with a KS result of 46.73%.

  • The basic n-gram models can be ranked according to their KS results (for at least three known letters) in the following descending order: bigram, unigram, trigram, and quadgram. The limited success of the trigram and quadgram LMs is probably because the discussed corpora are relatively small, and in many cases, the “correct” word is not presented at all or it lacks the highest frequency as the last word in the n-gram strings (for n ≥ 3). Furthermore, there are many more potential three or four gram strings than potential two gram strings; therefore, it is less likely to predict or complete the correct word. Due to these factors, the results of the quadgram produced poorer results than the trigram.

  • For most of the LMs’ variants, the more known letters for each word that should be completed, the better the KS results. Improvements greater than 1% along the same variant (row) were achieved until word completion of five letters. After having at least five letters, the KS improvement rates were less than 1% for all LM’s variants.

    Conclusions drawn from :

    • In contrast to the results for the A7’s corpus ():The synthesized LM’s variants are the best LMs with a KS result of 12.8% lower than the matching results for the A7 corpus (~16.8%).

    • The best KS results for word completion after having at least seven letters for the Haaretz corpus (~37.7%) are significantly lower than the best KS results for the A7 corpus (~46.5%). A logical explanation for this finding is that the Haaretz corpus has a broader range of n-grams words (81104 different tokens) than the compatible number of different tokens in the A7 corpus (58587 different tokens) and fewer repetitions of n-grams words; therefore, the statistical LMs are less successful.

    • The basic n-gram models for the Haaretz corpus (unigram, bigram, trigram, and quadgram) are different from the results for the A7 (bigram, unigram, trigram, and quadgram).

  • Similar to the results for the A7’s corpus:

    • The prefix and non-repetition additions are beneficial. The “+” variants are always better than the basic LM’s variants for word completion after having the first two or more letters. The improvement rate between each LM’s variant and its corresponding “+” variant varies between 2% and 11% and between 7% and 8% for almost all of the variants in general and for the three best variants (colored with red).

    • In most cases, all LMs’ variants produced higher KS results the more known letters they had. Improvements greater than 1% along the same variant (row) were achieved until word completion of five letters. After having at least five letters, the KS improvement rates were less than 1% for all LM’s variants.

    Conclusions drawn from :

  • In contrast to the results for the A7’s corpus ():

    • The synthesized LM’s variants are the best LMs with a KS result of approximately 9.2%, significantly lower than the matching results for the A7 corpus (~16.8%) and the Haaretz corpus (~12.8%).

    • The prefix and non-repetition additions are more successful than those reported for the two previous corpora. The “+” variants are always better than the basic LM’s variants for word completion after having the first two or more letters. The improvement rate between each LM’s variant and its corresponding “+” variant varies between 4% and 74% (!) and between 9% and 11% for almost all of the variants in general and 9% for the three best variants (colored with red). A possible partial explanation of these findings is that this corpus covers a more limited domain (the economic domain) and there are relatively more groups of words that share common prefixes.

    • The basic n-gram models can be ranked according to their KS results (for at least four known letters) in the following descending order: unigram, bigram, trigram, and quadgram. These results are similar to those of the Haaretz corpus and in contrast to those of the A7 corpus.

  • Similar to the results for the A7’s corpus:

    • In most cases, all LMs’ variants produced higher KS results the more known letters they had. Improvements greater than 1% along the same variant (row) were achieved until word completion of five letters. After having at least five letters, the KS improvement rates were less than 1% for all LM’s variants.

The best LM for both word prediction and completion for all of the three examined corpora () is the improved synthesized LM type (the interpolated and integrated LM and tagged LM) for all three variants. For word completion, all LM’s variants that include the two additions are better than the basic LM’s variants. These additions are not relevant for word prediction.

To identify which improvement is the best and whether one of the improvements can be omitted, we separately measured the improvement rate of each improvement (1—prefix and, 2—non-repetition). Due to the high run time (a few weeks) needed to generate , we performed the experiments only for the most complex variant, the SSD+ for all of the three corpora.

presents the contribution rate of the prefix and/or the non-repetition additions over the three web newspaper corpora. For each corpus, four rows are presented. The first row presents the basic variant (HaCohen-Kerner, Applebaum, and Bitterman Citation2014); the second row (with “1” adjacent to the right of the name of the model) presents the result after adding addition # 1 (the prefix improvement); the third row (with “2”) presents the result after adding addition # 2 (the non-repetition improvement); and finally, the row with “+” presents the result after adding both additions.

Table 7. The contribution of each addition to the SSD+ model.

Conclusions drawn from :

  • The two additions contribute to improving the KS rate only for word completion. These additions are not relevant for word prediction.

  • The non-repetition addition was better than the prefix addition.

  • The best variant for each LM for all of the three examined corpora is the “+” variant, which uses both improvements, i.e., the use of both additions leads to the best improvement.

Summary, conclusions, and future work

Whereas previous systems for the Hebrew language apply known variants of existing LMs without any alteration, in this study, we introduced four types of improvements concerning the LMs or the methods to test them. These improvements have been either fully or partially applied on 16 different variants of five types of LMs dealing with word prediction and completion in Hebrew.

The best LM for both word prediction and completion for all of the three examined corpora is the improved synthesized LM type (the interpolated and integrated LM and tagged LM) for all three variants. For word completion, the non-repetition and the Hebrew prefix improvements improved the KS rate for all of the LM’s variants. That is to say, all LM’s variants that include these improvements are better than the basic LM’s variants for word completion. The non-repetition improvement was better than the prefix improvement. However, the use of both improvements led to the best KS results.

Future directions for research would be to define and apply: (1) The use of dotted Hebrew to improve the KS for Hebrew texts; (2) Other types of LMs (e.g., LMs acquired by sampling of n-grams from documents or sampling of documents in order to speed up the construction of LMs) for this task as well as for other domains, applications and languages; and (3) Integration of LMs with other software components (e.g., POS-tagger, a morphological analyzer, and a cache model).

Notes

5. Arutz Sheva is an Israeli national religious media network. It offers free podcasts, live streaming radio, a daily e-mail news update, streaming video, and 24-hour updated text news.

6. Haaretz is Israel’s oldest daily newspaper. It was founded in 1918 and is now published in both Hebrew and English. Both Hebrew and English editions can be read on the Internet.

7. Themarker is an economic news website that offers ongoing coverage of the capital markets in Israel and globally, high-tech, advertising and media industries, real estate, labor market, consumer markets, the Israeli legal world, communication, vehicles, and transportation.

8. NRG is the online edition of the Israeli newspaper Maariv. Whereas most of the content on the website comes from the print edition, some of the material is written exclusively for the web edition.

References

  • Adler, M. 2007. Hebrew morphological disambiguation: an unsupervised stochastic word-based approach. Ph.D. diss., Ben-Gurion University of the Negev, Beer-Sheva, Israel.
  • Adler, M., Y. Netzer, D. Gabay, Y. Goldberg, and M. Elhadad. 2008. Tagging a hebrew corpus: The case of participles, LREC 2008. Marrakech, Morocco: European Language Resources Association.
  • Aliprandi, C., N. Carmignani, P. Mancarella, and M. Rubino 2007. A word predictor for inflected languages: System design and user-centric interface. Proceedings of the 2nd IASTED International Conference on Human-Computer Interaction, Chamonix, France.
  • Anson, D., P. Moist, M. Przywara, H. Wells, H. Saylor, and H. Maxime. 2006. The effects of word completion and word prediction on typing rates using on-screen keyboards. Assistive Technology 18:146–54. doi:10.1080/10400435.2006.10131913.
  • Beukelman, D., and P. Mirenda. 2005. Augmentative & alternative communication: Supporting children & adults with complex communication needs. 3rd ed. Paul H. Baltimore, MD: Brookes Publishing Company. ISBN 978-1-55766-684-0
  • Beukelman, D., and P. Mirenda. 2008. Augmentative and alternative communication: Supporting children and adults with complex communication needs. 3rd ed. p. 77. Baltimore, MD: Brookes Publishing.
  • Beyerlein, P. 1998. Discriminative model combination. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal, Vol. 1, 481–84.
  • Brants, T. 2000. TnT: A statistical part-of-speech tagger. Proceedings of the 6th Conference on Applied Natural Language Processing, Association for Computational Linguistics, 224–31.
  • Brown, P. F., V. J. D. Pietra, R. L. Mercer, S. A. D. Pietra, and J. C. Lai. 1992. An estimate of an upper bound for the entropy of English. Computational Linguistics 18 (1):31–40.
  • Calculator, S., Finch, A., McCloskey, S., Schlosser R., and Sementelli, C. 2004. Roles and responsibilities of speech-language pathologists with respect to augmentative and alternative communication. Position Statement, Technical report, Archived from the original on 13 February 2009. Retrieved January 23, 2009. American Speech-Language-Hearing Association. http://www.asha.org/docs/html/PS2005-00113.html.
  • Carlberger, J. 1997. Word prediction: design and implementation of a probabilistic word prediction program. Master diss., Royal Institute of Technology, Stockholm, Sweden.
  • Carlberger, A., J. Carlberger, T. Magnuson, M. S. Hunnicutt, S. E. Palazuelos-Cagigas, and S. A. Navarro. 1997. Profet, a new generation of word prediction: An evaluation study. In Natural language processing for communication aids, ed. A. Copestake, S. Langer, and S. Palazuelos-Cagigas, 23–28. Madrid, Spain: Proceedings of a workshop sponsored by ACL.
  • Darragh, J. J., and I. H. Witten. 1991. Adaptive predictive text generation and the reactive keyboard. Interacting with Computers 3 (1):27–50. doi:10.1016/0953-5438(91)90004-L.
  • Darragh, J. J., I. H. Witten, and M. L. James. 1990. The reactive keyboard: A predictive typing aid. Computer 23 (11):41–49. doi:10.1109/2.60879.
  • Fossett, B., and P. Mirenda. (2009). Augmentative and alternative communication. In Handbook of developmental disabilities, ed. S. L. Odom, R. H. Horner, and M. E. Snell, 330–66. New York, NY: Guilford Press. ISBN 978-1-60623-248-4.
  • Garay-Vitoria, N., and J. Abascal. 2006. Text prediction systems: A survey. Universal Access in the Information Society 4:188–203. doi:10.1007/s10209-005-0005-9.
  • Ghayoomi, M., and E. Daroodi. 2008. A POS-based word prediction system for the Persian language. In Advances in natural language processing, Proceedings of 6th International Conference, GoTAL 2008, Gothenburg, Sweden, August 25-27, ed. Ranta, Aarne, Nordström, Bengt, LNAI 5221, 138–47. Springer-Verlag Berlin Heidelberg.
  • HaCohen-Kerner, Y., A. Applebaum, and J. Bitterman. 2014. Experiments with language models for word completion and prediction in Hebrew. In Advances in natural language processing, ed. L. Calderon-Benavides, C. Gonzalez-Caro, E. Chavez, and N. Ziviani, 450–62. Springer-Verlag International Publishing.
  • HaCohen-Kerner, Y., and I. Greenfield. (2012). Basic word completion and prediction for Hebrew. In Proceedings of the 19th edition of the symposium on string processing and information retrieval (SPIRE 2012), ed. L. Calderón-Benavides, Gonzalez-Caro, C. Chavez, E. Ziviani, N. 237–44. Cartagena, Colombia, Oct 21–25,, LNCS 7608, Springer, Heidelberg.
  • Kimelfeld, B., E. Kovacs, Y. Sagiv, and D. Yahav. 2007. Using language models and the HITS algorithm for XML retrieval. In Comparative evaluation of XML information retrieval systems, ed. N. Fuhr, M. Lalmas, A. Trotman, 253–60. Berlin/Heidelberg: Springer.
  • Kirchhoff, K., D. Vergyri, J. Bilmes, K. Duh, and A. Stolcke. 2006. Morphology-based language modeling for conversational Arabic speech recognition. Computer Speech and Language 20 (4):589–608. doi:10.1016/j.csl.2005.10.001.
  • Li, J., and G. Hirst. 2005. Semantic Knowledge in Word Completion. Proceedings of the 7th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), Baltimore, MD, USA, 121–28.
  • McMahon, J. G. G. 1994. Statistical language processing based on self-organising word classification. Doctoral diss., Queen’s University of Belfast, Belfast, Northern Ireland.
  • Mikolov, T. 2012. Statistical language models based on neural networks. Doctoral dissertation, Ph. D. thesis, Brno University of Technology.
  • Morris, C., A. Newell, L. Booth, I. Ricketts, and J. Arnott. 1992. Syntax pal: A system to improve the written syntax of language-impaired users. Assistive Technology 4 (2):51–59. doi:10.1080/10400435.1992.10132194.
  • Netzer, Y., M. Adler, and M. Elhadad. 2008. Word prediction in Hebrew: Preliminary and surprising results. The International Society for Augmentative and Alternative Communication (ISAAC), Montreal, Canada. Thunberg, G.
  • Newell, A., S. Langer, and M. Hickey. 1998. The rôle of natural language processing in alternative and augmentative communication. Natural Language Engineering 4 (1):1–16. doi:10.1017/S135132499800182X.
  • Shaik, M. A. B., D. Rybach, S. Hahn, R. Schlüter, and H. Ney. 2012. Hierarchical hybrid language models for open vocabulary continuous speech recognition using wfst. Proc. of SAPA.
  • Shein, F., T. Nantais, R. Nishiyama, C. Tam, and P. Marshall. 2001. Word cueing for persons with writing difficulties: WordQ. Technology and Persons with Disabilities Conference, Los Angeles, CA . http://www.csun.edu/cod/conf/2001/proceedings/0126shein.htm
  • Stymne, S. 2011. Spell checking techniques for replacement of unknown words and data cleaning for haitian creole SMS translation. Proceedings of the 6th Workshop on Statistical Machine Translation, 470–77, Association for Computational Linguistics.
  • Swiffin, A. L., J. A. Pickering, J. L. Arnott, and A. F. Newell. 1985. PAL: An effort efficient portable communication aid and keyboard emulator. Proceedings of the 8th Annual Conference on Rehabilitation Technology, 197–99.
  • Tam, C., and D. Wells. 2009. Evaluating the benefits of displaying word prediction lists on a personal digital assistant at the keyboard level. Assistive Technology 21:105–14. doi:10.1080/10400430903175473.
  • Trnka, K., J. McCaw, D. Yarrington, and K. F. McCoy. 2009. User interaction with word prediction: The effects of prediction quality. Special Issue of ACM Transactions on Accessible Computing (TACCESS) on Augmentative and Alternative Communication 1 (3):1–34.
  • Trnka, K., and K. F. McCoy. 2007. Corpus studies in word prediction. Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), 195–202, ACM.
  • Trnka, K., and K. F. McCoy. 2008. Evaluating word prediction: Framing keystroke savings. ACL (Short Papers) 2008:261–64.
  • Wandmacher, T., and J. Y. Antoine. 2006 Training language models without appropriate language resources: Experiments with an AAC system for disabled people. Proceedings of the 5th European Conference on Language Resources and Evaluation (LREC’06), Gènes, Italie.
  • Wandmacher, T., and J.-Y. Antoine. 2007. Methods to integrate a language model with semantic information for a word prediction component. Proceedings of the ACL SIGDAT Joint Conference EMNLP-CoLLN’2007, Prague, Tchéquie, 503–13.
  • Wang, D., S. Zhu, T. Li, Y. Chi, and Y. Gong. 2011. Integrating document clustering and multidocument summarization. ACM Transactions on Knowledge Discovery from Data (TKDD) 5 (3):1–26. doi:10.1145/1993077.
  • Wu, C.-H., C.-H. Liu, M. Harris, and L.-C. Yu. 2010. Sentence correction incorporating relative position and parse template language models. IEEE Transactions on Audio, Speech, and Language Processing 18 (6):1170–81. doi:10.1109/TASL.2009.2031237.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.