Search in:

Applied Artificial Intelligence

An International Journal

Volume 36, 2022 - Issue 1

Submit an article Journal homepage

Open access

2,665

Views

CrossRef citations to date

Altmetric

Research Article

An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation

An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation

Thi-Vinh Ngoa Department of Computer Engineering, Thai Nguyen University of Information and Communication Technology, Thai Nguyen, VietnamCorrespondence[email protected]

https://orcid.org/0000-0001-8764-6688 View further author information

Phuong-Thai Nguyenb Institute of Artificial Intelligence, University of Engineering and Technology, Vietnam National University, Hanoi, VietnamView further author information

Van Vinh Nguyenc Department of Computer Science, The Faculty of Information Technology, University of Engineering and Technology, Vietnam National University, Hanoi, VietnamView further author information

Thanh-Le Had Institute of Anthropomatics and Robotics, Faculty of Informatics, Karlsruhe Institute of Technology, Karlsruhe, GermanyView further author information

Le-Minh Nguyene School of Information Science, Japan Advanced Institute of Science and Technology, JapanView further author information

Article: 2101755 | Received 30 Apr 2022, Accepted 01 Jul 2022, Published online: 02 Aug 2022

Cite this article
https://doi.org/10.1080/08839514.2022.2101755
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

Figures & data

Table 1. An example of synthetic data is generated from the original sentence pair on our TED Talk datasets using various thresholds. If the threshold (ths) is $t h s = x$ ), standard translation units (tokens) whose frequency is greater than $x$ in the target vocabulary will be replaced by their ATUs respectively. The source sentence is preserved while the target sentence is transformed in the dummy data pairs.

Display Table

Figure 1. ATUs are corresponding to the standard translation units in the target sentence.

Figure 2. Our overall method for generating synthetic data and integrating it into the NMT system.

Table 2. The number of the sentence pairs is in our bilingual datasets.

Download CSV Display Table

Table 3. The number of monolingual sentences is used for training our BERT model.

Download CSV Display Table

Table 4. Our data augmentation systems overcome the baseline systems on TED Talks datasets with the frequency threshold of 7 for replacement ATUs by standard translation units in the synthetic data. Our method is also compared to the Back Translation technique.

Download CSV Display Table

Table 5. Our data augmentation systems overcome the baseline systems on ALT datasets with the frequency threshold of 7 for replacement ATUs by standard translation units in the synthetic data.

Download CSV Display Table

Table 6. The number of sentences in the target side (Vietnamese) contains ATUs when using the threshold = 7 in TED Talk and ALT datasets.

Display Table

Table 7. The BLEU scores in backward models from baseline systems and vanilla systems in TED Talk Datasets. We use spacy for segmentation Japanese texts.

Download CSV Display Table

Table 8. The practical results of combined translation systems from Chinese, Japanese to Vietnamese on the TED Talks datasets.

Download CSV Display Table

Table 9. The practical results in combined training systems from Chinese, Japanese to Vietnamese on the ALT datasets.

Download CSV Display Table

Table 10. The practical results of translation systems when incorporating the BERT model on the TED Talks datasets. The system in (4) is then trained continuously to epoch 70 to achieve the state of the art.

Download CSV Display Table

Table 11. The practical results of translation systems when incorporating the BERT model on the ALT datasets. The system in (3) is also trained continuously to epoch 70.

Download CSV Display Table

Table 12. The number of sentence pairs in the bilingual training dataset of Japanese – Vietnamese when Japanese texts are segmented by kytea, or spacy, or mecab with the limitation in lengths of 150 source tokens in the training process.

Download CSV Display Table

Table 13. An example of translations in NMT systems from Japanese to Vietnamese employing our data augmentation (aug) method in the from bilingual systems.

Download CSV Display Table

Table 14. Results in the BLEU score and the target vocabulary sizes of combined translation systems with frequency thresholds of replacement are $t h s = 0$ and $t h s = 7$ on the ALT datasets. Japanese texts are segmented by spacy.

Display Table

Table 15. Results in the BLEU score and the target vocabulary sizes of combined translation systems which are augmented dummy data with frequency thresholds of replacement are $t h s = 0$ and $t h s = 7$ on TED Talks datasets. Japanese texts are segmented by spacy.

Display Table

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation

An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation

Table 2. The number of the sentence pairs is in our bilingual datasets.

Table 3. The number of monolingual sentences is used for training our BERT model.

Table 4. Our data augmentation systems overcome the baseline systems on TED Talks datasets with the frequency threshold of 7 for replacement ATUs by standard translation units in the synthetic data. Our method is also compared to the Back Translation technique.

Table 5. Our data augmentation systems overcome the baseline systems on ALT datasets with the frequency threshold of 7 for replacement ATUs by standard translation units in the synthetic data.

Table 6. The number of sentences in the target side (Vietnamese) contains ATUs when using the threshold = 7 in TED Talk and ALT datasets.

Table 7. The BLEU scores in backward models from baseline systems and vanilla systems in TED Talk Datasets. We use spacy for segmentation Japanese texts.

Table 8. The practical results of combined translation systems from Chinese, Japanese to Vietnamese on the TED Talks datasets.

Table 9. The practical results in combined training systems from Chinese, Japanese to Vietnamese on the ALT datasets.

Table 10. The practical results of translation systems when incorporating the BERT model on the TED Talks datasets. The system in (4) is then trained continuously to epoch 70 to achieve the state of the art.

Table 11. The practical results of translation systems when incorporating the BERT model on the ALT datasets. The system in (3) is also trained continuously to epoch 70.

Table 12. The number of sentence pairs in the bilingual training dataset of Japanese – Vietnamese when Japanese texts are segmented by kytea, or spacy, or mecab with the limitation in lengths of 150 source tokens in the training process.

Table 13. An example of translations in NMT systems from Japanese to Vietnamese employing our data augmentation (aug) method in the from bilingual systems.

Table 14. Results in the BLEU score and the target vocabulary sizes of combined translation systems with frequency thresholds of replacement are $t h s = 0$ and $t h s = 7$ on the ALT datasets. Japanese texts are segmented by spacy.

Table 15. Results in the BLEU score and the target vocabulary sizes of combined translation systems which are augmented dummy data with frequency thresholds of replacement are $t h s = 0$ and $t h s = 7$ on TED Talks datasets. Japanese texts are segmented by spacy.

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation

An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation

Figures & data

Table 2. The number of the sentence pairs is in our bilingual datasets.

Table 3. The number of monolingual sentences is used for training our BERT model.

Table 4. Our data augmentation systems overcome the baseline systems on TED Talks datasets with the frequency threshold of 7 for replacement ATUs by standard translation units in the synthetic data. Our method is also compared to the Back Translation technique.

Table 5. Our data augmentation systems overcome the baseline systems on ALT datasets with the frequency threshold of 7 for replacement ATUs by standard translation units in the synthetic data.

Table 6. The number of sentences in the target side (Vietnamese) contains ATUs when using the threshold = 7 in TED Talk and ALT datasets.

Table 7. The BLEU scores in backward models from baseline systems and vanilla systems in TED Talk Datasets. We use spacy for segmentation Japanese texts.

Table 8. The practical results of combined translation systems from Chinese, Japanese to Vietnamese on the TED Talks datasets.

Table 9. The practical results in combined training systems from Chinese, Japanese to Vietnamese on the ALT datasets.

Table 10. The practical results of translation systems when incorporating the BERT model on the TED Talks datasets. The system in (4) is then trained continuously to epoch 70 to achieve the state of the art.

Table 11. The practical results of translation systems when incorporating the BERT model on the ALT datasets. The system in (3) is also trained continuously to epoch 70.

Table 12. The number of sentence pairs in the bilingual training dataset of Japanese – Vietnamese when Japanese texts are segmented by kytea, or spacy, or mecab with the limitation in lengths of 150 source tokens in the training process.

Table 13. An example of translations in NMT systems from Japanese to Vietnamese employing our data augmentation (aug) method in the Table 1 from bilingual systems.

Table 14. Results in the BLEU score and the target vocabulary sizes of combined translation systems with frequency thresholds of replacement are ths=0 and ths=7 on the ALT datasets. Japanese texts are segmented by spacy.

Table 15. Results in the BLEU score and the target vocabulary sizes of combined translation systems which are augmented dummy data with frequency thresholds of replacement are ths=0 and ths=7 on TED Talks datasets. Japanese texts are segmented by spacy.

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 13. An example of translations in NMT systems from Japanese to Vietnamese employing our data augmentation (aug) method in the from bilingual systems.

Table 14. Results in the BLEU score and the target vocabulary sizes of combined translation systems with frequency thresholds of replacement are $t h s = 0$ and $t h s = 7$ on the ALT datasets. Japanese texts are segmented by spacy.

Table 15. Results in the BLEU score and the target vocabulary sizes of combined translation systems which are augmented dummy data with frequency thresholds of replacement are $t h s = 0$ and $t h s = 7$ on TED Talks datasets. Japanese texts are segmented by spacy.