73
Views
0
CrossRef citations to date
0
Altmetric
Articles

Tackling the multilingual and heterogeneous documents with the pre-trained language identifiers

ORCID Icon & ORCID Icon
Pages 391-402 | Received 12 Nov 2022, Accepted 22 May 2023, Published online: 30 May 2023
 

Abstract

The Web has become one of the most important data sources, and the content shared is most often multilingual, as users belong to different cultures and speak different languages. Multilingual content (document) is not suitable for many people who only need content in one language. Furthermore, dividing a multilingual document into monolingual documents helps researchers extract only the text of the desired language to use in different tasks such as training or model testing. Therefore, it is challenging to clean and divide the raw content manually. This paper presents an automatic approach to dividing a multilingual document and reassembling it into monolingual documents by examining three existing state-of-the-art tools for Language Identification (LI). We prepared different corpora with different heterogeneity characteristics for the evaluation and evaluated their code-switching pattern using three different code-switching metrics. The proposed approach reached 99% as the best accuracy result for the long segment (long text) and 90% for the mixed segment. In addition, a good correlation was found between the I-Index and accuracy with Pearson’s r = −0.998.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

3 We used the UDHR corpus in our experiments; we ignored blank lines and lines with less than 3 letters and we also removed numbers and all special characters (e.g., +!, *@#%&$ … etc) from the text

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 61.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 288.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.