Full article: Towards Malay named entity recognition: an open-source dataset and a multi-task framework

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Named entity recognition (NER) is a key component of many natural language processing (NLP) applications. The majority of advanced research, however, has not been widely applied to low-resource languages represented by Malay due to the data-hungry problem. In this paper, we present a system for building a Malay NER dataset (MS-NER) of 20,146 sentences through labelled datasets of homologous languages and iterative optimisation. Additionally, we propose a Multi-Task framework, namely MTBR, to integrate boundary information more effectively for NER. Specifically, boundary detection is treated as an auxiliary task and an enhanced Bidirectional Revision module with a gated ignoring mechanism is proposed to undertake conditional label transfer. This can reduce error propagation by the auxiliary task. We conduct extensive experiments on Malay, Indonesian, and English. Experimental results show that MTBR could achieve competitive performance and tends to outperform multiple baselines. The constructed dataset and model would be made available to the public as a new, reliable benchmark for Malay NER.

Keywords:

1. Introduction

Named Entity Recognition (NER) plays an essential role in multiple downstream NLP applications such as information retrieval (Guo et al., Citation2009), question answering (Aliod et al., Citation2006), and knowledge graph construction (Etzioni et al., Citation2005). It is intended to identify the entity boundaries and classify them into predefined categories (such as person, location, organisation, etc.). With the widespread application of neural networks recently, various neural approaches such as bidirectional long short-term memory (Bi-LSTM) (Ma & Hovy, Citation2016), convolutional neural network (CNN) (Chiu & Nichols, Citation2016), or recently pre-trained language models (PLMs) (Akbik et al., Citation2019, Citation2018) for NER have been proposed. Looking at the current development of NER, there are two main challenges:

Dependence on massive labelled data: neural NER models are highly successful for languages/domains with a large amount of labelled data. However, most low-resource languages (such as Malay, Tamil, etc.) do not have enough labelled data to train fully supervised models (Lin et al., Citation2020).
Boundary recognition error (BRE): One of the significant elements influencing NER performance is BRE (Li et al., Citation2021). In our preliminary experiments, we found that the current NER models are insufficient in recognising the entity boundaries (especially for long entities). Therefore, a solution to the BRE problem is urgently needed.

As the official language of Malaysia, Brunei, and Singapore, Malay belongs to Malayo-Polynesian branch in the Austronesian language family. According to Word tipsFootnote¹, the total number of Malay speakers is over 19 million, of which 16 million are native speakers. Faced with massive raw data in Malay, it is particularly significant to utilise NLP technology to extract pattern information. Unfortunately, Malay is a low-resource language with scarce labelled resource and under-developed NLP technology. Malay NER applications are insufficient for the following reasons:

Lack of large NER datasets: The Malay NER dataset construction was the subject of some earlier studies (Asmai et al., Citation2018; Ulanganathan et al., Citation2017). However, to the best of our knowledge, none of them is publicly available, and the data sizes are relatively small. In addition, fully manually labelled datasets are time and expert resource intensive and difficult to be widely used in low-resource languages.
Poor development of NER techniques: Since neural techniques rely heavily on abundant labelled data, as discussed above. Malay possesses quite limited labelled resources, therefore thwarts the neural methods applied to Malay (Morsidi et al., Citation2015). Due to the data-hungry restriction, the majority of current studies are still based on rules and machine learning(ML).

Facing the challenges above, it is necessary to explore an effective dataset construction method with automated strategies to alleviate the cost of manual annotation and to build neural NER models based on the constructed datasets for Malay language. In terms of dataset construction, given some specific cases that some low-resource languages have little labelled resources, while their homologous languages have certain labelled resources, and there are significant similarities between homologous languages such as Spanish-Portuguese and Indonesian-Malay (Ranaivo-Malançon, Citation2006), a potential dataset construction method is borrowing from the datasets of homologous languages. To that end, we propose a dataset construction framework for Malay NER. 20,146 sentences make up our Malay NER dataset (MS-NER). The framework is divided into two components specifically: (1) preliminary dataset generation using homologous-language labelled datasets and rules, and (2) iterative optimisation, a semi-manual technique to iteratively optimise the dataset using NER models and manual audition. This is a trade-off method that could reduce the overhead of manual annotation compared with the fully manual construction method and ensure the dataset quality to a certain degree by means of manual audition compared with fully automated methods. It could potentially work for the languages whose homologous languages have labelled datasets.

As for Malay NER models, based on the constructed dataset, we first apply multiple advanced neural methods for Malay NER and systematically compare and analyse their performance. In addition, to alleviate the negative impact brought by the BRE, we propose a neural multi-task (MT) framework with a bidirectional revision (Bi-revision) mechanism (MTBR). Instead of auxiliary features, we treat boundary detection (BD) as an auxiliary task and take advantage of the relationships between two tasks in a more sophisticated and clever manner: (1) multi-task learning, which has a regularisation impact that can effectively reduce overfitting to NER, provides general representations of both tasks. (2) we leverage explicit boundary information to guide the model to locate and classify entities precisely. Specifically, the label probabilities of the BD task would explicitly revise that of the NER task (main task). However, one main concern is that BD prediction errors may negatively affect the NER outputs. To tackle the error propagation issue, we introduce a gated ignoring mechanism by using the hidden features of the NER module to determine the degree of the BD revision. Gated mechanism is widely used in multiple applications (Zhao et al., Citation2022). Furthermore, a random probability is utilised to control whether the gate mechanism would take effect.

The following are this paper's main contributions:

A dataset construction framework using homologous-language labelled datasets and iterative optimisation is proposed for Malay NER. The framework could be potentially applied to some specific homologous language pairs.
A large Malay NER dataset (MS-NER), which is composed of 20,146 sentences, is constructed and would be publicly available as a benchmark dataset.
A neural multi-task (MT) framework with a bidirectional revision (Bi-revision) mechanism is presented to tackle the Malay NER task (MTBR). MTBR achieves competitive performance compared with multiple single-task and multi-task baselines on MS-NER. Besides, we also evaluate MTBR in more languages, including a representative high-resource language English and a Malay-related language Indonesian, to show its effectiveness.

The structure of this paper is as follows: The related works of the NER dataset and technique for high-resource and low-resource languages are highlighted in Section 2 correspondingly. The procedure for constructing the MS-NER dataset is thoroughly described in Section 3. The specifics of the MTBR framework are discussed in Section 4. Section 5 elaborates on the experimental setup and analyses the results. Finally, Section 6 draws conclusions.

2. Related work

We briefly cover the work associated with this research in this section. We first review the well-known datasets of resource-rich languages and the construction methods for low-resource language datasets. Then we introduce some currently widely used and effective NER models and review some approaches for low-resource NER. Finally, we review some related works on Malay NER.

2.1. Dataset

2.1.1. Well-known NER datasets

Over recent years, quite a few NER datasets have been proposed. Here are some widely used datasets:

CoNLL-2003 (Sang & Meulder, Citation2003) is considered to be one of the most widely used NER datasets for English and German. The dataset comes from news sentences on Reuters RCV1 corpus and includes four coarse-grained entity types: person, location, organisation, and miscellaneous;
OntoNotes (Hovy et al., Citation2006) is a fairly large dataset for multiple languages. It consists of many tagsets including parse trees, relation, word sense disambiguation, and entity types. As for the NER task, the OntoNotes dataset is composed of 18 fine-grained entity types such as organisation, location, percentage, date, etc. It is one of the largest and most difficult benchmark datasets for NER due to the diverse range of data sources, which consists of over 2,945,000 tokens in total.
WNUT2017 shared task (Derczynski et al., Citation2017) focuses on identifying peculiar, never-before-seen entities in the context of developing debates. It focuses on six entity types: location, person, group, corporation, product, and creative work.

These datasets are completely manually annotated, mainly for high-resource languages. However, building an entire-manually labelled dataset takes time and a lot of specialised resources. It is difficult to widely apply to resource-poor languages. To reduce the expense of manual annotation, it is required to investigate an efficient dataset construction method with automated techniques.

2.1.2. Low-resource NER datasets

To reduce the workload of manual annotation, one ideal solution is to automatically annotate the dataset. For example, a method is proposed by Menezes et al. (Citation2019) to automatically annotate a labelled dataset for NER by DBpedia and Wikipedia links and structured data. Due to their enormous size, the resulting dataset has a total of 87,769,158 tokens. In addition, distant supervision (Mintz et al., Citation2009) is also an effective way to automatically construct labelled datasets by making use of domain gazetteers. Alfina et al. (Citation2016) proposed a dataset construction framework based on distant supervision for Indonesian NER. They first annotate a seed NER dataset with the gazetteer in Indonesian DBpedia and then propose some rules to modify the noisy labels in the seed dataset. The ANEA (Hedderich et al., Citation2021) is another tool that provides the functionality to use distant supervision approaches in practice for many languages and named entity types. It also proposes a framework to iteratively improve the dataset and model during model training. Though automatically annotating datasets can sufficiently reduce the annotation load, some noises may exist in these fully automatic datasets without manual verification.

Tamil and Indonesian are representative Malay-related languages. Specifically, Tamil is a low-resource language which is also spoken in Singapore and Malaysia. In order to develop benchmark datasets for Indian Languages, AU-KBC conducted NER-shared work in Forum for Information Retrieval for Evaluation (FIRE) 2013 and 2014Footnote². The dataset is publicly available in English as well as in 4 Indian languages, including Bengali, Hindi, Malayalam, and Tamil. The NER benchmark for Tamil consists of 6096 and 1318 sentences, respectively, for training and testing. As for Indonesian, some studies present their efforts on Indonesian NER dataset construction with a series of automatic strategies (Alfina et al., Citation2016, Citation2017; Luthfi et al., Citation2014). The final modified dataset is comprised of (training, development, test) sets with (19000, 1240, 737) sentences. In addition, Fu et al. (Citation2021) propose a method, which is based on distant supervision and manual audit, to build a large Indonesian NER dataset with 50098 sentences. Since Malay and Indonesian come from the same language family and share a high language similarity, we can potentially construct a Malay NER dataset based on current large Indonesian NER datasets.

2.1.3. Malay NER datasets

Recently, some works have presented their effort to Malay NER dataset construction. Nevertheless, to the best of our knowledge, their quality and size are not satisfactory enough to support the development of neural technology. Additionally, none of them is released in the community. Table shows the detailed dataset statistics. “/” indicates that the information is not available.

Table 1. Malay NER datasets.

Download CSV Display Table

Taking the above issues into account, this work seeks to construct a labelled dataset for Malay by automated means, while introducing a small amount of expert effort to ensure the quality of the dataset. When investigating related Malay language datasets, we found an interesting phenomenon that the Malay language have little labelled resource, while the Indonesian language (a homologous language of Malay) have certain labelled resources (Alfina et al., Citation2016, Citation2017; Fu et al., Citation2021), and there are significant similarities between them. Therefore, a potential dataset construction method is borrowing from the datasets of Indonesian. This method can also be extended to other homologous languages.

2.2. Method

We first discuss some existing NER models in this subsection, respectively, for high-resource and low-resource languages. Our next step is to do a thorough literature review on Malay NER.

2.2.1. NER for high-resource language

There has been an increase in NER research in recent years. Conditional Random Fields (CRF) (McCallum, Citation2003) and Hidden Markov Models (Zhou & Su, Citation2002) within handcrafted features are examples of traditional ML approaches to NER. Modern neural NER methods combine character embeddings produced from CNN layer or Bi-LSTM layers with pre-trained word embeddings (Mikolov et al., Citation2013; Pennington et al., Citation2014). According to Lample et al. (Citation2016) and Ma and Hovy (Citation2016), a Bi-LSTM layer receives these features before perhaps being followed by a CRF layer.

Recently, fine-tuned pre-trained language models (PLMs) such as ELMO (Akbik et al., Citation2018) and BERT (Devlin et al., Citation2019) are also popular for NER. These models may encode semantic and syntactic information of tokens in the context. Later, NER is carried out using these context-aware embeddings. In particular, these PLMs are connected to a softmax classifier to perform entity label classification.

Most of these methods are evaluated in large-size CoNLL-2003 and OntoNotes datasets. However, the current NER models treat NER as a sequence labelling task without explicit use of entity boundary information, and hence they are insufficient in recognising the entity boundaries, especially for long entities to a certain degree.

2.2.2. NER for low-resource language

As mentioned above that most languages, especially low-resource languages, do not have enough labelled data to train such fully supervised models. Most of the NER models for low-resource are based on data augmentation, distant supervision, or cross-lingual transfer (Hedderich et al., Citation2021).

Data augmentation can modify features on the existing samples without changing the label, thereby obtaining new samples. There are multiple pieces of research being proposed for low-resource NER with data augmentation techniques (Dai & Adel, Citation2020), such as word replacement (Wei & Zou, Citation2019), mention replacement (Raiman & Miller, Citation2017), swap words (Wei & Zou, Citation2019), generative models (Xia et al., Citation2019), etc.

Cross-lingual NER has recently been put out as a viable option to effectively address the data-hungry issue for NER with limited resources. This method attempts to transfer knowledge from a source language, which is typically a high-resource language with a wealth of labelled data, to a target language, which is typically a low-resource language with little to no labelled data. Cross-lingual NER techniques can roughly be divided into three categories: knowledge distillation (KD)-based, data transfer-based, and model transfer-based.

Data transfer-based approach aims to build a pseudo labelled dataset for a target language by projecting alignment information such as parallel resources and machine translation from a source language (Jain et al., Citation2019; Mayhew et al., Citation2017; Ni et al., Citation2017). Then, the target NER model is trained on this pseudo labelled dataset.
Model transfer-based approach aims to directly apply the source-language trained model with language-independent features to target language (Keung et al., Citation2019; Wu & Dredze, Citation2019; Zirikly, Citation2015).
Recently, some works extend knowledge distillation (KD) in cross-lingual NER based on a teacher-student training framework (Wu et al., Citation2020a, Citation2020b). The idea is to produce pseudo labels for target language data using a source-language NER model without external resources.

Due to huge language differences (such as word order, grammar, etc.), cross-language NER seems difficult to achieve satisfactory results in target languages.

As for the NER task of Tamil and Indonesian, open-source benchmark datasets facilitate the use of supervised methods. For Tamil NER, the FIRE 2014 shared task attracts multiple systems such as CRF (Prabhakar et al., Citation2014), SVM, and their combination(Abinaya et al., Citation2014). Naive Bayes is also used for Tamil NER (Srinivasan & Subalalitha, Citation2019). Recently, Anbukkarasi et al. (Citation2022) employ deep learning technology and construct a GRU model based on the Multilingual Universal Sentence Encoder for Tamil NER. The abundance of NER resources in Indonesian has brought about the widespread use of neural NER techniques. Specifically, Gunawan et al. (Citation2018) construct a hybrid Bi-LSTM and CNN model for Indonesian NER. The performance of neural models incorporating word- and character-level features in Indonesian conversational texts is examined by Kurniawan and Louvan (Citation2018). Besides, transfer learning is recently introduced to tackle the task. To enhance the performance of Indonesian NER, Ikhwantri (Citation2019) suggests fine-tuning PLMs from high-resource to low-resource languages. Kosasih and Khodra (Citation2018) explore a transfer learning framework to transfer knowledge from a part-of-speech tagging system to a NER system.

2.2.3. Malay NER

Most current research on Malay NER is based on rules and ML. Alfred et al. (Citation2014) utilise a rule-based methodology to recognise named entities in Malay texts. This work centre on designing rules based on the POS tagging features and contextual features. Ulanganathan et al. (Citation2017) use CRF to construct a Malay language NER engine (Mi-NER) based on a POS tagger. Furthermore, Sulaiman et al. (Citation2017) explore using Stanford NER and Illinois NER technologies to find Malay-named entities. By combining fuzzy c-means and the K-nearest neighbours algorithm, Asmai et al. (Citation2018) propose an improved Malay NER model for crime texts. Salleh et al. (Citation2017) propose a conceptual model of Automatic Malay NER that uses the CRF method to identify entities from unstructured Malay text data. By using a projection technique to project entity labels from an English corpus to a Malay corpus, Zamin and Bakar (Citation2015) conduct cross-lingual transfer from English to Malay. The Dice Coefficient function and the bigram scoring technique with domain-specific rules are used for projection. Sharum et al. (Citation2011) propose to recognise names in unstructured Malay text documents based on POS features and semantic rules.

The models above are all based on different non-public datasets, and there is no authoritative benchmark for the Malay NER task. This paper aims to develop an effective dataset construction method to relieve the annotation workload for Malay language. In addition, based on the constructed dataset, an effective neural model based on multi-task learning for Malay NER is explored. The dataset and model have the potential to set up a new baseline for follow-up research.

3. Dataset construction

The foundation for the advancement of NLP technology is high-quality datasets. However, as far as we are aware, there are no NER datasets in Malay that are open-source. Additionally, fully human-annotated NER datasets are costly and time-consuming, and are therefore only in a small amount. In order to lessen the load of manual annotation, a semi-manual strategy is introduced to build a Malay NER dataset.

3.1. Dataset design

Indonesian and Malay are close languages that have high lexical similarity (Ranaivo-Malançon, Citation2006). Based on this language feature, we propose to construct a Malay preliminary dataset from the existing Indonesian datasets (Alfina et al., Citation2016, Citation2017; Fu et al., Citation2021). In addition, some rules Alfred et al. (Citation2014) are leveraged to detect entities in Malay texts to expand the preliminary dataset. After that, a semi-manual method is proposed to iteratively update the dataset and the model (introduced in Section 4).

3.2. Data sources

We build the dataset using Malay new articles. We crawl articles from Malay news websites that include articles on a variety of subjects, such as politics, economics, society, military, etc. The websites are displayed in Table . We separate paragraphs into individual sentences and then choose 30,000 sentences at random for annotation. Besides, we construct a large Malay vocabulary with 81,691 tokens from the news articles.

Table 2. Data source.

Download CSV Display Table

3.3. Dataset construction

As shown in Figure , the construction process consists of two parts: preliminary construction and iterative optimisation.

Collect the existing Indonesian NER datasets and separate them into individual sentences. Search whether all tokens of each sentence exist in the Malay dictionary. If so, add the sentence to the preliminary dataset; otherwise, discard it.
Expand the preliminary dataset within the unlabelled Malay sentences according to the rules. Only the sentences having entities are considered.

Figure 1. Dataset construction process. It consists of two parts: preliminary construction and iterative optimisation.

Through the above two steps, a seed dataset is constructed consisting of 20,146 sentences. However, the dataset constructed in these ways often suffers from mislabelling errors. Therefore, we propose the following steps to optimise the dataset.

Use all labelled sentences for both model training and testing.
Manually audit the sentences with different labels in two stages. Two auditors are hired to audit each sentence in order to guarantee the audited sentence quality. Ask a different auditor to review the sentences if two audit results dispute. The annotators are native speakers of the Malay language or have a post-graduate degree in Malay. The audit guideline is shown in Table .
Re-train the model with the audited dataset.
Repeat steps (1)–(3) until the dataset and the model converge.

Table 3. Audit guideline.

Download CSV Display Table

3.4. Analysis of MS-NER

The dataset contains 20,146 sentences with three entity categories: person (e.g. Sabrina Shawali), location (e.g. Brunei Darussalam), and organisation (e.g. UMNO Bahagian Besut). In the actual setup, the iteration number is set as 3 because there is no difference between the results of the dataset in the training and testing phases after 3 iterations, which in turn indicates that the dataset construction process has converged. We randomly choose 2,000 sentences from the corpus for manual verification to assess the corpus's quality, and the labelling error rate was 0.1 %.

Building seed datasets from existing datasets in the homogeneous language and semi-manual iterative optimisation strategy can greatly increase the speed of dataset annotation and reduce the cost required for manual annotation. Thus, it has the potential to migrate to specific language pairs and domain pairs in the industry. One prerequisite, which tends to be a concern, is the existence of a homogenous language for the target language and the availability of relevant labelled data for that homogenous language. Also, the similarity between the target language and the homogenous language needs to be high enough to ensure the quality of the seed dataset to a certain extent.

As for experiments, the dataset is (80%, 10%, 10%) divided into training, validation, and testing for subsequent experiments in BIO format. A labelled example is presented in Table .

Table 4. An example of MS-NER.

Download CSV Display Table

4. Model

4.1. Overview

Consider a sentence S with L tokens ${w_{i}}_{i = 1}^{L}$ , we, respectively, assign each token $w_{i}$ with start label $y_{s}^{i}$ and end label $y_{e}^{i}$ for the BD task and entity label $y_{n e r}^{i}$ for the NER task. In addition, we introduce a span classification for the BD module to assign one of four entity labels (PER, LOC, ORG, and O) $y_{a u_n e r}$ for span representations.

Following the overview in Figure , MTBR in this paper is a multi-task model consisting of two modules, respectively, for BD and NER tasks. The BD module is meant to recognise the entity spans with two classifiers (respectively, for start and end predictions), along with another classifier that classifies the spans as corresponding entity labels. The NER module is to predict the NER labels for each token. The token representation from the encoder is universally shared in the downstream two tasks. In addition to implicitly improving the contextual representation of NER through the vanilla multi-task framework, the span classifier is designed to explicitly revise the NER outputs through a gated ignoring mechanism. Two tasks are trained simultaneously.

Figure 2. The MTBR framework structure. Due to space limitation, [B-P, I-P, B-L, I-L, B-O, I-O, O] in the figure represents [B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, OTHER].

4.2. Encoder

The pre-trained BERT (Devlin et al., Citation2019) is utilised as our encoder for contextual representations $h_{i}$ . It is noted that other network structures such as LSTM are also suitable for the encoder here. It is formulated as: (1) $h_{i} = B E R T (w_{i})$ (1) After obtaining the token representations from the encoder, we apply two separate MLPs to create different representations $(h_{i}^{b d} / h_{i}^{n e r})$ for the (BD/NER) modules. (2a) $\begin{aligned} h_{i}^{b d} = M L P_{b d} (h_{i}) \end{aligned}$ (2a) (2b) $\begin{aligned} h_{i}^{n e r} = M L P_{n e r} (h_{i}) \end{aligned}$ (2b)

4.3. BD module

4.3.1. Standard BD

We use two token-wise classifiers to predict the start and end positions of entities. The contextual representation $h_{i}^{b d}$ is sent to an MLP classifier for boundary label prediction of $w_{i}$ . (3a) $\begin{aligned} p_{s}^{i} = s o f t m a x (W_{s} \cdot h_{i}^{b d} + b_{s}) \end{aligned}$ (3a) (3b) $\begin{aligned} p_{e}^{i} = s o f t m a x (W_{e} \cdot h_{i}^{b d} + b_{e}) \end{aligned}$ (3b) where $W_{s}$ and $W_{e}$ are fully connected matrices, $b_{s}$ and $b_{e}$ are bias vectors.

4.3.2. Span classification

After obtaining the entity boundaries, to better prompt the training of the NER task, we further introduce a task that classifies the spans as corresponding entity labels. We define consecutive tokens between the nearest pair of the start and end boundaries as a to-be-label entity and the tokens outside the boundaries as non-entities. Assuming that there are K spans in sentence S, we calculate the k-th summarised span representation and the label by averaging the task-specific representations $v_{s p}^{k}$ in their corresponding boundaries $(i, j)$ . (4) $v_{s p}^{k} = \frac{1}{j - i + 1} \sum_{t = i}^{j} h_{t}^{b d}$ (4) Later, the entity representation $v_{s p}^{k}$ is fed into an MLP classifier to predict its entity tag. (5) $p_{s p}^{k} = s o f t m a x (W_{s p} \cdot v_{s p}^{k} + b_{s p})$ (5) where $W_{s p}$ and $b_{s p}$ are connected matric and bias vectors, respectively.

4.3.3. Loss

For the BD task, we minimise two cross-entropy (CE) losses of the start and end boundaries $L_{b d} = L_{b d}^{s} + L_{b d}^{e}$ and a CE loss $L_{s p}$ for span classification. The weighted sum of two losses is the total loss, and $w_{i}$ in this paper is 0.5. (6a) $\begin{aligned} L_{b d}^{s} = \frac{1}{L} \sum_{i = 1}^{L} C E (p_{s}^{i}, y_{s}^{i}) \end{aligned}$ (6a) (6b) $\begin{aligned} L_{b d}^{e} = \frac{1}{L} \sum_{i = 1}^{L} C E (p_{e}^{i}, y_{e}^{i}) \end{aligned}$ (6b) (7) $\begin{aligned} L_{s p} = \frac{1}{K} \sum_{k = 1}^{K} C E (p_{s p}^{k}, y_{a u_n e r}^{k}) \end{aligned}$ (7) (8) $\begin{aligned} L_{1} = w_{1} L_{b d} + (1 - w_{1}) L_{s p} \end{aligned}$ (8)

4.4. NER module

4.4.1. Standard NER

The contextual representation $h_{i}^{n e r}$ for NER module is sent to an MLP classifier for entity label prediction of $w_{i}$ . (9) $p_{n e r}^{i} = s o f t m a x (W_{n e r} \cdot h_{i}^{n e r} + b_{n e r})$ (9) where $W_{n e r}$ is a fully connected matrix, $b_{n e r}$ is a bias vector.

4.4.2. Bi-revision mechanism

In addition to implicitly improving the NER performance within the vanilla multi-task framework, we assume that the output probabilities of span classification can further revise the result of the NER module. Conversely, to reduce the error propagation caused by the BD task, the NER module is conducive to verifying the revision of the BD task. Thus, we utilise the label probabilities of the span classification through a gated ignoring mechanism to obtain an adjusted probability for the NER module. Specifically, we first need to align the label probabilities of the span classification and NER tasks, as in Figure . For each token, the span classification module outputs the probability value $p_{s p} = [p_{p e r}, p_{l o c}, p_{o r g}, p_{o}]$ of $y_{a u_n e r}^{i}$ . Each type of $y_{a u_n e r}^{i}$ except for Other label corresponds to two $y_{n e r}^{i}$ labels (for instance, PER label corresponds to B-PER and I-PER labels). So, we transform $p_{s p}$ to $p_{s p_n e w} = [p_{b - p e r}, p_{i - p e r}, p_{b - l o c}, p_{i - l o c}, p_{b - o r g}, p_{i - o r g}, p_{o}]$ . If $w_{i}$ is the first token of the detected entity, the values of $(p_{p e r}, p_{l o c}, p_{o r g}, p_{o})$ are assigned to $(p_{b - p e r}, p_{b - l o c}, p_{b - o r g}, p_{o})$ , or they are assigned to $(p_{i - p e r}, p_{i - l o c}, p_{i - o r g}, p_{o})$ . Only the label probabilities of the predicted spans in the BD module would affect that in the NER module. After obtaining the aligned probability $p_{s p_n e w}^{i}$ , we calculate the revised NER probability as: (10) $p_{n e r^{'}}^{i} = p_{n e r}^{i} + p_{s p_n e w}^{i}$ (10) As stated above, we introduce the gated ignoring mechanism to alleviate the error propagation of the BD outputs. It consists of two components: a gate mechanism that determines the degree of probability revision: (11) $\begin{aligned} g a t e_{i} = s i g m o i d (W_{g} \cdot h_{i}^{n e r} + b_{g}) \end{aligned}$ (11) (12) $\begin{aligned} p_{n e r^{'}}^{i} = p_{n e r}^{i} + g a t e_{i} \cdot p_{s p_n e w}^{i} \end{aligned}$ (12) where $W_{g}$ is a fully connected matrix, $b_{g}$ is a bias vector. And a random probability p with a threshold α to control the gate mechanism: (13) $p_{f i n a l_n e r} = {\begin{cases} p_{n e r}^{i} & p > α \\ p_{n e r^{'}}^{i} & otherwise \end{cases}$ (13)

Figure 3. Probability alignment. If $w_{i}$ is the first token of the detected entity, the probabilities would flow towards the black arrow; otherwise they would flow towards the red arrow.

4.4.3. Loss

We minimise a cross-entropy loss $L_{2}$ on $p_{f i n a l_n e r}$ for NER task. (14) $L_{2} = \frac{1}{L} \sum_{i = 1}^{L} C E (p_{f i n a l_n e r}^{i}, y_{n e r}^{i})$ (14)

4.5. Training and inference

In each iteration, the proposed model simultaneously trains two tasks (BD and NER) end-to-end. The overall training loss is as follows: (15) $L = L_{1} + L_{2}$ (15) During inference, we utilise the output in the NER module as the final output.

5. Experiment

5.1. Baseline models

Most current Malay NER models are based on rules and ML. Because of their high cost and poor portability, they are not used as baselines in this paper. We compare MTBR with some advanced NER single-task and multi-task models:

In NeuralNER (Lample et al., Citation2016), word features are extracted using a Bi-LSTM layer, and the features are subsequently fed to a CRF layer for entity label prediction.
A CNN layer is used by CNNs-BiLSTM (Chiu & Nichols, Citation2016) to extract character features. These features are sent to a Bi-LSTM layer, and then a softmax layer is used to make the final prediction.
For the CNNs-Bi-LSTM-CRF (Ma & Hovy, Citation2016) model's final prediction, a CRF layer is used rather than a softmax layer.
ELMO-softmax (Akbik et al., Citation2018) uses a pre-trained ELMO model as the encoder to extract context-aware features. These features are passed to a softmax classifier for final prediction.
BERT-softmax (Devlin et al., Citation2019)is similar to ELMO-softmax and uses the pre-trained ELMO as the encoder.
BERT-CRF is similar to BERT-softmax, which uses a CRF layer instead of a softmax classifier for final prediction.
BERT-MRC (Li et al., Citation2020) formulate the NER task as a machine reading comprehension (MRC) task and propose a unified MRC framework for different kinds of NER tasks.
For the purpose of detecting entity boundaries, Bi-LSTM-PN (Li et al., Citation2021) employs BiLSTM as the encoder and another LSTM with pointer networks as the decoder. The predicted entity chunks are then categorised using a softmax classifier. We re-implement this baseline in accordance with Li et al. (Citation2021) by training two tasks in a multi-task framework and adding one softmax layer for the NER task.
Another multi-task system called MT-BERT (Zhao et al., Citation2019) uses pre-trained BERT as the encoder to jointly train a number of tasks in an end-to-end manner. In this study, we use this system to jointly train NER and BD problems.
MT-MED (Zhao et al., Citation2019) is another multi-task framework with explicit interaction between multiple tasks. In this paper, we re-implement this framework to jointly train BD and NER tasks.

We also train a GloVe (Pennington et al., Citation2014) embedding for NeuralNER, CNNs-BiLSTM, and CNNs-Bi-LSTM-CRF as well as an ELMO model for ELMO-softmax with massive Malay news articles. Additionally, Bahasa-BERTFootnote³ is utilised as the encoder for all BERT-based models.

5.2. Model setups

Our model is based on PyTorchFootnote⁴ with HuggingFace's TransformersFootnote⁵ package in one single NVIDIA Quadro RTX 8000. The model setups are shown in Table . As for the evaluation metrics, we report the precision, recall, and F1 score of multiple NER models.

Table 5. Model settings.

Download CSV Display Table

5.3. Main performance

The first section of Table presents the results of multiple previous top-performing single-task NER systems. We can see that some improvements to NER performance are made at the CNN layer for char-level representation. And it seems that adding a CRF layer on Bi-LSTM-CNNs cannot bring significant improvements. Among these previous studies, fine-tuned language models (ELMO, mBERT, and Bahasa-BERT) significantly outperform classic neural models represented by Bi-LSTM-CNNs-CRF. In addition, it is worthwhile to note that ELMO works slightly better than mBERT, we hold the idea that ELMO is Malay-oriented that is pre-trained in Malay news articles whereas mBERT is pre-trained in corpora of multiple languages and focuses on learning language-independent knowledge, leading to insufficient language-specific knowledge for Malay NER. Besides, Bahasa-BERT achieves the best performance in all language models, so we leverage it as our encoder for the follow-up experiments. Among the Bahasa-BERT-based models, Bahasa-BERT-CRF tends to work slightly better than Bahasa-BERT-softmax in this case, and Bahasa-BERT-MRC achieves the best performance in all single-task frameworks.

Table 6. Main performance.

Download CSV Display Table

The second part of Table reports the results of some multi-task models. In most cases, the performances of multi-task models outperform those of single-task models with the same encoder (Bi-LSTM-PN vs Bi-LSTM-CNNs, Bi-LSTM-MT-MED vs Bi-LSTM-CNNs-CRF, MT-BERT vs Bahasa-BERT-softmax). We also experiment with different encoders for the proposed MTBR framework, and the experimental results show that MTBR consistently achieves some performance gains over each single-task model, ranging from 1.44% to 1.81% in F1 score. This illustrates the effectiveness of the BD auxiliary task, which provides knowledge that can significantly improve the NER performance.

Taking a closer look at the multi-task models, we can see that MT-BERT outperforms Bi-LSTM-PN and MT-MED, which should be credited to pre-trained language models represented by BERT. In addition, MTBR(Bi-LSTM-CNNs) has a certain performance improvement of 1.44% and 0.60% in F1 score compared to both Bi-LSTM-PN and MT-MED, indicating that the proposed framework stimulates the potential of the BD task to the NER task in a more advanced interaction way. MTBR(Bahasa-BERT) obtains the best performance by improving more than 1.3% in F1 score over the best baseline model MT-BERT. Because of the novel design of the Bi-revision mechanism, MTBR can utilise not only the general representations of different tasks but also label probabilities of different tasks. Hence, MTBR obtains significant improvements compared with multi-task models and single-task models. Meanwhile, we could see that the MTBR(Bahasa-BERT) tends to outperform multiple baselines, which could be regarded as the new baseline method.

5.4. Ablation study

To evaluate the effectiveness of different components in MTBR, we remove a certain part of the model for ablation study and the results are presented in Table . Results show that almost every component can boost the whole MTBR. It seems that MTBR can make full use of the advantages of multi-task learning from different perspectives. Removing each of them would generally lead to a performance drop. Our analysis of the observations is drawn as follows.

w/o Boundary Detection: in this experiment, we remove the BD task and MTBR degenerates into a vanilla single-task model. Within the multi-task framework, the BD task contributes to boosting the NER performance with an improvement of 1.81% in F1 score, which should be credited to its BRE correction capability.
w/o Bi-revision: in this experiment, we remove the Bi-revision mechanism and the model degenerates into a vanilla multi-task model. This setting causes a performance drop of 1.35 on the F1 score. However, the model could achieve some performance gains compared to the vanilla single-task model. This illustrates that fusing the BD auxiliary task can improve the NER performance, but training both tasks simultaneously without external strategies may prevent the two tasks from interacting sufficiently and cannot fully exploit the potential of the BD task. In contrast, Bi-revision can explicitly fuse the outputs of the two tasks, allowing the BD output to effectively revise the NER output, thus further boosting the NER task.
w/o Gated Ignoring Mechanism: in this experiment, we remove the $g a t e_{i}$ and let the BD output probability all add up to the NER output probability as Equation (Equation10(10) $p_{n e r^{'}}^{i} = p_{n e r}^{i} + p_{s p_n e w}^{i}$ (10) ). This leads to a performance drop of 0.57% in F1 score. The gated ignoring mechanism can verify the correctness of the revision and control the degree of revision of the auxiliary task. It can well alleviate error propagation caused the prediction error in the BD module.
w/o random probability: in this experiment, we remove the random probability and use the revised probability all the time as Equation (Equation12(12) $\begin{aligned} p_{n e r^{'}}^{i} = p_{n e r}^{i} + g a t e_{i} \cdot p_{s p_n e w}^{i} \end{aligned}$ (12) ). Using the gated ignoring mechanism seems to enhance the model to some extend. Using a random probability to control whether BD revision is applied could further reduce error propagation. However, how BD revision could be more effectively controlled needs further exploration, which is left for future work.

Table 7. Performance of different modules.

Download CSV Display Table

5.5. BRE analysis

Since we explicitly introduce an auxiliary task of boundary recognition, BRE tends to decline. Table compares the ratios of BRE in different models. Our proposed MTBR framework with the BD auxiliary task can significantly reduce BRE, thus improving the NER performance.

Table 8. BRE analysis.

Download CSV Display Table

5.6. Case study

Table shows some representative cases in different modules of the proposed MTBR in this paper. From the first and second examples, we can see that the BD module can effectively revise the boundary and label of the entity in the NER module. In the first example, “Tan Sri Dr Mohd Irwan Seregar Abdullah” is an extra-long person entity and feeding it directly in the NER model for prediction would cause boundary detection error. In contrast, the BD module can correctly identify the boundary of the entity because it explicitly learns the boundary information of the entity and then pass this information through the Bi-revision mechanism to the NER model and effectively revise the entity boundaries. And in the second example, the polysemous word “Thailand” is likely to lead to ambiguity i.e. “Thailand” itself refers to a “LOC”, but in “Kelab Wartawan Asing Thailand”, it is part of an “ORG”, but the NER module incorrectly identifies “Thailand” as a “LOC” and identifies “Kelab Wartawan Asing” as an “ORG”, resulting in incorrect identification of the entity boundary, which can be effectively revised by the BD module. Moreover, in the third example, the BD module incorrectly identifies “Brahim's Dewina” as a “PER” and the NER module correctly identifies “Brahim's Dewina” as an “ORG”, if the output of the BD module is simply revised for the output of the NER module, it would cause error transmission. While in MTBR, the “gate” mechanism of the NER module can lower the threshold of transmitting information in the BD module to reject its revision. This demonstrates that our model can revise the output of the NER module through the BD module, while effectively preventing the error transmission of the BD module.

Table 9. Case study. MTBR gives all correct predictions in these cases.

Download CSV Display Table

5.7. Evaluation on more languages

To verify the effectiveness of the proposed model MTBR, we compare MTBR and the baseline methods on more languages, including a representative high-resource language English (en, CoNLL-2003 dataset) and a Malay-related language Indonesian (id, IDNER dataset). Notably, we use BERT-base-casedFootnote⁶ and mBERT, respectively, for English and Indonesian. The F1 scores are shown in Table , and we can see that in addition to MS-NER, the proposed MTBR model also works well in two evaluated datasets. Besides, MTBR with BERT as encoder achieves the best performance with an F1 score of 93.36% and 91.13%, respectively, in English and Indonesian. This demonstrates the effectiveness of MTBR and that it can be widely extended to languages other than Malay.

Table 10. MTBR performance on more languages.

Download CSV Display Table

6. Conclusion

For Malay named entity recognition (NER), we present a dataset construction framework based on homologous-language labelled datasets and iterative optimisation to construct a Malay NER dataset (MS-NER). The proposed framework tends to work for the languages whose homologous languages have labelled datasets. In addition, previous studies demonstrate the potential benefits of the boundary detection (BD) task for NER. Based on MS-NER, we further explore how the BD task can improve NER performance and propose a neural multi-task framework with a bidirectional revision (Bi-revision) mechanism (MTBR), taking both model transfer and label transfer of BD task into account to improve the NER performance. We also systematically evaluate and analyse some advanced neural NER models widely used for English on MS-NER. Experimental results show that MTBR can obtain competitive performance over multiple baselines in MS-NER. The dataset and model would be made available to the general public as a new benchmark.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Science and Technology Program of Guangzhou [grant number 202002030227] and GDUFS Yunshan Office on Public Opinion, one of the Internet culture demonstration projects of higher education institutions in Guangdong.

Notes

1 https://word.tips/100-most-spoken-languages/

2 https://dl.acm.org/doi/proceedings/10.1145/2824864

3 https://huggingface.co/malay-huggingface/bert-base-bahasa-cased

4 https://pytorch.org/

5 https://github.com/huggingface/transformers

6 https://huggingface.co/bert-base-cased

References

Abinaya, N., John, N., Ganesh, B. H. B., Kumar, A. M., & Soman, K. P. (2014). AMRITA_CENFIRE-2014: named entity recognition for indian languages using rich features. In Proceedings of the forum for information retrieval evaluation (pp. 103–111). Association for Computing Machinery. https://doi.org/10.1145/2824864.2824882.
Google Scholar
Akbik, A., Bergmann, T., & Vollgraf, R. (2019). Pooled contextualized embeddings for named entity recognition. In J. Burstein, C. Doran, and T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, June 2–7 (Vol. 1, pp. 724–728). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1078.
Google Scholar
Akbik, A., Blythe, D., & Vollgraf, R. (2018). Contextual string embeddings for sequence labelling. In E. M. Bender, L. Derczynski, and P. Isabelle (Eds.), Proceedings of the 27th international conference on computational linguistics, COLING 2018, August 20–26 (pp. 1638–1649). Association for Computational Linguistics. https://aclanthology.org/C18-1139/.
Google Scholar
Alfina, I., Manurung, R., & Fanany, M. I. (2016). DBpedia entities expansion in automatically building dataset for Indonesian NER. In 2016 international conference on advanced computer science and information systems (ICACSIS) (pp. 335–340). IEEE. https://doi.org/10.1109/ICACSIS.2016.7872784.
Google Scholar
Alfina, I., Savitri, S., & Fanany, M. I. (2017). Modified DBpedia entities expansion for tagging automatically NER dataset. In 2017 international conference on advanced computer science and information systems (ICACSIS) (pp. 216–221). IEEE. https://doi.org/10.1109/ICACSIS.2017.8355036.
Google Scholar
Alfred, R., Leong, L. C., On, C. K., & Anthony, P. (2014). Malay named entity recognition based on rule-based approach. International Journal of Machine Learning and Computing, 4(3). https://doi.org/10.7763/IJMLC.2014.V4.428
Google Scholar
Aliod, D. M., van Zaanen, M., & Smith, D. (2006). Named entity recognition for question answering. In L. Cavedon and I. Zukerman (Eds.), Proceedings of the Australasian language technology workshop, ALTA 2006, November 30–December 1 (pp. 51–58). Australasian Language Technology Association. https://aclanthology.org/U06-1009/.
Google Scholar
Anbukkarasi, S., Varadhaganapathy, S., Jeevapriya, S., Kaaviyaa, A., Lawvanyapriya, T., & Monisha, S. (2022). Named entity recognition for tamil text using deep learning. In 2022 international conference on computer communication and informatics (ICCCI) (pp. 1–5). https://doi.org/10.1109/ICCCI54379.2022.9740745.
Google Scholar
Asmai, S. A., Salleh, M. S., Basiron, H., & Ahmad, S. (2018). An enhanced Malay named entity recognition using combination approach for crime textual data analysis. International Journal of Advanced Computer Science and Applications, 9(9). https://doi.org/10.14569/issn.2156-5570
Google Scholar
Chiu, J. P. C., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4, 357–370. https://doi.org/10.1162/tacl_a_00104
Google Scholar
Dai, X., & Adel, H. (2020). An analysis of simple data augmentation for named entity recognition. In D. Scott, N. Bel, and C. Zong (Eds.), Proceedings of the 28th international conference on computational linguistics, COLING 2020, December 8–13 (pp. 3861–3867). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.343.
Google Scholar
Derczynski, L., Nichols, E., van Erp, M., & Limsopatham, N. (2017). Results of the WNUT2017 shared task on novel and emerging entity recognition. In L. Derczynski, W. Xu, A. Ritter, and T. Baldwin (Eds.), Proceedings of the 3rd workshop on noisy user-generated text, nut@emnlp 2017, September 7 (pp. 140–147). Association for Computational Linguistics. https://doi.org/10.18653/v1/w17-4418.
Google Scholar
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, June 2–7 (Vol. 1, pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423.
Google Scholar
Etzioni, O., Cafarella, M. J., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D. S., & Yates, A. (2005). Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1), 91–134. https://doi.org/10.1016/j.artint.2005.03.001
Web of Science ®Google Scholar
Fu, Y., Lin, N., Lin, X., & Jiang, S. (2021). Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition. Journal of Intelligent & Fuzzy Systems, 41(1), 563–574. https://doi.org/10.3233/JIFS-202286
Web of Science ®Google Scholar
Gunawan, W., Suhartono, D., Purnomo, F., & Ongko, A. (2018). Named-entity recognition for Indonesian language using bidirectional LSTM-CNNs. Procedia Computer Science, 135, 425–432. The 3rd International conference on computer science and computational intelligence (ICCSCI 2018), empowering smart technology in digital era for a better life. https://www.sciencedirect.com/science/article/pii/S1877050918314832.
Google Scholar
Guo, J., Xu, G., Cheng, X., & Li, H. (2009). Named entity recognition in query. In J. Allan, J.A. Aslam, M. Sanderson, C. Zhai, and J. Zobel (Eds.), Proceedings of the 32nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2009, July 19–23 (pp. 267–274). ACM. https://doi.org/10.1145/1571941.1571989.
Google Scholar
Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., & Klakow, D. (2021). A survey on recent approaches for natural language processing in low-resource scenarios. In K. Toutanova (Eds.), Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2021, June 6–11 (pp. 2545–2568). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.201.
Google Scholar
Hedderich, M. A., Lange, L., & Klakow, D. (2021). ANEA: Distant supervision for low-resource named entity recognition. CoRR abs/2102.13129. https://arxiv.org/abs/2102.13129.
Google Scholar
Hovy, E. H., Marcus, M. P., Palmer, M., Ramshaw, L. A., & Weischedel, R. M. (2006). OntoNotes: The 90% solution. In R. C. Moore, J. A. Bilmes, J. Chu-Carroll, and M. Sanderson (Eds.), Human language technology conference of the North American chapter of the association of computational linguistics, proceedings, June 4–9. The Association for Computational Linguistics. https://aclanthology.org/N06-2015/.
Google Scholar
Ikhwantri, F. (2019). Cross-lingual transfer for distantly supervised and low-resources Indonesian NER. arXiv preprint arXiv:1907.11158.
Google Scholar
Jain, A., Paranjape, B., & Lipton, Z. C. (2019). Entity projection via machine translation for cross-lingual NER. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, November 3–7 (pp. 1083–1092). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1100.
Google Scholar
Keung, P., Lu, Y., & Bhardwaj, V. (2019). Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019 November 3–7 (pp. 1355–1360). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1138.
Google Scholar
Kosasih, J. A., & Khodra, M. L. (2018). Transfer learning for Indonesian named entity recognition. In 2018 international symposium on advanced intelligent informatics (sain) (pp. 173–178). https://doi.org/10.1109/SAIN.2018.8673345.
Google Scholar
Kurniawan, K., & Louvan, S. (2018). Empirical evaluation of character-based model on neural named-entity recognition in Indonesian conversational texts. In W. Xu, A. Ritter, T. Baldwin, and A. Rahimi (Eds.), Proceedings of the 4th workshop on noisy user-generated text, nut@emnlp 2018, November 1 (pp. 85–92). Association for Computational Linguistics. https://doi.org/10.18653/v1/w18-6112.
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In K. Knight, A. Nenkova, and O. Rambow (Eds.), NAACL HLT 2016, the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies, June 12–17 (pp. 260–270). The Association for Computational Linguistics. https://doi.org/10.18653/v1/n16-1030.
Google Scholar
Li, F., Wang, Z., Hui, S. C., Liao, L., Song, D., & Xu, J. (2021). Effective named entity recognition with boundary-aware bidirectional neural networks. In J. Leskovec, M. Grobelnik, M. Najork, J. Tang, and L. Zia (Eds.), WWW '21: The web conference 2021, April 19–23 (pp. 1695–1703). ACM/IW3C2. https://doi.org/10.1145/3442381.3449995.
Google Scholar
Li, J., Sun, A., & Ma, Y. (2021). Neural named entity boundary detection. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1790–1795. https://doi.org/10.1109/TKDE.69
Web of Science ®Google Scholar
Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., & Li, J. (2020). A unified MRC framework for named entity recognition. In D. Jurafsky, J. Chai, N. Schluter, and J.R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, July 5–10 (pp. 5849–5859). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.519.
Google Scholar
Lin, B. Y., Lee, D., Shen, M., Moreno, R., Huang, X., Shiralkar, P., & Ren, X. (2020). TriggerNER: Learning with entity triggers as explanations for named entity recognition. In D. Jurafsky, J. Chai, N. Schluter, and J.R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, July 5–10 (pp. 8503–8511). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.752.
Google Scholar
Luthfi, A., Distiawan, B., & Manurung, R. (2014). Building an Indonesian named entity recognizer using Wikipedia and DBPedia. In 2014 international conference on Asian language processing, IALP 2014, October 20–22 (pp. 19–22). IEEE. https://doi.org/10.1109/IALP.2014.6973520.
Google Scholar
Ma, X., & Hovy, E. H. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, August 7–12 (Vol. 1). The Association for Computer Linguistics. https://doi.org/10.18653/v1/p16-1101.
Google Scholar
Mayhew, S., Tsai, C., & Roth, D. (2017). Cheap translation for cross-lingual named entity recognition. In M. Palmer, R. Hwa, and S. Riedel (Eds.), Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, September 9–11 (pp. 2536–2545). Association for Computational Linguistics. https://doi.org/10.18653/v1/d17-1269.
Google Scholar
McCallum, A. (2003). Efficiently inducing features of conditional random fields. In C. Meek and U. Kjærulff (Eds.), UAI '03, proceedings of the 19th conference in uncertainty in artificial intelligence, August 7–10 (pp. 403–410). Morgan Kaufmann.
Google Scholar
Menezes, D. S., Milidiú, R., & Savarese, P. (2019). Building a massive corpus for named entity recognition using free open data sources. In 8th Brazilian conference on intelligent systems, BRACIS 2019, October 15–18 (pp. 6–11). IEEE. https://doi.org/10.1109/BRACIS.2019.00011.
Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Y. Bengio and Y. LeCun (Eds.), 1st international conference on learning representations, ICLR 2013, May 2–4. Workshop Track Proceedings. http://arxiv.org/abs/1301.3781.
Google Scholar
Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In K. Su, J. Su, and J. Wiebe (Eds.), ACL 2009, proceedings of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing of the AFNLP, August 2–7 (pp. 1003–1011). The Association for Computer Linguistics. https://aclanthology.org/P09-1113/.
Google Scholar
Morsidi, F., Sarkawi, S., Sulaiman, S., Mohammad, S. A., & Wahid, R. A. (2015). Malay named entity recognition: A review. Journal of ICT in Education, 2, 1–14. https://ejournal.upsi.edu.my/index.php/JICTIE/article/view/2596
Google Scholar
Ni, J., Dinu, G., & Florian, R. (2017). Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In R. Barzilay and M. Kan (Eds.), Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, July 30–August 4 (Vol. 1, pp. 1470–1480). Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1135.
Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In A. Moschitti, B. Pang, and W. Daelemans (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25–29 (pp. 1532–1543). ACL. https://doi.org/10.3115/v1/d14-1162.
Google Scholar
Prabhakar, D. K., Dubey, S., Goel, B., & Pal, S. (2014). ISMFIRE-2014: Named entity recognition for Indian languages. In Proceedings of the forum for information retrieval evaluation (pp. 98–102). Association for Computing Machinery. https://doi.org/10.1145/2824864.2824881.
Google Scholar
Raiman, J., & Miller, J. (2017). Globally normalized reader. In M. Palmer, R. Hwa, and S. Riedel (Eds.), Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, September 9–11 (pp. 1059–1069). Association for Computational Linguistics. https://doi.org/10.18653/v1/d17-1111.
Google Scholar
Ranaivo-Malançon, B. (2006). Automatic identification of close languages-case study: Malay and Indonesian. ECTI Transactions on Computer and Information Technology (ECTI-CIT), 2(2), 126–134. https://doi.org/10.37936/ecti-cit.200622.
Google Scholar
Salleh, M. S., Asmai, S. A., Basiron, H., & Ahmad, S. (2017). A Malay named entity recognition using conditional random fields. In 2017 5th international conference on information and communication technology (ICOIC7) (pp. 1–6). IEEE. https://doi.org/10.1109/ICoICT.2017.8074647.
Google Scholar
Sang, E. F. T. K., & Meulder, F. D. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In W. Daelemans and M. Osborne (Eds.), Proceedings of the seventh conference on natural language learning, CONLL 2003, held in cooperation with HLT-NAACL 2003, May 31–June 1 (pp. 142–147). ACL. https://aclanthology.org/W03-0419/.
Google Scholar
Sharum, M. Y., Abdullah, M. T., Sulaiman, M. N., Murad, M. A. A., & Hamzah, Z. A. Z. (2011). Name extraction for unstructured Malay text. In 2011 IEEE symposium on computers & informatics (pp. 787–791). IEEE. https://doi.org/10.1109/ISCI.2011.5959017.
Google Scholar
Srinivasan, R., & Subalalitha, C. (2019). Automated named entity recognition from tamil documents. In 2019 IEEE 1st international conference on energy, systems and information processing (ICESIP) (pp. 1–5). https://doi.org/10.1109/ICESIP46348.2019.8938383.
Google Scholar
Sulaiman, S., Wahid, R., Sarkawi, S., & Omar, N. (2017). Using stanford NER and Illinois NER to detect malay named entity recognition. International Journal of Computer Theory and Engineering, 9(2), 147–150. https://doi.org/10.7763/IJCTE.2017.V9.1128
Google Scholar
Ulanganathan, T., Ebrahim, A., Xian, B. C. M., Bouzekri, K., Mahmud, R., & Hoe, O. H. (2017). Benchmarking Mi-NER: Malay entity recognition engine. In 9th international conference on information, process, and knowledge management (pp. 52–58).
Google Scholar
Wei, J. W., & Zou, K. (2019). EDA: easy data augmentation techniques for boosting performance on text classification tasks. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019 November 3–7 (pp. 6381–6387). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1670.
Google Scholar
Wu, Q., Lin, Z., Karlsson, B., Lou, J., & Huang, B. (2020a). Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language. In D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, July 5–10 (pp. 6505–6514). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.581.
Google Scholar
Wu, Q., Lin, Z., Karlsson, B. F., Huang, B., & Lou, J. (2020b). UniTrans: Unifying model transfer and data transfer for cross-lingual named entity recognition with unlabeled data. In C. Bessiere (Ed.), Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020 (pp. 3926–3932). IJCAI.org. https://doi.org/10.24963/ijcai.2020/543.
Google Scholar
Wu, S., & Dredze, M. (2019). Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, November 3–7 (pp. 833–844). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1077.
Google Scholar
Xia, M., Kong, X., Anastasopoulos, A., & Neubig, G. (2019). Generalized data augmentation for low-resource translation. In A. Korhonen, D.R. Traum, and L. Màrquez (Eds.), Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, July 28–August 2 (Vol. 1, pp. 5786–5796). Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1579.
Google Scholar
Zamin, N., & Bakar, Z. A. (2015). Name entity recognition for malay texts using cross-lingual annotation projection approach. In O. Gervasi (Eds.), Computational science and its applications – ICCSA 2015 – 15th international conference proceedings, part I, June 22–25 (Vol. 9155, pp. 242–256). Springer. https://doi.org/10.1007/978-3-319-21404-7_18.
Google Scholar
Zamin, N., Oxley, A., & Bakar, Z. A. (2013). Projecting named entity tags from a resource rich language to a resource poor language. Journal of Information and Communication Technology, 12, 121–146. https://e-journal.uum.edu.my/index.php/jict/article/view/8140.
Google Scholar
Zhao, S., Liu, T., Zhao, S., & Wang, F. (2019). A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In The thirty-third AAAI conference on artificial intelligence, AAAI 2019, the thirty-first innovative applications of artificial intelligence conference, IAAI 2019, the ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, January 27–February 1 (pp. 817–824). AAAI Press. https://doi.org/10.1609/aaai.v33i01.3301817.
Google Scholar
Zhao, W., Zhao, S., Chen, S., Weng, T. H., & Kang, W. (2022). Entity and relation collaborative extraction approach based on multi-head attention and gated mechanism. Connection Science, 34(1), 670–686. https://doi.org/10.1080/09540091.2022.2026295
Web of Science ®Google Scholar
Zhou, G., & Su, J. (2002). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th annual meeting of the association for computational linguistics, July 6–12 (pp. 473–480). ACL. https://aclanthology.org/P02-1060/.
Google Scholar
Zirikly, A. (2015). Cross-lingual transfer of named entity recognizers without parallel corpora. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing, ACL 2015, July 26–31 (Vol. 2, pp. 390–396). The Association for Computer Linguistics. https://doi.org/10.3115/v1/p15-2064.
Google Scholar

Towards Malay named entity recognition: an open-source dataset and a multi-task framework

Abstract

1. Introduction

2. Related work

2.1. Dataset

2.1.1. Well-known NER datasets

2.1.2. Low-resource NER datasets

2.1.3. Malay NER datasets

Table 1. Malay NER datasets.

2.2. Method

2.2.1. NER for high-resource language

2.2.2. NER for low-resource language

2.2.3. Malay NER

3. Dataset construction

3.1. Dataset design

3.2. Data sources

Table 2. Data source.

3.3. Dataset construction

Table 3. Audit guideline.

3.4. Analysis of MS-NER

Table 4. An example of MS-NER.

4. Model

4.1. Overview

4.2. Encoder

4.3. BD module

4.3.1. Standard BD

4.3.2. Span classification

4.3.3. Loss

4.4. NER module

4.4.1. Standard NER

4.4.2. Bi-revision mechanism

4.4.3. Loss

4.5. Training and inference

5. Experiment

5.1. Baseline models

5.2. Model setups

Table 5. Model settings.

5.3. Main performance

Table 6. Main performance.

5.4. Ablation study

Table 7. Performance of different modules.

5.5. BRE analysis

Table 8. BRE analysis.

5.6. Case study

Table 9. Case study. MTBR gives all correct predictions in these cases.

5.7. Evaluation on more languages

Table 10. MTBR performance on more languages.

6. Conclusion

Disclosure statement

Additional information

Funding

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date