1,527
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Cue prompt adapting model for relation extraction

, , , , , & show all
Article: 2161478 | Received 03 Aug 2022, Accepted 19 Dec 2022, Published online: 28 Dec 2022

Abstract

Prompt-tuning models output relation types as verbalised-type tokens instead of predicting the confidence scores for each relation type. However, existing prompt-tuning models cannot perceive named entities of a relation instance because they are normally implemented on raw input that is too weak to encode the contextual features and semantic dependencies of a relation instance. This study proposes a cue prompt adapting (CPA) model for relation extraction (RE) that encodes contextual features and semantic dependencies by implanting task-relevant cues in a sentence. Additionally, a new transformer architecture is proposed to adapt pre-trained language models (PLMs) to perceive named entities in a relation instance. Finally, in the decoding process, a goal-oriented prompt template is designed to take advantage of the potential semantic features of a PLM. The proposed model is evaluated using three public corpora: ACE, ReTACRED, and Semeval. The performance achieves an impressive improvement, outperforming existing state-of-the-art models. Experiments indicate that the proposed model is effective for learning task-specific contextual features and semantic dependencies in a relation instance.

1. Introduction

Relation extraction (RE) identifies predefined semantic relationships between two named entities in a sentence and plays a vital role in many downstream natural language processing (NLP) tasks, such as knowledge base construction (Q. Liu et al., Citation2016), question answering (K. Xu et al., Citation2016) or machine translation(Bao et al., Citation2014). RE is usually implemented as a classification problem, which predicates a relation type for each entity pair in a sentence. Even greater success has been achieved, it is still a challenging task, which suffers from two problems. First, because some relation types are asymmetric, the order between entities in each pair should be considered. Therefore, it leads to a serious data imbalance problem. Second, every entity pair in a sentence should be classified to determine whether there is a relation between them. Because these entity pairs share the same contextual features in a sentence, it is difficult to distinguish them. In order to reduce the influence of negative instances and capture the order information between entities, it is important to learn contextual features and semantic dependencies relevant to the considered entities in a sentence.

In traditional type classification models, many techniques have been developed to learn contextual features and semantic dependencies relevant to the considered entities, such as position embedding (Zeng et al., Citation2015), multi-channel (Chen et al., Citation2020), neuralized feature engineering (Chen et al., Citation2021), and entity indicators (Qin et al., Citation2021; W. Zhou & Chen, Citation2021). In these models, PLMs are mainly used to support token embedding. Subsequently, entities' relevant features (e.g. entity positions or types) are encoded into a task-specific representation. In this case, PLMs-based deep architectures are usually designed to compress every relation instance into an abstract representation. Classification depends only on a dense representation of the entire input. Because the representation is usually a vector, it undoubtedly results in a serious semantic loss. Furthermore, the process of initialising PLMs is implemented as a masked token-prediction task (Devlin et al., Citation2018). There is also a gap between pre-training objectives and fine-tuning objectives.

In the prompt-tuning schema, predicting relation types are transformed into a verbalised-type token-prediction task. In general, prompts are defined as templates with slots that take values from a verbalised-type token set. Predefined prompts are concatenated with an input (sentence) and fed into PLMs to predict masked slots, similar to a closed-style schema (Schick & Schütze, Citation2020). For example, an input in semantic recognition is first concatenated with a prompt (e.g. “It was [MASK]”). The input is subsequently fed into PLMs to predict masked tokens (e.g. Glad or Sad). This approach is effective in making use of knowledge within PLMs because prompt tuning is helpful in bridging the gap between PLMs and RE. Therefore, this approach has been successfully applied to tasks such as text classification and natural language inference (Schick & Schütze, Citation2020).

In related studies, several prompts have been designed for PLM tuning (Brown et al., Citation2020). Despite great successes having been achieved in prompt-tuning models, their effectiveness heavily depends on the quality of the prompt templates. Current prompt-tuning models are often implemented directly on raw inputs concatenated with predefined prompt templates. In this case, it is difficult to encode contextual features and semantic dependencies because they cannot perceive the named entities of a relation instance.

This study proposes a cue prompt adapting (CPA) model for RE consisting of three components. First, entity cues are implanted into a sentence instead of implementing a classifier on the raw input in order to learn the semantic dependencies of a relation instance. Second, a new transformer architecture is proposed to tune PLMs for RE to adapt to the proposed entity cues. Third, a goal-oriented prompt template is designed to decode the potential semantic features of a PLM. This model enables PLMs to encode semantic dependencies between the type tokens and contextual words. The proposed model achieves a state-of-the-art performance on three public evaluation datasets.

The remainder of this paper is organised as follows. Section 2 introduces the related work. The proposed CPA model is presented in Section 3. Section 4 presents experiments to evaluate the proposed CPA model. Finally, the conclusions are presented in Section 5.

2. Related work

The task of extracting relations is typically implemented as a classification problem. Shallow architectures were widely used in the early research stage, such as rule and feature-engineering based architectures (Chen et al., Citation2015; S. Zhao & Grishman, Citation2005). This was because manually designed rules were required to extract the features of a relation instance. However, these models were expensive in human labour, and generation to a different domain was difficult. In contrast, deep architectures adopt multi-stacked network layers for designed feature transformations, such as convolutional neural networks (CNNs) (Nguyen & Grishman, Citation2015; Zeng et al., Citation2014) and recurrent neural networks (RNN) (Geng et al., Citation2020; Wang et al., Citation2016; P. Zhou et al., Citation2016). These networks have the advantage of automatically extracting high-order abstract representations from raw input.

PLMs in deep neural networks have been widely adopted to embed tokens into distributed representations to learn better relation representations, such as ELMo (Peters et al., Citation2018) and BERT (Devlin et al., Citation2018). PLM tuning in RE has achieved great success (Torfi et al., Citation2020). PLMs typically consist of billions of parameters that are automatically learned from external resources. These parameters encode rich knowledge of sentences that is valuable for downstream tasks (Brown et al., Citation2020). Therefore, PLMs are tuned with annotated examples during the training process to learn task-relevant representations. In this field, there are two paradigms for tuning PLMs: fine-tuning and prompt tuning.

In the fine-tuning paradigm, PLMs are used to map every token into a distributed representation, such as BERT, ALBERT (Lan et al., Citation2019), and RoBERTa (Y. Liu et al., Citation2019). PLMs are effective in addressing the feature sparsity problem because they are pretrained from external resources using unsupervised methods (C. Li & Tian, Citation2020; Soares et al., Citation2019). K. Zhao et al. (Citation2021) proposed a graph neural network for a joint entity and RE. R. Li et al. (Citation2021) proposed an entity correlated attention neural model to extract entities and their relations. W. Zhao and Zhao et al. (Citation2022) proposed a gated and attentive network to collaboratively extract entities and their relations. Hang et al. (Citation2021) proposed an end-to-end neural network model for the joint extraction of entities and overlapping relations. Chen et al. (Citation2021) combined a neural network with feature engineering and proposed a neuralized feature engineering method. Q. Zhao and Xu et al. (Citation2022) proposed a knowledge guided distance supervision model for biomedical RE. Cohen et al. (Citation2020) used the schema of question answering to verify the feasibility of RE. P. Li and Mao (Citation2019) proposed a knowledge-oriented CNN for causal RE. Lyu and Chen (Citation2021) proposed an entity type restriction, where entity types were exploited to restrict candidate relations. K. Zhao and Yang et al. (Citation2022) proposed a consistent representation learning method for few-shot RE.

Prompt tuning has received considerable attention in recent years and has achieved great success (Hu et al., Citation2021; P. Liu et al., Citation2021). In this paradigm, RE is implemented as a mask language model that addresses two issues: template design and verbaliser construction. In a related work, Han et al. (Citation2021) proposed a PTR model that applied logic rules to construct prompts with several sub-prompts. This model was able to encode prior knowledge of each class into prompt tuning. Shin et al. (Citation2020) proposed a gradient-guided method to create prompts automatically. Gao et al. (Citation2020) presented a prompt model that used sequence-to-sequence models to generate prompt candidates. Cui et al. (Citation2022) proposed a prototypical verbaliser to learn prototype vectors through contrastive learning. Xiang et al. (Citation2020) proposed knowledge-aware prompt-tuning that jointly optimised the representation of a prompt template and answered words with knowledge constraints.

3. Methodology

The task of extracting entity relations was formalised to provide a formalised discussion, as follows.

A relation instance is defined as a 3-tuple I=r,e1,e2 that contains a relation mention r and two named entities e1 and e2. The relation mention r is a token sequence r=[t1,t2,,tn]. The entities ek=[ti,,tj] and (k{1,2}) are a substring of r. Y={y0,y1,,yM} is a relation type set that is composed of M positive-relation types and one negative-relation type y0. I={I1,I2,} represents a relation instance set. The RE is subsequently represented as a map between I and Y, which is expressed as: (1) f:IY(1) where f is a function which can be a shallow model (e.g. a support vector machine or maximum entropy classifier) or a deep neural network (e.g. a CNN or RNN).

In a traditional model, a deep architecture (denoted as N) is implemented on the original input r to extract its representation. The network N can be embedded with a PLM to support token embedding and encode external knowledge, which is denoted as NM. The output of NM is represented as H=[H1,H2,,Hn], where Hi is an abstract representation of token ti. H is often mapped into a vector, then fed into a classifier (C) to make a prediction. This process is formalised and expressed as: (2) P(Y|I)=Softmax(C(NM(r)))(2) In prompt tuning, class types are verbalised into a token set V that is composed of relation types or category labels (e.g. “true” or “false”). Elements of V are referred to as “type tokens”. A prompt is defined as a template with slots that can be filled by verbalised type tokens, for example, “P = It is [MASK]”. Subsequently, this template is concatenated with raw input and fed into a deep network to predict the distribution of type tokens in the position of “[MASK]”. This is expressed as: (3) [H1,,HL]=NM(r+P)(3) where “+” denotes the character string concatenation operation. In prompt tuning, the confidence score is assigned to a type token vV for each slot ([MASK]) in the prompt template instead of outputting a class label based on token representations [H1,,HL].

A CPA model was proposed in this study to perceive named entities and encode the semantic dependencies between them. The proposed model is composed of three components: cue encoding, PLM adapting, and prompt decoding. The architecture of the proposed model is shown in Figure .

Figure 1. Architecture of our proposed CPA model.

Figure 1. Architecture of our proposed CPA model.

In the cue encoding component, entity cues were implanted into a sentence to learn the semantic dependencies of a relation instance instead of implementing a classifier on raw input. In the PLM adapting component, two Adapter layers were added to tune PLMs for entity cueing (Houlsby et al., Citation2019). In the prompt decoding component, a goal-oriented prompt template was designed to decode the potential semantic features of a PLM for RE. Each component is discussed in detail below.

3.1. Cue encoding component

Directly implementing a deep network on r typically causes serious performance degradation because the network is unaware of the positions of the considered entities. To address this problem, entity cues were implanted into an input to control the attention of a deep network for learning task-specific representations. This was formalised and is expressed as: (4) Cueing(ek)=[ck,ek,/ck](4) (5) Cueing(r)=[r¨|ek/Cueing(ek),k={1,2}](5) where ck and /ck are specific tokens representing the start and end boundaries of the entity ek (k={1,2}), respectively. These are referred to as entity cues.

Equation (Equation4) concatenates the two tokens on both sides of ek. In Equation (Equation5), ek/Cueing(ek) denotes the string replacement operation, where ek is replaced by Cueing(ek). Therefore, the function Cueing(r) implants entity cues on both sides of the considered entity pair. Using these settings, Equation (Equation2) can be revised as: (6) P(Y|I)=Softmax(C(NM((r))))(6) In Equation (Equation6), the entity cues that were implanted into the input enable the deep network to focus on the considered entity pair. The classification is then based on a sentence representation relevant to the considered entities. This approach is effective for learning the contextual features and semantic dependencies of a relation instance.

In prompt tuning, researchers have mainly focussed on designing prompt templates to tune PLMs for downstream tasks (Brown et al., Citation2020). It was assumed that tuning PLMs to attend to task-specific information in prompt tuning was also valuable. Therefore, this study focussed on designing and implanting entity cues for tuning PLMs to support RE. Several cueing strategies were proposed in this study, as listed in Table .

Table 1. Cueing strategies.

In Table , square brackets indicate that the inner is a token sequence and Cueingo indicates that the input relation mention remained unchanged. Cueinge replaces entity ek (k{1,2}) with a token sequence “ck,ek,/ck”. This is the traditional strategy used in related studies (Chen et al., Citation2021; Qin et al., Citation2021). Note that all pairs of closed braces and parentheses are also used as tokens to indicate the positions of the named entities. In Cueinget(ek), the “type1” and “type2” tokens are denoted by head and tail entity types, respectively. Different braces are used to distinguish different entities.

3.2. PLM adapting component

The prompt is concatenated with the input and fed into a deep neural network to learn the token representations H. This is expressed as: (7) [H1,,HL]=NM(Cueing(r)+P(e1,e2))(7) where P(e1,e2) is a prompt template used to generate output for the RE. This is discussed in Section 3.3 in detail.

The PLM NM is modified to adapt to the proposed entity cues by adding two Adapter layers, as shown in Figure . Each Adapter consists of two linear layers and a nonlinear layer, and its output size is consistent with that of its input. This can be formalised and expressed as: (8) A(x)=(WGeLU(Wx+b)+b)+x(8) where, x is the input hidden states, and W, W, b and b are trainable parameters. The dimension size of A(x) is the same as that of x. NM with Adapter is represented as NMA.

In the output layer, the normalised confidence score that NMA assigns as a type token vV to [MASK] i for each slot ([MASK]) in a prompt template was used instead of outputting a class label based on token representations [H1,,HL]. This is expressed as: (9) S([MASK]=v|I)=(HvHMi)2(9) where HMiH is the representation of a [MASK] i, and Hv is the token-type representation of vV in the employed PLMs. Subsequently, RE is transformed into a token predication in masked slots. Given a relation instance I, the distribution of the type token v in slot [MASK] i is expressed as: (10) P(v|I)=exp(S([MASK]=v|I))vVexp(S([MASK]=v|I))(10) The prompt tuning outputs relation types as verbalised-type tokens, which have the advantage of making full use of the rich knowledge of PLMs.

3.3. Prompt decoding component

Each PLM contains a large number of parameters that are pre-trained from external resources using unsupervised methods, encoding rich knowledge for RE. A goal-oriented prompt template was designed for the prompt-decoding component to decode the potential semantic features of a PLM.

Specifically, a relation prompt was defined as a template with three slots. For example, “P(e1,e2) = the head entity type [MASK]1 e1 [MASK]2 the tail entity type [MASK]3 e2”, where [MASK] takes values from the token set V. V is composed of entity types and associations between types. Corresponding labels for each mask are listed in the Table .

Table 2. Labels of [MASK] on different datasets.

The masked tokens for the relation representations in the three datasets are listed in Table . The column “Relation Types” denotes the redefined relation types in the manually annotated datasets. “[MASK]1”, “[MASK]2”, and “[MASK]3” are template slots that should be predicated by a prompt tuning relation classifier. For example, in the ACE corpus, a “PER-SOC” entity relation was identified if the outputs for the three slots were the tokens “person”, “was related to”, and “social”.

Examples to demonstrate the utilisation of the proposed entity cues and prompt templates are presented in Figure . “Full prompt” is a prompt template in which a template has context with three slots. Slots [MASK]1 and [MASK]3 can take values from {“person”, “country”, and ···  }. These denote the types of named entities. [MASK]2 takes values from {“was born in”, “was located in”, and ···  }. This is used to indicate the relation between named entities. In “Naive prompt”, three [MASK]s are directly used without any contextual words. This was mainly used for comparison purposes.

Figure 2. Example of entity cues and prompt templates.

Figure 2. Example of entity cues and prompt templates.

The cueing strategies listed in Figure were concatenated with “Full prompt” and “Naive prompt”, where ⊕ denotes the concatenating operation. For example, “Cueinget(ek)+Full” indicates that e1 and e2 are first replaced in r by the two strings “{(type1)e1}” and “type2e2”, respectively, given a relation instance r,e1,e2. Subsequently, the revised relation mention (r¨|ek/Cueinght(ek),k={1,2}) is concatenated with the full prompt. The output is fed into a PLM to predict the type of tokens in each [MASK]. If a PLM outputs “person”, “is parent of”, and “person”, then a “person:parent“ relation is identified between e1 and e2.

4. Experiments

In this section, the proposed CPA model is verified using three popular evaluation datasets. The model is then compared with several state-of-the art models. Experiments were also conducted to demonstrate the advantages of the proposed CPA model using few-shot learning and a case study.

4.1. Datasets and experimental settings

The experiments were conducted using three evaluation datasets: ACE 2005 EnglishFootnote1, SemEval 2010 Task 8 (Hendrickx et al., Citation2019), and ReTACRED (Stoica et al., Citation2021). The ACE 2005 English corpus is a classic and widely used dataset that was annotated from newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech. The SemEval 2010 corpus was published by the 8th Task of Semantic Evaluation Conference in 2010. ReTACRED is a large-scale supervised RE dataset obtained through crowdsourcing and is targeted toward TAC KBP relations. The statistics for these datasets are presented in Table .

Table 3. Statistics of different datasets.

RoBERTaLARGE (Y. Liu et al., Citation2019) was adopted as the PLM in these experiments. The maximum length of each input was set to 150. The Adam optimiser (Kingma & Ba, Citation2014) was used as the model optimiser. The dropout rate was set to 0.1 to avoid overfitting. The epochs, learning rate, and batch size were set to 30, 1e-5, and 64, respectively. For comparison with related work, the experimental settings of Qin et al. (Citation2021) were used in the ACE dataset and those from Han et al. (Citation2021) were used in the other two datasets.

4.2. Comparison with related works

The Cueinget(ek) strategy was adopted in this experiment, as shown in Table . Here, entities e1 and e2 in a relation mention were replaced by strings “{(type1)e1}” and “type2e2”, respectively. The revised relation mention was concatenated with the full prompt, as illustrated in Figure . Every concatenated string was fed into a PLM with an Adapter to predict the type of tokens in the masked slots. The results were compared with those of several related studies on three public evaluation datasets: ACE 2005, ReTACRED, and SemEval 2010.

ME (Kambhatla, Citation2004) and SVM (G. Zhou et al., Citation2005) represent the maximum entropy and support vector machine classifiers for the ACE 2005 dataset, respectively. FCM denotes a feature-rich compositional embedding model for RE (Gormley et al., Citation2015), and Mix-CNN (Zheng et al., Citation2016) are CNN-based methods for RE. SSM is a set space model that calculates the features in a sentence to alleviate the sparse-feature problem (Yanping et al., Citation2017). Dual-PN denotes an integrated dual pointer network with a multi-head attention mechanism (Park & Kim, Citation2020). BERT-CNN used the rich semantics of PLMs in a CNN (Qin et al., Citation2021). All these performances are presented in Table .

Table 4. Performance on ACE 2005 English dataset.

The results in Table  demonstrate that shadow modes (the ME and SVM models) achieved a lower performance because they used categorical features, which cannot encode the semantic features of words. In deep neural networks, the Mixed-CNN and FCM models also exhibited low performance without the utilisation of PLMs. The performances of Dual-PN, SSM, and BERT-CNN improved considerably from the word embeddings in the PLMs, which were effective in learning semantic features from external resources. However, these models could not make full use of the knowledge in the PLMs because there was a gap between the pre-training and fine-tuning objectives. The proposed model achieved state-of-the-art performance compared to the fine-tuning models.

PA-LSTM combined a long short-term memory (LSTM) sequence model with entity-position-aware attention for the ReTACRED dataset (Zhang et al., Citation2017). C-GCN denotes a contextualised graph convolutional network that was used to learn effective representations (Zhang et al., Citation2018). Joshi et al. designed a SpanBERT model to represent and predict the spans of text, which achieved good performance (Joshi et al., Citation2020). REBEL is an autoregressive approach that frames relation extraction as a seq2seq task (Cabot & Navigli, Citation2021). These performances are presented in Table , where “*” indicates that the performance was generated by implementing the published source codes because its precision and recall were not reported.

Table 5. Performance on the ReTACRED dataset.

The results in Table  demonstrate that PA-LSTM and C-GCN achieved low performance because they are fine-tuning models that do not make full use of the rich knowledge in PLMs. The proposed approach achieved the best performance in the ReTACRED dataset compared with other prompt-tuning models (e.g. the PTR model). This was because the proposed model contains cue-encoding and PLM adapting components that effectively learned the contextual features and semantic dependency information of a relation instance.

For the SemEval 2010 dataset, MV-RNN (Socher et al., Citation2012) and SDP-LSTM (Y. Xu et al., Citation2015) utilised an RNN to learn the dependency information of entity relations. CR-CNN (Santos et al., Citation2015), Multi-Channel (Chen et al., Citation2020). TACNN integrated an attention mechanism into a CNN to increase the effect of the relationship matrix weights of the two entities (Geng et al., Citation2022). TRE (Alt et al., Citation2019), R-BERT (Wu & He, Citation2019), and IA-BERT (Tao et al., Citation2019) are PLM-based fine-tuning methods, whereas PTR (Han et al., Citation2021) and KnowPrompt (Xiang et al., Citation2020) use PLMs that are more suitable for RE tasks using prompt tuning. All these performances are presented in Table .

Table 6. Performance on SemEval 2010 dataset.

The results in Table  demonstrate the same trends as those in Tables  and . In the SemEval dataset, fine-tuning models achieved higher performance than traditional neural networks without PLMs. Prompt-tuning models also exhibited better performance because they were more adept at using the potential knowledge of PLMs. However, prompt-tuning models are not always better than fine-tuning models because fine-tuning models also address the gap between PLMs by integrating external knowledge (e.g. syntactic indicators). The proposed method learned more task-related dependency features from the raw data by integrating the three components. The proposed model also achieved state-of-the-art performance.

4.3. Ablation study

An ablation study was conducted to further quantify the contribution of each component of the proposed model. The ACE dataset was used in this experiment. Here, “w/” indicates that the corresponding component was used and “w/o” indicates that some components were omitted from the CPA model to determine the influence on the final performance. Omitting the prompt component refers to the use of a naive prompt, as discussed in Section 3.3. “replace” indicates a component was replaced with another component. The results are presented in Table .

Table 7. Ablation study on ACE 2005 dataset.

The results in Table  demonstrate that each of the three components influenced the final performance. The model could not effectively learn the representation of the mask if the goal-oriented prompt was removed. Therefore, the performance was degraded. The model was weak in perceiving named entities when the cueing strategy was removed, leading to serious performance degeneration. Therefore, the cueing component and goal-oriented prompt were highly beneficial because they were effective in learning the semantic dependencies of a relation instance. The results also show that these three components were mutually promoted. The performance decreased considerably if all of them were omitted.

“Entity hints” (Giorgi et al., Citation2022) and “Entity markers” (W. Zhou & Chen, Citation2021) have also been proposed for relation extraction in related works. They were applied in fine-tuning models. Comparing with them, our “Entity cues“ have different structures. The result in Table  indicates that, “entity cues” achieved better performance. It is more effective to learn sentence representations relevant to the considered named entities in prompt-tuning method.

The result in Table  indicates that the CPA model based on BART (Lewis et al., Citation2019) has a significant performance degradationFootnote2. The reason is that BART was mainly proposed as decoders to support generative tasks. Because the CPA model needs to predict verbalised-type tokens in the slots “[MASK]”, it is optimised under the same pre-training objectives in the training process. Therefore, RoBERTa achieved better performance in the CPA model.

4.4. Performance on different cueing

Cueing strategies with naive and full prompts were compared to demonstrate their effectiveness and the influence of cueing strategies on performance. The ReTACRED and ACE datasets were used in this experiment. The results are presented in Table .

Table 8. Performance with different cueing strategies on ReTACRED and ACE 2005 datasets.

  1. In Cueingo(ek)+Naive, every original input was directly concatenated with a naive prompt. This was mainly conducted as a baseline of prompt tuning for comparison. Cueingo(ek)+Full was the strategy used in study (Han et al., Citation2021). The Full prompt contains contextual words that considerably improved the performance.

  2. In Cueinge(ek)+Naive, entity cues were implanted into the input to indicate the position of the entities ei (i{1,2}). Compared with the related work in Tables  and , this approach achieved competitive performance. This result indicates that entity cues are very powerful in prompt-tuning based models. Cueinge(ek)+Full also outperformed the naive version.

  3. In this cueing strategy, different entity cues were used to distinguish between the entities. Here, contextual words were used as entity cues (e.g. entity types) instead of specific tags (e.g. ck or /ck). This strategy was effective in encoding the contextual features and semantic dependencies of a relation instance because entity cues and prompts contain contextual words. This strategy achieved a state-of-the-art performance.

RE achieved a more robust performance in all experiments when entity cues and prompts were used simultaneously. The proposed cueing strategy packaged the entity order and location together, whereas traditional entity cue methods only mark the location of the entity. This setting achieved the best performance in the experiments.

4.5. Performance on few-shot

PLMs encode the rich knowledge of a sentence, which is valuable in supporting few-shot learning. In this experiment, the feasibility of the proposed method was evaluated for few-shot learning. The same seed (seed=42) was used to randomly sample the K-shot (K ∈ {8, 16, 32}) from each relation class of the training set. These were used to tune the PLM and subsequently evaluated on the entire testing dataset. For comparison, the typical R-BERT and PTR models were used in the fine-tuning and prompt-tuning paradigms, respectively. The impressive results are shown in Table .

Table 9. Evaluation of few-shot case using SemEval 2010 and ReTACRED datasets.

The prompt-tuning method was more suitable for few-shot learning than the fine-tuning method. The proposed model also outperformed the traditional fine-tuning (R-BERT) and prompt-tuning (PTR) methods. These results demonstrate that it is not reliable to tune a large model with very few samples in the fine-tuning paradigm because it leads to weak performance. However, task-specific information can be learned from PLMs using very few samples in prompt-tuning learning.

The performance improved steadily in all the experiments when the number of shots was increased. Few-shot learning exhibited impressive performance in the SemEval 2010 dataset and achieved competitive performance when K = 32. However, its performance degraded considerably in the ReTACRED dataset compared to the results in Table . This is because the RE task in the ReTACRED dataset was more challenging owing to unbalanced data. Additionally, its performance depends significantly on the number of training data. However, compared with the R-BERT model, prompt-tuning methods still achieved robust performance. The result indicates that the prompt-tuning approach was insensitive to the data imbalance problem in few-shot learning.

The influence of the training epochs on few-shot learning is shown in Figure . The proposed model demonstrated a faster convergence and higher performance. It also exhibited a stable performance in the training process. This conclusion is similar to that from the results in Table .

Figure 3. Influence of Training Epochs on Few-shot Learning using SemEval 2010 dataset.

Figure 3. Influence of Training Epochs on Few-shot Learning using SemEval 2010 dataset.

The effectiveness of prompt-tuning in few-shot learning has great potential in real applications. It has the advantage to considerably reduce the requirement for manually annotated datasets, which are expensive in human labour and time. Furthermore, under the support of prompt-tuning, the migration between different domains is much easier. Another advantage of prompt-tuning is the potential to make full use of PLMs for natural language processing. PLMs contains a huge number of parameters pre-trained from external resources with unsupervised methods, which is incredibly thirsty for computational power. Instead of tuning PLMs for task-specific tasks, prompt tuning models predicate relation types as verbalised type tokens. It avoids the cost for fine-tuning, which is an effective way to decode potential semantic features of a PLM. Our proposed CPA model implanted task-relevant cues into the input sentence. Then, a novel Transformer was design to tune PLMs for perceive named entities in a sentence. It is effective to learn task-specific contextual features and semantic dependencies in a relation instance and to decode potential knowledge of PLM for predication.

4.6. Case study on ACE

The task of extracting relations from the ACE dataset was more difficult compared with the ReTACRED and SemEval 2010 evaluation datasets and is characterised by two challenges. First, two named entities of a relation instance can overlap in a sentence in the ACE corpora. For example, the phrase “a waiting shed at the Davao City International Airport” was annotated as an FAC (facility) entity. It was also nested with an inner FAC entity “the Davao City International Airport”. Additionally, a PART-WHOLE relation was defined between them. The second challenge is that all entity pairs in a sentence must be verified to predict the possible relations between them. However, distinguishing them is difficult because all entity pairs in the same sentence share the same contextual features. Two experiments were conducted to demonstrate the influence of these two challenges on the performance and effectiveness of the proposed model.

I represents the entire set of relation instances. Two strategies were used to divide relation instances. In the first strategy, the data were divided into two parts, I^1 and I¨1 (I^1I¨1=I and I^1I¨1=), where set I^1 contained relation instances composed only of nested entity mentions. In the second strategy, data were divided into two parts, I^2 and I¨2 (I^2I¨2=I and I^2I¨2=). The set I^2 contained all relation instances that shared the same relation mentioned at least twice. Subsequently, every part was evaluated on an independent CPA model to compare the performances. The results are presented in Table .

Table 10. Performance of complex data on ACE dataset.

In the first case, a relation mention was only composed of a named entity, where contextual features are rare for RE. Therefore, it is customary to expect that the performance of I^1 will be lower. However, unexpectedly, the results show that I^1 performed better than I¨1. The reason for this may be that relations in a single nested named entity only occurred with specific relation types in the first case, which were easily recognisable. Furthermore, named entities with overlapping structures were typically noun phrases, which often have normal structures that are helpful in RE. In the second case, I^2 achieved a lower performance because many relation instances in I^2 shared the same relation mentions. In this case, different entity pairs in a relational instance shared the same contextual features. However, the relations between these entity pairs were entirely different, which confused the model and lead to performance degradation.

The proposed CPA model exhibited competitive and stable performance in all cases compared with the traditional prompt tuning model (PTR). The performance was clearly improved, particularly in the I¨1 and I¨2 datasets. The results indicate that the proposed model was effective in learning semantic features and encoding semantic dependencies in a relation instance.

4.7. Limitations of CPA model

The CPA model is a prompt learning approach based on PLMs. It also has three major limitations. First, manually designed prompt templates are required for predication. Therefore, the migration between different corpora is difficult. Second, the quality of the templates is influential on the final performance. However, at current, there is no standard method available to generate these templates. Third, the effectiveness of prompt learning heavily depends on PLMs with a growing number of billions of parameters. Because the reason of computational complexity, it limits the practicability of this approach.

5. Conclusion and future work

This study proposed a CPA model for RE, in which predicting relation was implemented as a verbalised type token prediction task. In this model, a goal-oriented prompt template and a novel cueing strategy were designed to perceive named entities for decoding potential knowledge of PLMs. Furthermore, an Adapter was proposed to learn the iteration between them. Several experiments were conducted to evaluate the effectiveness of our model for relation extraction on three popular public evaluation datasets. It achieved state-of-the-art performance for relation extraction. In future work, the cueing strategy can be extended to support other NLP tasks. Furthermore, more studies can be conducted to reveal the mechanisms of the cueing, adapting, and prompting frameworks.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the National Natural Science Foundation of China [grant numbers 62066008 and 62166007], and the Key Projects of Science and Technology Foundation of Guizhou Province [grant number [2020]1Z055].

Notes

2 The weight of BART & RoBERTa were downloaded from https://huggingface.co/models.

References

  • Alt, C., Hübner, M., & Hennig, L. (2019). Improving relation extraction by pre-trained language representations. arXiv preprint arXiv:1906.03088.
  • Bao, J., Duan, N., Zhou, M., & Zhao, T.. (2014). Knowledge-based question answering as machine translation. In Proceedings of ACL (pp. 967–976). ACL.
  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S.. (2020). Language models are few-shot learners. In Proceedings of NeurIPS, 33, 1877–1901. http://doi.org/10.48550/arXiv.2005.14165.
  • Cabot, P. L. H., & Navigli, R. (2021). REBEL: Relation extraction by end-to-end language generation. In Proceedings of EMNLP (pp. 2370–2381). ACL.
  • Chen, Y., Wang, K., Yang, W., Qing, Y., Huang, R., & Chen, P. (2020). A multi-channel deep neural network for relation extraction. IEEE Access, 8, 13195–13203. https://doi.org/10.1109/Access.6287639
  • Chen, Y., Yang, W., Wang, K., Qin, Y., Huang, R., & Zheng, Q. (2021). A neuralized feature engineering method for entity relation extraction. Neural Networks, 141, 249–260. https://doi.org/10.1016/j.neunet.2021.04.010
  • Chen, Y., Zheng, Q., & Chen, P. (2015). Feature assembly method for extracting relations in Chinese. Artificial Intelligence, 228, 179–194. https://doi.org/10.1016/j.artint.2015.07.003
  • Cohen, A. D., Rosenman, S., & Goldberg, Y. (2020). Relation classification as two-way span-prediction. arXiv preprint arXiv:2010.04829.
  • Cui, G., Hu, S., Ding, N., Huang, L., & Liu, Z. (2022). Prototypical verbalizer for prompt-based few-shot tuning. arXiv preprint arXiv:2203.09770.
  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723.
  • Geng, Z., Chen, G., Han, Y., Lu, G., & Li, F. (2020). Semantic relation extraction using sequential and tree-structured LSTM with attention. Information Sciences, 509, 183–192. https://doi.org/10.1016/j.ins.2019.09.006
  • Geng, Z., Li, J., Han, Y., & Zhang, Y. (2022). Novel target attention convolutional neural network for relation classification. Information Sciences, 597, 24–37. https://doi.org/10.1016/j.ins.2022.03.024
  • Giorgi, J., Bader, G. D., & Wang, B. (2022). A sequence-to-sequence approach for document-level relation extraction. arXiv preprint arXiv:2204.01098.
  • Gormley, M. R., Yu, M., & Dredze, M. (2015). Improved relation extraction with feature-rich compositional embedding models. arXiv preprint arXiv:1505.02419.
  • Han, X., Zhao, W., Ding, N., Liu, Z., & Sun, M. (2021). PTR: Prompt tuning with rules for text classification. arXiv preprint arXiv:2105.11259.
  • Hang, T., Feng, J., Wu, Y., Yan, L., & Wang, Y. (2021). Joint extraction of entities and overlapping relations using source-target entity labeling. Expert Systems with Applications, 177, 114853. https://doi.org/10.1016/j.eswa.2021.114853
  • Hendrickx, I., Kim, S. N., Kozareva, Z., Nakov, P., Séaghdha, D. O., Nakov, P., Padó, S., Pennacchiotti, M., Romano, L., & Szpakowicz, S. (2019). Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. arXiv preprint arXiv:1911.10422.
  • Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In Proceedings of ICML (pp. 2790–2799). ACM.
  • Hu, S., Ding, N., Wang, H., Liu, Z., Li, J., & Sun, M. (2021). Knowledgeable prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. arXiv preprint arXiv:2108.02035.
  • Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64–77. https://doi.org/10.1162/tacl_a_00300
  • Kambhatla, N. (2004). Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction. In Proceedings of ACL (pp. 178–181). ACL.
  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  • Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, A., Stoyanov, O., & Zettlemoyer, L. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  • Li, C., & Tian, Y. (2020). Downstream model design of pre-trained language model for relation extraction task. arXiv preprint arXiv:2004.03786.
  • Li, P., & Mao, K. (2019). Knowledge-oriented convolutional neural network for causal relation extraction from natural language texts. Expert Systems with Applications, 115, 512–523. https://doi.org/10.1016/j.eswa.2018.08.009
  • Li, R., Li, D., Yang, J., Xiang, F., Ren, H., Jiang, S., & Zhang, L. (2021). Joint extraction of entities and relations via an entity correlated attention neural model. Information Sciences, 581, 179–193. https://doi.org/10.1016/j.ins.2021.09.028
  • Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.
  • Liu, Q., Li, Y., Duan, H., Liu, Y., & Qin, Z. (2016). A survey of knowledge mapping construction techniques. Journal of Computer Research and Development, 53(3), 582–600. https://doi.org/10.7544/issn1000-1239.2016.20148228.
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Lyu, S., & Chen, H. (2021). Relation classification with entity type restriction. arXiv preprint arXiv:2105.08393.
  • Nguyen, T. H., & Grishman, R. (2015). Relation extraction: Perspective from convolutional neural networks. In Proceedings of ACL (pp. 39–48). ACL.
  • Park, S., & Kim, H. (2020). Dual pointer network for fast extraction of multiple relations in a sentence. Applied Sciences, 10(11), 3851. https://doi.org/10.3390/app10113851
  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
  • Qin, Y., Yang, W., Wang, K., Huang, R., Tian, F., Ao, S., & Chen, Y. (2021). Entity relation extraction based on entity indicators. Symmetry, 13(4), 539. https://doi.org/10.3390/sym13040539
  • Santos, C. N. D., Xiang, B., & Zhou, B. (2015). Classifying relations by ranking with convolutional neural networks. arXiv preprint arXiv:1504.06580.
  • Schick, T., & Schütze, H. (2020). Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676.
  • Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., & Singh, S. (2020). Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
  • Soares, L. B., FitzGerald, N., Ling, J., & Kwiatkowski, T. (2019). Matching the blanks: Distributional similarity for relation learning. arXiv preprint arXiv:1906.03158.
  • Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of EMNLP-CONLL (pp. 1201–1211). ACL.
  • Stoica, G., Platanios, E. A., & Póczos, B. (2021). Re-tacred: Addressing shortcomings of the tacred dataset. In Proceedings of AAAI.
  • Tao, Q., Luo, X., Wang, H., & Xu, R. (2019). Enhancing relation extraction using syntactic indicators and sentential contexts. In Proceedings of ICTAI (pp. 1574–1580). IEEE.
  • Torfi, A., Shirvani, R. A., Keneshloo, Y., Tavaf, N., & Fox, E. A. (2020). Natural language processing advancements by deep learning: A survey. arXiv preprint arXiv:2003.01200.
  • Wang, L., Cao, Z., De Melo, G., & Liu, Z. (2016). Relation classification via multi-level attention cnns. In Proceedings of ACL (pp. 1298–1307). ACL.
  • Wu, S., & He, Y. (2019). Enriching pre-trained language model with entity information for relation classification. In Proceedings of CIKM (pp. 2361–2364). ACM.
  • Xiang, C., Ningyu, Z., Xin, X., Shumin, D., Yunzhi, Y., Chuanqi, T., Fei, H., Luo, S., & Huajun, C. (2020). KnowPrompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction. arXiv preprint arXiv:2104.07650.
  • Xu, K., Reddy, S., Feng, Y., Huang, S., & Zhao, D. (2016). Question answering on freebase via relation extraction and textual evidence. arXiv preprint arXiv:1603.00957.
  • Xu, Y., Mou, L., Li, G., Chen, Y., Peng, H., & Jin, Z. (2015). Classifying relations via long short term memory networks along shortest dependency paths. In Proceedings of EMNLP (pp. 1785–1794). ACL.
  • Yanping, C., Qinghua, Z., & Ping, C. (2017). A set space model for feature calculus. IEEE Intelligent Systems, 32(5), 36–42. https://doi.org/10.1109/MIS.2017.3711651
  • Zeng, D., Liu, K., Chen, Y., & Zhao, J. (2015). Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of EMNLP (pp. 1753–1762). ACL.
  • Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J.. (2014). Relation classification via convolutional deep neural network. In Proceedings of COLING (pp. 2335–2344). ACM.
  • Zhang, Y., Qi, P., & Manning, C. D. (2018). Graph convolution over pruned dependency trees improves relation extraction. arXiv preprint arXiv:1809.10185.
  • Zhang, Y., Zhong, V., Chen, D., Angeli, G., & Manning, C. D. (2017). Position-aware attention and supervised data improve slot filling. In Proceedings of EMNLP (pp. 35–45). ACL.
  • Zhao, K., Xu, H., Cheng, Y., Li, X., & Gao, K. (2021). Representation iterative fusion based on heterogeneous graph neural network for joint entity and relation extraction. Knowledge-Based Systems, 219, 106888. https://doi.org/10.1016/j.knosys.2021.106888
  • Zhao, K., Xu, H., Yang, J., & Gao, K. (2022). Consistent representation learning for continual relation extraction. arXiv preprint arXiv:2203.02721.
  • Zhao, Q., Xu, D., Li, J., Zhao, L., & Rajput, F. A. (2022). Knowledge guided distance supervision for biomedical relation extraction in Chinese electronic medical records. Expert Systems with Applications, 204,117606. https://doi.org/10.1016/j.eswa.2022.117606
  • Zhao, S., & Grishman, R. (2005). Extracting relations with integrated information using kernel methods. In Proceedings of ACL (pp. 419–426). ACL.
  • Zhao, W., Zhao, S., Chen, S., Weng, T. H., & Kang, W. (2022). Entity and relation collaborative extraction approach based on multi-head attention and gated mechanism. Connection Science, 34(1), 670–686. https://doi.org/10.1080/09540091.2022.2026295
  • Zheng, S., Xu, J., Zhou, P., Bao, H., Qi, Z., & Xu, B. (2016). A neural network framework for relation extraction: Learning entity semantic and relation pattern. Knowledge-Based Systems, 114, 12–23. https://doi.org/10.1016/j.knosys.2016.09.019
  • Zhou, G., Su, J., Zhang, J., & Zhang, M. (2005). Exploring various knowledge in relation extraction. In Proceedings of ACL (pp. 427–434). ACL.
  • Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. (2016). Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of ACL (pp. 207–212). ACL.
  • Zhou, W., & Chen, M. (2021). An improved baseline for sentence-level relation extraction. arXiv preprint arXiv:2102.01373.