686
Views
0
CrossRef citations to date
0
Altmetric
Research Article

MoCoUTRL: a momentum contrastive framework for unsupervised text representation learning

ORCID Icon, , , &
Article: 2221406 | Received 20 Feb 2023, Accepted 31 May 2023, Published online: 16 Jun 2023

Abstract

This paper presents MoCoUTRL: a Momentum Contrastive Framework for Unsupervised Text Representation Learning. This model improves two aspects of recently popular contrastive learning algorithms in natural language processing (NLP). Firstly, MoCoUTRL employs multi-granularity semantic contrastive learning objectives, enabling a more comprehensive understanding of the semantic features of samples. Secondly, MoCoUTRL uses a dynamic dictionary to act as the approximately ground-truth representation for each token, providing the pseudo labels for token-level contrastive learning. The MoCoUTRL can extend the use of pre-trained language models (PLM) and even large-scale language models (LLM) into a plug-and-play semantic feature extractor that can fuel multiple downstream tasks. Experimental results on several publicly available datasets and further theoretical analysis validate the effectiveness and interpretability of the proposed method in this paper.

1. Introduction

Discovering universally applicable representations is one core objective of deep learning (DL) (Hjelm et al., Citation2019), and one intuitive way is to train a representation learning function. Representation learning is beneficial for a wide range of machine learning tasks (Girshick et al., Citation2014; Hill et al., Citation2016; Oquab et al., Citation2014) based on it (Bengio et al., Citation2013). Deep neural networks for natural language processing (NLP) have advanced rapidly in past decades. As Bengio et al. presented the so-called neural net language models (NNLMs) (Bengio et al., Citation2003), distributed representations for symbolic data (Hinton, Citation1986) were introduced to the NLP field, leading to continuous exploration and application of representation learning in the NLP field.

Research on text representation learning can be roughly divided into two phases: methods before and after neural networks are utilised. The most primitive and naive text representation method uses statistical concepts such as word frequency; commonly used methods are one-hot encoding and TF-IDF (term frequency-inverse document frequency). Besides its intuitiveness and ease of operation, this method has obvious shortcomings. On the one hand, representation methods based on statistical quantities cannot sufficiently capture the semantic similarity relationship of fine-grained text sequences. On the other hand, the text representations generated by these methods have high dimensions, which are prone to the problem of the curse of dimensionality.

Since the emergence of NNLMs, researchers have focused on research that utilises pre-trained language models (PLMs) to obtain text representations. These methods are called word embedding technology or distributed representation methods of words. The resulting embedding of each token is wholly derived from the context of surrounding words, thus providing a more refined reflection of semantic similarity and correlation between words. Due to limited data and computing capacity, early word embedding models were of smaller scale and could only produce static word vectors. However, with the advancement of computing power, context word embeddings generated by large-scale PLMs have become the essential underlying method for various tasks in the NLP field.

Despite the excellent performance in many NLP applications, exponentially increasing computing power threshold and negligible marginal benefits push research on PTMs into a tricky spot. Therefore, this work focuses on text representation learning from another perspective – contrastive learning. Contrastive learning aims to generate a universal representation for each sample by pulling semantically close neighbours together and pushing apart non-neighbours (Hinton & Salakhutdinov, Citation2006). This training framework was first adopted in CV to train unsupervised pre-trained models since the previous mainstream pre-trained models in CV were obtained by supervised training on ImageNet (Deng et al., Citation2009). The success of contrastive learning in CV led to studies on contrastive learning for text representation learning. However, in recent research on text contrastive learning, there remain two problems: Firstly, The smallest unit of text data is a word, but existing works on text contrastive learning always use a sequence of text as the smallest unit of representation. We deem that introducing words as the smallest semantic unit in contrastive learning is necessary. From conventional static word embedding models like word2vec(Mikolov, Chen, et al., Citation2013; Mikolov, Sutskever, et al., Citation2013), to pre-trained language models like BERT (Devlin et al., Citation2019), and even to currently state-of-the-art ultra-large language models such as GPT-4(OpenAI, Citation2023), PaLM (Chowdhery et al., Citation2022) and LLaMA (Touvron et al., Citation2023), their training processes rely on the inspiration of the distributional hypothesis(Harris, Citation1954), where the meaning of a word is jointly determined by its context. In the ideal case of a long text sequence, rich contextual information can provide sufficient semantic details; but the distributed hypothesis may face significant challenges in short text sequences. For example, “Jason ran away happily without any words” and “Jason ran away angrily without any words” can be considered two text sequences with totally different semantics. However, there is no difference between the two sentences except for “happily” and “angrily”. Even the most influential models cannot deduce opposed semantic expressions from the same context. Existing researchers (Gao et al., Citation2021; Qian et al., Citation2022) lack direct learning of the smallest semantic unit of terms, which may not perform well on short sequences.

Secondly, unlike image data, where high-dimensional representations can be obtained directly from the original documents (He et al., Citation2022), text data lacks a reliable representation of text labels (Lee et al., Citation2022) if contrastive learning at the representational level is performed on text. Recent advances still follow the in-batch contrastive learning approach based on NTXent-like loss. The data within a batch comprises multiple sample pairs, and a sample pair contains a pair of semantically similar samples, for example, a pair of text sequences containing an entailment relation is used as a sample pair in the SimCSE (Gao et al., Citation2021). During training, the two samples within each sample pair are mutually positive, while all other data within the batch are mutually negative sample pairs. However, this within-batch contrastive learning approach depends on the batch size and usually works well for large-size batches (usually containing thousands of sample pairs). Also, the two samples in a positive pair are not semantically identical to each other. For example, the “Two dogs are running.” and “There are animals outdoors.” are treated as a pair of positive samples. Although the two sentences are semantically related, it does not mean that the two sentences have exactly the same meaning, and the training goal of NTXent will contribute to the convergence of the representations of the two utterances, which is inconsistent with the actual semantic situation.

To solve the above problems, we propose MoCoUTRL (a Momentum Contrastive Framework for Unsupervised Text Representation Learning), mainly making the following contributions. The code of this work is open-sourced on https://github.com/Uchiha-Monroe/MoCoUTRL_pub_ver

  1. MoCoUTRL introduces a contrastive learning loss at the token level in addition to the conventional sentence level so that the semantic features of the samples can be learned more fully, especially the model’s performance on shorter text sequences.

  2. Unlike the positive and negative sample generation methods used in previous text contrastive learning, this work constructs a dynamic representation lexicon as an enhanced positive sample pair for token-level contrastive learning, and this approach avoids the semantic bias problem associated with traditional text enhancement.

  3. In addition to the conventional contrastive experiments, this work further validates the rationality and interpretability of this work by visualising and quantitatively analysing the representations of the trained MoCoUTRL model using two attributes related to the quality of characterisation learning, alignment and uniformity (Wang & Isola, Citation2020).

2. Related works

2.1 Contrastive learning

Contrastive learning is an essential unsupervised learning method because it eschews manual labelling. Contrastive learning originated in CV (Dosovitskiy et al., Citation2014; Hadsell et al., Citation2006) and has also made progress in other fields of DL (Bachman et al., Citation2019; Kong et al., Citation2019; Oord et al., Citation2019). The inspiration for contrastive learning was inherited from the InfoMax criterion (Linsker, Citation1988) and bred a bunch of innovative research over the years. SimCLR (Chen et al., Citation2020) proposes a straightforward, high-performance image representation learning method using contrastive learning. The nonlinear projector specially introduced at the end of the model significantly improves the performance. MoCo (He et al., Citation2020) uses a momentum encoder to represent the original memory bank, a data structure that stores fixed label representations, dramatically improving the algorithm’s efficiency. BYOL (Grill et al., Citation2020) innovatively proposes a generative contrastive learning method, which abandons using negative samples in traditional contrastive learning methods. SwAV (Caron et al., Citation2020) creates a prototype label similar to the cluster centre for samples in contrastive learning, improving the training process’s stability. There is also much recent research on contrastive learning to various subfields in NLP. Numerous empirical successes reveal the great potential of contrastive learning for representation learning.

2.2 Text representation learning

A good representation contains rich semantic information and can reveal data distribution. Various features can better apply semantic information to various downstream tasks. Text data is probably the most abstract and complex data format currently faced by deep learning, and research on its representation learning has always been the focus of the field of natural language processing. The introduction of words’ distributed representation (Mikolov, Sutskever, et al., Citation2013) makes the text representation out of the statistical bag-of-words methods such as TF-IDF. Word embedding is assigning a distributed high-dimensional representation to each word. The proposal of Neural net language models (Bengio et al., Citation2003) introduces the subsequent exploration of text representation. Text representation learning is closely related to language models in most subsequent studies. In the era of contextual word embedding, the research on text representation learning has entered a rapid development stage from this point on. We will introduce the background of contextual word embedding and PLMs in the next section.

2.3 Pre-trained language models

Word embedding models, such as Word2vec (Mikolov, Sutskever, et al., Citation2013) and GloVe (Pennington et al., Citation2014), provide each token with a unique and invariable representation and suffer from problems with polysemous words and out-of-vocabulary (OOV) words. With the subsequent emergence of ELMo (Peters et al., Citation2018), GPT (Radford et al., Citation2018) and BERT (Devlin et al., Citation2019), large-scale PTMs have gradually become the mainstream method in NLP. The so-called contextual embedding generated by them is a dynamic representation method, which solves the problems quickly faced by their static counterparts. Roughly, PTMs can be divided into the following categories according to the model structure: autoregressive models represented by GPT (Radford et al., Citation2018), CTRL (Keskar et al., Citation2019), Transformer-XL (Dai et al., Citation2019), Reformer (Kitaev et al., Citation2020), autoencoding models represented by BERT (Devlin et al., Citation2019), ALIBERT (Lan et al., Citation2020), RoBERTa (Liu et al., Citation2021), and sequence-to-sequence models represented by BART (Lewis et al., Citation2020), T5 (Raffel et al., Citation2020). Overall, autoencoding language models like BERT are the first vital achievements to achieve universal breakthroughs in various NLP sub-tasks, and these models are better at handling multiple discriminative downstream tasks. The autoregressive language models represented by GPT follow the design pattern of traditional one-direction language models and are relatively good at solving text-generation tasks. seq2seq structured pre-trained language models are constructed by combining the advantages of the above two frameworks in order to have both discriminative and generative downstream tasks, and experiments have shown that models such as T5 have indeed achieved very good experimental results at that time. It has been shown that models such as T5 have indeed achieved very good experimental results at the time.

3. Methodology

The general architecture of our proposed MoCoUTRL is depicted in Figure . We aim to learn a text representation function mapping each input text to a feature in an unsupervised way. Generally, the framework learns by comparing the distance between inputs and their augmented samples in the feature space. There are totally five major components in this framework: (1) a random data augmentation module, (2) a PTM-based encoder, (3) a momentum-based dynamic dictionary, (4) a nonlinear feature projector, and (5) a contrastive loss function module. In the rest of this section, we will first go through the overall workflow of the framework and then describe each of these five structure components.

Figure 1. The overall structure of MoCoUTRL.

Figure 1. The overall structure of MoCoUTRL.

In this part, we first give the overall structure of MoCoUTRL, as shown in Figure , and briefly describe the proposed method. Later, this work elaborates on the data authentication module, momentum-based dynamic dictionary, PTM-based representation encoder, nonlinear Feature projector and contrastive loss function in five sections. Finally, an overall MoCoUTRL algorithm flow is given in detail.

Figure depicts the overall structure of our proposed MoCoUTRL model. The meaning of SLD and TLD in the figure stands for Semantic Latent Decoder and Token Latent Decoder, respectively. For the input text, MoCoUTRL only adopts different components in transforming the text into a vector representation for the first time. It uses a regular Embedding Layer for the prototype sample and Dynamic Dictionary for positive and negative samples while using identical model components for the rest of the Encoder, SLD Projector, and TLD Projector.

The whole training corpus C={x1,x2,x3,,xN} consists of cleaned text sentences from the unlabelled text. Before training, data augmentation converts each sample xi into a positive xi+ and a negative sample xi. Then, all these samples are converted into ordered sequences of tokens by tokenisation. The tokenised xi=[wcls,w1,w2,,wn] will be transformed into a tensor xiRn×emb through the word embedding layer eφ(), where n is the length of the token sequence and emb is the dimension of the word embeddings. Similarly, augmented samples xi+ and xi are also transformered into tensors xi+ and xi by tokenisation and vectorisation, but not using eφ() but a dynamic dictionary dξ(). Afterwards, these tensors will be respectively input into a PTM-based feature encoder fθ() to generate feature representations.

3.1 Data augmentation module

Data augmentation is of essential significance in discriminative contrast learning. The pattern of data augmentation and the number of augmented samples associated with each sample during the model training profoundly impact the overall contrastive learning framework. Typical contrastive learning involves bringing positive sample pairs as close as possible to each other and pushing negative sample pairs as far apart as possible in the feature space so that the final representations of input samples generated by the feature encoder fθ() in the feature space contain rich semantic information. Unlike many intuitive and effective data augmentation methods such as crop, rotation and colour distortion in CV (Chen et al., Citation2020), there is no consensus on text enhancement methods in the NLP field.

The MoCoUTRL conducts data augmentation using synonym and antonym substitution. During training, MoCoUTRL uses both sample-level and token-level contrastive learning tasks. Specifically, for a sample-level text denoted as xori, a randomly selected token is used for a synonym or antonym substitution, thus producing positive and negative samples labelled as xsyn and xant respectively.

3.2 Momentum update-based dynamic dictionary

MoCoUTRL trains the encoder fθ() by matching an encoded input q to a dynamic dictionary of encoded keys dξ() by contrastive training. Unlike the traditional contrastive learning method based on the Siamese model, MoCoUTRL employs a conventional encoder to transform input samples into high-dimensional representations. In contrast, the positively and negatively augmented samples utilise an extra dynamically maintained feature dictionary to obtain real-time feature representations. Theoretically, we can ultimately use a shared encoder to embed the raw samples and the positive and negative samples, which is one of the fundamental paradigms of contrastive learning (or so-called metric learning) in recent years. However, the training framework that utilises dynamic feature dictionaries can converge more quickly and stably, which has been validated in the research of the CV field. The dynamic dictionary can be seen as composed of initial ground-truth representations of all tokens, whose parameters are initialised by a word embedding layer of PTMs, and then trained according to momentum updates in each training process. Note that in previous studies of the CV field, the size of the whole dynamic dictionary is often so large that it cannot be stored in memory. However, this problem does not exist in NLP, as the size of the entire dictionary obtained after tokenisation is acceptable for modern high-performance GPUs. Here, this work opts for the empirical approach (He et al., Citation2020), momentum contrast, to update the parameters of the dynamic dictionary. Specifically, the parameters of the word embedding matrix eφ() are updated by commonly used backpropagation, while the parameter matrix of the dynamic dictionary is updated by means of momentum contrast. Denote the representation of the kth word in the word list in the word embedding layer and the dynamic dictionary layer as φk and ξk, respectively. Thus, at each training step, the parameters involved in the dynamic dictionary will be iterated according to the following momentum update: (1) ξkmξk+(m1)φk(1) Here m[0,1) is a hyperparameter to control the rate of momentum update of ξk. In our experiments, we take relatively large values of m(close to 1), thus allowing ξk to evolve in a slow but smoother and more stable manner compared to φk. In the CV field, this similar ground-truth table can be implemented intuitively, i.e. the original RGB value in the pixel matrix. However, there is no direct way to transform words into real-value vectors, so MoCoUTRL uses the word embedding layer of PTMs to initialise this ground-truth table.

3.3 PTM-based representation encoder

Deep neural networks built upon PTMs have become the most common paradigm in NLP today by fine-tuning or prompt learning on downstream applications, achieving state-of-the-art results on various sub-tasks. These previous studies demonstrate that Transformer-based PTMs are currently the best method for processing textual data. This work uses pre-trained DeBERTa (He et al., Citation2021) as the feature encoder, an improved version of the BERT model with significant improvements in training corpus size, parameter settings, and pre-training tasks.

3.4 Nonlinear feature projector

It has been demonstrated in multiple empirical studies that a nonlinear transformation layer for contrastive learning after the feature encoding layer helps the feature encoder to obtain better representation. Although contrastive learning is currently a powerful and prevalent tool for unsupervised representation learning, especially in the CV field. There is a gap between the target of contrastive learning and that of most downstream tasks, and the role of the nonlinear transformation layer is to fill this gap. Then, the model can generate features with universal semantics but less task-related knowledge.

3.5 Contrastive loss function

In this work, loss functions are designed for the semantic units of sentences and tokens’ granularity.

3.5.1 Long-span contrastive loss

For each xi(1iN), we can get its positive variant xi+ and negative variant xi by data augmentation. Then the encoder can achieve their instant representations fθ(), and the nonlinear projector is utilised to obtain their feature representations for the contrastive learning training task. (2) zi=σ(WSLD(2)σ(WSLD(1)Pooling(fθ(xi)+bSLD(1)))+bSLD(2))(2) (3) zi+=σ(WSLD(2)σ(WSLD(1)Pooling(fθ(xi+)+bSLD(1)))+bSLD(2))(3) (4) zi=σ(WSLD(2)σ(WSLD(1)Pooling(fθ(xi)+bSLD(1)))+bSLD(2))(4)

In the above equations, the σ() means the sigmoid activation function, the W and b in the above equations are respectively the weight matrix and bias vector of the affine transformation. The Pooling() in Eq. (2–4) refers to a pooling operation that aggregates the high-dimensional features generated by each input token into a feature that represents the entire text sequence. In downstream tasks, the [CLS]-pooling method is often used to integrate all output representations. However, many studies have shown that this method is ineffective in representation learning. Sometimes, the performance is less than word embedding models like GloVe (Pennington et al., Citation2014). The commonly used pooling method for sentence representation uses the average of all token features output by the last layer of the model.

The σ() in the above equations is Gaussian Error Linear Units (GELU) (Wolf et al., Citation2020) activation function. (5) σ(x)=GELU(x)=xΦ(x)=x2π0xet2dt(5) We referenced the idea of InfoNCE (Oord et al., Citation2019) and designed a similar but not entirely identical loss function for training. In the original InfoNCE, a batch of data consists of pairs of positive samples, and any unmatched samples form negative sample pairs. The InfoNCE objective function, also known as the NTXent loss function, was proposed in the literature (Oord et al., Citation2019). This paper constructs a generative prediction task for contrastive learning and creates an objective function by maximising the mutual information between the current context and the data to be predicted. In this work, we construct a separate negative sample for each sample to increase the diversity between opposite semantic samples. Therefore, in actual training, negative samples only come from separately created negative samples, not from any unmatched sample. Precisely, assuming a training batch consists of B (original sample, positive sample, negative sample) triplets, the loss for the i-th triplet is calculated as follows: (6) Llongspan(i)=logexp(zizi+/τ)j=1Bexp(zizj/τ)(6)

The τ appears in equation (6) is the temperature factor which controls the “softness and hardness degree” of the softmax-like loss function.

3.5.2 Token-level contrastive loss

Because of the importance of tokens in the NLP field, the MoCoUTRL constructs a token-level contrastive loss function in this part. For each xi=[wcls,w1,w2,,wn], the token chosen to be the anchor token is denoted as wiori. The text obtained by replacing the anchor word wiori with a synonym wisyn is a positive sample xi+, and the text obtained by replacing the anchor word with an antonym wiant is a negative sample xi. For the replaced anchor words wisyn and wiant, the output of the last layer of the encoder is directly used as its feature representation: (7) wisyn=σ(WTLD(2)σ(WTLD(1)fθ(wisyn)+bTLD(1))+bTLD(2))(7) (8) wiant=σ(WTLD(2)σ(WTLD(1)fθ(wiant)+bTLD(1))+bTLD(2))(8)

For each anchor word, we will use the dynamic dictionary dξ() to obtain their feature representation, respectively, and use this as the anchor representation as the label of token-level comparative learning. (9) eiori=dξ(wiori)(9) In each training batch, the following methods can be used to calculate the token-level loss: (10) Ltokenlevel=1Ni=1N[MSELoss(eiori,wisyn)MSELoss(eiori,wiant)](10)

The MSELoss(,) in Equation (10) means the mean squared error (MSE) loss between a pair of embeddings. In this work, a hyperparameter λ[0,1] is used to balance the significance of these two losses: (11) L=λLlongspan+(1λ)Ltokenlevel(11)

3.6 Algorithm framework of MoCoUTRL

After introducing all components of the proposed model, we depict the steps and details of MoCoUTRL as a whole in Algorithm 1.

4. Experiments

4.1 Datasets

The experiments of this work are all carried out on Ubuntu 18.0 with four GPUs (NVIDIA Tesla V100 PCIe-32GB). This work is completed upon open-source frameworks such as Pytorch (Paszke et al., Citation2019) and Transformers (Wolf et al., Citation2020).

The BookCorpus (Zhu et al., Citation2015) is used to construct training data during training. The BookCorpus dataset consists of free novels created by unpublished authors. It contains 11,038 novels, about 74M sentences and about 1G words, and can be divided into 16 themes, such as history and adventure. The semantic Textual Similarity (STS) dataset (Cer et al., Citation2017) and the Web of Science (WOS) dataset (Kowsari et al., Citation2017) are used to conduct comparative experiments. The STS Benchmark comprises a selection of the English datasets used in the STS tasks organised in SemEval between 2012 and 2017. Its corpus is derived from image captions, news headlines, and forum texts. The WOS dataset is a text classification dataset collected by the literature (Kowsari et al., Citation2017) from 46,985 published Web of Science papers containing seven categories and 134 subcategories.

4.2 Experimental settings

First and foremost, in selecting the encoder model, we have tried Bert, Roberta, and Deberta. These three have been strong benchmark models for an extended period since the large-scale PTM was proposed at the end of 2018. After evaluation, this work selects the Deberta model as the backbone of the encoder. In preprocessing the BookCorpus, this paper adopts the method of literature (Kowsari et al., Citation2017) to convert it into one single sentence per sample and then perform data enhancement to construct the training format (anchor samples, positive samples, and negative samples).

In the experiment, the temperature constant is set to 0.07 by default, the momentum coefficient is set to 0.999, the primary learning rate is set to 1e-6, and the cosine annealing schedule dynamically adjusts the learning rate during the training process. The default batch size of the model in the comparative learning phase is set to 32 so that the total number of samples contained in each training batch is 96, and the number of positive samples at the long span and token-level levels corresponding to each anchor sample is 1. The corresponding explicit negative samples are 3 × 32 − 2 = 94. The parameter momentum update of the dictionary relies on the entire word vector matrix of the encoder, so other tokens in the entire vocabulary are implicitly used as weak negative samples for training in the token-level loss function.

4.3 Comparative experiments on STS Benchmark and WOS

We use the Spearman correlation coefficient as the evaluation criterion on the STS Benchmark and SICK datasets. Note that the ground truth of the whole dataset is {yi}i=1N, and the model’s output is {y^i}i=1N. The Spearman correlation coefficient is computed as follows: (12) ρ=i=1N(yiy¯i)(y^iy^¯i)i=1N(yiy¯i)2i=1N(y^iy^¯i)2(12) This work adopts unsupervised and supervised models to carry out contrast experiments. The main unsupervised benchmark models used include GloVe, BERT, BERT-flow (Li et al., Citation2020), BERT-whitening (Su et al., Citation2021), RoBERTa, CT-BERT (Carlsson et al., Citation2021), SimCSE-BERT (Gao et al., Citation2021), SimCSE-RoBERTa (Gao et al., Citation2021) and DiffCSE (Chuang et al., Citation2022). The adopted supervised benchmark models include InferSent-GloVe, SentenceBert (Reimers & Gurevych, Citation2019), SentenceBert-flow (Carlsson et al., Citation2021), SentenceBert-whitening (Gao et al., Citation2021), SentenceRoBERTa (Reimers & Gurevych, Citation2019), and SimCSE-RoBERTa (Gao et al., Citation2021).

The WOS dataset is a text multi-classification task, and the proportions of different categories differ. This work uses Accuracy, Recall-Macro, Precision-Macro and F1score-Macro to compare MoCoUTRL with other models.

The mathematical expressions of evaluation metrics used in the WOS dataset are listed in Eq. (13)–(16). Note that this work multiplies the original value of each metric by 100 for straightforward representation. (13) Accuracy=TP+TNTP+TN+FP+FN×100(13) (14) Precision=TPTP+FP×100(14) (15) Recall=TPTP+FN×100(15) (16) F1Score=2×(Precision×Recall)Precision+Recall(16)

In the above formulations, True Positive (TP) represents the number of literature of a specific category is classified correctly, False Positive (FP) means the number of literature of a specific category is classified incorrectly, True Negative (TN) denotes the number of literature that does not belong to a specific category is classified correctly. False Negative (FN) indicates the number of literature that does not belong to a specific category is classified incorrectly. The final evaluation metrics of all six categories in the WOS dataset are obtained by “macro” (calculating the average).

The performance of MoCoUTRL and other methods is presented in Table , and experimental results on the WOS dataset are depicted in Table . In Table , the unsupervised and supervised algorithms are presented separately in the upper and lower parts of the table. The table displays the experimental results of these models on the STS series tasks and the SICK task, with the highest score in each task for unsupervised and supervised models bolded for differentiation. In the models in Table , the content in parentheses represents the pooling method used for feature extraction with PTM. “last2 avg” represents the average output of the last two layers of the model. In contrast, “first-last avg” represents the average output of the first and last layers of the model. The benchmark model names followed by “flow” and “whitening” represent the BERT-flow and BERT-whitening models, respectively. The “MoCoUTRL + MLP” means we use a Multi-Layer Perceptron behind MoCoUTRL to do supervised training on downstream datasets.

Table 1. Performance on STS tasks.

Table 2. Performance on the WOS dataset.

4.4 Analysis of alignment and uniformity

There has always been discussion and research on the problem of anisotropy in text representation learning, which is more evident in the PTMs. Because the output text representation shows strong anisotropy, most features are generally distributed in a relatively narrow high-dimensional cone space, severely limiting its expressive ability. In the former literature (Wang & Isola, Citation2020), the author proposes alignment and uniformity, two properties that a good feature distribution needs to satisfy. Alignment means that similar samples need to have similar feature representations, and uniformity means that the feature representations generated by the model should retain as much information as possible.

Previous research experiments have proved that text representations obtained by many PTMs have obvious anisotropy problems, contrary to the ideal theoretical properties expected. To further explore the characteristics generated by MoCoUTRL qualitatively and quantitatively, this work analyses the attributes of alignment and uniformity of the output features.

Regarding the nature of alignment, two methods are used to visualise the performance: the first is to count the L2 distance of the output features of the same category of samples, and the second is to select the representations of 5 tokens for each sample randomly. These features are reduced to R2 space using linear kernel PCA, then plot the feature distributions with Gaussian kernel density estimation (KDE), and calculate its von Mises-Fisher (vMF) KDE on angles (arctan(x1,x2) is used in this work) for each sample (x1,x2) in R2. In terms of uniformity, we use a similar operation to the second method of exploring alignment properties, except that the input samples are replaced by those belonging to the unified category. The features of four subcategories are selected in the figure for display.

It can be seen from Figure that the pre-trained DeBERTa model has strong anisotropy, the distance between any two samples is close, and the entire representation space is in the shape of a super cone. The supervised DeBERTa has better performance in distinguishing samples of different categories. Nevertheless, it performs poorly in property alignment: Most sample representations are limited to several isolated spaces, contrary to the alignment property. MoCoUTRL has both suitable alignment property and uniformity property. Although there is no special supervised training, MoCoUTRL can still clearly distinguish different types of samples, which can be seen from the representation distribution of samples. The representations generated by MoCoUTRL can be more evenly distributed in the entire representation space, thus preserving maximum information.

Figure 2. Analyses on alignment and uniformity.

Figure 2. Analyses on alignment and uniformity.

4.5 L2 normalisation with Gaussian perturbation

When computing Ltokenlevel, a dynamic vocabulary updated with parameter momentum is used as the label representation for each token. Current contrastive methods use L2 regularisation to ensure all representations are released on a hypersphere. Thus, if cosine similarity is the similarity of any two representations, we can use the dot product to simplify the calculation. However, this approach also has specific problems. One of the most prominent problems is the sparsity of data distribution in high-dimensional space, illustrated by Figure .

Figure 3. The sparsity of high-dimensional space.

Figure 3. The sparsity of high-dimensional space.

Compared with all representations distributed on the hypersphere, this work distributes the features on the outermost thin shell of the hypersphere with a certain thickness. Similarly, taking the 2D situation as an example, it is assumed that the entire circle with radius 1 is the range of all representations, and the blue circle part is the actual sample representation distribution range output by the model. In this case, the proportion of the ring part to the entire space is as follows: (17) η=VringVC=πr2π(r(1ε))2πr2=1(1ε)2(17) Extend it to a high-dimensional space, then: (18) η=VringVC=kπrnkπ(r(1ε))nkπrn=1(1ε)n(18)

Since ε(0,1), when the dimension n of the high-dimensional space approaches positive infinity, then there is: (19) η=limn1(1ε)n=1(19) That is, as long as the thickness of the outer shell of the hypersphere is any constant greater than 0, the proportion of the volume occupied by the shell in the high-dimensional space is infinitely close to the volume of the entire representation space. In practical experiments, since the generated feature representations are finite-dimensional, it is necessary to consider more realistic situations. The benchmark encoder model used in the experiment is Deberta-base, and the output feature dimension is 768 dimensions. Assuming that the thickness of the shell is 1% of the radius of the hypersphere, then: (20) η=1(10.01)768=0.9996(20)

It can be seen from this that in the higher-dimensional space, a large part of the volume of the hypersphere is concentrated on the outermost shell. Therefore, it is only necessary to limit the thickness of the shell to one-hundredth of the radius of the hypersphere, which can significantly improve the utilisation of the feature space for the sample representation. To achieve this, we subtract a random variable d(μ,Σ) obeying a Gaussian distribution from the “label” representation (that is, the representation of the dynamic vocabulary) after L2 regularisation, where dRD, μ=[0.005,0.005,,0.005]DT, and Σ is a diagonal matrix Σ=diag(0.005/3). This paper adopts the 3σ rule commonly used in the industry, which can ensure that most of the variables of this distribution (about 99.74%) fall within the 3σ interval near the mean. In summary, for the representation ei=dξ(wi) of any token in the dynamic vocabulary after L2 regularisation, the final representation with Gaussian perturbation is: (21) e~i=eid,d(μ,Σ)(21)

4.6 Ablation study

In order to comprehensively test the superiority of the method proposed, we will conduct a series of ablation experiments in this section to verify the effectiveness of each component of MoCoUTRL. Specifically, we first conducted experiments on the backend of MoCoUTRL’s Encoder, comparing BERT, RoBERTa, and DeBERTa. Finally, we selected DeBERTa as the ultimate backend model. Then, we analysed the effects of two types of contrastive losses at different granularities. In addition, L2 regularisation with Gaussian perturbation is also a component that is believed to have a significant effect in this work, and we also analysed its effectiveness. We conduct all our ablation studies on the BookCorpus dataset.

The experimental data in Table is composed of the best results obtained from selecting five independent repeated experiments. From the experimental results, it is not difficult to see that DeBERTa and RoBERTa can achieve significant performance advantages as backends compared to BERT. This is also why this article chooses the DeBERTa model as the backend model for MoCoUTRL. Regarding the two semantic granularity losses, repeated experiments have shown that the role of the token-level contrastive loss is significantly smaller than that of the long-span contrastive loss. That is to say, the performance improvement of the model mainly comes from the overall understanding level of a text sequence rather than the semantic difference of a single token. Intuitively, language diversity means that two text segments with similar semantics may not be the same length. Two text segments with opposite semantics may not need to be implemented using antonyms and negating predicate or auxiliary verbs. This diversity may be why the token-level contrastive loss cannot dominate and can only be used as an auxiliary training loss. We conducted multiple experiments on L2 Norm with Gaussian Perturbation and found that this operation can bring MoCoUTRL a relatively small but stable performance improvement. We guess that through sufficient training, MoCoUTRL has been able to output representations that can meet the two properties of alignment and uniformity and can basically meet the distribution on the hypersphere with a radius of 1. The role of Formula (21) is only to make it more in line with the ideal distribution in theory.

Table 3. Ablation study of the proposed MoCoUTRL.

5 Conclusion

Although previous studies have fully proved large-scale PTMs built upon Transformer to have strong semantic capture ability and generalisation ability in multiple downstream applications, the growing computing requirements and high training costs make it impossible for any research institution or individual to train PTMs with thousands of billions of parameters. Moreover, if PTMs are regarded as feature encoders, the representations generated by a single PTM cannot achieve performance matching its parameter size and training costs. Therefore, this paper introduces contrastive learning based on PTM. Through the construction of target losses at the sample level and token level and a dynamic vocabulary updated by momentum and used as the token’s ground-truth representation, MoCoUTRL’s comparative experiments on STS and WOS datasets have achieved achievements that match or even exceed the current cutting-edge methods. The quantitative analysis of the two attributes of alignment and uniformity of the representation output by MoCoUTRL also proves the validity of MoCoUTRL theoretically.

Several recent studies on large language models (LLMs) have been published and attracted widespread attention, and they have been the de facto new research paradigm in NLP with their impressive performance on many complex downstream tasks. Although these LLMs are capable of many tasks, they still have areas where they are not good. For example, in information retrieval, even the GPT-4 model, which is currently in the absolute top tier of performance, still cannot achieve more accurate and effective performance than traditional search engines. the contrastive learning framework proposed by MoCoUTRL can better exploit the potential of LLMs in retrieval. By training the text representations output by LLM for comparative learning, a semantic index of the documents to be retrieved can be constructed containing deep semantic information, and then document retrieval based on deep semantic information can be achieved with only the most naive similarity calculation. At present, Microsoft Bing’s ChatGPT interface still requires the use of traditional search engines, which also confirms that LLM is still difficult to perform in the face of large-scale information retrieval tasks, and if the MoCoUTRL framework can achieve high-quality deep-level semantic retrieval, it may bring a new breakthrough to the current industrial information retrieval industry.

Whether in academia or industry, training large-scale pre-trained language models from scratch is costly. However, directly applying pre-trained models open-sourced by other institutions or companies to specific business scenarios may encounter problems such as complex convergence and uneven data distribution. The training framework of MoCoUTRL can alleviate this problem to some extent by constructing MoCoUTRL’s contrastive learning task on data within the field, enabling the model to quickly adapt to the data distribution within the application field and generate higher-quality text representations, thus assisting in the specific deployment of downstream applications.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by Defense Industrial Technology Development Program: [Grant Number JCKY2020601B018].

References

  • Bachman, P., Hjelm, R. D., & Buchwalter, W. (2019). Learning representations by maximizing mutual information across views. Advances in Neural Information Processing Systems.
  • Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and New perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. https://doi.org/10.1109/TPAMI.2013.50
  • Bengio, Y., Ducharme, R., & Vincent, P. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155. https://dl.acm.org/doi/abs/10.5555/944919.944966
  • Carlsson, F., Gogoulou, E., Ylipaa, E., Gyllensten, A. C., & Sahlgren, M. (2021). SEMANTIC RE-TUNING WITH CONTRASTIVE TENSION. 21.
  • Caron, M., Goyal, P., Misra, I., Mairal, J., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems (pp. 9912–9924).
  • Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (pp. 1–14). https://doi.org/10.18653/v1/S17-2001
  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. 1597–1607.
  • Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., … Fiedel, N. (2022). PaLM: Scaling Language Modeling with Pathways (arXiv:2204.02311). arXiv. http://arxiv.org/abs/2204.02311
  • Chuang, Y.-S., Dangovski, R., Luo, H., Zhang, Y., Chang, S., Soljacic, M., Li, S.-W., Yih, S., Kim, Y., & Glass, J. (2022). Diffcse: Difference-based contrastive learning for sentence embeddings. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4207–4218). https://doi.org/10.18653/v1/2022.naacl-main.311
  • Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., & Salakhutdinov, R. (2019). Transformer-XL: Attentive language models beyond a fixed-length context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 2978–2988). https://doi.org/10.18653/v1/P19-1285
  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. 248–255. https://doi.org/10.1109/CVPR.2009.5206848
  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). arXiv. http://arxiv.org/abs/1810.04805
  • Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., & Brox, T. (2014). Discriminative unsupervised feature learning with convolutional neural networks. Advances in Neural Information Processing Systems.
  • Gao, T., Yao, X., & Chen, D. (2021). Simcse: Simple contrastive learning of sentence embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 6894–6910). https://doi.org/10.18653/v1/2021.emnlp-main.552
  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition (pp. 580–587). https://doi.org/10.1109/CVPR.2014.81
  • Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos, R., & Valko, M. (2020). Bootstrap your Own latent A New approach to self-supervised learning. Advances in Neural Information Processing Systems (pp. 21271–21284).
  • Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR’06) (pp. 1735–1742). https://doi.org/10.1109/CVPR.2006.100
  • Harris, Z. S. (1954). Distributional structure. WORD, 10(2–3), 146–162. https://doi.org/10.1080/00437956.1954.11659520
  • He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 15979–15988). https://doi.org/10.1109/CVPR52688.2022.01553
  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9726–9735). https://doi.org/10.1109/CVPR42600.2020.00975
  • He, P., Liu, X., Gao, J., & Chen, W. (2021). Deberta: Decoding-enhanced bert with disentangled attention. International Conference on Learning Representations.
  • Hill, F., Cho, K., & Korhonen, A. (2016). Learning distributed representations of sentences from unlabelled data. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1367–1377). https://doi.org/10.18653/v1/N16-1162
  • Hinton, G. E. (1986). Learning distributed representations of concepts. Proceedings of the Eighth Annual Conference of the Cognitive Science Society (pp. 12–23).
  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647
  • Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., & Bengio, Y. (2019). LEARNING DEEP REPRESENTATIONS BY MUTUAL IN- FORMATION ESTIMATION AND MAXIMIZATION. 24.
  • Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). CTRL: A Conditional Transformer Language Model for Controllable Generation (arXiv:1909.05858). arXiv. http://arxiv.org/abs/1909.05858
  • Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The efficient transformer. International Conference on Learning Representations 2020.
  • Kong, L., de Masson d’Autume, C., Yu, L., Ling, W., Dai, Z., & Yogatama, D. (2019). A Mutual Information Maximization Perspective of Language Representation Learning. International Conference on Learning Representations.
  • Kowsari, K., Brown, D. E., Heidarysafa, M., Jafari Meimandi, K., Gerber, M. S., & Barnes, L. E. (2017). Hdltex: Hierarchical deep learning for text classification. 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) (pp. 364–371). https://doi.org/10.1109/ICMLA.2017.0-134
  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS. 17.
  • Lee, S., Lee, D., Jang, S., & Yu, H. (2022). Toward interpretable semantic textual similarity via optimal transport-based contrastive sentence learning. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5969–5979). https://doi.org/10.18653/v1/2022.acl-long.412
  • Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). Bart: Denoising sequence-to-sequence Pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 7871–7880). https://doi.org/10.18653/v1/2020.acl-main.703
  • Li, B., Zhou, H., He, J., Wang, M., Yang, Y., & Li, L. (2020). On the sentence embeddings from Pre-trained language models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 9119–9130). https://doi.org/10.18653/v1/2020.emnlp-main.733
  • Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3), 105–117. https://doi.org/10.1109/2.36
  • Liu, Z., Lin, W., Shi, Y., & Zhao, J. (2021). A robustly optimized BERT Pre-training approach with post-training. In S. Li, M. Sun, Y. Liu, H. Wu, L. Kang, W. Che, S. He, & G. Rao (Eds.), Chinese computational linguistics (Vol. 12869, pp. 471–484). Springer International Publishing. https://doi.org/10.1007/978-3-030-84186-7_31
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/pdf/1301.3781.pdf
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems.
  • Oord, A. V. D., Li, Y., & Vinyals, O. (2019). Representation Learning with Contrastive Predictive Coding (arXiv:1807.03748). arXiv. http://arxiv.org/abs/1807.03748
  • OpenAI. (2023). GPT-4 Technical Report. (arXiv:2303.08774). arXiv. http://arxiv.org/abs/2303.08774
  • Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring Mid-level image representations using convolutional neural networks. 2014 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1717–1724). https://doi.org/10.1109/CVPR.2014.222
  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J. & Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library (arXiv:1912.01703). arXiv. https://doi.org/10.48550/arXiv.1912.01703
  • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). https://doi.org/10.3115/v1/D14-1162
  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations (arXiv:1802.05365). arXiv. http://arxiv.org/abs/1802.05365
  • Qian, J., Dong, L., Shen, Y., Wei, F., & Chen, W. (2022). Controllable natural language generation with contrastive prefixes. Findings of the Association for Computational Linguistics, 2022, 2912–2924. https://doi.org/10.18653/v1/2022.findings-acl.229
  • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1), 1–67. https://dl.acm.org/doi/abs/10.5555/3455716.3455856
  • Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3980–3990). https://doi.org/10.18653/v1/D19-1410
  • Su, J., Cao, J., Liu, W., & Ou, Y. (2021). Whitening Sentence Representations for Better Semantics and Faster Retrieval (arXiv:2103.15316). arXiv. https://doi.org/10.48550/arXiv.2103.15316
  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models (arXiv:2302.13971). arXiv. https://doi.org/10.48550/arXiv.2302.13971
  • Wang, T., & Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. Proceedings of the 37th International Conference on Machine Learning (pp. 9929–9939).
  • Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. (2020). Transformers: State-of-the-Art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38–45). https://doi.org/10.18653/v1/2020.emnlp-demos.6
  • Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and Reading books. 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 19–27).