446
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Exploring graph representation strategies for text classification

ORCID Icon, ORCID Icon & ORCID Icon
Article: 2289832 | Received 16 Apr 2023, Accepted 27 Nov 2023, Published online: 21 Dec 2023

ABSTRACT

Since 2005, the deep learning community has had access to input graphs to their models. So, the natural language processing (NLP) community started using this technique to process text. However, a challenge that the graph neural networks (GNN) may encounter is the sensibility to representation format. Since different graphs can represent the same text, the model’s performance may change depending on the representation used. Even though many practitioners have this intuition, only some works touch on this aspect of GNN. Therefore, we explore twelve different text representation strategies that build graphs from text and apply them to the same GNN to investigate how different graphs may affect the results. We divide these strategies into four groups: reading order, dependency-based, binary tree, and graph of words. From these groups, we created the binary tree group for this paper. Nevertheless, in our tests, we observed that the dependency-based representations tend to achieve better performance: The dependency-based methods allow us to stay competitive in five relevant datasets and beat the state-of-the-art in another dataset. These results suggest that performing representation tuning can be a valuable technique to improve a deep learning model.

1. Introduction

The introduction of graph neural networks (GNN) in 2005 (Gori et al., Citation2005) allowed the deep learning community to expand by allowing deep learning in problems that could only be represented as graphs. The natural language processing (NLP) community saw this new technique as an opportunity to represent text using graphs in 2013 (Rousseau & Vazirgiannis, Citation2013). While the community experimented with this new approach, GNNs replaced more state-of-art results. Even though the GNNs are promising, the same deep learning model can process different types of graph representations and data domains because a graph is an abstract concept that allows this diversity. This characteristic of graphs imposes a limitation that may impact the discoverability of promising deep learning models because the chosen representation may need to be revised to the proposed task. However, the deep learning community has no culture of representation tuning when practitioners create a new deep learning model.

Because we observed that few works deal with the impact of text representation on GNNs, we experiment with it. Thus, this work may sensitise practitioners to think about the representation used with a deep learning model. In this regard, we present experiments that show the importance of thinking on the input of a deep learning model. Our experiments reveal noticeable performance differences, only changing the graph input to a deep learning model. In a dataset, we even beat the state-of-the-art without losing competitiveness in other datasets. Therefore, using representation tuning, a deep learning development team may improve the performance of a GNN model and the efficiency of its team while creating a novel GNN model. The improvement in team efficiency comes because we may have a single model that may be trained multiple times with different graph representations. Since the GNNs tend to require fewer computational resources than a language model for NLP tasks, we may train identical GNNs multiple times, even at the same time, to choose the best result. As a result, this team should be able to use the best version of its model and representation sooner, possibly reducing some costs associated with building a deep learning model.

Since we intend to investigate representation tuning for a GNN model from an NLP perspective, we chose a trendy natural language processing task: text classification. We can generalise this task to a graph classification problem and use a verified model that classifies graphs. Because we transformed the problem into a graph classification problem, the natural language used to write the texts should not matter for the classifier. Thus, we should not need to build new deep learning models for each language. We only need to train them again with graphs built from the new language. This paper investigated this property by experimenting with three natural languages: English, Portuguese, and German.

Therefore, we present three contributions: 1) We present that the GNN development can benefit from a representation tuning step; 2) We present some novel representation based on binary trees; 3) As a result of our study, we achieve the state-of-the-art in a competition dataset only by changing the representation used to represent the text.

This text starts presenting the literature about text representation, the theme of this work, regardless of whether it is a graph representation strategy. Because of that, we identified three kinds of main approaches that represent text, and we present them in Section 2. Regarding the representation strategies studied in this paper, there are three sources: the literature, intuitive representation strategies, and novel representation strategies we created. In total, we tested twelve representation strategies, which can be divided into four categories, as we will present in Section 3.

We present two types of experiments in Section 4: The first investigates the accuracy of each representation strategy in popular and novel datasets that cover three distinct natural languages; The second type of experiment simulates two competitions in German regarding social media text classification. Section 5 finishes the paper with a discussion of the work presented.

2. Related works

In the search for solutions for better text representation approaches, researchers and practitioners explored a wide variety of techniques. This Section presents three types of methodologies to represent text that are prevalent in state-of-the-art: the traditional feature engineering method, presented in Section 2.1, deep learning-based approaches, presented in Section 2.2, and graph-based approaches, presented in Section 2.3. Because our approach is graph-based, we explored more works using this representation type.

2.1. Traditional feature engineering method

One of the first methods of representing text is the Bag of Words (BoW) (Harris, Citation1954; Joachims, Citation1998). The BoW consists of a word list that contains how many occurrences of each word a document contains. Using the BoW technique, a matrix can be constructed where the rows represent documents, and the columns represent words. This way, in each cell of this matrix, we can place the word count from each document. Machine learning techniques, such as SVM (Joachims, Citation1998), receive text represented as a BoW.

Another usual way to represent text is by using n-grams. An n-gram uses a sliding window with size n. All the n words that appear in the window are the n-gram. Take, for example, the sentence “I parked my car in your garage”. The 2-gram (also known as bigram) representation is [I parked]; [parked my]; [my car]; [car in]; [in your]; [your garage]. One of the early usages of this technique applied to natural language processing appeared in Broder et al. (Citation1997) ‘s work. They used the n-gram to analyse the text from the web and search for plagiarism. In another work, the n-grams were used to classify Arabic text (Ayed et al., Citation2017).

Regarding word representation in a collection of documents, we can use a method called Term Frequency-Inverse Document Frequency (TF-IDF). This model generates a number that indicates the importance of a document’s word concerning the collection of documents. The TF-IDF is given by Equation Equation1 (Salton, Citation1989). (1) TF - IDFi = tfi ×log(Ndfi)(1) In Equation Equation1, the tfi is the frequency of the ith term, N is the total number of documents, and dfi is the number of documents that contain the ith term.

2.2. Deep learning-based method

Mikolov et al. (Citation2013) explored another approach to represent text. They created one of the first word vectors, the word2vec. This technique uses machine learning to build a vector representation that considers each word’s syntactical and semantical information. This idea spread and inspired the creation of GloVe (Pennington et al., Citation2014), fastText (Joulin et al., Citation2016), A La Carte Embedding (Khodak et al., Citation2018), and another word embedding technique based on n-grams (Gupta et al., Citation2019).

Another initiative to represent text is by creating language models. These models build probability distributions over a sequence of words and their contexts. Three popular language models are ELMo (Peters et al., Citation2018), BERT (Devlin et al., Citation2019) and GPT-3 (Brown et al., Citation2020). These language models can perform multiple NLP tasks, such as text question answering, abstract summarising, and sentiment classification. They require a first training that needs a state-of-the-art machine to perform. When someone needs to use it to perform some task, the user has to perform a fine-tuning, which adjusts the model’s parameters to the specific task.

ELMo, short for Embedding from Language Models, is one of the pioneering language models (Peters et al., Citation2018) that offers deep contextualised word representations. Its primary objectives are to mitigate two core challenges: 1) to address the intricate characteristics of word usage, including syntax and semantics; 2) to capture the varying word meanings across diverse linguistic contexts, commonly referred to as polysemy. To deal with these challenges, the authors proposed a method for generating highly contextualised word representations employing bi-directional language models, known as biLMs. A biLM comprises both forward and backward language models constructed using BiLSTM. By joining the representations from each layer of the biLM, ELMo enables the delivery of specific word embeddings for the same word within disparate sentences.

In contrast, BERT (Devlin et al., Citation2019) builds a language model based on three distinct types of embeddings, each corresponding to a dedicated lookup table: token embeddings, segment embeddings, and position embedding. The token embeddings contain lookup words, ensuring that even rare words may be embedded (Shukri, Citation2019). Segment embeddings accommodate multiple inputs by assigning a numerical tag (0, 1, 2, … , n) to each input. Position embeddings encode the position of each token. Thus, BERT calculates the element-wise sum of the three types of embeddings to generate the final token embedding. Then, BERT undergoes pre-training for two primary tasks: the Masked Language Model (MLM) and Next Sentence Prediction (NSP).

BERT’s groundbreaking achievements and its open-source availability by Google have motivated numerous practitioners to employ and train the model with texts in languages other than English. For instance, Souza et al. (Citation2020) applied BERT to Portuguese, Chan et al. (Citation2020) to German, and Cañete et al. (Citation2020) to Spanish.

Furthermore, BERT’s development has inspired other researchers to construct their language models based on BERT’s foundation. Examples include roBERTa (Liu et al., Citation2019), ERNIE (Sun et al., Citation2019), SpanBERT (Joshi et al., Citation2020), and ALBERT (Lan et al., Citation2019).

Researchers also investigated more techniques to improve BERT’s performance. According to Xu et al. (Citation2023), BERT representation power can be further improved by applying a technique called “Contrastive Learning”. In contrastive learning, a model learns the similarities and differences between examples in a self-supervised learning environment. This technique helps BERT to reduce a problem called “anisotropy”, where the embeddings are not well distributed into the vector space, which can generate bias in the semantic similarity. Xu et al. (Citation2023)’s work is one of the most recent works that survey the use of contrastive learning with BERT.

Another popular language model, GPT-3 Brown et al. (Citation2020), is an autoregressive language model with 175 billion parameters. It uses transformers to simulate how humans perform tasks by receiving a few examples or simple instructions.

2.3. Graph-based method

The evolution of computer hardware has created the possibility to apply the successful methods of deep neural networks to process data in the non-Euclidean domain, such as graphs (Cao et al., Citation2020). These new models allowed the study of distinct ways of representing text in the non-Euclidean domain.

Rousseau and Vazirgiannis (Citation2013) created one of the first graph representation approaches for text classification. Their technique transforms each text document into a graph, the Graph of Words (GoW). The resulting graph represents words as their vertices. A sliding window captures the co-occurrence of words and builds the edges of the graph between these co-occurring words. In the presentation of the method, Rousseau and Vazirgiannis (Citation2013) defined weights for each edge using a function called Term Weights - Inverse Document Frequency. However, according to Rousseau and Vazirgiannis (Citation2013), the GoW also works without the edges’ weight.

Malliaros and Skianis (Citation2015) used the graph structure created by Rousseau and Vazirgiannis (Citation2013) and changed the function that calculates the edges’ weight.

Yao et al. (Citation2019) create their graph by considering all documents and words of the corpus. Then an edge is created between the vertex representing the document and a word from this document. Next, they add an edge for all words that co-occur. They also add a self-loop edge, a edge that links a vertex to itself, to use their representation with the GCN model (Kipf & Welling, Citation2017). Finally, they add the edge weight based on the type of vertices it connects.

L. Huang et al. (Citation2019) represents text using a method similar to a Graph of Words (Rousseau & Vazirgiannis, Citation2013). They consider each word a node and link each word to adjacent words. The edge set E is defined as follows: E = eij | i ∈ [1, l]; j ∈ [ip, i + p ]. In this definition, l is the number of words in the text, and p denotes the number of adjacent words connected to each word in the graph. Therefore, they can choose how many words connects to each word by defining different values of p.

Zhang et al. (Citation2020) also uses the Graph of Words proposed by Rousseau and Vazirgiannis (Citation2013). The resulting graphs are undirected graphs built of a sliding window with size 3.

Ding et al. (Citation2020) created a framework that supports hypergraphs. A hypergraph is a graph that, instead of edges, has hyperedges. These hyperedges are edges that connect to more than two other vertices. Ding et al.’s framework supports multiple hyperedges, where each hyperedge of a node has a different semantic. Their initial study considered Sequential hyperedges, based on sliding windows of size 3, and Semantic hyperedges, which uses the Latent Dirichlet Allocation (LDA), a generative probabilistic model for collections of discrete data (Blei et al., Citation2003), to construct them.

Yang and Cui (Citation2021) use a multi-representation approach. They use word embeddings in a GoW representation. They also extract features by using BERT. The combination of BERT and the GoW is aggregated in a co-attention layer and classified by a fully connected network. Lin et al. (Citation2021), inspired from the TextGCN proposed by Yao et al. (Citation2019), developed the BertGCN. This model creates representations of the documents using a BERT-style model (Devlin et al., Citation2019). Then they create a graph following the Yao et al. (Citation2019) methodology to serve as input to a GCN (Kipf & Welling, Citation2017) model.

Mei et al. (Citation2021) goal is to create a graph representation for each document, where each node n is a word of the document. Thus, in the pre-processing of each document, the TF-IDF of each word is calculated. This value is used as the weight w of the word. Then, the words are sorted descending by their weight. Now that an ordered list of words is available, one of the two following methods can be applied in this step.

In Method A, the node with the highest weight receives the degree n − 1, and the node with the lowest weight receives the degree 1. From this two nodes a line equation (y = kx + b) is created by considering the following points: (wmax, n − 1) and (wmin, 1), where wmax is the highest weight and wmin is the lowest weight. All the other degrees are calculated by using the line equation generated.

In Method B, the degree is calculated by Equation Equation2. In this Equation, wsum is the sum of all weights, and wni is the weight of node ni. After calculating all degrees, the nodes are sorted descending by their degree. (2) deg(ni)=wni×(n1)wsum+1(2) This ordered list of nodes is used to construct the graph using an idea similar to the Havel- Hakimi algorithm (Hakimi, Citation1962; Havel, Citation1955). We connect the first node (the one with the highest degree) to all nodes. The nodes that were connected have their degree reduced by one. If one node gets its degree reduced to zero, it is removed from the list. The process continues until all nodes are removed from the set.

Huang et al. (Citation2022) constructs their graph by tokenising the words into sub-words. Then, they connect the sub-words that occur in a fixed sliding window. Then the edge’s weight is calculated based on the point-wise mutual information (PMI). After building the graph, a context enrichment step occurs. This step is responsible for creating the node features for each sub-word.

Wang et al. (Citation2023) build their graphs by considering each word a vertex, and the edges come from four distinct sources: The first is the co-occurrence of words within a fixed-size sliding window of length 3. The second is syntactic dependency. The third is self-loops. The last type is the fusion of edges that start and finish in the same vertices. Thus the resulting graph is not a multigraph.

Wang et al. (Citation2023) create a novel graph convolution model called LDGCN that uses the same graph representation used in TextGCN (Yao et al., Citation2019). Wang et al. (Citation2023) also uses the same GCN model with a difference, a discriminative regular term module. This module aims to group the texts that belong to the same scene class, which may reduce the cross-entropy loss.

Bugueño and de Melo (Citation2023) investigate the intuitive graphs based on sliding windows, similar to the Reading order and Graph of words strategies we present in the following Section. In their work, they define the following types of graphs. The simplest method considers each word a node and builds an edge between two nodes if two words are consecutive terms in the text. The extended version of the former method is the same, but instead of only building edges between consecutive terms, it also builds edges in a window size of three terms. Two sequence-based methods build directed arcs between the nodes. In a simplified approach, there are no weights; in the sequence-weighted approach, the weight is the number of times the arc appears. The last graph-based representation follows the same approach Huang et al. (Citation2019) did.

Then Bugueño and de Melo (Citation2023) compare these five graph-based representations against a pre-trained version of BERT and a fine-tuned version of BERT in some tasks. The authors conclude that graph-based approaches are promising but also sensitive to the lexical features of the documents. Their advantages against BERT are their simplicity, both for creation and processing, and they tend to converge earlier in datasets containing shorter documents.

Wang et al. (Citation2023) used a psychological study to build embeddings that try to mitigate the polysemy of words. Later, they use these embeddings to represent the words that compose a graph. These graphs use the same shape Zhang et al. (Citation2020) used. Then, the graphs are passed to a Dual BiGRU-CNN model to classify the sentiment of the text.

3. Explored representations

In this part, we describe the representations strategies studied. We explored four groups of representation strategies: Reading order, Binary tree, Dependency tree, and Graph of words. As a subsequent Subsection will present, each group strategy may have more than one variation.

All strategies build directed graphs G = (V, E), where V is the set of vertices, and E is the set of arcs. Each vertex is the word embedding of a token of the document. A token is a word or a non-word sequence (i.e. punctuation mark, numbers). The word embedding used is the pre-trained fastText (Joulin et al., Citation2016) for the language processed (English, Portuguese or German). We choose fastText because it has many languages pre-trained, and it can build embeddings for rare words, improving the final result and enabling other practitioners to reproduce this research with other natural languages. The resulting graph G is the representation of one document.

We do not perform any pre-processing besides choosing the correct language for the word embeddings and tagging (POS tagging and dependency tagging). Therefore, if the document has punctuation marks or emojis, they will be represented in the final graph. We summarise the graph creation process overview in Figure .

Figure 1. Overview of the text-to-graph conversion. The white boxes are the steps used to transform the text dataset into a graph dataset. The black boxes are the representation strategies options used. There are 12 strategies: 2 for Reading order, 3 for Binary tree, 6 for Dependency tree, and 1 for Graph of words.

Figure 1. Overview of the text-to-graph conversion. The white boxes are the steps used to transform the text dataset into a graph dataset. The black boxes are the representation strategies options used. There are 12 strategies: 2 for Reading order, 3 for Binary tree, 6 for Dependency tree, and 1 for Graph of words.

A Python library called spaCy (Honnibal et al., Citation2020) does all tokenization and dependency parsing when needed. The source code we developed to build all representations is available at https://github.com/HenriqueVarellaEhrenfried/Text_Representation.

3.1. Reading order

The reading order strategy is based on how humans read text. Therefore, an arc starts in the first word and ends in the second word. Another arc starts in the second word, ends in the third word, and so forth until the last arc ends in the last word. This description defines the reading order only (ROO). The difference between the ROO approach and a GoW with sliding window 1 is that the ROO considers each word independently. Thus, if the same word appears in two or more parts of the text, they are represented multiple times, while the GoW only represents it once.

There is another strategy based on the reading order. It is called reading order circular (ROC). Its difference with the ROO is that the last word connects to the first word with an arc that starts on the last word and ends on the first word. Figure  shows these two strategies. We present an example of creating the ROO and ROC representations in Algorithm 1. In this Algorithm, we omit the step that adds the word embeddings to improve readability.

Figure 2. Graphical representation of the Reading order text strategies. The solid arcs appear in both representation strategies, ROC and ROO. The dashed arc only appears on the ROC representation strategy.

Figure 2. Graphical representation of the Reading order text strategies. The solid arcs appear in both representation strategies, ROC and ROO. The dashed arc only appears on the ROC representation strategy.

3.2. Binary trees

The binary tree idea was motivated by the observation that a word in an utterance has two functions: A morphological (POS) function, where the word by itself has its semantic, and a syntactical (DEP) function, where a word influences other words in the utterance to build the meaning intended by the utterance’s creator. Thus, if we combine this information into a single data, we could use it to represent an utterance better.

Then, we developed a way to unite this information. We wanted to represent an entire sentence using this novel technique, so we immediately thought of binary trees because they would group words with similar properties in different branches. Once the original binary tree provided interesting results, from a representation perspective, we experimented with variations of it, the Red-Black tree and the AVL tree.

Thus, the Binary tree group contains three representation strategies. To the best of our knowledge, all of them are novel representation strategies that we created, which we present in the following paragraphs. The idea is to use well-known data structures (binary tree, AVL tree, and Red-Black tree) to represent text.

The base for this representation strategy is three different binary trees structures: the Binary tree (Windley, Citation1960), the AVL tree (Adel'son-Velskii & Landis, Citation1962), and the Red-Black tree (Bayer, Citation1972). The Binary tree is a method developed to minimise the storage needed by rearranging data. The AVL tree and the Red-Black tree are self-balancing binary trees. Natural numbers are the input of the three trees. Therefore, we had to create a procedure to transform the text from the documents into natural numbers. Hence, we created a function that converts a word into two natural numbers and maps them to a single natural number.

These two numbers are the position, in two different lists, of the part-of-speech (POS) and the dependency (DEP) information of the word. These lists come from a Python library called spaCy (Honnibal et al., Citation2020) in alphabetical order. Since each language has its particularities, these lists are language dependent.

The formal concepts of part-of-speech (POS) and dependency (DEP) are the following. A part-of-speech is a category assigned to a word under its syntactic functions. In English, some of the main parts-of-speech are nouns, verbs, pronouns, and adjectives. Regarding dependency, one of the first linguists to formalise the concept of dependency was Tesnière (Citation1959). He believed a sentence is an organised set of words containing connections the human mind perceives. The connection between these words is called “dependency”.

Thus, our procedure begins with creating these lists by acquiring the information from the spaCy library and storing them in lists. Let us call the part-of-speech list P and the dependency list D.

Then, we build the domain set of our function by creating a Cartesian plane that begins with two points: O = (0, 0) and P = (| D | + 1, | P | + 1). We next consider every possible natural number in the interval [ 0, | D | ] for the X-axis, and [ 0, | P | ] for the Y-axis. For each combination of these intervals, we calculate the Euclidean distance from this generated point Q to point O; let us call it the distance x. Afterwards, we calculate the Euclidean distance from point Q to point P; let us call it the distance y. The resulting distance x and y are subtracted and inserted into the R list.

We order the list R when we finish building it. Now, for each word of a document, we get the part-of-speech and the dependency tag of the word, search the lists P, D to gather their indexes, calculate the distances x and y, and calculate the number k as being k = xy. Finally, we look for k in the R list and return the found index. We insert this index into one of the binary trees. The new node receives the embedding of the word provided by fastText (Joulin et al., Citation2016). When we insert all words from the document, the tree is ready.

Observe that the insertion follows the order in which the words appear in the document. Figure  illustrates a sentence represented with each binary tree. In this text, we refer to the binary tree as BT, the AVL tree as AT, and Red-Black Tree as RBT. To illustrate the procedure that creates the binary trees, we present Algorithm 2. In this Algorithm, we omit the step that adds the word embeddings to improve readability.

Figure 3. Resulting trees after all insertions. The numbering inside the vertices corresponds to the order in which the word appears in the text. This number is also presented in the first line of numbers at the image’s bottom. The numbers outside the vertices are the keys returned by the function. They are also presented in the last line of numbers at the image’s bottom.

Figure 3. Resulting trees after all insertions. The numbering inside the vertices corresponds to the order in which the word appears in the text. This number is also presented in the first line of numbers at the image’s bottom. The numbers outside the vertices are the keys returned by the function. They are also presented in the last line of numbers at the image’s bottom.

Intuitively, these trees separate the words depending on their POS and DEP values. Since we use the D and P lists we got from spaCy in alphabetical order, we usually start inserting a word around the middle of the domain interval. Thus, the following words are inserted in a binary tree that reasonably splits the word classes, as Figure  shows. Notice that although we build all possible combinations of POS and dependency, some may never happen (i.e. PUNCT with VB will never happen since a verb is never a punctuation mark).

3.3. Dependency trees

The third method also uses the library spaCy (Honnibal et al., Citation2020). However, this time we construct the dependency tree using the dependency information provided by spaCy. We defined six strategies to build the dependency tree: 1) Only; that is, only the dependency tree. We call it DTOY; 2) Order; we add the reading order to the DTOY. We call it DTOR; 3) Order Multigraph; we build the same tree from DTOR, but instead of omitting repeating arcs, we represent them again. We call it DTORM; 4) Only Self is the same tree from DTOY, but we add a self-loop arc for each vertex. We call it DTOYS; 5) Order Self is the same tree from DTOR, but we add a self-loop arc for each vertex. We call it DTORS; 6) Order Self Multigraph is the same tree from DTORM, but we add a self-loop arc for each vertex. We call it DTORMS. We illustrate these variations in Figure . Algorithm 3 presents the creation procedure. In this Algorithm, we omit the step that adds the word embeddings to improve readability.

Figure 4. Graph representation using the Dependency tree approach. The solid lines define the Dependency tree. Dotted lines (self-loops) represent the Self Component. Long dashed lines represent the Multigraph component. Dashed lines represent the Order component. The components may be added to the dependency tree, creating a graph that contains all solid arcs and the arcs of the component chosen.

Figure 4. Graph representation using the Dependency tree approach. The solid lines define the Dependency tree. Dotted lines (self-loops) represent the Self Component. Long dashed lines represent the Multigraph component. Dashed lines represent the Order component. The components may be added to the dependency tree, creating a graph that contains all solid arcs and the arcs of the component chosen.

3.4. Graph of words

The fourth method used is the Graph of words, as idealised by Rousseau and Vazirgiannis (Citation2013). In this representation strategy, the idea is to link all words that appear in a predefined window size. So if the window size is four, the first word of the sentence link itself to the following four words, then the window shifts for the next word, and the second word links itself to the following four words. This process continues until the sliding window exceeds the number of words left. We reduce the window size in this case, and the process continues. If the window size becomes zero, the graph is ready.

If, during the process, two words have the exact spelling, their representation is a single node. However, the words become distinct nodes if there is at least one different character (i.e. sand, Sand, SAnd, SAND).

To build our graphs using this representation strategy, we followed the procedure proposed by Rousseau and Vazirgiannis (Citation2013), including the recommended parameters: a sliding window of size 4 builds unweighted directed graphs. We refer to this representation strategy as GOW. Figure  shows an example of a sentence represented with GOW. We illustrate the process of creating a graph-of-word in Algorithm 4. In this Algorithm, we omit the step that adds the word embeddings to improve readability.

Figure 5. Graph representation using the Graph of Words with sliding window of maximum size 4.

Figure 5. Graph representation using the Graph of Words with sliding window of maximum size 4.

From what we presented, we show in Table , the statistics of the generated graph for each representation strategy. Observe, as we previously presented, that the number of nodes is the number of tokens in a document. A token in this context can be a word, a number, a punctuation mark, or even an emoji. Table  presents the number of arcs, in the worst case, as a function of the number of nodes. Only the GOW representation depends on an additional parameter, the sliding window, denoted as s. Note that the sliding window can not exceed the number of nodes, thus 0 ≤ s ≤ n − 1.

Table 1. Statistics of the graphs created in each representation strategy. In this Table, n denotes the number of tokens, and s is the sliding window size.

3.5. Deep learning model

This work used the UGformerV1 (Nguyen et al., Citation2022), an open-source model that uses Transformers to classify graphs. Its use was motivated because the UGformer is a generic graph classification model.

We chose to use the UGformerV1 instead of the UGformerV2 because the authors Nguyen et al. (Citation2022) stated that the first variant should be used for larger graphs, while the second variant should be used for small and medium graphs. Since they do not provide metrics to qualify a graph as small, medium or large, and we do not know the graph size that will be processed, we always use the variant that can handle larger graphs. We also performed a non-rigorous test regarding both variants. The first variant could achieve better accuracy scores under similar parameters, which made us suspect that the graphs we process are large.

UGformerV1’s implementation is available at the authors’ Github.Footnote1 Since this work aims to study text representations and their efficiency, we will briefly present the UGformer in the following paragraphs. If more details about this model are needed, please refer to (Nguyen et al., Citation2022).

UGformer’s objective is to learn a plausible embedding oGm for each graph Gm to predict its label ym. Therefore, it is given a set of M graphs {Gm}m=1M and their corresponding class labels {ym}m=1M

Each graph G is defined as G = (V, E, X), where V is a set of vertices, E is a set of edges, and XR ||V|| ×d represents feature vectors of vertices, which are the fastText (Joulin et al., Citation2016) embeddings in this work. These graphs are fed into a Transformer network to learn a vertex embedding ov for each vertex v ∈ V. Later, all vertex embeddings are summed to obtain the oG.

To learn the vertex embeddings ov, a set of neighbours Nv of each vertex v ∈ V is sampled and inputted with the origin vertex to the UGformer learning.

The vertices Nv ∪ {v} are transformed into a sequence of feature vectors. These vectors are iteratively refined by a Transformer encoder followed by a feed-forward network and layer normalisation.

After T steps the output representation hT,V(k) is fed to the next (k + 1)-th layer. Therefore, multiple layers are stacked on top of each other to capture k-hops neighbours in the UGformer. In the end, all output representations of all K layers are concatenated to infer the vertex. embedding ov. The graph embedding oG is the sum of all vertices embeddings ov ∀ v ∈ V. The model parameters are learned by minimising a cross-entropy loss function.

3.6. Summary

In this Section, we presented twelve types of text representation strategies using graphs: Read- ing Order Only (ROO), Reading Order Circular (ROC), Binary Tree (BT), AVL Tree (AT), Red-Black Tree (RBT), Dependency Tree Only (DTOY), Dependency Tree Order (DTOR), Dependency Tree Order Multigraph (DTORM), Dependency Tree Only Self (DTOYS), Dependency Tree Order Self (DTORS), Dependency Tree Order Multigraph Self (DTORMS), and Graph of Words (GOW). We also presented the deep learning model used in our experiments. The following Section will use these strategies to represent distinct classification datasets.

4. Experiments

This part presents the experiments conducted. We present the datasets used, followed by our methodology, results, and results analysis.

4.1. Datasets used

This work used four datasets - MR, Ohsumed, B2W-Rating-Balanced, and 10kGNAD to evaluate the language independence of the graph classification. They came from a different source: the MR and Ohsumed datasets were extracted from the TextGCN project (Yao et al., Citation2019). While the B2W-Rating-BalancedFootnote2 and 10kGNADFootnote3 were extracted from GitHub. We will refer to these four datasets as the LI datasets in this text.

The MRFootnote4 dataset is a movie review dataset for binary sentiment classification, in which each review only contains one sentence (Pang & Lee, Citation2005). The corpus has 5,331 positive and 5,331 negative reviews.

The Ohsumed corpusFootnote5 (Moschitti & Basili, Citation2004) is from the MEDLINE database, a bibliographic database of crucial medical literature maintained by the National Library of Medicine. In this work, we used the 13,929 unique cardiovascular diseases abstracts in the first 20,000 abstracts of 1991. Each document in the set has one or more associated categories from the 23 disease categories. As we focus on single-label text classification, the documents belonging to multiple categories are excluded. Thus only 7,400 documents belonging to only one category remain.

The B2W-Rating-Balanced is a balanced dataset (Ehrenfried et al., Citation2022) composed of 41,945 reviews written in Portuguese. These classes are the review score given by the review’s author. This score is an integer number between 1 and 5, where one is the lowest and five is the highest.

The 10kGNAD consists of 10,273 news articles in German from an Austrian online newspaper categorised into nine topics (Web, Panorama, International, Wirtschaft, Sport, Inland, Etat, Wissenschaft, Kultur). It is important to note that the 10kGNAD is an unbalanced dataset. The goal of this dataset is to categorise each document into one topic. Table  presents a summary of the content of each dataset.

Table 2. LI datasets used in the experiments. The number after the ± sign is the standard deviation.

We also experimented with these twelve representation strategies in two German competitions, the GermEval 2018 (Wiegand et al., Citation2018) and the GermEval 2021 (Risch et al., Citation2021). These competitions collected text from social media and proposed different classification tasks. Because the text came from social media, it may contain emojis, mentions to other users (that the competition committee anonymised, but the mentions remain in the text), and slang. An overview of the datasets, with information about the number of documents and statistics about the tokens is available in Table .

Table 3. Statistics of GermEval Datasets. The number after the ± sign is the standard deviation.

The GermEval 2018 Shared Task comprises two tasks of identification of offensive language with 8,541 comments collected from Twitter. From these comments, 5,009 composes the training set, and 3,532 composes the test set.

In the first task, the classification task is a coarse-grained binary classification. The second task is a fine-grained multi-class classification. The first task contained the text classified with either OTHER or OFFENSE. The OFFENSE class comprises texts with abusive language, insults or profane statements. The OTHER class contains all text that is not in the OFFENSE class.

The second task contains four classes, a category for texts that are not offensive called OTHER, and the three subcategories from the OFFENSE class from task one. They are PROFANITY, INSULT, and ABUSE. More information about class distribution is in Table a. The GermEval 2021 competition aimed to identify toxic, engaging, and fact-clamming comments. The corpus came from 4,188 comments from a German political talk show page on Facebook. From these 4,188 comments, 3,244 are training data, and 944 are test data.

Table 4. Composition of GermEval datasets.

In this competition, there are three tasks of binary identification, one for toxic text, another for engaging text, and another for fact-clamming text. More information about class distribution is in Table b.

These datasets were selected because they can represent the main challenges found in text classification within them. The mentioned challenges are language ambiguity, language variation, and the use of complex language features such as irony and metaphors. In addition, they present different scenarios where we can apply representation tuning: a data analysis environment and a competitive environment.

4.1.1. Baselines

In this part, we report the baselines for each dataset experimented. We compare our approaches to other graph-based approaches whenever it is possible. Since we did not find experimentation of the German datasets (10kGNAD, GermEval 2018, and GermEval 2021) with graphs, we report the results found. We also report the only result we found on the Portuguese dataset (B2W-Rating). Table  presents the results concerning the English datasets. Table a presents the top five results of GermEval 2018 in each task, while Table b presents the top five results of GermEval 2021 in each task.

Table 5. Baseline for the English datasets (MR and Ohsumed). The bold results are the highest.

Table 6. Baselines, in terms of F1-Score, for the GermEval datasets. These baselines come from the results of the competitions (Risch et al., Citation2021; Wiegand et al., Citation2018).

We begin reporting the baselines we just find one result. The 10kGNAD, regarding text classification, has a baseline in accuracy of 90.5%. This result came from the BERT for German (Chan et al., Citation2020).

The Portuguese dataset (B2W-Rating) is a new dataset. Therefore the only baseline found comes from the creator of the dataset (Ehrenfried et al., Citation2022), which scored an accuracy of 60.4% by using a TextGCN.

4.2. Methodology

For each dataset from the LI datasets presented in Table , we transformed each document into a graph using the methods explained in Section 3. Therefore, we generate one file with all graphs for each representation strategy. Since twelve representations exist for each of the four datasets, we built 48 files. Each of these 48 files takes approximately 1 h to be built using the desktop machine described in the following paragraphs.

Then, we run the UGFormerV1 model for each of these files. We use the 10-fold strategy. Therefore, 10% of the dataset is the test set, and the other 90% is the training set.

For the GermEval datasets, we experimented with a simulated competition environment. So, we ran each representation strategy once, following the division of train and test data provided by the competition’s committee.

We modified the original model UGformerV1 (Nguyen et al., Citation2022) to allow each vertex to contain the node features referenced in Section 3. Thus, the code mentioned in the previous paragraph adds the word embeddings of each word as the node feature. The modified code is available at our Github.Footnote6

All experiments were conducted using the Pytorch version of the UGformerV1 (Nguyen et al., Citation2022) supervised model briefly explained in Section 3.5. We used as a base most of the default parameters suggested by the authors of UGformerV1 (Nguyen et al., Citation2022), so most of the experiments may run using a GPU with 8 Gb of VRAM. The parameters used are:

num_timesteps = 1; num_neighbors = 4; sampled_num = 512; dropout = 0.5; learning_rate = 0.0001;

ff_hidden_size = 1024; num_hidden_layers = 2; num_epochs = 30; batch_size = 4;

Regarding the parameters of the UGFormer for the GermEval datasets, we only increased the number of epochs from 30 to 70, so we could analyse the effects of more epochs. The five experiments report the test metrics in the last epoch since the UGFormer does not implement early stopping.

For all experiments, we did not perform parameter tuning in any experiment. That is because the work aims to evaluate how different representation strategies behave. Performing parameter tuning would move the reader’s attention from the representation’s impact to the model’s performance, which is not intended in this work. Thus, we standardised the parameter as previously described.

To run the experiments from Table , we used a desktop machine equipped with an Intel i9 10900KF, 32 GB RAM, and an RTX 3080 12 GB. The operating system is Ubuntu Server 22.04 without any graphical interface. This machine took approximately 120 h to train and infer each LI dataset 10 times each, totalling approximately 480 h to train and infer all representation strategies for all LI datasets. Plus 48 h to build the 48 representations, totalling 528 h to execute the complete experiment.

To run the GermEval experiments, we had access to a cluster equipped with an Intel Xeon CPU E5-2683 v4, two Tesla P100 16GB, and 512 GB RAM. However, the experiments used only one Tesla P100 and 8 CPU cores. Regarding the time needed to train and infer, for the GermEval 2018, each of the 24 experiments took approximately 40 min. For the GermEval 2021, each of the 36 experiments took approximately 34 min. We spent 16 h in the GermEval 2018 and 20 h and 24 min in the GermEval 2021.

Regarding metrics, the evaluation of the four datasets presented in Table  uses accuracy, as the literature does. At the same time, the GermEval experiments use F1-Score, as the competition’s committee expected.

4.3. Results

We present the accuracy results in Table  and Table . The values reported in Table  average ten executions. The results in Table  are the result of a single execution, as discussed previously. Bold values present the best result from each type of graph (Reading order, Binary tree, Dependency tree, and GOW). Underlined values represent the best accuracy result of the dataset. The number after the ± sign is the standard deviation.

Table 7. Accuracy of each dataset for each graph strategy. The bold values are the highest values of their category (Reading order, Binary tree, Dependency tree, and Graph of words). The underlined values are the best results for the dataset.

Table 8. Accuracy and F1-Score for each task of GermEval 2018 and GermEval2021 competition with each representation strategy presented. The bold values are the highest values of their category (Reading order, Binary tree, Dependency tree, and Graph of words). The underlined values are the best results for the dataset. Observe that the competition uses the F1-Score to rank the participants.

We calculated the p-values for all experiments in Table , and we report them on our GitHub pageFootnote7 because the resulting tables may consume much space.

4.4. Results analysis

We consider the baselines presented in Section 4.1.1 to analyse our results. When we consider the results of Table , we observe that we do not surpass any state-of-the-art result regardless of the representation strategy used. Nevertheless, our investigation performs better than some graph-based approaches, such as GraphStar (Haonan et al., Citation2019) and TextGCN (Yao et al., Citation2019).

We also observe that the representation strategy can improve the accuracy of the results in the same dataset up to 2.8% if it has an average of 30 tokens and up to 14.9% if the dataset has more than 150 tokens, as shown in Table .

Table 9. Improvement observed in accuracy by changing the representation strategy.

We also observe that the dependency-based trees perform well under any size of tokens, being the best representation strategy for three datasets. Even in the MR dataset, where the dependency-based trees are not the best, they perform similarly to the best representation strategy. The GOW strategy performs the worst when the text contains more than 150 tokens, probably because it loses semantic information, similar to the BoW approach.

To compare dependency-based approaches with and without the reading order module. We divide the comparison into two groups: the group with self-loop edges and the group without self-loop edges. When the group without self-loop edges is observed, we notice that, except for the 10KGNAD dataset, all datasets performed better without the reader order module.

In the group that has self-loop edges, the result was mixed. MR and 10KGNAD perform better with the reading order module, and Ohsumed and B2W perform better without the reading order module.

When a similar comparison regarding the use of self-loop edges is drawn, using the same base structure (DTOY vs DTOYS; DTOR vs DTORS; DTORM vs DTORMS), adding a self-loop edge improves the performance of the model when assessing the datasets with larger sentences (Ohsumed and 10KGNAD). However, in a smaller dataset, we did not observe any pattern. The MR dataset had mixed results, and the B2W dataset performed the best without self-loop edges.

We present more detailed metrics of the best and worst results in Table  and Table , respectively. Observe that these metrics follow the same pattern observed with the accuracy results.

Table 10. Metrics observed in the best results of each of the four datasets: MR - ROC, B2W - DTOY, Ohsumed - DTOYS, and 10KGNAD - DTORMS. These metrics are the average of 10 executions.

Table 11. Metrics observed in the worst results of each of the four datasets: MR - RBT, B2W - AT, Ohsumed - GOW, and 10KGNAD - GOW. These metrics are the average of 10 executions.

However, if we analyse the AUROC (Area under the ROC curve) following the same methodology used in Hosmer et al. (Citation2013), the results show that all experiments have at least an acceptable classification, with results from 10KGNAD achieving an excellent classification.

When we look at the competition datasets, we can observe that when the German text is relatively smaller, in the GermEval 2018 dataset, the methods based on reading order are enough to deliver the best performance, while the Binary tree is always the worst strategy. Regardless of the method, we could not beat the state-of-the-art, but we could achieve the top 4 results in the fine-grained task.

However, the best results come from the GermEval 2021. We beat the state-of-the-art in all tasks with an average increase of 1.12x the result achieved by the best team for each competition task. Once again, the Dependency-based methods performed the best, leading us to suspect that perhaps the most reliable representation method to start using when developing a deep learning model should be based on dependency trees.

Table  reports the maximum improvement observed in each task of the two GermEval competitions in F1-Score.

Table 12. Improvement observed in F1-Score by changing the representation strategy.

We observed in GermEval 2018 and GermEval 2021 that in the dependency trees that do not use self-loops, the majority of the results were better when the reading order module was enabled. The only exceptions were the F1-Score from GermEval 2021 in the tasks Fact Clamming and Toxic. These results might happened because the dataset division is more balanced when compared to the Engaging task.

The self-loop edges versions also performed better when the reading order module was enabled. The exceptions are the Fact Clamming and the Fine, which had results that favoured the version without the reading order module. We also attribute these results to the dataset division.

When we look closer at these dependency-based approaches and compare the same base representation with and without the self-loop edges, we observe the following: In the GermEval 2018 Coarse task, the accuracy is always better without self-loop edges, and the F1-Score is always better with self-loop edges. These results indicate that even though the model could more accurately classify the graphs without the self-loop edge, it wrongly classifies false positives and false negatives more often. However, the Fine task of GermEval 2018 followed this pattern in only one case: when the graph is a multigraph, the best performance in accuracy and F1-Score comes from the variant without self-loop edges.

Regarding GermEval 2021, in the Engaging task, the multigraph approach did not follow the pattern. It performed best with self-loop edges. The others performed better without self-loops. In the Fact Clamming, the three approaches performed best without self-loops. Nonetheless, in the Toxic task, the dependency tree without any modules had its best accuracy with self-loops but its best F1-Score is without self-loops. These results suggest that the version without self-loops classifies false positives and false negatives more accurately. The representation version that added the reading order performed best with self-loop edges, and the multigraph version performed the best without self-loop edges.

Regarding all results presented, we hypothesise that the dependency-based strategies performed well in most cases because they can better capture valuable relations to define the semantics. At the same time, the binary trees divide the words by their function in the text, and the other two strategies, reading order and graph of words, capture the sequence in which words appear in a text.

Thus, we identified that no representation model is the best in all scenarios because each classification dataset and task needs the assessment of different text fragments to be classified accurately. This phenomenon happens because language is applied differently in each dataset - each sentence author builds different representations in their mind to communicate. In addition, the classification labels we create can be subjective, making even humans debate the categorisation of some sentences.

Regardless, if we needed to point to the most reliable representation strategy family, our results suggest that it would be the dependency tree family. When the dependency-based representations are regarded, adding and removing edges in the representation model can help the deep learning model understand the importance of different text fragments in the desired classification task, improving the final result.

5. Discussion and further work

This work presented an exploration of representation strategies for text using graphs. We explored twelve different graph formats divided into four groups, Reading order, Binary tree, Dependency tree, and Graph of words. We applied them to four classification datasets, covering three distinct languages and two competitions in German.

We showed that the representation strategy used can affect the performance of a deep learning model. In our tests, we observed that the best type of text representations as a graph are the dependency-based representations, which achieved up to 145% improvement over the worst strategy used. These results may motivate other researchers and practitioners to consider representation tuning as another technique to improve machine learning algorithms.

The exploration performed regarding text representation beat the current state-of-the-art in the GermEval 2021 tasks, and they remained competitive in other datasets.

Further work may investigate if the same improvements occur in other deep learning models or if it is a characteristic of UGFormerV1. Further work may also explore a collaboration with linguists to develop a novel text representation tuned to deep learning processing. The motivation comes from our results with the dependency-based strategies, which were some of the best performers of the studied strategies. In future work, parameter tuning could be applied to improve the results we achieved even further.

In addition, a similar methodology could be used to test other datasets and other NLP tasks to verify if other scenarios also benefit from representation tuning, as our results suggest.

Acknowledgements

Research financed with funds from the National Development Fund (FNDE, in Portuguese) and the Ministry of Education (MEC, in Portuguese) of the Federal Government of Brazil, carried out by the Center for Scientific Computing and Free Software (C3SL, in Portuguese) of the Federal University of Paraná (UFPR, in Portuguese). We also want to thank the Coordination for the Improvement of Higher Education Personnel (CAPES) - Program of Academic Excellence (PROEX) (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) - Programa de Excelência Acadêmica (PROEX) in Portuguese).

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

References

  • Adel'son-Velskii, G. M., & Landis, E. M. (1962). An algorithm for the organization of information. Doklady Akademii Nauk, 146, 263–266.
  • Ayed, R., Labidi, M., & Maraoui, M. (2017). Arabic text classification: New study. 2017 International Conference on Engineering & mis (Icemis), 1–7.
  • Bayer, R. (1972). Symmetric binary b-trees: Data structure and maintenance algorithms. Acta Informatica, 1(4), 290–306. https://doi.org/10.1007/BF00289509
  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
  • Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8-13), 1157–1166. https://doi.org/10.1016/S0169-7552(97)00031-7
  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  • Bugueño, M., & de Melo, G. (2023). Connecting the dots: What graph-based text representations work best for text classification using graph neural networks?
  • Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., & Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. In Pml4dc at iclr 2020.
  • Cao, W., Yan, Z., He, Z., & He, Z. (2020). A comprehensive survey on geometric deep learning. IEEE Access, 8, 35929–35949. https://doi.org/10.1109/ACCESS.2020.2975067
  • Chan, B., Schweter, S., & Möller, T. (2020, December). German’s next language model. In Proceedings of the 28th international conference on computational linguistics (pp. 6788–6796). International Committee on Computational Linguistics. https://aclanthology.org/2020.coling-main.598
  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidi- rectional transformers for language understanding. In Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technolo- gies, volume 1 (long and short papers) (pp. 4171–4186). Association for Computational Linguistics. https://www.aclweb.org/anthology/N19-1423
  • Ding, K., Wang, J., Li, J., Li, D., & Liu, H. (2020, November). Be more with less: Hypergraph attention networks for inductive text classification. In In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 4927–4936). Association for Computational Linguistics. https://aclanthology.org/2020.emnlp-main.399
  • Ehrenfried, H. V., Todt, E., (2022). Should i buy or should i pass: E-commerce datasets in Portuguese. In V. Pinheiro et al. (Ed.), Computational processing of the Portuguese language (pp. 420–425). Springer International Publishing.
  • Gori, M., Monfardini, G., & Scarselli, F. (2005). A new model for learning in graph domains. In Proceedings. 2005 Ieee International Joint Conference on Neural Networks, 2005. (Vol. 2, p. 729-734 vol. 2). https://doi.org/10.1109/IJCNN.2005.1555942.
  • Gupta, P., Pagliardini, M., & Jaggi, M. (2019). Better word embeddings by disentangling contextual n-gram information. In NAACL-HLT (1) (pp. 933–939). Association for Computational Linguistics.
  • Hakimi, S. L. (1962). On realizability of a set of integers as degrees of the vertices of a linear graph. i. Journal of the Society for Industrial and Applied Mathematics, 10(3), 496–506. Retrieved 2022-11-22, from http://www.jstor.org/stable/2098746
  • Haonan, L., Huang, S. H., Ye, T., & Xiuyan, G. (2019). Graph star net for generalized multi-task learning.
  • Harris, Z. S. (1954). Distributional structure. WORD, 10(2-3), 146–162. https://doi.org/10.1080/00437956.1954.11659520
  • Havel, V. (1955). Poznámka o existenci konecných grafu. Casopis pro peˇstování matematiky, 080(4), 477-480. http://eudml.org/doc/19050
  • Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. Zenodo.
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Assessing the fit of the model. In Applied logistic regression (pp. 153–225). John Wiley & Sons, Ltd. https://onlinelibrary.wiley.com/doi/abs/10.10029781118548387.ch5
  • Huang, L., Ma, D., Li, S., Zhang, X., & Wang, H. (2019, November). Text level graph neural network for text classification. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp) (pp. 3444–3450). Hong Kong, People’s Republic of China: Association for Computational Linguistics. https://aclanthology.org/D19-1345
  • Huang, Y.-H., Chen, Y.-H., & Chen, Y.-S. (2022, October). ConTextING: Granting document-wise contextual embeddings to graph neural networks for inductive text classification. In Proceedings of the 29th International Conference on Computational Linguistics (pp. 1163–1168). Gyeongju, Republic of Korea: International Committee on Computational Linguistics. https://aclanthology.org/2022.coling-1.100
  • Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec, & C. Rouveirol (Eds.), Machine learning: Ecml-98 (pp. 137–142). Springer Berlin Heidelberg.
  • Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64–77. https://doi.org/10.1162/tacl_a_00300
  • Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
  • Khodak, M., Saunshi, N., Liang, Y., Ma, T., Stewart, B., & Arora, S. (2018, July). A la carte embedding: Cheap but effective induction of semantic feature vectors. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 12–22). Association for Computational Linguistics. https://aclanthology.org/P18-1002
  • Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (iclr).
  • Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv. https://arxiv.org/abs/1909.11942
  • Lin, Y., Meng, Y., Sun, X., Han, Q., Kuang, K., Li, J., & Wu, F. (2021). BertGCN: Transductive Text Classification by Combining GCN and BERT. arXiv preprint arXiv:2105.05727. http://arxiv.org/abs/2105.05727
  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv. https://arxiv.org/abs/1907.11692
  • Lu, Z., Du, P., & Nie, J.-Y. (2020). Vgcn-bert: augmenting bert with graph embedding for text classification. In Advances in information retrieval: 42nd European conference on ir research, ecir 2020, lisbon, portugal, april 14–17, 2020, proceedings, part i 42 (pp. 369–382).
  • Malliaros, F. D., & Skianis, K. (2015, aug). Graph-based term weighting for text categorization. In Proceedings of the 2015 Ieee/acm International Conference on Advances in Social Networks Analysis and Mining, Asonam 2015 (pp. 1473–1479). Association for Computing Machinery, Inc.
  • Mei, X., Cai, X., Yang, L., & Wang, N. (2021). Graph transformer networks based text representation. Neurocomputing, 463, 91–100. https://www.sciencedirect.com/science/article/pii/S0925231221012169
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Moschitti, A., & Basili, R. (2004). Complex linguistic features for text classification: A comprehensive study. In S. McDonald, & J. Tait (Eds.), Advances in information retrieval (pp. 181–196). Springer Berlin Heidelberg.
  • Nguyen, D. Q., Nguyen, T. D., & Phung, D. (2022). Universal graph transformer self-attention networks. In Companion Proceedings of the web Conference 2022 (www ‘22 companion).
  • Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 115–124). Association for Computational Linguistics. https://doi.org/10.3115/1219840.1219855.
  • Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In emnlp (Vol. 14, pp. 1532–1543). https://nlp.stanford.edu/pubs/glove.pdf
  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018, June). Deep contextualized word representations. In Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long papers) (pp. 2227–2237). New Orleans, Louisiana: Association for Computational Linguistics. https://aclanthology.org/N18-1202
  • Risch, J., Stoll, A., Wilms, L., & Wiegand, M. (2021, September). Overview of the GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments. In Proceedings of the germeval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments (pp. 1–12). Association for Computational Linguistics. https://aclanthology.org/2021.germeval-1.1
  • Rousseau, F., & Vazirgiannis, M. (2013). Graph-of-word and tw-idf: New approach to ad hoc ir. In Proceedings of the 22nd acm international conference on information; knowledge management (pp. 59–68). Association for Computing Machinery. https://doi.org/10.1145/2505515.2505671.
  • Salton, G. (1989). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc.
  • Shukri, M. (2019). How the embedding layers in bert were implemented. https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a (Accessed in July 16th, 2021).
  • Souza, F., Nogueira, R., & Lotufo, R. (2020). Bertimbau: Pretrained bert models for Brazilian Portuguese. In R. Cerri, & R. C. Prati (Eds.), Intelligent systems (pp. 403–417). Springer International Publishing.
  • Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., & Wu, H. (2019). ERNIE: enhanced representation through knowledge integration. CoRR, abs/1904.09223. http://arxiv.org/abs/1904.09223
  • Tesnière, L. (1959). Eléments de syntaxe structurale. Klincksieck.
  • Wang, B., Sun, Y., Chu, Y., Min, C., Yang, Z., & Lin, H. (2023, May 29). Local discriminative graph convolutional networks for text classification. Multimedia Systems, https://doi.org/10.1007/s00530-023-01112-y
  • Wang, Y., Feng, L., Liu, A., Wang, W., & Hou, Y. (2023a). Dual bigru-cnn-based sentiment classification method combining global and local attention. The Journal of Supercomputing, https://doi.org/10.1007/s11227-023-05558-9
  • Wang, Y., Wang, C., Zhan, J., Ma, W., & Jiang, Y. (2023b). Text fcg: Fusing contextual information via graph learning for text classification. Expert Systems with Applications, 219, 119658. https://www.sciencedirect.com/science/article/pii/S0957417423001598
  • Wiegand, M., Siegel, M., & Ruppenhofer, J. (2018). Overview of the germeval 2018 shared task on the identification of offensive language. Verlag der Österreichischen Akademie der Wissenschaften.
  • Windley, P. F. (1960). Trees, forests and rearranging. The Computer Journal, 3(2), 84–88. https://doi.org/10.1093/comjnl/3.2.84
  • Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., & Weinberger, K. (2019). Simplifying graph convolutional networks. In International Conference on Machine Learning, 6861–6871.
  • Xu, L., Xie, H., Li, Z., Wang, F. L., Wang, W., & Li, Q. (2023). Contrastive learning models for sentence representations. ACM Transactions on Intelligent Systems and Technology, 14(4), https://doi.org/10.1145/3593590
  • Yang, Y., & Cui, X. (2021). Bert-enhanced text graph neural network for classification. Entropy, 23(11), https://www.mdpi.com/1099-4300/23/11/1536
  • Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification. In Proceedings of the aaai conference on artificial intelligence (Vol. 33, pp. 7370–7377). AAAI Press.
  • Zhang, Y., Yu, X., Cui, Z., Wu, S., Wen, Z., & Wang, L. (2020, July). Every document owns its structure: Inductive text classification via graph neural networks. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 334–339). Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.31
  • Zhu, H., & Koniusz, P. (2021). Simple spectral graph convolution. International Conference on Learning representations.