Full article: The detection of distributional discrepancy for language GANs

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

A pre-trained neural language model (LM) is usually used to generate texts. Due to exposure bias, the generated text is not as good as real text. Many researchers claimed they employed the Generative Adversarial Nets (GAN) to alleviate this issue by feeding reward signals from a discriminator to update the LM (generator). However, some researchers argued that GAN did not work by evaluating the generated texts with a quality-diversity metric such as Bleu versus self-Bleu, and language model score versus reverse language model score. Unfortunately, these two-dimension metrics are not reliable. Furthermore, the existing methods only assessed the final generated texts, thus neglecting the dynamic evaluating the adversarial learning process. Different from the above-mentioned methods, we adopted the most recent metric functions, which measure the distributional discrepancy between real and generated text. Besides that, we design a comprehensive experiment to investigate the performance during the learning process. First, we evaluate a language model with two functions and identify a large discrepancy. Then, several methods with the detected discrepancy signal to improve the generator were tried. Experimenting with two language GANs on two benchmark datasets, we found that the distributional discrepancy increases with more adversarial learning rounds. Our research provides convicted evidence that the language GANs fail.

KEYWORDS:

1. Introduction

Text generation based on neural language models (LM) (e.g. LSTM Hochreiter & Schmidhuber, Citation1997) has received much attention and has been used for news generation (Zellers et al., Citation2019), text summarisation (Lin et al., Citation2022) and image captioning (Xu et al., Citation2015). However, the generated sentences are still of low quality with regards to semantics and global coherence and are often imperfect grammatically speaking (Caccia et al., Citation2020).

These issues give rise to a large discrepancy between generated text and real text. Two underlying reasons are the architecture and the number of parameters of the LM itself (Radford et al., Citation2019; Santoro et al., Citation2018). Many researchers attribute this to exposure bias (Bengio et al., Citation2015) because an LM is trained with a maximum likelihood estimate (MLE) and predicts the next word conditioned on words from the ground truth during training. However, an LM only uses words that it has generated during the reference.

Statistically, this discrepancy means the two distributional functions of real texts and generated texts are different. Reducing this distributional difference may be a practicable way to improve text generation.

Some researchers try to reduce this difference with GAN (Goodfellow et al., Citation2014) as the success in image generation (Wu et al., Citation2021), image classification (Cao et al., Citation2021) and stock prediction (Li et al., Citation2022; Wu et al., Citation2022). They used a discriminator to detect the discrepancy between real samples and generated samples, and fed the signal back to upgrade the generator (an LM). To solve the non-differential issue that arises by the need to handle discrete tokens, reinforcement learning (RL) (Williams, Citation1992) was adopted by SeqGAN (Yu et al., Citation2017), RankGAN (Lin et al., Citation2017), and LeakGAN (Guo et al., Citation2018). The Gumbel-Softmax is also introduced by GSGAN (Jang et al., Citation2017) and RelGAN (Nie et al., Citation2019) to solve this issue. These language GANs pre-train both the generator (G) and the discriminator (D) before adversarial learning.Footnote¹ During adversarial learning, for each round, the G is trained several epochs and the D is trained tens of epochs. The learning process will not stop until the model converges. Furthermore, considering the generated texts' quality and diversity simultaneously (Shi et al., Citation2018), MaskGAN (Fedus et al., Citation2018), DpGAN (Xu et al., Citation2018) and FMGAN (Chen et al., Citation2018) are proposed. They evaluate the generated text with Bleu (Papineni et al., Citation2002) versus self-Bleu (Zhu et al., Citation2018) or LM score versus reverse LM score (Cífka et al., Citation2018), and claim these GANs improve the performance of the generator.

However, some questions have been recently raised over these claims. Semeniuta et al. (Citation2019) and Caccia et al. (Citation2020) showed that by more precise experiments and evaluation, these considered GAN variants are outperformed by a well-adjusted language model. They draw a performance line in a quality-diversity space by adjusting the softmax temperature. Bleu and language model scores are usually used for measuring local and global quality, respectively; self-Bleu and reverse language model scores are used for measuring local and global diversity, respectively. To overcome these two-dimension metrics, de Masson et al. (Citation2019) proposed a single metric Fréchet embedding distance (FED). It computes the Fréchet distance between two Gaussian distributions. However, Cai et al. (Citation2021) showed that all metrics are not appropriate to evaluate an un-conditional text generator and proposed a novel metric. In short, whether these language GANs fail or not is still an open problem.

We investigate this issue in depth. For language GANs, several critical issues are still not clear, such as whether D detects the discrepancy, whether the detected discrepancy is severe, and whether the signals from D can improve the generator. In this paper, we try to solve these problems by investigating GAN in both pre-training and the adversarial learning processes. Theoretically analysing the signal from D, we employ approximate discrepancy and absolute discrepancy (Cai et al., Citation2021) to measure the distributional discrepancy. With these two functions, we first measure the discrepancy between the real text and the faked text, which is generated by an MLE-trained language model (pre-train). Second, we attempt some methods to update the generator with a feedback signal from D. Then, we use these metric functions to evaluate the updated generator. Finally, we analyse the performance of two typical language GANs during adversarial learning with these two functions across two benchmark datasets.

Our contributions are as follows:

We are the first to measure the variation of the distributional discrepancy between real text and generated text during the training of language GANs by using a discriminator to design and implement two equations.
Although this discrepancy could be detected by a discriminator (D), the feedback signal from D cannot improve G using existing methods. This manifests as an increase in the discrepancy with adversarial learning.
Experimenting on two existing language GANs, SeqGAN and RelGAN, the distributional discrepancy between real text and generated text increases with more adversarial learning rounds. This demonstrates that existing adversarial learning does not work. Thus, the industrial systems need not try in this way.

The rest of the paper is organised as follows. Section 2 describes the related work. Section 3 introduces the proposed method to measure the distributional discrepancy. The next section presents the experimental procedure in detail. The experiments and analysis are shown in Section 5. Finally, we give a short summary in Section 6.

2. Related work

Many GAN-based models were proposed to improve neural language models. SeqGAN (Yu et al., Citation2017) attacked the non-differential issue by resorting to RL. By applying a policy gradient method (Sutton et al., Citation2000), they optimised the LSTM generator with rewards received through Monte Carlo (MC) sampling. Many researchers such as RankGAN and MailGAN (Che et al., Citation2017) also used this technique, although it is ineffective in MC search. The RL-free model, e.g. GSGAN, contained continuous application of the approximating softmax function and working on latent continuous space directly. TextGAN (Salimans et al., Citation2016) added Maximum Mean Discrepancy to the original objective of GAN based on feature matching. Considering the drawbacks of pre-training a neural language model, Nie et al. (Citation2019) proposed a RelGAN that uses the relation memory (Santoro et al., Citation2018), which allows for interactions between memory slots by using the self-attention mechanism (Vaswani et al., Citation2017). Gu & Cheung (Citation2018) optimised GAN by evolutionary algorithms. We selected SeqGAN and RelGAN as representatives for this study. The results show that the adversarial learning does not work for either of these models.

Caccia et al. (Citation2020) argued the current evaluation measures correlate with human judgment (Cífka et al., Citation2018) was treacherous. They furthermore proposed a temperature sweep, which evaluates model at many temperature settings rather than only one. By drawing lines in a quality-diversity space such as Bleu versus self-Bleu, or language model score versus reverse language model score, they show that a well-adjusted language model can beat those considered language GANs. Unfortunately, the limitations of Bleu vs self-Bleu were shown by training a 5-gram language model and its scores were even better than the training data (de Masson et al., Citation2019). Cai et al. (Citation2021) also revealed that they were unreliable and proposed a novel metric to evaluate unconditional text generation by calculating the distributional discrepancy between two text sets. This single metric could simultaneously measure both the diversity and quality. We adopt it and propose a simpler version. Semeniuta et al. (Citation2019) and He et al. (Citation2021) also argued GAN-based models were weaker than LM, because they observed a less severe impact of exposure bias. The latter furtherly quantified the exposure bias by using conditional distribution. Obviously, the existing methods only assessed the final generated texts, thus neglecting the dynamic evaluating the adversarial learning processes. Different from the above-mentioned methods, we investigate the mechanism of language GANs and quantify the discrepancy between real texts and generated texts both after the pre-training and the whole adversarial learning process.

3. Method

In GAN, the generator $G_{θ}$ implicitly defines a probability distribution $p_{θ} (x)$ to mimic the real data distribution $p_{d} (x)$ . θ is the parameters of the language model $G_{θ}$ and is the parameter of the value function V, which is listed as follow. $p_{θ}$ is the distributional function of $G_{θ}$ . (1) $min_{G_{θ}} max_{D_{ϕ}} V (D_{ϕ}, G_{θ}) = E_{p_{d} (x)} [\log D_{ϕ} (x)] + E_{p_{θ} (x)} [\log (1 - D_{ϕ} (x))]$ (1) Alternating optimisation of $G_{θ}$ and $D_{ϕ}$ is used to resolve the above equation. Given θ, to detect the discrepancy between $p_{θ} (x)$ and $p_{d} (x)$ , we optimise $D_{ϕ}$ as follows: (2) $max_{D_{ϕ}} V (D_{ϕ}, G_{θ}) = max_{D_{ϕ}} E_{p_{d} (x)} [\log D_{ϕ} (x)] + E_{p_{θ} (x)} [\log (1 - D_{ϕ} (x))]$ (2) Assuming $D_{ϕ}^{*} (x)$ is the optimal solution for a given θ, according to Goodfellow et al. (Citation2014), it will be, (3) $D_{ϕ}^{*} (x) = \frac{p_{d} (x)}{p_{d} (x) + p_{θ} (x)}$ (3) and it is obvious to get the follow formula: (4) ${\begin{cases} D_{ϕ}^{*} (x) \geq 0.5, i i f p_{d} (x) \geq p_{θ} (x) \\ D_{ϕ}^{*} (x) < 0.5, i i f p_{d} (x) < p_{θ} (x) \end{cases}$ (4) Because the real distribution $p_{d}$ cannot be obtained in practice, it is impossible to directly measure the discrepancy according to Equation (Equation3(3) $D_{ϕ}^{*} (x) = \frac{p_{d} (x)}{p_{d} (x) + p_{θ} (x)}$ (3) ). Fortunately, we have massive real sentences and each x could be a sample from $p_{d}$ . Based on real samples and the above equation, we obtain a way to estimate the distributional discrepancy.

3.1. Approximate discrepancy

Let, (5) $q_{d} (x) = \frac{p_{d} (x)}{p_{d} (x) + p_{θ} (x)}, q_{θ} (x) = \frac{p_{θ} (x)}{p_{d} (x) + p_{θ} (x)}$ (5) Therefore, $q_{d} (x) = p (x c o m e s f r o m r e a l d a t a | x)$ , $q_{θ} (x) = p (x c o m e s f r o m g e n e r a t e d d a t a | x)$ , $q_{d} (x) + q_{θ} (x) = 1$ . With Equation (Equation5(5) $q_{d} (x) = \frac{p_{d} (x)}{p_{d} (x) + p_{θ} (x)}, q_{θ} (x) = \frac{p_{θ} (x)}{p_{d} (x) + p_{θ} (x)}$ (5) ), we can get a constraint and an approximate measure of distributional function. Figure (a) illustrates the relationship between $q_{θ} (x)$ and $q_{d} (x)$ .

Figure 1. Illustration of two measures. (a) The yellow area denotes the negative, and the green one denotes the positive. (b) Half of the shaded area equals the result of Equation (Equation10(10) $d_{s} = \frac{1}{2} [E_{\begin{matrix} p_{d} (x) \\ D_{ϕ}^{*} (x) > 0.5 \end{matrix}} (1) - E_{\begin{matrix} p_{d} (x) \\ D_{ϕ}^{*} (x) \leq 0.5 \end{matrix}} (1) + E_{\begin{matrix} p_{θ} (x) \\ D_{ϕ}^{*} (x) \leq 0.5 \end{matrix}} (1) - E_{\begin{matrix} p_{θ} (x) \\ D_{ϕ}^{*} (x) > 0.5 \end{matrix}} (1)]$ (10) ). (a) Approximate discrepancy. (b) Absolute discrepancy.

Let, (6) $u_{d} = E_{p_{d} (x)} (D_{ϕ}^{*} (x)), u_{θ} = E_{p_{θ} (x)} (D_{ϕ}^{*} (x))$ (6) These are two equations for these two statistics, which are the expectation of the $D_{ϕ}^{*}$ 's predictions on real text and on generated text. From the above equation, it is easy to obtain the following equation: (7) $\frac{1}{2} [u_{d} + u_{θ}] = 0.5$ (7) The results give a constraint for $D_{ϕ}$ converging to $D_{ϕ}^{*}$ . We should take this constraint into account when estimating the ideal function $D_{ϕ}^{*}$ . From Equation (Equation3(3) $D_{ϕ}^{*} (x) = \frac{p_{d} (x)}{p_{d} (x) + p_{θ} (x)}$ (3) ), the optimisation process for the discriminator increases $u_{d}$ and decreases $u_{θ}$ to as small a value as possible. So, we can estimate the distributional discrepancy according to the following function.

Intuitively, using $u_{d}$ and $u_{θ}$ , we get a metric function to measure the discrepancy between $p_{θ} (x)$ and $p_{d} (x)$ , (8) $d_{a} = | u_{d} - u_{θ} |$ (8) We call this approximate discrepancy. It is the difference in the average score that a well-trained discriminator (denoted as ${\hat{D}}_{ϕ}$ ) makes in the predictions on real samples compared to generated samples. It reflects the discrepancy between these two sets to some degree. From Equations (Equation5(5) $q_{d} (x) = \frac{p_{d} (x)}{p_{d} (x) + p_{θ} (x)}, q_{θ} (x) = \frac{p_{θ} (x)}{p_{d} (x) + p_{θ} (x)}$ (5) ), (Equation6(6) $u_{d} = E_{p_{d} (x)} (D_{ϕ}^{*} (x)), u_{θ} = E_{p_{θ} (x)} (D_{ϕ}^{*} (x))$ (6) ) and (Equation8(8) $d_{a} = | u_{d} - u_{θ} |$ (8) ), we get Equation (Equation9(9) $d_{a} = \int | q_{d} (x) - q_{θ} (x) | p_{d} (x) d x = E_{p_{d} (x)} | q_{d} (x) - q_{θ} (x) |$ (9) ), (9) $d_{a} = \int | q_{d} (x) - q_{θ} (x) | p_{d} (x) d x = E_{p_{d} (x)} | q_{d} (x) - q_{θ} (x) |$ (9) The range of $d_{a}$ is $0 \sim 1$ . The bigger its value is, the larger the discrepancy. When $p_{d} (x) = p_{θ} (x)$ , namely there is no discrepancy, there will be $d_{a} = 0$ . On the contrary, $d_{a} = 1$ if $\forall x, p_{d} (x) * p_{θ} (x) \equiv 0$ . Figure (a) illustrates the discrepancy between two distributional functions $q_{θ} (x)$ and $q_{d} (x)$ . Both of them are systematic to the line of q = 0.5.

Cai et al. (Citation2021) proposed a novel metric that is more complete than ours because there are a positive part and also a negative part, as presented in Figure (b), and it is defined as absolute discrepancy $d_{s}$ . This is represented by the following equation: (10) $d_{s} = \frac{1}{2} [E_{\begin{matrix} p_{d} (x) \\ D_{ϕ}^{*} (x) > 0.5 \end{matrix}} (1) - E_{\begin{matrix} p_{d} (x) \\ D_{ϕ}^{*} (x) \leq 0.5 \end{matrix}} (1) + E_{\begin{matrix} p_{θ} (x) \\ D_{ϕ}^{*} (x) \leq 0.5 \end{matrix}} (1) - E_{\begin{matrix} p_{θ} (x) \\ D_{ϕ}^{*} (x) > 0.5 \end{matrix}} (1)]$ (10) The range of $d_{s}$ is also $0 \sim 1$ . The drawback of this metric is that it needs more computation than ours. Both metrics are used in this paper.

3.2. Using $D_{ϕ}^{*} (x)$ to improve $G_{θ}$

Given an instance x generated by $G_{θ}$ , if $D_{ϕ}^{*} (x)$ is larger, it means the possibility of x in real data is larger. For an instance x, if $D_{ϕ}^{*} (x) = 0.8$ , there will be $p_{θ} (x) < p_{d} (x)$ according to Equation (Equation3(3) $D_{ϕ}^{*} (x) = \frac{p_{d} (x)}{p_{d} (x) + p_{θ} (x)}$ (3) ). So, we should update $G_{θ}$ to increase the probability density $p_{θ} (x)$ . It may improve the performance of $G_{θ}$ . Based on this, we can select some generated instances by the value of $D_{ϕ}^{*} (x)$ to update the generator. In fact, we find it improves the performance a little when compared with the random selection. However, this method is still worse than without it. Experiment 5.3 shows the results.

4. Implementation procedure

The optimal function $D_{ϕ}^{*}$ is an ideal function that can only be statistically estimated by an approximated function. We can design a function $D_{ϕ}$ and sample from real data and generated data; then, we train $D_{ϕ}$ according to Equation (Equation2(2) $max_{D_{ϕ}} V (D_{ϕ}, G_{θ}) = max_{D_{ϕ}} E_{p_{d} (x)} [\log D_{ϕ} (x)] + E_{p_{θ} (x)} [\log (1 - D_{ϕ} (x))]$ (2) ). When the results convergence, we get ${\hat{D}}_{ϕ}$ , which is the approximated function of $D_{ϕ}^{*}$ . The degree of approximation is mainly determined by three factors: the structure and the number of the parameters number of $D_{ϕ}$ , the volume of training data, and the settings of hyper-parameters.

Based on the above analysis, we obtain two metric functions to measure the distributional discrepancy between dataset A and B (for example, A is composed of real sentences while B consists of machine-generated sentences). The implementation procedure is described as follows:

Step 1: Design a discriminator $D_{ϕ}$ .

Step 2: Sets A and B are, respectively, divided into a training set $D_{t r a i n A}$ and $D_{t r a i n B}$ , a validation set $D_{d e v A}$ and $D_{d e v B}$ , and a test set $D_{t e s t A}$ and $D_{t e s t B}$ . The partition should be as equal an amount of instances as possible for classification training.

Step 3: $D_{ϕ}$ is optimised with $D_{t r a i n A}$ and $D_{t r a i n B}$ according to the Equation (Equation2(2) $max_{D_{ϕ}} V (D_{ϕ}, G_{θ}) = max_{D_{ϕ}} E_{p_{d} (x)} [\log D_{ϕ} (x)] + E_{p_{θ} (x)} [\log (1 - D_{ϕ} (x))]$ (2) ). Validating with $D_{d e v A}$ and $D_{d e v B}$ , we can judge whether $D_{ϕ}$ convergences or not and then get ${\hat{D}}_{ϕ}$ .

Step 4: According to Equation (Equation8(8) $d_{a} = | u_{d} - u_{θ} |$ (8) ) and (Equation10(10) $d_{s} = \frac{1}{2} [E_{\begin{matrix} p_{d} (x) \\ D_{ϕ}^{*} (x) > 0.5 \end{matrix}} (1) - E_{\begin{matrix} p_{d} (x) \\ D_{ϕ}^{*} (x) \leq 0.5 \end{matrix}} (1) + E_{\begin{matrix} p_{θ} (x) \\ D_{ϕ}^{*} (x) \leq 0.5 \end{matrix}} (1) - E_{\begin{matrix} p_{θ} (x) \\ D_{ϕ}^{*} (x) > 0.5 \end{matrix}} (1)]$ (10) ), with two test datasets, we can estimate the discrepancy of two distributional functions between dataset A and B. ${\hat{d}}_{s}$ denotes the absolute discrepancy, and ${\hat{d}}_{a}$ denotes the approximate discrepancy, respectively.

Algorithm 1 illustrates the procedure. Generally speaking, there should be $d_{s} \leq {\hat{d}}_{s}$ . Because $D_{ϕ}^{*}$ cannot be obtained, it is hard to get the degree of the approximation of $d_{s}$ to ${\hat{d}}_{s}$ . Many research results have shown that discriminators with deep neural networks are very powerful, some of which can even exceed human performance on tasks such as image classification (He et al., Citation2016) and text classification (Kim, Citation2014). So, if $D_{ϕ}$ with CNN, and an attention mechanism is well trained, ${\hat{D}}_{ϕ}$ will be a meaningful approximation of $D_{ϕ}^{*}$ . Therefore, we can obtain the meaningful approximation of $d_{s}$ and $d_{a}$ via ${\hat{D}}_{ϕ}$ .

5. Experiment

We select SeqGAN and RelGAN as representative models for our experiment, and the benchmark datasets are also the same as used by these models previously. Then, we show that the well-trained discriminator $D_{ϕ}$ can measure the discrepancy between the real and generated texts, and then point out that the existing GAN-based methods does not work. Finally, a third-party discriminator is used to evaluate the performance of adversarial learning with incremental training iterations.

5.1. Datasets and model settings

Both SeqGAN and RelGAN used a relatively short sentences dataset (COCO image caption)Footnote² and a long sentences dataset (EMNLP2017 WMT news).Footnote³ For the former dataset, the sentences' average length is about 11 words. There are, in total 4682-word types, and the longest sentence consists of 37 words. Both the training and test data contain 10,000 sentences. For the latter dataset, the average length of sentences is about 20 words. There are in total 5255-word types and the longest sentence consists of 51 words. All training data, about 280 thousand sentences, is used and there are 10,000 sentences in the test data. According to Section 3, each test data is divided into two parts. Half is the validation set and the remaining half is the test set. We always generate the same number of sentences to compare with the two test datasets, respectively.

For these two models, all hyper-parameters, including word embedding size, learning rate and dropout, are set the same as in their original papers. For RelGAN, the standard GAN loss function (the non-saturating version) is adopted because the relative standard loss which is used in (Nie et al., Citation2019) does not meet the constraints of Equation (Equation7(7) $\frac{1}{2} [u_{d} + u_{θ}] = 0.5$ (7) ). But, when measuring RelGAN's discrepancy during the adversarial stage, its own loss function is still a relatively standard loss. A critical hyper-parameter, temperature, is set to 100, which is the best result in their paper. During the process of training $D_{ϕ}$ , we always train 10,000 epochs and observe performance on the validation dataset.

5.2. Distributional differences in pre-training

We estimate the distributional differences caused by the MLE-based generators. We first train the generator for N epochs and then $D_{ϕ}$ until it converges (this needs 10,000 epochs). For example, following Nie et al. (Citation2019), we train $G_{θ}$ for 150 epochs, and select the one whose perplexity (PPL) is the smallest by measuring on the validation set. Then, $D_{ϕ}$ is trained following the procedure in Section 4.

Figure shows the discrepancy between real and generated texts. The discrepancy increases with more training for discriminator until the training is stable.

Figure 2. Discrepancy between real and generated texts with two pre-training generators after 80 epochs across two datasets. Red lines denote absolute discrepancy, and blue ones mean approximate discrepancy. Pale lines denote batch instances' discrepancy, and the curve is the exponential moving average on this sampled batch for each epoch. (a) Discrepancy of SeqGAN on EMNLP. (b) Discrepancy of SeqGAN on COCO. (c) Discrepancy of RelGAN on EMNLP. (d) Discrepancy of RelGAN on COCO.

Figure shows the discriminator's prediction on real and machine-generated texts, respectively. The more difference between the scores on these two datasets, the distributional discrepancy is larger. From this figure, we can see $D_{ϕ}$ convergences after about 3000 epochs for RelGAN, but SeqGAN needs more epochs to train the discriminator because the latter used an LSTM as the generator.

Figure 3. Discriminator scores on real and generated texts with pre-training two generators after 80 epochs across two datasets. Red lines denote the score on the real validation set, and blue ones mean the score on the machine-generated validation set. Pale lines denote batch instances' discrepancy, and the curve is the exponential moving average on this sampled batch for each epoch. (a) Accuracy with SeqGAN on EMNLP. (b) Accuracy with SeqGAN on COCO. (c) Accuracy with RelGAN on EMNLP. (d) accuracy with RelGAN on COCO.

Considering the smoothed value on one batch rather than the prediction on the whole data, we use the convergence discriminator to predict on the all validation data and generated data.Footnote⁴ Table summarises the discrepancy across two models and two datasets. It shows that the difference between real text and generated text does exist and it is huge.

Table 1. Discrepancy across two models and two datasets after pre-training.

Display Table

5.3. Detected discrepancy by ${\hat{D}}_{ϕ}$ cannot improve the generator

We explore the improvement of $G_{θ}$ with the discrepancy detected by ${\hat{D}}_{ϕ}$ at the end of pre-training. We select the best pre-train epochs for $G_{θ}$ . It should be noted that ${\hat{D}}_{ϕ}$ is well-trained with sufficient real sentences and generated ones by $G_{θ}$ . Then, $G_{θ}$ is updated according to the signals from the ${\hat{D}}_{ϕ}$ . To verify the effect of the feedback signals, we generate many rather than only several batch-size instances to adjust θ.

Then, fixing $G_{θ}$ , we re-train $D_{ϕ}$ with 10,000 epochs to get a new convergence discriminator to compute two distributional functions according to Equations (Equation9(9) $d_{a} = \int | q_{d} (x) - q_{θ} (x) | p_{d} (x) d x = E_{p_{d} (x)} | q_{d} (x) - q_{θ} (x) |$ (9) ) and (Equation10(10) $d_{s} = \frac{1}{2} [E_{\begin{matrix} p_{d} (x) \\ D_{ϕ}^{*} (x) > 0.5 \end{matrix}} (1) - E_{\begin{matrix} p_{d} (x) \\ D_{ϕ}^{*} (x) \leq 0.5 \end{matrix}} (1) + E_{\begin{matrix} p_{θ} (x) \\ D_{ϕ}^{*} (x) \leq 0.5 \end{matrix}} (1) - E_{\begin{matrix} p_{θ} (x) \\ D_{ϕ}^{*} (x) > 0.5 \end{matrix}} (1)]$ (10) ). Unfortunately, in the view of both absolute and approximated discrepancy, the discrepancy always exceeds the original value computed in pre-training. It demonstrates that the generator is not improved further. Figure illustrates the comparison.

Figure 4. The comparison of the discrepancy between pre-train and the generator is updated with the feedback signals from ${\hat{D}}_{ϕ}$ , which is obtained from pre-train. The vertical dash line represents the end of pre-training. (a) Approximate discrepancy on EMNLP. (b) Absolute discrepancy on EMNLP. (c) Approximate discrepancy on COCO. (d) Absolute discrepancy on COCO.

Figure 4. The comparison of the discrepancy between pre-train and the generator is updated with the feedback signals from Dˆϕ, which is obtained from pre-train. The vertical dash line represents the end of pre-training. (a) Approximate discrepancy on EMNLP. (b) Absolute discrepancy on EMNLP. (c) Approximate discrepancy on COCO. (d) Absolute discrepancy on COCO.

Besides following Zhu et al. (Citation2018), we also propose a new method to update $G_{θ}$ in the adversarial way. Rather than using all the generated instances to update $G_{θ}$ , only the ones, which are assigned relatively high scores by $D_{ϕ}$ , are used. We denote it as HW. The reason is that we assume that the higher score instances may be more informative than the lower ones. The method that only the relatively low scores samples are used to adjust the generator is also experimented with. Regretfully, all of them fail. Table lists the discrepancy across two datasets with different settings. The discrepancy is always larger than that of the pre-training.

Table 2. Comparison between the absolute discrepancy in pre-training and $G_{θ}$ updated by ${\hat{D}}_{ϕ}$ 's feedback signal.

Display Table

5.4. A third-party discriminator evaluates these language GANs

In order to evaluate different adversarial learning's GANs, we use a third-party discriminator $D_{ϕ}^{3}$ , which is a clone of the discriminator in its counterpart language GAN except for the parameters' values. For each adversarial round, we train $D_{ϕ}^{3}$ from scratch many epochs (verifying its convergence) with real text and generated text. Then, two distributional functions are computed according to its prediction. Figure shows the dynamic evaluation result. In view of both approximate discrepancy and absolute discrepancy, the distribution difference on the real text and generated text does not decrease when more adversarial learning rounds are adapted. Once again, the results show that the approach of the existing language GANs cannot improve text generation.

Figure 5. A third party discriminator evaluates two GANs' performance on COCO. (a) SeqGAN. (b) RelGAN.

6. Conclusion and future work

Unconditional text generation is the step-stone of conditional text generation such as news generation and text summarisation. It is not clear that GAN can improve the unconditional text generation. We present two metric functions to measure the discrepancy between real text and generated text. Numerous experiments show that this discrepancy does exist. We use various methods to update the generator parameters according to the detected discrepancy signals. Unfortunately, the distributional difference between real data and generated data does not decrease, indicating the difficulty of generator improvement with these signals. Finally, we use a third-part discriminator to evaluate the effectiveness of GAN and find that with more adversarial learning epochs, the discrepancy increases rather than decreases. Our study provided valuable information for the industry by analysing the existing language GANs do not work in-depth.

Many studies could be done in the future. First, the novel method used to facilitate the reward signals to improve the generator is worth further study. Besides the constraints from intrinsic language characteristics, common sense and logic should be introduced to improve text generation. Finally, diversity, such as conversation generation in chat platforms, should be further investigated.

Acknowledgments

We thank the anonymous reviewers for their valuable comments.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the National Natural Science Foundation of China [grant numbers 61936012, 61976114, 81373056] and the National Key Research and Development Program of China [grant number 2018YFB1005102].

Notes

1 An exception is RelGAN which does not need to pre-train D.

2 http://cocodataset.org/.

3 http://www.statmt.org/wmt17/.

4 According to Section 3, we sample generated instances as much as test instances.

References

Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. International Conference on Neural Information Processing Systems, (pp. 1171–1179). https://dl.acm.org/doi/10.5555/2969239.2969370
Google Scholar
Caccia, M., Caccia, L., Fedus, W., Larochelle, H., Pineau, J., & Charlin, L. (2020). Language GANs falling short. International Conference on Learning Representation. https://openreview.net/pdf?id=BJgza6VtPB
Google Scholar
Cai, P., Chen, X., Jin, P., Wang, H., & Li, T. (2021). Distributional discrepancy: A metric for unconditional text generation. Knowledge-Based Systems, 2021(217), 1–9. https://doi.org/10.1016/j.knosys.2021.106850
Google Scholar
Cao, Z., Zhou, Y., Yang, A., & Peng, S. (2021). Deep transfer learning mechanism for fine-grained cross-domain sentiment classification. Connection Science, 33(4), 911–928. https://doi.org/10.1080/09540091.2021.1912711
Web of Science ®Google Scholar
Che, T., Li, Y., Zhang, R., Hjelm, D., & Bengio, Y. (2017). Maximum-likelihood augmented discrete generative adversarial networks. https://arxiv.org/abs/1804.07972
Google Scholar
Chen, L., Dai, S., Tao, C., Shen, D., Gan, Z., Zhang, H., Zhang, Y., & Carin, L. (2018). Adversarial text generation via feature-mover's distance. International Conference on Neural Information Processing Systems, (pp. 4671–4682). https://dl.acm.org/doi/10.5555/3327345.3327377
Google Scholar
Cífka, O., Severyn, A., Alfonseca, E., & Filippova, K. (2018). Eval all, trust a few, do wrong to none: Comparing sentence generation models. https://arxiv.org/abs/1804.07972
Google Scholar
de Masson, C., Rosca, M., Rae, J., & Mohamed, S. (2019). Training language GANs from scratch. International Conference on Neural Information Processing Systems, (pp. 4300–4311). https://dl.acm.org/doi/10.5555/3454287.3454674
Google Scholar
Fedus, W., Goodfellow, I., & Dai, A. (2018). MaskGAN: Better Text Generation via Filling in the ______. International Conference on Learning Representation. https://openreview.net/pdf?id=ByOExmWAb
Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. International Conference on Neural Information Processing Systems, (pp. 2672–2680). https://dl.acm.org/doi/10.5555/2969033.2969125
Google Scholar
Gu, F., & Cheung, Y. (2018). Self-Organizing Map-Based Weight Design for Decomposition-Based Many-Objective Evolutionary Algorithm. IEEE Transactions on Evolutionary Computation, 22(2), 211–225. https://doi.org/10.1109/TEVC.2017.2695579
Web of Science ®Google Scholar
Guo, J., Lu, S., Han, C., Zhang, W., & Wang, J. (2018). Long text generation via adversarial training with leaked information. AAAI Conference on Artificial Intelligence, (pp. 5141–5148). https://dlnext.acm.org/doi/10.5555/3504035.3504665
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90
Google Scholar
He, T., Zhang, J., Zhou, Z., & Glass, J. (2021). Exposure bias versus self-Recovery: Are distortions really incremental for autoregressive text generation? https://arxiv.org/abs/1905.10617
Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
PubMed Web of Science ®Google Scholar
Jang, E., Gu, S., & Poole, B. (2017). Categorical reparameterization with Gumbel-Softmax. International Conference on Learning Representation. https://openreview.net/pdf?id=rkE3y85ee
Google Scholar
Kim, Y. (2014). Convolutional neural networks for sentence classification. Conference on Empirical Methods in Natural Language Processing, (pp. 1746–1751). https://doi.org/10.3115/v1/D14-1181
Google Scholar
Li, Y., Dai, H., & Zheng, Z. (2022). Selective transfer learning with adversarial training for stock movement prediction. Connection Science, 34(1), 492–510. https://doi.org/10.1080/09540091.2021.2021143
Web of Science ®Google Scholar
Lin, K., Li, D., He, X., Zhang, Z., & Sun, M. (2017). Adversarial ranking for language generation. International Conference on Neural Information Processing Systems, (pp. 3158–3168). https://dl.acm.org/doi/10.5555/3294996.3295075
Google Scholar
Lin, N., Li, J., & Jiang, S. (2022). A simple but effective method for Indonesian automatic text summarisation. Connection Science, 34(1), 29–43. https://doi.org/10.1080/09540091.2021.1937942
Web of Science ®Google Scholar
Nie, W., Narodytska, N., & Patel, A. (2019). RelGAN: Relational generative adversarial networks for text generation. International Conference on Learning Representation. https://openreview.net/pdf?id=rJedV3R5tm
Google Scholar
Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). Bleu: A method for automatic evaluation of machine translation. . Annual Meeting of the Association for Computational Linguistics, (pp. 311–318). https://doi.org/10.3115/1073083.1073135
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners (Technical Report). OpenAI.
Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. International Conference on Neural Information Processing Systems, (pp. 2234–2242). https://dl.acm.org/doi/10.5555/3454287.3454674
Google Scholar
Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski, M., Weber, T., Wierstra, D., Vinyals, O., Pascanu, R., & Lillicrap, T. (2018). Relational recurrent neural networks. International Conference on Neural Information Processing Systems, (pp. 7299–7310). https://dl.acm.org/doi/epdf/10.5555/3327757.3327832
Google Scholar
Semeniuta, S., Severyn, A., & Gelly, S. (2019). On accurate evaluation of GANs for language generation. https://arxiv.org/pdf/1806.04936
Google Scholar
Shi, Z., Chen, X., Qiu, X., & Huang, X. (2018). Toward diverse text generation with inverse reinforcement learning. International Joint Conference on Artificial Intelligence, (pp. 4361–4367). https://dl.acm.org/doi/abs/10.5555/3304222.3304376
Google Scholar
Sutton, R., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. International Conference on Neural Information Processing Systems, (pp. 1057–1063). https://dl.acm.org/doi/10.5555/3009657.3009806
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. International Conference on Neural Information Processing Systems.
Google Scholar
Williams, R. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3), 229–256. https://doi.org/10.1007/BF00992696
Web of Science ®Google Scholar
Wu, Q., Zhu, B., Yong, B., Wei, Y., Jiang, X., Zhou, R., & Zhou, Q. (2021). ClothGAN: Generation of fashionable Dunhuang clothes using generative adversarial networks. Connection Science, 33(2), 341–358. https://doi.org/10.1080/09540091.2020.1822780
Web of Science ®Google Scholar
Wu, S., Liu, Y., Zou, Z., & Weng, T. (2022). S_I_LSTM: Stock price prediction based on multiple data sources and sentiment analysis. Connection Science, 34(1), 44–62. https://doi.org/10.1080/09540091.2021.2021143
Web of Science ®Google Scholar
Xu, J., Ren, X., Lin, J., & Sun, X. (2018). Diversity-promoting GAN: A cross-entropy based generative adversarial network for diversified text generation. Conference on Empirical Methods in Natural Language Processing, (pp. 3940–3949). https://doi.org/10.18653/v1/D18-1428
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning, (pp. 2048–2057). https://dl.acm.org/doi/10.5555/3045118.3045336
Google Scholar
Yu, L., Zhang, W., Wang, J., & Yu, Y. (2017). SeqGAN: Sequence generative adversarial nets with policy gradient. AAAI Conference on Artificial Intelligence, (pp. 2852–2858). https://dl.acm.org/doi/10.5555/3298483.3298649
Google Scholar
Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y. (2019). Defending against neural fake news. International Conference on Neural Information Processing Systems, (pp. 9054–9065). https://dl.acm.org/doi/10.5555/3454287.3455099
Google Scholar
Zhu, Y., Lu, S., Lei, Z., Guo, J., Zhang, W., Wang, J., & Yu, Y. (2018). Texygen: A benchmarking platform for text generation models. International ACM SIGIR Conference on Research & Development in Information Retrieval, (pp. 1097–1100). https://doi.org/10.1145/3209978.3210080
Google Scholar

The detection of distributional discrepancy for language GANs

ABSTRACT

1. Introduction

2. Related work