Search in:

Applied Artificial Intelligence

An International Journal

Volume 33, 2019 - Issue 9

Submit an article Journal homepage

Free access

729

Views

CrossRef citations to date

Altmetric

Listen

Articles

Towards Autoencoding Variational Inference for Aspect-Based Opinion Summary

Tai HoangFaculty of Information Systems, University of Information Technology, Ho Chi Minh City, VietNamView further author information

Huy LeFaculty of Information Systems, University of Information Technology, Ho Chi Minh City, VietNamView further author information

Tho QuanFaculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, Ho Chi Minh City, VietNamCorrespondence[email protected]
View further author information

Pages 796-816 | Published online: 17 Jun 2019

Cite this article
https://doi.org/10.1080/08839514.2019.1630148
CrossMark

In this article

ABSTRACT
Introduction
Latent Dirichlet Allocation and Autoencoding Variational Inference Approaches for Topic Modeling
Autoencoding Variational Inference for Aspect Discovery
AutoEncoding Variational Inference for Joint Sentiment/Topic
Experiments
Conclusion
Additional information
Footnotes
References
Appendixes

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Aspect-based Opinion Summary (AOS), consisting of aspect discovery and sentiment classification steps, has recently been emerging as one of the most crucial data mining tasks in e-commerce systems. Along this direction, the LDA-based model is considered as a notably suitable approach, since this model offers both topic modeling and sentiment classification. However, unlike traditional topic modeling, in the context of aspect discovery, it is often required some initial seed words, whose prior knowledge is not easy to be incorporated into LDA models. Moreover, LDA approaches rely on sampling methods, which need to load the whole corpus into memory, making them hardly scalable. In this research, we study an alternative approach for AOS problem, based on Autoencoding Variational Inference (AVI). Firstly, we introduce the Autoencoding Variational Inference for Aspect Discovery (AVIAD) model, which extends the previous work of Autoencoding Variational Inference for Topic Models (AVITM) to embed prior knowledge of seed words. This work includes enhancement of the previous AVI architecture and also modification of the loss function. Ultimately, we present the Autoencoding Variational Inference for Joint Sentiment/Topic (AVIJST) model. In this model, we substantially extend the AVI model to support the JST model, which performs topic modeling for corresponding sentiment. The experimental results show that our proposed models enjoy higher topic coherent, faster convergence time and better accuracy on sentiment classification, as compared to their LDA-based counterparts.

Introduction

Recently, Aspect-based Opinion Summary Hu and Liu (Citation2004) have been introduced as an emerging data mining process in e-commerce systems. Generally, this task aims to extract aspects from a product review and subsequently infer the sentiments of the review writer towards the extracted aspect. The result of an AOS task is illustrated in . Thus, AOS consists of two major steps, known as aspect discovery and aspect-based sentiment analysis. For the first step, there are two major approaches. The first one relied on linguistic methods, such as using part-of-speech and dependency grammars analysis Qiu et al. (Citation2011) or using supervised methods Jin and Ho (Citation2009). However, this approach is likely able to detect only the explicit aspects, e.g., the aspects which are referred explicitly in the context. For example, the review such as “The price of this restaurant is quite high” can be inferred as a mention of the aspect price, explicitly discussed in the text. However, for another review of “The foods here are not very affordable”, the price aspect is also implied, but implicitly. Thus, it is hard to be detected if one only fully relies on linguistic and supervised methods. The another approach for aspect discovery is based on Latent Dirichlet Allocation (LDA) Blei, Ng, and Jordan (Citation2003), which is widely used for topic modeling Zhao et al. (Citation2010). In this approach, a topic is modeled as a distribution of words in the given corpus, thus can be treated as a discovered aspect. For example, the price aspect can be discovered as a distribution over some major words such that price, expensive, affordable, cheap, etc. This approach is widely applied today to detect hidden topics in documents. For the second step of aspect-based sentiment analysis, various works based on feature-extracted machine learning are proposed, e.g., Bespalov et al. (Citation2011). Recently, many works on using deep learning for sentiment classification have also been report Zhang, Wang, and Liu (Citation2018).

Figure 1. Aspect-based opinion summary result of one product.

However, in the context of topic discovery, perhaps the most remarkable work perhaps is the approach of Joint Sentiment/Topic model (JST) Lin and He (Citation2009) since this work extended the usage of LDA for topic modeling as a joint system allowing not only topics to be discovered but also sentiment words associated with the topics. Thus, it is very potential to completely solve the full AOS problem using LDA-based approach, as attempted in Wu et al. (Citation2015).

However, aspect discovery is not entirely the same process as topic modeling. As reported in Lu et al. (Citation2011), the task of aspect discovery should require some initial words for the aspects, known as the seed words. However, in LDA-based approaches, it is not easy to incorporate the prior knowledge of seed words to the topic modeling systems. Moreover, the LDA-based approaches rely on sampling methods, e.g., Gibbs sampling Blei, Ng, and Jordan (Citation2003) to learn the parameters of the required Dirichlet distributions. This work requires the whole corpus to be loaded into memory for sampling, which incurs heavy computational cost. Then, in this study, we explore the application of Autoencoding Variational Inference For Topic Models (AVITM) Srivastava and Sutton (Citation2017) as an alternative method of LDA for AOS. In AVITM, a deep neural network is integrated in variational autoencoder Kingma and Welling (Citation2014) with the technique of reparameterization trick to simulate the sampling work, which eventually learns the desired topic distribution as done by LDA.

Hence, the approach of AVI can achieve theoretically the same objective of hidden topic detection like LDA, but it can avoid the heavy cost of loading the whole corpus for sampling, since the input data can be gradually fed to the input layer of the deep neural network. Further, when the training data are enriched with new documents, the sampling process of LDA must be restarted from beginning, whereas in ATITM, the new documents can be incrementally trained in the neural network. Lastly, by unnormalizing the distribution of words with corresponding topics when training the neural networks, ATIVM can obtain more coherence on the generated topics.

Urged by those advantages of the AVI-based approach, we consider further extending this direction in the theme of AOS. To be concrete, we consider using AVI to support aspect discovery, not only topic modeling. In addition, we also aim to introduce an AVI-based version of the JST model, which can also perform topic modeling and sentiment classification at the same time, meanwhile still enjoying the advantages offered by AVI as aforementioned. Thus, our research contributes on the two following novel modes. The first proposal is known as Autoencoding Variational Inference for Aspect Discovery (AVIAD), in which we extend the existing work of AVITM to support incorporating prior knowledge from a set of pre-defined seed words of aspects for better discovery performance. The second model is referred to as Autoencoding Variational Inference for Joint Sentiment/Topic (AVIJST). This is our ultimate model, which can be considered as a counterpart of the JST model. However, since the autoencoders are used instead of sampling, AVIJST can easily be scalable. In addition, this model can return not only the sentiment/topic-word matrix $((S \times K) \times V)$ but also the sentiment-word matrix $(S \times V)$ , which is useful in many practical situations. In addition, AVIJST can take into account the guidance from a small set of labeled data to achieve significant improvement on classification performance. The rest of the paper is organized as follows. In Section 2, we recall background knowledge of LDA and AVITM. In Section 3 and Section 4, we present the models of AVIAD and AVIJST, respectively. Section 5 discusses our experimental results on some benchmark databases. Finally, Section 6 concludes the paper.

Latent Dirichlet Allocation and Autoencoding Variational Inference Approaches for Topic Modeling

In this section, we recall the technique of Autoencoding Variational Inference For Topic Model (AVITM) where autoencoder is adopted to play the role of Latent Dirichlet Allocation (LDA) for topic modeling.

Latent Dirichlet Allocation and Joint Sentiment/Topic Model

Given a large dataset of document, or corpus, topic modeling is a unsupervised classification task that determines themes (or topics) in documents. In this context, a topic is treated as a distribution over a fixed vocabulary and a document can exhibit multiple topics (but typically not many). To fulfill this task, Latent Dirichlet Allocation (LDA) is introduced as a generative process where each document is assumed to be generated by this process. Meanwhile, Joint Sentiment-Topic (JST) model Lin and He (Citation2009) is a generative model extended from the popular LDA model which is introduced to solve the problem of sentiment classification without prior labeled information. To generate a document, the process randomly chooses a distribution over topics. Then, each word in the document is generated by randomly choosing a topic from the distribution over topics and then randomly choosing a word from the corresponding topic.

Formally, the LDA process and JST can be visualized as graphical models given in ,b where

Figure 2. Probabilistic graphical model of LDA and JST.

$β_{1 : K}$ are the topic distribution and each $β_{k}$ is a distribution over the vocabulary correspondingly to topic $k$ ;
$θ_{d}$ are the topic proportions for document $d$ ;
$z_{d, n}$ is the topic assignment for word $n$ in document $d$ ;
$w_{d, n}$ are the observed words for document $d$ ;
$α$ is the prior parameter of the respective Dirichlet distributions where $θ_{d}$ is assumed.

A graphical model of JST is represented in . Compared to LDA, JST has additionally the following component.

$π_{d}$ are the sentiment proportions for document $d$ ;
$l_{d, n}$ is the sentiment assignment for word $n$ in document $d$ ;
$α$ and $γ$ are the prior parameters of the respective Dirichlet distributions where $θ_{d, s}$ and $π_{d}$ are assumed, respectively.

Intuitively, the key idea behind the LDA process is that given a set of observed document $d$ over a vocabulary of $N$ words, we try to infer two sets of latent variables, which are represented by document-topic distribution and a topic-word distribution. Meanwhile, in JST, we try to infer three sets of latent variables, which is joint sentiment/topic-document distribution $θ$ , joint sentiment/topic-word distribution $β$ as well as sentiment-document distribution $π$ given set of observed document $d$ over a vocabulary of $N$ words.

Then, a document in the LDA process will be generated as

(1)

p (w | α, β) = \int_{θ} (\prod_{n = 1}^{N} \sum_{z_{n} = 1}^{k} p (w_{n} | z_{n}, β) p (z_{n} | θ)) p (θ | α) d θ .

(1)

Meanwhile, in the JST process, each document $d$ is generated from distribution:

(2)

\begin{matrix} p (w | α, β, γ) = \int & \prod_{s} p (θ_{s} | α) \int (p (π | γ) \prod_{n} \sum_{s_{n}} p (s_{n} | π) \\ \sum_{z_{n}} p (z_{n} | s_{n}, θ_{s}) p (w_{n} | z_{n}, s_{n}, β) d π) d θ . \end{matrix}

(2)

In LDA and JST approaches, those distributions are evaluated by sampling methods such as Gibbs sampling Lin and He (Citation2009). However, this sampling approach required the whole corpus to be loaded into memory, which is heavily costly. Moreover, the sampling approaches prevent concurrent processing and needed to be restated when there are changes in the dataset, making this direction hardly scalable.

Variational Auto-Encoder

Autoencoder (AE) Rumelhart, Hinton, and Williams (Citation1986) are a neural network architecture which can be seen as a nonlinear function (black-box) that includes two parts: encoder and decoder, as depicted in .

Figure 3. A general autoencoder system.

Given an input $x$ in a higher-dimensional space, the encoder maps $x$ into $z$ in a lower-dimensional space as $z = f (x)$ . Meanwhile, the decoder subsequently maps $z$ into $\hat{x}$ as $\hat{x} = g (z)$ where $\hat{x}$ is in the same space as $x$ . The loss function of the overall network will be calculated as $L (x, \hat{x}) =∥ x - \hat{x} ∥^{2}$ , whose aim is to make the decoder generate the “Same” input given. Once the network converges, the encoded $z$ will represent hidden features (or latent features) discovered from the input space.

However, the latent space generated by the traditional Autoencoder process is generally concrete (not continuous); thus, it can well generate latent features on the samples which were previously trained. However, when generating latent information for a new sample, this method may suffer from poor performance.

Variational Auto-Encoder (VAE) Kingma and Welling (Citation2014) is an extension of AE, where the latent space $z$ will be learn a posterior probability $p (z | x)$ , as depicted in . Thus, the encoding-decoding process will be performed as follows. For an input data point $x_{i}$ , the encoder will draw a latent variable $z$ by sampling based on the posterior probability $p (z | x)$ . Then, the decoder will generate the output $\hat{x}$ by sampling based on the posterior probability $p (x | z)$ . In other words, the encoder tries to learn $p (z | x)$ , whereas the decoder tries to learn $p (x | z)$ .

Figure 4. Variational auto encoder model where observable variable $x$ and its correspondent latent $z$ are distributed on Gaussian distribution $N (μ_{x | z}, Σ_{x | z})$ and $N (μ_{z | x}, Σ_{z | x})$ , respectively.

Based on Bayes theorem, $p (z | x)$ can be evaluated as

(3)

p (z | x) = \frac{p (x | z) p (z)}{p (x)} .

(3)

However, since the prior distribution $p (x)$ is generally intractable, we will instead approximate the desired probability $p (z | x)$ by an approximated distribution $q_{λ} (z | x)$ where $λ$ is the variational parameter corresponding to the distribution family to which $q$ belongs. In VAE, $q$ is often a Gaussian distribution, hence for each data point $x_{i}$ , we have $λ_{x_{i}} = {μ_{x_{i}}, σ_{x_{i}}^{2}}$ . Moreover, VAE adopts the amortize inference Ritchie, Horsfall, and Goodman (Citation2016) approach, in which all of data points will share (amortize) the same $λ$ .

Thus, the encoder part of an VAE will consists of two fully connected modules, whose roles are to, respectively, learn the parameter $μ$ and $σ$ , as presented in . For an input vector $x$ , a latent vector $z$ will be sampled based on the currently learned value of $μ$ and $σ$ . The decoder will do the similar thing to reconstruct the output $\hat{x}$ .

The goal of the learning process is that we try to make the variational distribution $q_{λ} (z | x)$ as “similar” as possible to the $p (z | x)$ . To measure such similarity, one can use the Kullback-Leibler divergence Cover and Thomas (Citation1991), which measures the information lost when using $q$ to approximate $p$ :

(4)

\begin{matrix} D_{K L} (q_{λ} (z | x) ∥ p (z | x)) = & E_{q} [log q_{λ} (z | x)] \\ - E_{q} [log p (x, z)] + log p (x) . \end{matrix}

(4)

Our goal is to find the variational parameters that minimize this divergence, or $q_{λ}^{*} (z | x) = arg m i n_{λ} D_{K L} (q_{λ} (z | x) ∥ p (z | x))$ . From 4, we have

(5)

E L B O (λ) = E_{q} [log p (x, z)] - E_{q} [log q_{λ} (z | x)] .

(5)

and

(6)

log p (x) = E L B O (λ) + D_{K L} (q_{λ} (z | x) ∥ p (z | x)) .

(6)

where ELBO stands for Evidence Lower Bound. It is due to the fact that the Kullback-Leibler divergence is always greater than or equal to zero, by Jensen’s inequality Cover and Thomas (Citation1991). Hence, instead of minimizing the Kullback-Leibler divergence in 6, one can equally maximize $E L B O (λ)$ .

When deployed in a VAE, $E L B O (λ)$ can be computed as ELBO of all data points. ELBO of a single data point can be expressed as

(7)

E L B O_{i} (λ) = E_{q_{λ} (z | x_{i})} [log p (x_{i} | z)] - D_{K L} (q_{λ} (z | x_{i}) ∥ p (z)) .

(7)

where $p (z)$ is adopted as the normal distribution $N (0, 1)$ .

Thus, let $θ$ and $ϕ$ be the weights and biases of the encoder and decoder of the VAE, Equation 7 can be regarded as the loss function of the VAE as

(8)

L = \underset{r e c o n s t r u c t i o n l o s s}{\underset{⏟}{E_{q_{ϕ} (z | x)} \overset{d e c o d e r}{\overset{⏞}{[log p_{θ} (x | z)]}}}} - \underset{r e c o g n i t i o n l o s s}{\underset{⏟}{D_{K L} (\overset{e n c o d e r}{\overset{⏞}{q_{ϕ} (z | x)}} ∥ \overset{p r i o r}{\overset{⏞}{p_{θ} (z)}})}} .

(8)

where the reconstruction loss measure the error occurring when the VAE reconstruct the output from the input, meanwhile the recognition loss measure the error occurring when generating the latent variable, which plays the role of regularization of this loss function.

The encoder networks usually simulate Gaussian distributions, so the recognition loss has nice closed-form solution. On the other hand, the reconstruction loss can be estimated by using Monte-Carlo sampling. However, sampling method is generally indifferentiable and thus cannot be back-propagated in a NN system. Thus, in Kingma and Welling (Citation2014), the reparameterization trick is proposed, which replaced the sampling step $z \sim p (z | x) = N (μ, σ^{2})$ in the training process by $z = μ + σ ϵ$ where $ϵ$ is sampled from the trivial distribution $N (0, 1)$ , which is differentiable.

VAE for Topic Modeling

From the original probability used by LDA, one can use the collapsing z13s technique Srivastava and Sutton (Citation2017) to reduce the number of distributions that we need to compute the approximation as

(9)

\begin{matrix} p (w | α, β) & = \int_{θ} (\prod_{n = 1}^{N} \sum_{z_{n} = 1}^{k} p (w_{n} | z_{n}, β) p (z_{n} | θ)) p (θ | α) d θ \\ = E_{p (θ | α)} [\prod_{n = 1}^{N} p (w_{n} | θ, β)] . \end{matrix}

(9)

where $p (w | α, β) = M u l t i n o m i a l (1, \sum_{k} θ_{k} β_{k})$ .

Hence, one only needs to evaluate the distributions of $θ$ and $β$ . In Srivastava and Sutton (Citation2017), an approach using VAE for topic modeling has been introduced to replace the old approach of LDA. VAE uses autoencoder to learn the distributions of $θ$ and $β$ . However, as LDA uses Dirichlet distribution, meanwhile VAE is intended to learn Gaussian distributions as previously discussed. To solve this, Laplace approximation Srivastava and Sutton (Citation2017) is applied. Basically, a Dirichlet prior distribution $p (θ | α)$ with parameters $α$ (for $K$ topics) will be approximated as $N (μ_{1}, σ_{1})$ where $μ_{1}$ and $σ_{1}$ are evaluated as below.

(10)

\begin{matrix} μ_{1 k} & = log α_{k} - \frac{1}{K} \sum_{i}^{K} log α_{i} . \\ Σ_{1 k k} & = \frac{1}{α_{k}} (1 - \frac{2}{K}) + \frac{1}{K^{2}} \sum_{i}^{K} \frac{1}{α_{i}} . \end{matrix}

(10)

Thus, we can compute the recognition loss by evaluating the closed-form of KL divergence between two Gaussian distributions. On the other hand, the reconstruction loss is evaluated by compute the probability density function of distribution $p (w | α, β) = M u l t i n o m i a l (1, \sum_{k} θ_{k} β_{k})$ . Therefore, the final loss function is computed as follows.

(11)

\begin{matrix} L (Θ) & = \sum_{d = 1}^{D} [E_{ϵ \sim N (0, I)} [w_{d}^{T} log (σ (β) σ (μ_{0} + Σ_{0}^{1 / 2} ϵ))] \\ - (\frac{1}{2} \{t r (Σ_{1}^{- 1} Σ_{0}) + {(μ_{1} - μ_{0})}^{T} Σ_{1}^{- 1} (μ_{1} - μ_{0}) - K + log \frac{| Σ_{1} |}{| Σ_{0} |}\})] . \end{matrix}

(11)

Autoencoding Variational Inference for Aspect Discovery

As previously discussed, aspect discovery Chen, Mukherjee, and Liu (Citation2014) is a problem similar to topic modeling. Instead of discovering topics, one tries to discover aspects of concepts mentioned in a document. For example, when analyzing reviews of restaurants, the aspects that can be mentioned may include food, service, price, etc.

However, the key difference between topic modeling and aspect discovery is that the latter normally requires seed words Lu et al. (Citation2011). Based on those seed words, other aspect-related terms are further discovered. In Lu et al. (Citation2011), the discovery is done by incorporating seed word information via prior distribution of $β$ distribution into Gibbs sampling training process.

On the other hand, in AVITM Srivastava and Sutton (Citation2017), the authors use non-smooth version LDA illustrated in , where $β$ has no prior distribution. Therefore, for VAE direction, we propose to modify the loss function to reflect the prior knowledge conveyed by the seed words. It is realized by our proposed Autoencoding Variational Inference for Aspect Discovery (AVIAD) model, as presented in .

Figure 6. AutoEncoding variational inference for aspect discovery. As illustrated, the yellow block $θ$ and $β$ is corresponded to the document-topic and topic-word distributions which is described in , respectively. Meanwhile, $γ$ and $γ^{p r i o r}$ are additional blocks which play an important role in the aspect discovery task.

The goal of this model is also to retrieve the topic-word distribution $β$ , like the original model described in Sect. 2.3. However, in this model, we also embed the prior knowledge of seed words into the network structure. For example, the prior distribution $γ^{p r i o r}$ of the given seed words will be represented as the matrix represented in

Figure 5. Prior distribution matrix.

The idea behind this distribution matrix is that we want to “force”, for instance, the seed word salad to belong to the aspect Food, which is represented as the first row in the matrix. In our AVIAD model, this prior distribution $γ^{p r i o r}$ is given as yellow block in . To incorporate this distribution in our training process, we introduce new loss function.

(12)

\begin{matrix} L (Θ) = E_{q_{ϕ} (θ | w)} [log p (w | θ, β)] & - D_{K L} (q_{ϕ} (θ | w) ∥ p (θ)) \\ - λ \sum_{n \in S} ∥ σ (γ_{n}) - γ_{n}^{p r i o r} ∥^{2} . \end{matrix}

(12)

One can see that we have modified the ELBO in (8) by introducing the new term of (12), where each $γ_{n}$ is a topic distribution for each word $n$ that existed in set $S$ corresponding to document $d$ . Thus, this square loss term $| | σ (γ_{n}) - γ_{n}^{p r i o r} | |^{2}$ will make the network try to produce the distribution $γ_{n}$ as similar to the prior distribution $γ^{p r i o r}$ of seed words as possible. As a result, not only the predefined seed words are distributed to corresponding aspects, but also other similar words are also discovered in those aspects, as illustrated in .

Table 1. Discovered aspects (bold text indicates seed words).

Download CSV Display Table

For example, assume at the $k^{t h}$ iteration of training process, the learning matrix $γ$ , given as illustrated in , must be normalized by applying softmax function $σ$ . Then, by minimizing the Euclidean distance between it and the prior matrix $γ^{p r i o r}$ in concurrently with the other two terms in Equation 12, not only the injected word such as sauce, salad, but onion, cheese will also be converged to the true aspect.

Figure 7. Example of constructing the square loss term. Firstly, every rows in the $γ$ matrix is normalized via softmax function. After that, a submatrix is constructed by choosing only set of words in the normalized matrix $γ$ such that this word must also exists in a given $γ^{p r i o r}$ matrix. Finally, the prior loss is computed via the Euclidean distance between this submatrix and $γ^{p r i o r}$ .

AutoEncoding Variational Inference for Joint Sentiment/Topic

The Proposed Model of AVIJST

In this section, we discuss our proposed ultimate model of Autoencoding Variational Inference for Aspect-based Joint Sentiment/Topic (AVIJST). Instead of training the JST model using Gibbs sampling, we want to take the advantage of Variational Autoencoder method which is fast and scalable on large dataset to this joint sentiment/topic model. First, inspired from the collapsing z’s technique Srivastava and Sutton (Citation2017), we collapse both set of $z$ and $s$ variables . Thus, we only have to sample from $θ$ and $π$ only:

(13)

\begin{matrix} p (w | α, β, γ) & = \int \prod_{s} p (θ_{s} | α) \int p (π | γ) \prod_{n} p (w_{n} | π, θ_{s}, β) d π d θ \\ = E_{\prod_{s} p (θ | α) p (π | γ)} [\prod_{s} p (w_{n} | π, θ_{s}, β)] . \end{matrix}

(13)

where $p (w_{n} | π, θ, β) = M u l t i n o m i a l (1, \sum_{s} π_{s} \sum_{k} θ_{s k} β_{s k})$ .

In AVIJST, we no longer rely on the predefined set of seed words, since it is not easy to construct such a set with a large corpus. Instead, we observed that the distribution $π$ is trained through reparameterization trick to reflect the sentiment of each document, so it can be seen as a discriminant function trained in supervised model. Motivated from this, we incorporate prior knowledge by using labeled information. That is, we use a (small) set of sentiment-labeled documents to guide the learning process. In the experiment, we can also treat our model as a semi-supervised model, which needs only a small set of labeled information for classification problem. The network structure of our AVIJST is given in .

Figure 8. Autoencoding variational inference for joint sentiment/topic modeling where $θ$ , $β$ and $π$ are the corresponding latent variables in JST graphical model (). Moreover, the yellow block which is used to mapping document $x$ to latent variable $π$ can be represented as any modern classification deep neural network.

In our model, the classification network for $π$ distribution described in the yellow block in , which is compatible with many kinds of neural networks. For simplicity, we only consider the Multi Layer Perceptron (MLP) network and state-of-the-art Convolution Neural Network (CNN) network combined with word embedding (WE) layer. Besides, we use the same parameters (shared network) for all $θ_{s}$ in hidden layers of encoder network instead of constructing $s$ different encoder networks. We want to remind the readers that the soft-max layer is applied for normalization purpose at the final layer of $σ (θ)$ and $σ (π)$ variables. Then, our new lower-bound function is given as

(14)

\begin{matrix} L = & \underset{r e c o n s t r u c t i o n l o s s}{\underset{⏟}{E_{\prod_{s} q (θ_{s} | w) q (π | w)} [log \prod_{n} p (w_{n} | π, θ_{s}, β)]}} + \underset{c l a s s i f i c a t i o n l o s s}{\underset{⏟}{E_{\hat{p} (π)} [- log q (π | w)]}} \\ - \underset{r e c o g n i t i o n l o s s}{\underset{⏟}{D_{K L} (q (π | w) ∥ p (π)) - \sum_{s} D_{K L} (q (θ_{s} | w) ∥ p (θ_{s}))}} . \end{matrix}

(14)

As discussed in Sect. 2.3, the reconstruction loss can be computed through Monte-Carlo sampling, while the recognition loss has nice closed-form since Laplace approximation is applied. In addition, to incorporate labeled information, we integrate the classification loss in our loss function in (14) where $\hat{p} (π)$ is empirical distribution.

Sentiment-Word Matrix

One additional advantage of our proposed AVIJST is that we can generate the sentiment-word matrix from the learning results. Firstly, we observed that each word $w$ in a document is generated by $p (w | π, θ, β) =$ Multinomial $(1, π θ β)$ ; where sentiment/topic-word distribution $β$ can be seen as a learning weight matrix in the decoder network which presented in . Inspired from this, we want our model learn another sentiment-word distribution $ν$ , which show the top word for each sentiment orientation. We integrate $ν$ in our model by combining it linearly to generate new generative distribution:

(15)

p (w | π, θ, β) = M u l t i n o m i a l (1, π θ β + λ π v) .

(15)

In Equation (15), $λ$ is regularization weight. In experiments section, we will illustrate some top words resulted from the constructed sentiment-word matrix.

Experiments

Datasets and Experimental Setup

Aspect Discovery

For evaluating the AVIAD model, the URSAFootnote¹ restaurant dataset was used. It contains 2066324 tokens and 52624 documents in total. Documents in this dataset are marked with one or more labels from the standard label set $S$ = {Food, Staff, Ambience, Price, Anecdote, Miscellaneous}. In order to avoid ambiguity, we only consider sentences with single label from three standard aspects $S$ = {Food, Staff, Ambience}. Moreover, there are two different kind of dataset will be evaluated in the experiment of topic modeling (Section 5.2) which are the imbalance and balance dataset. While only 10000 sentences for each aspect is evaluated in balance dataset, the imbalance dataset contains 62348, 23730 and 13385 sentences in Food, Staff and Ambience aspect, respectively. Finally, due to the stability of the balance dataset, it will also be used in the evaluation of supervised performance in Section 5.3.1.

We set a number of topics $K = 3$ corresponding to the number of chosen aspects, and Dirichlet parameter $α = 0.1$ in our AVIAD model. In this experiments, we compare our method with the Weak supervised LDA (WLDA) Lu et al. (Citation2011), which incorporates seed word information via prior distribution of latent variable $β$ in LDA model. Furthermore, the set of seed words which will be fed into both models is described in .

Table 2. AVIAD and WLDA seedwords.

Download CSV Display Table

Sentiment Classification

Regarding AVIJST model, we use two different data sets which are Large Movie Review Dataset (IMDB)Footnote² and Yelp restaurant.Footnote³ The statistical information of these datasets are described in . Similar to the restaurant dataset, in the experiment of unsupervised performance, the sentiment datasets will also be devived into two subsets with different size, which are {10k, 25k} and {20k, 200k} for IMDB and Yelp, respectively. Moreover, due to the requirement of prior information, the subjectivity word list MPQAFootnote⁴ is also used in the re-implementation of JST model.

Table 3. Statistics for IMDB and Yelp datasets.

Download CSV Display Table

In the experimental setup, since we want to avoid the collapsing problem which was proposed by Srivastava and Sutton (Citation2017), the learning rate is set very high at $0.001$ for both AVIAD and AVIJST. In addition, in AVIJST, we also set the classification learning rate to $0.005$ , due to the collapsing also occurred when training the discriminant distribution $π$ .

Experimental Results of Topic Modeling

For topic modeling performance, we compare our proposed models of AVIAD and AVIJST with their LDA-based counterparts, i.e., the Weakly supervised LDA (WLDA) Lu et al. (Citation2011) and JST model with Gibbs sampling Lin and He (Citation2009), respectively. In this experiment, we adopt the normalized point-wise mutual information (NPMI) Bouma (Citation2009) as the main metrics to evaluate the qualitative of topic/aspects discovered by our model. The result has been proven that the set of word in discovered topic closely matches the human judgment in Lau, Newman, and Baldwin (Citation2014).

and show that AVIAD and AVIJST outperforms both the traditional WLDA and JST model where the average coherent values of our proposed models converge much faster than its counterpart models. Besides, we also report the top words discovered by AVIAD and AVIJST in –, respectively.

Figure 9. Aspect discovery performance on restaurant dataset.

Figure 10. Joint sentiment/topic modeling performance based on average topic coherent.

Interestingly, and shows that our AVIJST not only can extract the higher coherent topic but also discover more specific top words for each topic than the JST model. For example, in IMDB dataset, the words “shaolin”, which is a famous styles of Chinese material arts as well as “sammo”, who is a well-known actor of Chinese, are founded in topic “Kung fu”, whereas JST model only retrieved the general words like “action” and “fight” . Moreover, our AVIJST can also discover topics that JST cannot find, which are “Starwar” (0.30) and “Wrestle” (0.31). Meanwhile, in Yelp dataset, due to the independent to the polarity of some common words in documents, words in topic such as Thai food and Japanese food which were founded by AVIJST are inconsistent with JST model in term of polarity. Furthermore, we also report the top words discovered for each sentiment orientation which mentioned in Sect. 4.2 in .

Table 7. Sentiment words discovered.

Download CSV Display Table

Experimental Results of Classification Performance

Aspect Discovery

Although AVIAD and WLDA models were first proposed as an unsupervised model, the returned $θ$ matrix can be treated as a classification model. Therefore, the classification performance of these models are also evaluated in via precision, recall and F1 metrics.

Table 8. Aspect identification results.

Download CSV Display Table

In general, with the same set number of seedwords, our proposed AVIAD model outperformed WLDA in most case. Regarding the staff aspect, the amount of precision which is evaluated by WLDA model (0.662) is much lower than AVIAD (0.805) due to the number of food aspects sentences is significantly larger than staff. Similarly, AVIAD recall is also 10% greater than its counterpart model (0.793) on ambience aspect.

Sentiment Classification

To compare with our AVIJST, we build two others neural networks. The first one is multilayer perceptron classification network with two sequential fully connected hidden layers where input is Bag-of-word models for each document, called MLP. For the second network, we construct CNN network where each document is transformed into Word Embedding vector before feeding into Convolution Neural Network layers. These two networks are integrated in our AVIJST architecture under classification network with corresponding name AVIJST-MLP and AVIJST-CNN as presented in . Furthermore, the model II semi-supervised variational autoencoder (SSVAEII-MLP and SSVAEII-CNN) which was first proposed for a semi-supervised problem in Kingma et al. (Citation2014) is also evaluated in this experiment.

Due to the lack of knowledge from MPQA prior, the supervised performance of JST model is significantly lower than others. Meanwhile, the latent variables learned by our AVIJST model outperform the solitary classification network as well as the semi-supervised VAE model in most cases which is shown in and . Especially, with using only 100 labeled documents over 25,000 in total on IMDB dataset, AVIJST-CNN proved that the latent variables learned in our method can help the CNN network achieve the highest accuracy (76.0%) among them, whereas the solitary CNN showed high performance when given only a large number of labeled documents (half or full documents in the dataset in this case).

Table 9. Accuracy on test set for IMDB.

Download CSV Display Table

Table 10. Accuracy on test set for Yelp.

Download CSV Display Table

Conclusion

In this paper, we study using Autoencoding Variational Inference approach for Aspect-based Opinion Mining, instead of the LDA-based approaches, widely known by the Joint Sentiment/Topic model. The motivation behind is that the deep neural networks of autoencoding allow us to avoid the heavy cost of sampling, enabling this approach scalable in parallel systems. This approach also enables us to take advantages of prior knowledge from seed words or small pre-labeled guiding sets to enjoy better performance.

As a result, we introduce two models of Autoencoding Variational Inference for Aspect Discovery (AVIAD) and Autoencoding Variational Inference for Joint Sentiment/Topic (AVIJST), which outperformed their LDA-based counterparts when experimented on benchmarking datasets. Especially, the AVIJST is designed flexibly which allows any neural-network-based classification method to be integrated in an end-to-end manner. Just for example, in this work, we employed MLP and the state-of-the-art CNN deep networks for classification.

Even though our AVI-based approaches have been proven outperforming the LDA-based counterparts, we have still not fully solved the AOS problem by AVI. Thus, for the future work, we aim to a complete solution by investigating a neural network architecture allowing joint distributions of aspects, documents and sentiment to be represented and trained seamlessly.

Additional information

Funding

This research is funded by Vietnam National University HoChiMinh City (VNU-HCM) under grant number [B2018-20-07].

Notes

1. http://spidr-ursa.rutgers.edu/datasets/.

2. http://ai.stanford.edu/amaas/data/sentiment/.

3. https://www.yelp.com/dataset/challenge.

4. http://www.cs.pitt.edu/mpqa/.

References

Bespalov, D., B. Bai, Y. Qi, and A. Shokoufandeh (2011). Sentiment classification based on supervised latent n-gram analysis. In Proceedings of the 20th acm international conference on information and knowledge management (pp. 375–82), Glasgow, Scotland, UK.
Google Scholar
Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003 March. Latent dirichlet allocation. Journal of Machine Learning and Research 3, 993-1022.
Google Scholar
Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, 31–40, University of Potsdam, Potsdam, Germany.
Google Scholar
Chen, Z., A. Mukherjee, and B. Liu. 2014. Aspect extraction with automated prior knowledge learning. Baltimore, Maryland, USA: ACL.
Google Scholar
Cover, T. M., and J. A. Thomas. 1991. Elements of information theory. New York, NY, USA: Wiley-Interscience.
Google Scholar
Hu, M., and B. Liu (2004). Mining opinion features in customer reviews. In Proceedings of the 19th national conference on artifical intelligence, San Jose, California.
Google Scholar
Jin, W., and H. H. Ho (2009). A novel lexicalized hmm-based learning framework for web opinion mining. In Proceedings of the 26th annual international conference on machine learning (pp. 465–72), Montreal, Quebec, Canada.
Google Scholar
Kingma, D. P., D. J. Rezende, S. Mohamed, and M. Welling (2014). Semi-supervised learning with deep generative models. In Proceedings of the 27th international conference on neural information processing systems - volume 2, Montreal, Canada.
Google Scholar
Kingma, D. P., and M. Welling. 2014. Auto-encoding variational bayes. Banff, Canada: ICLR.
Google Scholar
Lau, J. H., D. Newman, and T. Baldwin (2014). Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. Proceedings of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden.
Google Scholar
Lin, C., and Y. He (2009). Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th acm conference on information and knowledge management (pp. 375–84), Hong Kong, China.
Google Scholar
Lu, B., M. Ott, C. Cardie, and B. K. Tsou (2011). Multi-aspect sentiment analysis with topic models. In Proceedings of the 11th international conference on data mining workshops (pp. 81–88), Vancouver, BC, Canada.
Google Scholar
Qiu, G., B. Liu, J. Bu, and C. Chen. March 2011. Opinion word expansion and target extraction through double propagation. Computational Linguistics 37(1):9–27. doi: 10.1162/coli_a_00034.
Web of Science ®Google Scholar
Ritchie, D., P. Horsfall, and N. D. Goodman. Deep amortized inference for probabilistic programs. ArXiv E-Prints 2016. October
Google Scholar
Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 1986. Parallel distributed processing: Explorations in the microstructure of cognition. vol. 1. D. E. Rumelhart and J. L. McClelland. C. PDP Research Group (Eds.), (pp. 318–62). Cambridge, MA, USA: MIT Press. Retrieved from . . : . . http://dl.acm.org/citation.cfm?id=104279.104293.
Google Scholar
Srivastava, A., and C. Sutton. 2017. Autoencoding variational inference for topic models. In Yoshua Bengio, and Yann LeCun (Eds.), Toulon, France: Iclr.
Google Scholar
Wu, H., Y. Gu, S. Sun, and X. Gu. Aspect-based opinion summarization with convolutional neural networks. ArXiv E-Prints 2015. November, 3157-3163.
Google Scholar
Zhang, L., S. Wang, and B. Liu. deep learning for sentiment analysis: a Survey. ArXiv E-Prints 2018. January, 8, e1253.
Google Scholar
Zhao, W. X., J. Jiang, H. Yan, and X. Li (2010). Jointly modeling aspects and opinions with a maxent-lda hybrid. In Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 56–65), Massachusetts, USA.
Google Scholar

Appendix A. Generative Model

In this section, we will show the generative process as well as its connection to Variational AutoEncoder in the decoder network of two aforementioned generative model LDA and JST.A.1. LDALatent Dirichlet Allocation (LDA) assumes the following generative process for each document

d

in a corpus

D

for document $d$ in corpus $D$ do

Choose $θ_{d} \sim D i r i c h l e t (α)$

for position $n$ in $d$ do

Choose a topic $z_{n} \sim M u l t i n o m i a l (1, θ_{d})$

Choose a word $w_{n} \sim M u l t i n o m i a l (1, β_{z_{n}})$

end for

Under this generative process, the marginal distribution of document $d$ is

(A1)

p (w | α, β) = \int_{θ} p (θ | α) (\prod_{n = 1}^{N} \sum_{z_{n} = 1}^{K} p (w_{n} | z_{n}, β) p (z_{n} | θ)) d θ .

(A1)

This equation A1 can be seen as a model parameters form:

(A2)

p (w | α, β) = \frac{Γ (\sum_{i} α_{i})}{\prod_{i} Γ (α_{i})} \int_{θ} (\prod_{i = 1}^{K} θ_{i}^{α_{i} - 1}) (\prod_{n = 1}^{N} \sum_{k = 1}^{K} \prod_{j = 1}^{V} {(θ_{j} β_{k j})}^{w_{n}^{j}}) d θ .

(A2)

Due to the one-hot encoding of $w_{n}^{j}$ , thus, in VAE point of view, one can treat the generative distribution $p (w_{n} | θ, β)$ as the decoder network.

(A3)

p (w_{n} | θ, β) = \sum_{k = 1}^{K} \prod_{j = 1}^{V} (θ_{j} β_{k j})^{w_{n}^{j}} = {(θ β)}^{w_{n}} .

(A3)

where $θ$ is the sampling matrix which is the output after using the reparameterization trick on the encoder network of $μ_{θ}$ and $σ_{θ}$ while $β$ can be seen as a learning weight matrix of fully connected layer in the decoder network.

A.2. JST

A graphical model of JST is represented in . Compared to LDA, JST has additionally the following component.

$π_{d}$ are the sentiment proportions for document $d$ ;
$l_{d, n}$ is the sentiment assignment for word $n$ in document $d$ ;
$α$ and $γ$ are the parameters of the respective Dirichlet distributions where $θ_{d, s}$ and $π_{d}$ are assumed, respectively.

Like LDA, each document $d$ is generated through a generative process

for document $d$ in corpus $D$ do

Choose $π_{d} \sim D i r i c h l e t (γ)$

for sentiment label $s$ under document $d$ do

Choose $θ_{d, s} \sim D i r i c h l e t (α)$

end for

for position $n$ in $d$ do

Choose a sentiment $l_{n} \sim M u l t i n o m i a l (1, π_{d})$

Choose a topic $z_{n} \sim M u l t i n o m i a l (1, θ_{d, l_{n}})$

Choose a word $w_{n} \sim M u l t i n o m i a l (1, β_{l_{n}, z_{n}})$

end for

and its correspondent marginal distribution:

(A4)

\begin{matrix} p (w | α, β, γ) = \int \prod_{s} p (θ_{s} | α) \int & p (π | γ) \prod_{n} \sum_{l_{n}} p (l_{n} | π) \\ \sum_{z_{n}} p (z_{n} | l_{n}, θ_{s}) p (w_{n} | z_{n}, l_{n}, β) d π d θ . \end{matrix}

(A4)

where the reconstruction network can also be treated as a multiplication between three matrix $θ$ , $β$ and $π$ .

(A5)

p (w_{n} | θ, β, π) = \sum_{s = 1}^{S} \sum_{k = 1}^{K} \prod_{j = 1}^{V} (π_{s} θ_{s j} β_{s k j})^{w_{n}^{j}} = {(π θ β)}^{w_{n}} .

(A5)

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Towards Autoencoding Variational Inference for Aspect-Based Opinion Summary

ABSTRACT

Introduction

Latent Dirichlet Allocation and Autoencoding Variational Inference Approaches for Topic Modeling

Latent Dirichlet Allocation and Joint Sentiment/Topic Model

Variational Auto-Encoder

VAE for Topic Modeling

Autoencoding Variational Inference for Aspect Discovery

Table 1. Discovered aspects (bold text indicates seed words).

AutoEncoding Variational Inference for Joint Sentiment/Topic

The Proposed Model of AVIJST

Sentiment-Word Matrix

Experiments

Datasets and Experimental Setup

Aspect Discovery

Table 2. AVIAD and WLDA seedwords.

Sentiment Classification

Table 3. Statistics for IMDB and Yelp datasets.

Experimental Results of Topic Modeling

Table 4. Topics extracted by AVIAD and WLDA.

Table 5. IMDB Topics extracted by AVIJST and JST.

Table 6. Yelp Topics extracted by AVIJST and JST.

Table 7. Sentiment words discovered.

Experimental Results of Classification Performance

Aspect Discovery

Table 8. Aspect identification results.

Sentiment Classification

Table 9. Accuracy on test set for IMDB.

Table 10. Accuracy on test set for Yelp.

Conclusion

References

Appendix A. Generative Model

Information for

Open access

Opportunities

Help and information

Towards Autoencoding Variational Inference for Aspect-Based Opinion Summary

ABSTRACT

Introduction

Latent Dirichlet Allocation and Autoencoding Variational Inference Approaches for Topic Modeling

Latent Dirichlet Allocation and Joint Sentiment/Topic Model

Variational Auto-Encoder

VAE for Topic Modeling

Autoencoding Variational Inference for Aspect Discovery

Table 1. Discovered aspects (bold text indicates seed words).

AutoEncoding Variational Inference for Joint Sentiment/Topic

The Proposed Model of AVIJST

Sentiment-Word Matrix

Experiments

Datasets and Experimental Setup

Aspect Discovery

Table 2. AVIAD and WLDA seedwords.

Sentiment Classification

Table 3. Statistics for IMDB and Yelp datasets.

Experimental Results of Topic Modeling

Table 4. Topics extracted by AVIAD and WLDA.

Table 5. IMDB Topics extracted by AVIJST and JST.

Table 6. Yelp Topics extracted by AVIJST and JST.

Table 7. Sentiment words discovered.

Experimental Results of Classification Performance

Aspect Discovery

Table 8. Aspect identification results.

Sentiment Classification

Table 9. Accuracy on test set for IMDB.

Table 10. Accuracy on test set for Yelp.

Conclusion

Additional information

Funding

Notes

References

Appendix A. Generative Model

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date