Full article: A Fully Bayesian Inference with Gibbs Sampling for Finite and Infinite Discrete Exponential Mixture Models

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

In this paper, we propose clustering algorithms based on finite mixture and infinite mixture models of exponential approximation to the Multinomial Generalized Dirichlet (EMGD), Multinomial Beta-Liouville (EMBL) and Multinomial Shifted-Scaled Dirichlet (EMSSD) with Bayesian inference. The finite mixtures have already shown superior performance in real data sets clustering using the Expectation–Maximization approach. The proposed approaches in this paper are based on a Monte Carlo simulation technique namely Gibbs sampling algorithm including an additional Metropolis–Hastings step, and we utilize exponential family conjugate prior information to construct their posterior relying on Bayesian theory. Furthermore, we also present the infinite models based on Dirichlet processes, which results in clustering algorithms that do not require the specification of the number of mixture components to be given in advance and selects it in a principled manner. The performance of our Bayesian approaches was evaluated in some challenging real-world applications concerning text sentiment analysis, fake news detection, and human face gender recognition.

Introduction

Clustering count vectors is a challenging task on large data sets considering its high dimensionality and sparsity nature (Jain Citation2010). The bag of words representation for text systematically exhibits the burstiness phenomenon, if a word appears once in a document, it is much more likely to appear again (Church and Gale Citation1995; Katz Citation1996). This phenomenon is not limited to text and can also be observed in images with visual words (Jegou, Douze, and Schmid Citation2009).

Table 1. Experiment results for IMDB movie reviews.

Display Table

Table 2. Experiment results for Tripadvisor hotel reviews.

Display Table

It also has a sparsity nature that few words show up with high occurrence and some are less as often as possible or do not appear at all (Margaritis and Thrun Citation2001). Thus, such data are generally represented as sparse high-dimensional vectors, with few thousands of dimensions with a sparsity of 95–99% (Dhillon and Modha Citation2001).

Hierarchical Bayesian modeling frameworks, such as Generalized Dirichlet multinomial mixture model (GDM), Beta-Liouville multinomial mixture model (MBL) and Shifted-Scaled Dirichlet multinomial mixture model (MSSD) (Alsuroji, Zamzami, and Bouguila Citation2018; Bouguila Citation2008; Elkan Citation2006), have shown excellent performance for high-dimensional count data clustering. However, their estimation procedures are very inefficient when the data collection size is large (Zamzami and Bouguila Citation2020a). The exponential family of distributions has a finite-sized sufficient statistics (Brown Citation1986), meaning that we can compress the data into a fixed-sized summary without loss of information (DasGupta Citation2011). Efficient exponential family approximations to the MGD (EMGD), MBL (EMBL) and MSSD (EMSSD) have been previously proposed by Zamzami and Bouguila (Zamzami and Bouguila Citation2020a, Citation2022, Citation2020b). These distributions have shown to address the burstiness phenomenon successfully and to be considerably computationally faster than their original distribution forms especially when dealing with sparse and high-dimensional data (i.e. these exponential approximations are evaluated as functions or non-zero counts only as we will see in the next section).

The main problem in the case of finite mixture models is the estimation of the model parameters (Brooks Citation2001). Expectation–maximization (EM) algorithm is a simple and effective approach for model’s parameters estimation (Emanuel and Herman Citation1987). However, the EM algorithm for finite mixtures has several drawbacks. For example, the occurrence of local maximum and singularities in likelihood function will often cause problems for deterministic gradients method (Robert Citation2007). Moreover, in high dimensional estimation, it will be hard to obtain reliable estimates which possess generalization capabilities to predict the densities at new data points (Cai Citation2010; Dias and Wedel Citation2004). Some Bayesian approaches are based on simulation methods, such as Gibbs sampling, which explore high-density regions (Roeder and Wasserman Citation1997). The stochastic aspect of these simulation methods ensures the escape from local maximum (e.g., Bouguila, Ziou, and Hammoud Citation2009). Tsionas (Citation2004) proposed an estimation approach for multivariate t distribution using Gibbs sampling with data augmentation. Amirkhani, Manouchehri, and Bouguila (Citation2021) presented a fully Bayesian approach within Monte Carlo simulation for Multivariate Beta mixture parameters estimation. Bouguila, Ziou, and Hammoud (Citation2009) successfully adopted a Bayesian algorithm based on Metropolis-within Gibbs sampling for a finite Generalized Dirichlet mixture. Najar, Zamzami, and Bouguila (Citation2019) used Monte Carlo simulation method for exponential family approximation to the Dirichlet Compound Multinomial mixture model (EDCM) parameters estimation and shown excellent results in some real applications. Xuanbo, Bouguila, and Zamzami (Citation2021) successfully proposed a fully Bayesian approach based on Gibbs sampling technique for exponential family approximation to the Multinomial Scaled Dirichlet mixture model (EMSD).

Another challenging aspect when using finite mixture model is usually to estimate the number of clusters which best describes the data without overfitting or underfitting it. For this purpose, many approaches have been suggested. These approaches can be divided into two different strategies for mixture models. The first strategy is the implementation of model selection criteria. The second strategy is resampling from the full posterior distribution with the number of clusters considered unknown. However, the majority of these approaches cannot be easily used for high-dimensional data (Bouguila and Ziou Citation2010). The infinite mixture models based on Dirichlet process (Antoniak Citation1974; Korwar and Hollander Citation1973) have recently attracted wide attention, thanks to the development of MCMC techniques. Dirichlet process mixture (DPM) models resolve the difficulties related to model selection (MacEachern and Muller Citation1998). Rasmussen (Citation1999) successfully applied Dirichlet process on Gaussian mixture model with Gibbs sampling to obtain accurate number of classes. Bouguila and Ziou (Citation2010) also presented a clustering algorithm for Dirichlet process mixture of Generalized Dirichlet distributions with MCMC techniques. Najar, Zamzami, and Bouguila (Citation2020) proposed an infinite mixture of exponential family approximation to the Multinomial Dirichlet Compound mixture model and showed superior experimental results in recognition of human interactions in feature films. Thus, we extend these finite mixture models to infinite mixture models based on Dirichlet process to tackle model selection in the case of sparse high-dimensional vectors.

In this paper, we present clustering algorithms based on finite and infinite mixtures of EMGD, EMBL and EMMSD from Bayesian viewpoint using Gibbs sampling within M–H steps. These distributions have already shown excellent performances in clustering real-world high-dimensional count data sets with deterministic approach. The key contributions of this article are as following: (1) Determination of conjugate priors to EMGD, EMBL and EMSSD by taking into account the fact that these distributions are members of the exponential family, and (2) through challenging applications that concern text sentiment analysis, text fake news detection and human face gender recognition, we show that the proposed algorithms are efficient for clustering sparse high-dimensional count data. The learning of the proposed finite mixtures and their infinite counterparts will be based on MCMC algorithms namely Gibbs sampling and Metropolis–Hastings (M–H) (Favaro and Whye Teh Citation2013).

The rest of this paper is organized as follows. The next two sections, review and develop conjugate prior distributions for the EMBL, EMGD and EMSSD distributions. Then, we present a Bayesian estimation for their finite mixture models parameters using Gibbs sampling, and extend these finite mixture models to infinite mixture models while developing complete clustering algorithms. After we exhibit the abilities of the proposed approaches in text sentiment analysis, text fake news detection, human face gender recognition. The concluding remarks and future work directions are given at the end of the paper.

Exponential Approximation of Distributions for Count Data

In this section, we review the approximations to the MGD, the MBL and the MSSD to bring them to the exponential family of distributions.

Exponential Family

The exponential family of distributions is widely used in machine learning research due to its sufficient property, as the sufficient statistics can give all of needed parameter information by the whole sample data set. For a random variable $X$ and a distribution with $M$ parameters in exponential family we have:

(1)

p (X | ξ) \propto H (X) exp (\sum_{l = 1}^{M} G_{l} (ξ)) T_{l} (X) + Φ (ξ))

(1)

where $G_{l} (ξ)$ is called the natural parameter, $T_{l}$ (X) is the sufficient statistic, $H (X)$ is the underlying measure, and $Φ (ξ)$ is called log normalizer used to ensure that the distribution integrates to one (DasGupta Citation2011).

The Exponential Family Approximation to Multinomial Generalized Dirichlet (MGD) Distribution

We define $X = (x_{1}, \dots, x_{D + 1})$ as a sparse count data vector describing a text document, or an image where $x_{d}$ corresponds to the frequency of appearances of a word or visual word $w$ . The MGD distribution is defined by (Bouguila Citation2008):

(2)

M G D (X | ξ) = \frac{Γ (n + 1)}{\prod_{d = 1 : x_{d} \geq 1}^{D + 1} Γ (x_{d} + 1)} \prod_{d = 1 : x_{d} \geq 1}^{D} \frac{Γ (α_{d} + β_{d})}{Γ (α_{d}) Γ (β_{d})} \frac{Γ (α_{d}^{'} + β_{d}^{'})}{Γ (α_{d}^{'}) Γ (β_{d}^{'})}

(2)

where $n = \sum_{d}^{D + 1} x_{d}$ , $α^{'} = α_{d} + x_{d}$ , $β^{'} = β_{d} + x_{d + 1} \dots + x_{D + 1}$ , for $d = 1, \dots, D$ , and $ξ = (α, β)$ . In count data represented as bag-of-words, Zamzami and Bouguila (Citation2020a) found, experimentally, that $α_{d} ≪ β_{d} ≪ 1$ for almost all words $w$ based on different data sets. Moreover, we have for $x \geq 1$ (Elkan Citation2006):

(3)

lim_{α \to 0} \frac{Γ (α + x)}{Γ (α)} - α Γ (x) = 0

(3)

Then, the exponential family form for MGD can be written as (Zamzami and Bouguila Citation2022):

(4)

\begin{matrix} E M G D (X | ξ) \propto (\prod_{D : x_{D} \geq 1} x_{d}^{- 1}) \prod_{D : x_{D} \geq 1} \frac{Γ (z_{d})}{Γ (x_{d} + z_{d})} n \times {exp \sum_{d = 1}^{D} I (x_{d} >= 1) log \frac{α_{d} β_{d}}{α_{d} + β_{d}}} \end{matrix}

(4)

where $I (x_{d} >= 1)$ is an indicator that represents whether a word $w$ shows up at any entry in the vector $X$ , and $z_{d} = x_{d + 1} + \dots + x_{D + 1}$ .

The Exponential-Family Approximation to Multinomial Beta-Liouville (MBL) Distribution

If a random vector $X = (x_{1}, \dots, x_{D + 1})$ follows a Multinomial Beta-Liouville distribution, then (Bouguila Citation2011):

(5)

M B L (X | ξ) = \frac{Γ ((\sum_{d = 1}^{D + 1} x_{d}) + 1)}{\prod_{d = 1}^{D + 1} Γ (x_{d} + 1)} \times \frac{Γ (\sum_{d = 1}^{D} α_{d}) Γ (α + β) Γ (α^{'}) Γ (β^{'}) \prod_{d = 1}^{D} Γ (α_{d}^{'})}{Γ (\sum_{d = 1}^{D} α_{d}^{'}) Γ (α^{'} + β^{'}) Γ (α) Γ (β) \prod_{d = 1}^{D} Γ (α_{d})}

(5)

where $α_{d}^{'} = α_{d} + x_{d}$ , $α^{'} = α + \sum_{d = 1}^{D} x_{d}$ , $β^{'} = β + x_{D + 1}$ , and $ξ = (α, β)$ .

In several real world applications, the MBL mixture model has provided good high clustering accuracy, comparably to Multinomial Scaled Dirichlet mixture model (MSD) (Zamzami and Bouguila Citation2019), and Multinomial Generalized Dirichlet mixture model (MGD) (Bouguila Citation2008), it also outperforms other widely used mixture models, such as mixtures of Multinomial distributions (MM) and Dirichlet Compound Multinomial (DCM) distributions (Bouguila and Ziou Citation2007; Madsen, Kauchak, and Elkan Citation2005). However, MBL does not belong to the exponential family, and it is not efficient in high-dimensional spaces where many parameters need to be estimated (Zamzami and Bouguila Citation2020a). Approximating MBL to belong to exponential family can reduce the computation cost and improve the efficiency of MBL to model sparse high-dimensional count data (Elkan Citation2006).

Zamzami and Bouguila (Citation2020a) found empirically that $α ≪ 1$ and $β ≃ 1$ for real data sets and proposed maximum likelihood method for model parameters estimation. Thus, relying on EquationEquation (3)(3) $lim_{α \to 0} \frac{Γ (α + x)}{Γ (α)} - α Γ (x) = 0$ (3) , we have the form of exponential approximation for multinomial Beta-Liouville distribution as:

(6)

E M B L (X | ξ) \propto (\prod_{d : x_{d} >= 1} x_{d}^{- 1}) n! \frac{Γ (S) Γ (α^{'}) Γ (β^{'}) α}{Γ (S + n) Γ (α^{'} + β^{'})} exp {\sum_{d = 1}^{D} I (x_{d} \leq 1) log (α_{d})}

(6)

where $I (x_{d} >= 1)$ , the sufficient static, is an indicator whether the word $d$ appears at least once in the vector $X$ and $S = \sum_{d = 1}^{D} α_{d}$ .

2.4. The Exponential-Family Approximation to Shifted Scaled Dirichlet Multinomial (MSSD) Distribution

Define a random vector $X = (x_{d}, \dots, x_{D})$ that follows a Shifted Scaled Dirichlet Multinomial Distribution, then:

(7)

M S S D (X | ξ) \approx \frac{n! Γ (α_{+})}{\prod_{d : x_{d} \geq 1}^{D} x_{d}! Γ (α_{+} τ N)} \prod_{d : x_{d} \geq 1}^{D} \frac{Γ (α_{d} + τ x_{d})}{β_{d}^{x_{d}} Γ (α_{d})}

(7)

where $α_{+} = \sum_{d}^{D} α_{d}$ , $n = \sum_{d = 1}^{D} x_{d}!$ and $ξ = (α, β, τ)$ .

For high dimensional data, Zamzami and Bouguila (Citation2020b) found that the value of $α$ parameters are really small which combined with some approximations gave the exponential Multinormial Shifted-Scaled Dirichlet (EMSSD) as:

(8)

E M S S D (X | ξ) \propto n! \frac{Γ (α_{+}) τ^{D}}{Γ (α_{+} + τ N)} \prod_{d : x_{d} \geq 1}^{D} \frac{α}{β_{d}^{x_{d}} x_{d} τ}

(8)

The Proposed Bayesian Learning Framework

In this section, we propose the algorithms to learn the parameters for finite and infinite mixture models of EMBL, EMGD and EMMSD.

Finite Mixture of Distributions

A finite mixture of distributions with $M$ components is defined as (e.g., (Bouguila and Fan Citation2020)):

(9)

P (X | ξ) = \sum_{j = 1}^{M} p (X | ξ_{j}) P_{j}

(9)

where the $P_{j}$ are the mixing weights and $p (X | ξ_{j})$ is the components distribution, $Θ = (ξ, P)$ is the entire set of parameters to be estimated, where $ξ = (ξ_{1} \dots ξ_{M})$ , $ξ_{j}$ represents the parameters of distribution $j$ , and $P = (P_{1}, \dots, P_{M})$ is the vector of weight parameters. The $P_{j}$ must satisfy: $0 \leq P_{j} \leq 1, j = 1 \dots M, \sum_{j = 1}^{M} P_{j} = 1$ .

Bayesian Learning for Finite Mixture Weight Parameters

Given a set of $N$ independent vectors $X = (X_{1} \dots X_{N})$ described by a finite mixture model, and $M$ is the number of mixture components, supposed to be known, the main problem is to estimate the mixture parameters. In this work, we rely on Bayesian techniques to resolve this problem.

We define an indicator for each $X_{i}$ in data set $X$ for each class $j$ as:

(10)

Z_{i j} = \{\begin{matrix} 1 & i f X_{i} b e l o n g s t o c l a s s j \\ 0 & o t h e r w i s e \end{matrix}

(10)

where $Z = {Z_{1}, \dots, Z_{N}}$ and $Z_{i} = (Z_{i 1}, \dots, Z_{i M})$ . In the Bayesian paradigm information brought by the complete data ( $X, Z$ ), a realization of ( $X, Z$ ) $\sim p (X, Z | Θ)$ is combined with prior information about the parameters $Θ$ that is specified in a prior distribution with density $π (Θ)$ and summarized in probability distribution $π (θ | X, Z)$ called the posterior distribution. This can be derived from the joint distribution, $p (X, Z | Θ) π (θ)$ (Nizar, Ziou, and Hammoud Citation2009). Thus, we have:

(11)

\begin{matrix} π (ξ | X, Z) & = \frac{π (ξ) p (X, Z | ξ)}{\int π (ξ) p (X, Z | ξ) d ξ} & \propto π (ξ) \times p (X, Z | ξ) \end{matrix}

(11)

where $\int π (ξ) p (X, Z | ξ) d ξ$ is the marginal density of the complete data $(X, Z)$ . We can directly simulate $ξ \sim π (ξ | X, Z)$ with well-known Gibbs sampler rather than directly computing it. Gibbs sampling is widely used in Bayesian mixture model, especially in the case of incomplete data (Train Citation2009; Xuanbo, Bouguila, and Zamzami Citation2021). That is, associate with each observation $X_{i}$ a missing multinomial variable $Z \sim M (1, {\hat{Z}}_{i 1}, \dots, {\hat{Z}}_{i M})$ .

(12)

{\hat{Z}}_{i j} = \frac{p (X_{j} | ξ) P_{j}}{\sum_{j = 1}^{M} p (X_{j} | ξ) P_{j}}

(12)

In fact, the weight parameter is independent of $X$ , $P \propto π (P | Z)$ (Samuel, Balakrishnan, and Johnson Citation2000), and we know that the vector $P$ is defined on the simplex {( $P_{1}, \dots, P_{M}); \sum_{j = 1}^{M - 1} P_{j} < 1$ }, then the natural prior distribution for vector $P$ is the Dirichlet distribution, we determine the prior of $P$ (Lee Citation2012) as:

(13)

π (P | η) = \frac{Γ (\sum_{j = 1}^{M} η_{j})}{\prod_{j = 1}^{M} Γ (η_{j})} \prod_{j = 1}^{M} P_{j}^{η_{j} - 1}

(13)

where $η = (η_{1}, \dots, η_{M})$ is the parameters vector of the Dirichlet distribution. Moreover, we have:

(14)

\begin{matrix} π (Z | P) = \prod_{i = 1}^{N} π (Z_{i} | P) = \prod_{i = 1}^{N} P_{1}^{Z_{i 1}} \dots P_{M}^{Z_{i M}} = \prod_{i = 1}^{N} \prod_{j = 1}^{M} P_{j}^{Z_{i j}} = \prod_{j = 1}^{M} P_{j}^{n_{j}} \end{matrix}

(14)

where $n_{j} = \sum_{i = 1}^{N} I_{Z_{i j} = 1}$ , Having the prior distribution and likelihood distribution in hand, we can obtain the posterior for weight parameters $P$ by the following:

(15)

\begin{matrix} π (P | Z) & \propto π (P) π (Z | P) \\ = \prod_{j = 1}^{M} P_{j}^{n_{j}} \frac{Γ (\sum_{j = 1}^{M} η_{j})}{\prod_{j = 1}^{M} Γ (η_{j})} \prod_{j = 1}^{M} P_{j}^{η_{j} - 1} \\ = \frac{Γ (\sum_{j = 1}^{M} η_{j})}{\prod_{j = 1}^{M} Γ (η_{j})} \prod_{j = 1}^{M} P_{j}^{η_{j} + n_{j} - 1} \\ \propto D (η_{1} + n_{1} \dots η_{M} + n_{M}) \end{matrix}

(15)

where $D$ is Dirichlet distribution with parameters $(η_{1} + n_{1}, \dots, η_{M} + n_{M})$ . We note that the prior and posterior distributions $π (P)$ and $π (P | Z)$ are both Dirichlet distributions. In this case, we say that the Dirichlet is conjugate prior for mixture proportions. Therefore, the weight parameters can be sampled from Dirichlet distribution. We selected $η_{j} = 1, j = 1, \dots, M$ in our experiments.

The Bayesian Learning for Infinite Mixture Weight Parameters

In finite mixture model, we have considered $M$ to be fixed finite quantity. In this section, we will explore the limit $M \to \infty$ and present the conditional posteriors for the indicators and weight parameters based on Dirichlet process. We take $(η_{1}, \dots, η_{M}) = (η / M, \dots, η / M)$ for EquationEquation (11)(11) $\begin{matrix} π (ξ | X, Z) & = \frac{π (ξ) p (X, Z | ξ)}{\int π (ξ) p (X, Z | ξ) d ξ} & \propto π (ξ) \times p (X, Z | ξ) \end{matrix}$ (11) , thus we obtain a simpler form for prior probability of infinite mixture weight parameters:

(16)

π (P_{i n f} | η) = \frac{Γ (η)}{Γ {(η / M)}^{M}} \prod_{j = 1}^{M} π_{j}^{η / M - 1}

(16)

where we have $P_{i n f} = (P_{i n f^{1}}, \dots, P_{i n f^{M}})$ . From EquationEquation (12)(12) ${\hat{Z}}_{i j} = \frac{p (X_{j} | ξ) P_{j}}{\sum_{j = 1}^{M} p (X_{j} | ξ) P_{j}}$ (12) , we have the prior distribution for the Z parameter that corresponds to multinomial distribution. Using the standard Dirichlet integral, we could marginalize out the $P_{i n f}$ parameter to get the following probability for the prior directly in terms of the indicators (Rasmussen Citation1999):

(17)

\begin{matrix} p (Z | η) & = \int P (Z | P_{i n f}) P (P_{i n f} | η) \\ = \frac{Γ (η)}{Γ {(η / M)}^{M}} \int \prod_{j = 1}^{M} π^{n_{j} + η / M - 1} d π_{j} \\ = \frac{Γ (η)}{Γ (N + η)} \prod_{j = 1}^{M} \frac{Γ (n_{j} + η / M)}{Γ (η / M)} \end{matrix}

(17)

Based on Bayes principle, we obtain the conditional posterior distribution for the mixing weight vector:

(18)

\begin{matrix} π (P_{i n f} | Z) & = \frac{p (P_{i n f} | η) p (Z | P_{i n f})}{p (Z | η)} \\ = \prod_{j = 1}^{M} P_{i n f^{j}}^{n_{j}} \frac{Γ (\sum_{j = 1}^{M} η)}{\prod_{j = 1}^{M} Γ (η_{j})} \prod_{j = 1}^{M} P_{i n f^{j}}^{η / M - 1} \\ = \frac{Γ (\sum_{j = 1}^{M} η / M)}{\prod_{j = 1}^{M} Γ (η / M)} \prod_{j = 1}^{M} P_{i n f^{j}}^{η / M + n_{j} - 1} \\ \propto D (η / M + n_{1} \dots η / M + n_{M}) \end{matrix}

(18)

In order to be able to use Gibbs sampling for the indicators $Z_{i}$ , we need the conditional prior for a single indicator given all the others: this is easily obtained from EquationEquation (17)(17) $\begin{matrix} p (Z | η) & = \int P (Z | P_{i n f}) P (P_{i n f} | η) \\ = \frac{Γ (η)}{Γ {(η / M)}^{M}} \int \prod_{j = 1}^{M} π^{n_{j} + η / M - 1} d π_{j} \\ = \frac{Γ (η)}{Γ (N + η)} \prod_{j = 1}^{M} \frac{Γ (n_{j} + η / M)}{Γ (η / M)} \end{matrix}$ (17) by keeping all but a single indicator fixed (Najar, Zamzami, and Bouguila Citation2020):

(19)

p (Z_{i} = j | Z_{_} i, η) = \frac{n_{_i, j} + η / M}{N - 1 + η}

(19)

where the subscript $- i$ indicates all except $i$ and $n_{_i, j}$ is the number of observations, excluding $X_{i}$ , that are associated with component $j$ .

Lastly, we choose inverse Gamma as prior for parameters $η$ :

(20)

\begin{matrix} p (η | ϑ, ρ) = \frac{ρ^{ϑ} exp (- ρ / η)}{Γ (ϑ) η^{ϑ + 1}} \end{matrix}

(20)

The likelihood for $η$ can be derived from EquationEquation (17)(17) $\begin{matrix} p (Z | η) & = \int P (Z | P_{i n f}) P (P_{i n f} | η) \\ = \frac{Γ (η)}{Γ {(η / M)}^{M}} \int \prod_{j = 1}^{M} π^{n_{j} + η / M - 1} d π_{j} \\ = \frac{Γ (η)}{Γ (N + η)} \prod_{j = 1}^{M} \frac{Γ (n_{j} + η / M)}{Γ (η / M)} \end{matrix}$ (17) , which together with the prior from EquationEquation (19)(19) $p (Z_{i} = j | Z_{_} i, η) = \frac{n_{_i, j} + η / M}{N - 1 + η}$ (19) gives:

(21)

p (η | ϑ, ρ, M, N) = \frac{ρ^{ϑ} exp (- ρ / η)}{Γ (ϑ) η^{ϑ + 1}} \times \frac{η^{M} Γ (η)}{Γ (N + η)}

(21)

We selected $(ϑ, ρ) = (4, 2)$ in our experiments. These values were previously used in Bouguila and Ziou (Citation2010), because they allow a diffuse range of values of the number of clusters $M$ (more details and discussions can be found in Escobar and West Citation1995). For the indicators, letting $M \to \infty$ in EquationEquation (19)(19) $p (Z_{i} = j | Z_{_} i, η) = \frac{n_{_i, j} + η / M}{N - 1 + η}$ (19) , the conditional prior reaches the following limits (Rasmussen Citation1999):

(22)

p (Z_{i} = j | η, Z_{_i}) = \{\begin{matrix} \frac{n_{_i, j}}{N - 1 + η} & i f n_{_i j} > 0 \\ \frac{η}{N - 1 + η} & i f n_{_i j} = 0 \end{matrix}

(22)

Having this prior distribution, we can obtain the conditional posterior by multiplying the model likelihood:

(23)

p (Z_{i} = j | η, Z_{_i}) = \{\begin{matrix} \frac{n_{_i, j}}{N - 1 + η} p (X | ξ) & i f n_{_i j} > 0 \\ \int \frac{η}{N - 1 + η} p (X | ξ) p (ξ) d ξ & i f n_{_i j} = 0 \end{matrix}

(23)

Unfortunately, this integral is not analytically tractable in EquationEquation (23)(23) $p (Z_{i} = j | η, Z_{_i}) = \{\begin{matrix} \frac{n_{_i, j}}{N - 1 + η} p (X | ξ) & i f n_{_i j} > 0 \\ \int \frac{η}{N - 1 + η} p (X | ξ) p (ξ) d ξ & i f n_{_i j} = 0 \end{matrix}$ (23) , hence, we consider a Monte Carlo sampling approximation.

Learning Algorithm for Finite Mixture Model of EMGD

Define $π (ξ)$ as the prior distribution for the parameters of the EMGD distribution. We use the fact that EMGD belongs to the exponential family. In fact, if a S-parameters density $ρ$ belongs to the exponential family then we can rewrite it in the exponential form which has been shown in EquationEquation (1)(1) $p (X | ξ) \propto H (X) exp (\sum_{l = 1}^{M} G_{l} (ξ)) T_{l} (X) + Φ (ξ))$ (1) .

Writing the EMGD in the exponential form gives:

(24)

\begin{matrix} H (X) & = (\prod_{W : x_{W} \geq 1} x_{w}^{- 1}) \prod_{W : x_{W} \geq 1} \frac{Γ (z_{w})}{Γ (x_{w} + z_{w})} n! \\ G_{w} (ξ) & = log \frac{α_{w} β_{w}}{α_{w} + β_{w}} \\ T_{w} (X) & = \sum_{w = 1}^{W} I (x_{w} >= 1) \\ ϕ (ξ) & = 0 \end{matrix}

(24)

In this case, a prior of $ξ$ is given by (Lee Citation2012) as:

(25)

π (ξ) \propto exp (\sum_{w = 1}^{W} ρ_{l} G_{w} (ξ) + k Φ (ξ))

(25)

where $ρ = (ρ_{1}, \dots, ρ_{w})$ , and $k > 0$ are referred as hyperparameters.

The prior for EMGD can be written as following:

(26)

π (α, β) \propto exp (\sum_{w = 1}^{W} ρ_{l} log \frac{α_{w} β_{w}}{α_{w} + β_{w}})

(26)

Having the prior in hand, the mixture model posterior is (see Appendix A):

(27)

\begin{matrix} π (ξ_{j} | M, X) & \propto π (ξ_{j}) \prod_{Z_{i j} = 1} E M G D (X_{i} | ξ_{j}) \\ \propto exp [\sum_{w = 1}^{W} log \frac{α_{w} β_{w}}{α_{w} + β_{w}} (ρ_{w} + \sum_{Z_{i j} = 1}^{N} I (x_{i w} \geq 1))] \\ \times \prod_{Z_{i j} = 1}^{N} (\prod_{w : x_{i w} \geq 1} x_{i w}^{- 1} \frac{Γ (z_{i w})}{Γ (x_{i w} + z_{i w})} n!) \end{matrix}

(27)

According to the posterior hyperparameters, following (Nizar, Ziou, and Hammoud Citation2009), once the sample $X$ is known, we can use it to get the prior hyperparameters. Then, we held $(ρ_{1}, \dots, ρ_{W})$ and $(η_{1}, \dots, η_{M})$ fixed at: $η_{j} = 1$ , $j = 1 \dots M$ , $ρ_{w} = 1$ , $w = 1 \dots W$ .

Algorithm 1 Finite EMGD (FinEMGD) learning algorithm

Initialization: Using MOM and K-means method to initialize model parameters

Input: A data set $X = {X_{1} \dots X_{N}}$ , each is $W$ -dimensional sparse count vector, the number of clusters $M$

output: $Θ$

for $t = 1 \dots :$

(1) Generate $Z^{t} \sim M (1; {\hat{Z}}_{i 1}^{t - 1} \dots {\hat{Z}}_{i M}^{t - 1})$

(2) Generate weight parameters $P^{t}$ from EquationEquation 13(13) $π (P | η) = \frac{Γ (\sum_{j = 1}^{M} η_{j})}{\prod_{j = 1}^{M} Γ (η_{j})} \prod_{j = 1}^{M} P_{j}^{η_{j} - 1}$ (13)

(3) Generate model $ξ^{t}$ from EquationEquation 11(11) $\begin{matrix} π (ξ | X, Z) & = \frac{π (ξ) p (X, Z | ξ)}{\int π (ξ) p (X, Z | ξ) d ξ} & \propto π (ξ) \times p (X, Z | ξ) \end{matrix}$ (11) using M–H algorithm

M–H algorithm:

(1) Generate ${\tilde{ξ}}_{j}$ from $q ({\tilde{ξ}}_{j} | ξ_{j}^{t - 1})$ and $u \sim U [0, 1]$

(2) compute $r = \frac{π ({\tilde{ξ}}_{j} | M, X) q (ξ_{j}^{t - 1} | {\tilde{ξ}}_{j})}{π (ξ_{j}^{t - 1} | M, X) q ({\tilde{ξ}}_{j} | ξ_{j}^{t - 1})}$

(3) if $r < u$ then: $ξ^{t} = \tilde{ξ}$ else: $ξ^{t} = ξ^{t - 1}$

In Algorithm 1, $ξ_{j} = (α_{j 1}, β_{j 1}, \dots, α_{j W}, β_{j W})$ , and we take the K-means (Hartigan and Wong Citation1979) and the method of moments (MOM) (Wong Citation2010) for initializing the model parameters. In the M–H step, the major factor is choosing proposal distribution $q$ (Sorensen and Gianola Citation2002; Train Citation2009). As the model parameters are satisfied $0 < α_{j w} ≪ β_{j w} \leq 1$ , we choose the Gamma distribution as the proposal distribution for $α_{j w}$ and $β_{j w}$ .

(28)

α_{j w} \sim G (α, σ_{1}), β_{j w} \sim G (β, σ_{2})

(28)

The complexity of an algorithm is determined by the size of data set (i.e., number of observations $N$ ), and the number of mixture components $K$ . The algorithm computation complexity for one iteration is $O (N K)$ where $σ_{1}$ and $σ_{2}$ are scale parameters of the Gamma distributions. The complete algorithm for estimating the EMGD parameters using the proposed approach is presented in Algorithm 1.

Learning Algorithm for Infinite Mixture Model of EMGD

We know that the model parameters $α$ and $β$ in EMGD satisfy $0 < α_{j w} ≪ β_{j w} < 1$ , then appealing flexible choice as prior is the Beta distribution, with shape parameters: $δ,$ and $ϖ, ρ$ , then:

(29)

p (α_{j}) \propto \frac{Γ (δ +)}{Γ (δ) Γ ()} α_{j}^{δ - 1} (1 - α_{j})^{- 1}

(29)

(30)

p (β_{j}) \propto \frac{Γ (ϖ + ρ)}{Γ (ϖ) Γ (ρ)} β_{j}^{ϖ - 1} (1 - β_{j})^{ρ - 1}

(30)

where $α_{j} = (α_{j 1}, \dots, α_{j D})$ , $β_{j} = (β_{j 1}, \dots β_{j D})$ .

Then, the conditional posterior distributions for $α_{j}$ and $β_{j}$ are:

(31)

\begin{matrix} p (α_{j} | X, Z) & \propto p (α_{j}) \prod_{Z_{i j} = 1} E G D M (X_{i} | ξ_{j}) \\ \frac{Γ (δ +)}{Γ (δ) Γ ()} α_{j}^{δ - 1} {(1 - α_{j})}^{- 1} \prod_{Z_{i j} = 1} {(\prod_{W : x_{W} \geq 1} x_{w}^{- 1}) \prod_{W : x_{W} \geq 1} \frac{Γ (z_{w})}{Γ (x_{w} + z_{w})} n \\ \times {exp \sum_{w = 1}^{W} I (x_{w} >= 1) log \frac{α_{w} β_{w}}{α_{w} + β_{w}}}} \end{matrix}

(31)

(32)

\begin{matrix} p (β_{j} | X, Z) & \propto p (β_{j}) \prod_{Z_{i j} = 1} E G D M (X_{i} | ξ_{j}) \\ \frac{Γ (ϖ + ρ)}{Γ (ϖ) Γ (ρ)} β_{j}^{ϖ - 1} {(1 - β_{j})}^{ρ - 1} \prod_{Z_{i j} = 1} {(\prod_{W : x_{W} \geq 1} x_{w}^{- 1}) \prod_{W : x_{W} \geq 1} \frac{Γ (z_{w})}{Γ (x_{w} + z_{w})} n \\ \times {exp \sum_{w = 1}^{W} I (x_{w} >= 1) log \frac{α_{w} β_{w}}{α_{w} + β_{w}}}} \end{matrix}

(32)

In order to have more flexible model, we introduce an additional hierarchical level by allowing the hyperparmeters to follow some selected distributions. The hyperparmeters $δ,$ and $ϖ, ρ$ associated with $α$ and $β$ respectively are given Beta distribution and exponential distribution:

(33)

p (δ | ς, υ) = \frac{Γ (ς + υ)}{Γ (ς) Γ (υ)} δ^{ς - 1} (1 - ς)^{υ - 1}

(33)

(34)

p (| λ) = λ exp (- λ)

(34)

(35)

p (ϖ | κ, ω) = \frac{Γ (κ + ω)}{Γ (κ) Γ (ω ϖ^{κ - 1} {(1 - κ)}^{ω - 1}}

(35)

(36)

p (ρ | ι) = ι exp (- ι ρ)

(36)

For those hyperparameters $δ,$ and $ϖ, ρ$ , the prior of $α$ and $β$ is considered as likelihood. Thus, the conditional posterior can be obtained (see Appendix C).

Then, we have the learning algorithm 2 for infinite mixture model of EGDM:

Algorithm 2 Infinite EGDM (InfEGDM) learning algorithm

Initialization: Using MOM to initialize model parameters

Input: a data set $X = {X_{1} \dots X_{N}}$ , each is W-dimensional sparse count data

output: $Θ$

for $t = 1 \dots :$

(1) Generate $Z^{t}$ from EquationEquation (23)(23) $p (Z_{i} = j | η, Z_{_i}) = \{\begin{matrix} \frac{n_{_i, j}}{N - 1 + η} p (X | ξ) & i f n_{_i j} > 0 \\ \int \frac{η}{N - 1 + η} p (X | ξ) p (ξ) d ξ & i f n_{_i j} = 0 \end{matrix}$ (23) with Monte Carlo sampling approximation

(2) Update the number of represented components

(3) Generate weight parameters $η$ from EquationEquation (20)(20) $\begin{matrix} p (η | ϑ, ρ) = \frac{ρ^{ϑ} exp (- ρ / η)}{Γ (ϑ) η^{ϑ + 1}} \end{matrix}$ (20) with adaptive reject sampling (ARS)

(4) Generate weight parameters $P^{t}$ from $D i r (η / M + n_{1}, \dots, η / M + n_{M})$

(5) Update $α, β$ in M–H algorithm

M–H algorithm:

for $γ_{j}$ in ( $α_{j}, β_{j}$ ):

(1) Generate ${\tilde{γ}}_{j}$ from $q ({\tilde{γ}}_{j} | γ_{j}^{t - 1})$ and $u \sim U [0, 1]$

(2) compute $r = \frac{p ({\tilde{γ}}_{j} | M, X) q (γ_{j}^{t - 1} | {\tilde{γ}}_{j})}{p (γ_{j}^{t - 1} | M, X) q ({\tilde{γ}}_{j} | γ_{j}^{t - 1})}$ from EquationEquation (31)(31) $\begin{matrix} p (α_{j} | X, Z) & \propto p (α_{j}) \prod_{Z_{i j} = 1} E G D M (X_{i} | ξ_{j}) \\ \frac{Γ (δ +)}{Γ (δ) Γ ()} α_{j}^{δ - 1} {(1 - α_{j})}^{- 1} \prod_{Z_{i j} = 1} {(\prod_{W : x_{W} \geq 1} x_{w}^{- 1}) \prod_{W : x_{W} \geq 1} \frac{Γ (z_{w})}{Γ (x_{w} + z_{w})} n \\ \times {exp \sum_{w = 1}^{W} I (x_{w} >= 1) log \frac{α_{w} β_{w}}{α_{w} + β_{w}}}} \end{matrix}$ (31) or EquationEquation (32)(32) $\begin{matrix} p (β_{j} | X, Z) & \propto p (β_{j}) \prod_{Z_{i j} = 1} E G D M (X_{i} | ξ_{j}) \\ \frac{Γ (ϖ + ρ)}{Γ (ϖ) Γ (ρ)} β_{j}^{ϖ - 1} {(1 - β_{j})}^{ρ - 1} \prod_{Z_{i j} = 1} {(\prod_{W : x_{W} \geq 1} x_{w}^{- 1}) \prod_{W : x_{W} \geq 1} \frac{Γ (z_{w})}{Γ (x_{w} + z_{w})} n \\ \times {exp \sum_{w = 1}^{W} I (x_{w} >= 1) log \frac{α_{w} β_{w}}{α_{w} + β_{w}}}} \end{matrix}$ (32)

(3) if $r < u$ then: $ξ^{t} = \tilde{ξ}$ else: $ξ^{t} = ξ^{t - 1}$

Update the hyperparameters $δ,$ and $ϖ, ρ$ with MCMC sampling in their conditional posterior

Learning Algorithm for Finite Mixture Model of EMBL

EMBL also belongs to the exponential family. We define $X = {X_{1}, \dots, X_{N}}$ , where $X_{i} = [x_{i 1} \dots x_{i W}]$ . We can show following EquationEquation (1)(1) $p (X | ξ) \propto H (X) exp (\sum_{l = 1}^{M} G_{l} (ξ)) T_{l} (X) + Φ (ξ))$ (1) , that:

(37)

\begin{matrix} H (X) & = (\prod_{W : x_{W} \geq 1} x_{w}^{- 1}) n! \\ G_{w} (ξ) & = log (α_{w}) \\ T_{w} (X) & = \sum_{w = 1}^{W} I (x_{w} >= 1) \\ ϕ (ξ) & = log {\frac{Γ (α^{'}) Γ (s) Γ (β^{'}) Γ (α) α}{Γ (s + n) Γ (α^{'} + β^{'})}} \end{matrix}

(37)

Based on EquationEquation (15)(15) $\begin{matrix} π (P | Z) & \propto π (P) π (Z | P) \\ = \prod_{j = 1}^{M} P_{j}^{n_{j}} \frac{Γ (\sum_{j = 1}^{M} η_{j})}{\prod_{j = 1}^{M} Γ (η_{j})} \prod_{j = 1}^{M} P_{j}^{η_{j} - 1} \\ = \frac{Γ (\sum_{j = 1}^{M} η_{j})}{\prod_{j = 1}^{M} Γ (η_{j})} \prod_{j = 1}^{M} P_{j}^{η_{j} + n_{j} - 1} \\ \propto D (η_{1} + n_{1} \dots η_{M} + n_{M}) \end{matrix}$ (15) , we have a prior as following:

(38)

π (α, β) \propto exp [\sum_{w = 1}^{W} ρ_{d} log (α_{d}) + k (log (\frac{Γ (s) Γ (α^{'}) Γ (β^{'}) α}{Γ (s + n) Γ (α^{'} + β^{'})}))]

(38)

From Bayesian theory, the posterior can be written as (see Appendix B):

(39)

\begin{matrix} π (ξ_{j} | M, X) & \propto π (ξ_{j}) \prod_{Z_{i j} = 1} E M B L (X_{i} | ξ_{j}) \\ \propto exp [\sum_{w = 1}^{W} log (α_{w}) (ρ_{w} + \sum_{Z_{i j} = 1}^{N} I (x_{i w} \geq 1)) \\ + k (log (\frac{Γ (α_{j}^{'}) Γ (β_{j}^{'}) α_{j}}{(S) \times (S + 1) \dots (S + n - 1) Γ (α_{j}^{'} + β_{j}^{'})})) \\ + \sum_{i = 1, z_{i j} = 1} (log (\frac{Γ (α^{'}) Γ (β^{'}) α}{(S) \times (S + 1) \dots (S + n - 1) Γ (α^{'} + β^{'})}))] \end{matrix}

(39)

Once the sample $X$ is known, the posterior hyperparameters can be fixed, we fix $ρ_{w} = 1$ , $k = 1$ and $η = 1$ (Bouguila, Ziou, and Hammoud Citation2009).

In Bayesian approach, choosing an effective proposal prior distribution is significant factor for the model parameters estimation and convergence time. With many different common proposal distributions attempts, we finally select Beta distribution as proposal distribution for $α_{j w}$ , and inv-Gamma distribution for $β$ .

Algorithm 3 Finite EMBL (FinEMBL) learning algorithm

Initialization: Using the MOM and the K-means method to initialize model parameters

Input: a data set $X = X_{1} \dots X_{N}$ , each is W-dimensional sparse count data, the number of clusters $M$

output: $Θ$

for $t = 1 \dots :$

(1) Generate $Z^{t} \sim M (1; {\hat{Z}}_{i 1}^{t - 1} \dots {\hat{Z}}_{i M}^{t - 1})$

(2) Generate weight parameters $P^{t}$ from EquationEquation (13)(13) $π (P | η) = \frac{Γ (\sum_{j = 1}^{M} η_{j})}{\prod_{j = 1}^{M} Γ (η_{j})} \prod_{j = 1}^{M} P_{j}^{η_{j} - 1}$ (13)

(3) Generate model $ξ^{t}$ from EquationEquation (39)(39) $\begin{matrix} π (ξ_{j} | M, X) & \propto π (ξ_{j}) \prod_{Z_{i j} = 1} E M B L (X_{i} | ξ_{j}) \\ \propto exp [\sum_{w = 1}^{W} log (α_{w}) (ρ_{w} + \sum_{Z_{i j} = 1}^{N} I (x_{i w} \geq 1)) \\ + k (log (\frac{Γ (α_{j}^{'}) Γ (β_{j}^{'}) α_{j}}{(S) \times (S + 1) \dots (S + n - 1) Γ (α_{j}^{'} + β_{j}^{'})})) \\ + \sum_{i = 1, z_{i j} = 1} (log (\frac{Γ (α^{'}) Γ (β^{'}) α}{(S) \times (S + 1) \dots (S + n - 1) Γ (α^{'} + β^{'})}))] \end{matrix}$ (39) using M–H algorithm

M–H algorithm:

(1) Generate ${\tilde{ξ}}_{j}$ from $q ({\tilde{ξ}}_{j} | ξ_{j}^{t - 1})$ and $u \sim U [0, 1]$

(2) compute $r = \frac{π ({\tilde{ξ}}_{j} | M, X) q (ξ_{j}^{t - 1} | {\tilde{ξ}}_{j})}{π (ξ_{j}^{t - 1} | M, X) q ({\tilde{ξ}}_{j} | ξ_{j}^{t - 1})}$

(3) if $r < u$ then: $ξ^{t} = \tilde{ξ}$ else: $ξ^{t} = ξ^{t - 1}$

(40)

α_{j w} \sim B (α, σ_{1}), β \sim i n v G (β, σ_{2})

(40)

The complete steps for estimating the EMBL model parameters using the proposed approach are given in Algorithm 3. Note that the proposed Algorithm 3 requires computational cost $O (N K)$ per step.

Learning Algorithm for Infinite Mixture Model of EMBL

As shown empirically, the values of $α$ and $β$ satisfy $0 < α ≪ 1$ and $β ≃ 1$ . Thus, we choose the beta distribution and Inverse Gamma distribution as priors for $α$ and $β$ with hyperparameters $δ,$ and $ϖ, ρ$ , then

(41)

p (α_{j}) \sim \frac{Γ (δ +)}{Γ (δ) Γ ()} α_{j}^{δ - 1} (1 - α_{j})^{- 1}

(41)

(42)

p (β_{j}) \sim \frac{ρ^{ϖ} exp (- ρ / β_{j})}{Γ (ϖ) β_{j}^{ϖ - 1}}

(42)

Having this prior, the full conditional posteriors for $α_{j}$ and $β_{j}$ are:

(43)

\begin{matrix} p (α_{j} | X, Z) & \propto p (α_{j}) \prod_{Z_{i j} = 1} P_{E M B L} (X_{i} | ξ_{j}) \\ \propto \frac{Γ (δ +)}{Γ (δ) Γ ()} α_{j}^{δ - 1} {(1 - α_{j})}^{- 1} \prod_{Z_{i j} = 1} {(\prod_{d : x_{d} >= 1} x_{d}^{- 1}) n! \\ \propto \frac{Γ (S) Γ (α^{'}) Γ (β^{'}) α}{Γ (S + n) Γ (α^{'} + β^{'})} \times exp \{\sum_{d = 1}^{D} I (x_{d} \geq 1) log (α_{d})\}} \end{matrix}

(43)

(44)

\begin{matrix} p (β_{j} | X, Z) & \propto p (β_{j}) \prod_{Z_{i j} = 1} P_{E M B L} (X_{i} | ξ_{j}) \\ \propto \frac{ρ^{ϖ} exp (- ρ / β_{j})}{Γ (ϖ) β_{j}^{ϖ - 1}} \prod_{Z_{i j} = 1} {(\prod_{d : x_{d} >= 1} x_{d}^{- 1}) n! \\ \propto \frac{Γ (S) Γ (α^{'}) Γ (β^{'}) α}{Γ (S + n) Γ (α^{'} + β^{'})} \times exp {\sum_{d = 1}^{D} I (x_{d} \geq 1) log (α_{d})}} \end{matrix}

(44)

In order to reduce the sensitivity of parameters, we give priors for the hyperparmaeters $δ,$ and $ϖ, ρ$ , by choosing Beta distribution, exponential distribution and Inverse Gamma distribution, exponential distribution, respectively

(45)

p (δ | ς, υ) \sim \frac{Γ (ς + υ)}{Γ (ς) Γ (υ)} δ^{ς - 1} (1 - ς)^{υ - 1}

(45)

(46)

p (| λ) \sim λ exp (- λ)

(46)

(47)

p (ϖ | κ, ω) \sim \frac{ω^{κ} exp (- ω / ϖ)}{Γ (κ) ϖ_{j}^{κ - 1}}

(47)

(48)

p (ρ | ι) \sim ι exp (- ι ρ)

(48)

For those hyperparameters $δ,$ and $ϖ, ρ$ , the prior of $α$ and $β$ is considered as likelihood. Thus, the conditional posterior can be obtained (see Appendix C). The parameter learning algorithm of this infinite model is similar to the infinite mixture model of EGDM, we only need to replace the posterior probability for $α, β$ and $δ,, ϖ, ρ$ in M–H steps.

Thus, we have the learning algorithm 4:

Algorithm 4 Infinite EMBL (InfEMBL) learning algorithm

Initialization: Using MOM to initialize model parameters

Input: a data set $X = X_{1} \dots X_{N}$ , each is W-dimensional sparse count data

output: $Θ$

for $t = 1 \dots :$

(1) Generate $Z^{t}$ from EquationEquation (22)(22) $p (Z_{i} = j | η, Z_{_i}) = \{\begin{matrix} \frac{n_{_i, j}}{N - 1 + η} & i f n_{_i j} > 0 \\ \frac{η}{N - 1 + η} & i f n_{_i j} = 0 \end{matrix}$ (22) with Monte Carlo sampling approximation

(2) Update the number of represented components

(4) Generate weight parameters $P^{t}$ from $D i r (η / M + n_{1}, \dots, η / M + n_{M})$

(5) Update $α, β$ in M–H algorithm

M–H algorithm:

for $γ_{j}$ in ( $α_{j}, β_{j}$ ):

(1) Generate ${\tilde{γ}}_{j}$ from $q ({\tilde{γ}}_{j} | γ_{j}^{t - 1})$ and $u \sim U [0, 1]$

(2) compute $r = \frac{p ({\tilde{γ}}_{j} | M, X) q (γ_{j}^{t - 1} | {\tilde{γ}}_{j})}{p (γ_{j}^{t - 1} | M, X) q ({\tilde{γ}}_{j} | γ_{j}^{t - 1})}$ from EquationEquation (43)(43) $\begin{matrix} p (α_{j} | X, Z) & \propto p (α_{j}) \prod_{Z_{i j} = 1} P_{E M B L} (X_{i} | ξ_{j}) \\ \propto \frac{Γ (δ +)}{Γ (δ) Γ ()} α_{j}^{δ - 1} {(1 - α_{j})}^{- 1} \prod_{Z_{i j} = 1} {(\prod_{d : x_{d} >= 1} x_{d}^{- 1}) n! \\ \propto \frac{Γ (S) Γ (α^{'}) Γ (β^{'}) α}{Γ (S + n) Γ (α^{'} + β^{'})} \times exp \{\sum_{d = 1}^{D} I (x_{d} \geq 1) log (α_{d})\}} \end{matrix}$ (43) or EquationEquation (44)(44) $\begin{matrix} p (β_{j} | X, Z) & \propto p (β_{j}) \prod_{Z_{i j} = 1} P_{E M B L} (X_{i} | ξ_{j}) \\ \propto \frac{ρ^{ϖ} exp (- ρ / β_{j})}{Γ (ϖ) β_{j}^{ϖ - 1}} \prod_{Z_{i j} = 1} {(\prod_{d : x_{d} >= 1} x_{d}^{- 1}) n! \\ \propto \frac{Γ (S) Γ (α^{'}) Γ (β^{'}) α}{Γ (S + n) Γ (α^{'} + β^{'})} \times exp {\sum_{d = 1}^{D} I (x_{d} \geq 1) log (α_{d})}} \end{matrix}$ (44)

(3) if $r < u$ then: $ξ^{t} = \tilde{ξ}$ else: $ξ^{t} = ξ^{t - 1}$

Update the hyperparameters $δ,$ and $ϖ, ρ$ with MCMC sampling in their conditional posterior

Learning Algorithm for Finite Mixture Model of EMMSD

EMMSD can be written following EquationEquation (1)(1) $p (X | ξ) \propto H (X) exp (\sum_{l = 1}^{M} G_{l} (ξ)) T_{l} (X) + Φ (ξ))$ (1) , as:

(49)

\begin{matrix} H (X) & = \frac{n!}{\prod_{w : x_{w} \geq 1}^{D} x_{i} w} \\ G_{w 1} (ξ) & = log (α_{w}) - l o g (τ) \\ G_{w 2} (ξ) & = log (β_{w}) \\ T_{w 1} (X) & = \sum_{w = 1}^{W} I (x_{w} >= 1) \\ T_{w 2} (X) & = \sum_{w = 1}^{W} I (x_{w} >= 1) x_{w} \\ ϕ (ξ) & = log {\frac{Γ (α_{+}) τ^{D}}{Γ (α_{+} + τ n)}} \end{matrix}

(49)

(50)

\begin{matrix} π (α, β) & \propto exp [\sum_{w = 1}^{W} {ρ_{1 w} (log (α_{w}) - log (τ)) + ρ_{2 w} log (β_{w})} \\ + k \times log {\frac{Γ (α_{+}) τ^{D}}{Γ (α_{+} + τ n)}} \end{matrix}

(50)

From Bayesian theory, the posterior can be written as

(51)

\begin{matrix} p (ξ_{j} | X, Z) & \propto π (α, β) \prod_{Z_{i j} = 1}^{N} E M M S D (X | ξ_{j}, M) \\ = & exp [\sum_{w = 1}^{W} {ρ_{1 w} (log (α_{w}) - log (τ)) + ρ_{2 w} log (β_{w})} \\ + k \times log {\frac{Γ (α_{+}) τ^{W}}{Γ (α_{+} + τ n)}} \prod_{Z_{i j} = 1}^{N} p_{E M M S D} (X | ξ_{j}, M) \\ \propto & exp {(\sum_{Z_{i j} = 1}^{N} [I (x_{i} w \geq 1) + ρ_{1 w}) (\sum_{w = 1}^{W} log (α_{j w} - τ_{j w})) \\ + (\sum_{Z_{i j} = 1}^{N} [I (x_{i} w \geq 1) x_{i w} + ρ_{2 w}) (\sum_{w = 1}^{W} log (β_{j} w)) \\ + k \times log {\frac{Γ (α_{+}) τ^{D}}{Γ (α_{+} + τ n)} + \sum_{Z_{i j} = 1}^{N} \frac{Γ (α_{+}) τ^{W}}{Γ (α_{+} + τ n_{i})}} \end{matrix}

(51)

Once the sample $X$ is known, the posterior hyperparameters can be fixed, we fix $ρ_{1 w} = 1, ρ_{2 w} = 1$ , $k = 1$ and $η = 1$ . Having the posterior in hand, we can propose the algorithm for finite mixture model of EMMSD.

Algorithm 5 Finite EMMSD (FinEMSSD) learning algorithm

Initialization: Using MOM and K-means method to initialize model parameters

Input: a data set $X = X_{1} \dots X_{N}$ , each is W-dimensional sparse count data, the number of clusters $M$

output: $Θ$

for $t = 1 \dots :$

(1) Generate $Z^{t} \sim M (1; {\hat{Z}}_{i 1}^{t - 1} \dots {\hat{Z}}_{i M}^{t - 1})$

(2) Generate weight parameters $P^{t}$ from EquationEquation 13(13) $π (P | η) = \frac{Γ (\sum_{j = 1}^{M} η_{j})}{\prod_{j = 1}^{M} Γ (η_{j})} \prod_{j = 1}^{M} P_{j}^{η_{j} - 1}$ (13)

M–H algorithm:

(1) Generate ${\tilde{ξ}}_{j}$ from $q ({\tilde{ξ}}_{j} | ξ_{j}^{t - 1})$ and $u \sim U [0, 1]$

(2) compute $r = \frac{π ({\tilde{ξ}}_{j} | M, X) q (ξ_{j}^{t - 1} | {\tilde{ξ}}_{j})}{π (ξ_{j}^{t - 1} | M, X) q ({\tilde{ξ}}_{j} | ξ_{j}^{t - 1})}$

(3) if $r < u$ then: $ξ^{t} = \tilde{ξ}$ else: $ξ^{t} = ξ^{t - 1}$

In Algorithm (5), $ξ_{j} = [α_{j 1}, β_{j 1} ., τ_{j 1}, \dots, α_{j W}, β_{j W}, τ_{j W}]$ , and we take the K-means and the method of Moment (MOM) (Wong Citation2010) for initializing model parameters $α$ . We initialize $β$ with a constant proportion vector and $τ$ as a vector one. Choosing proposal distribution is significant part in M–H steps (Sorensen and Gianola Citation2002; Train Citation2009). As the model parameters satisfy $0 < α_{j w} ≪ 1$ and $0 < β_{j w} < 1$ , we choose the Beta distribution and Gamma distribution as the proposal distributions for $α_{j w}$ , $β_{j w}$ and the Inverse Gamma distribution for $τ$ .

(52)

α_{j w} \sim B (α, σ_{1}), τ \sim i n v G (τ, σ_{2}), β \sim G a m m a (β, σ_{3})

(52)

The algorithm computation complexity for one iteration is $O (N K)$

Learning Algorithm for Infinite Mixture Model of EMMSD

We find that taking the prior (EquationEquation (50)(50) $\begin{matrix} π (α, β) & \propto exp [\sum_{w = 1}^{W} {ρ_{1 w} (log (α_{w}) - log (τ)) + ρ_{2 w} log (β_{w})} \\ + k \times log {\frac{Γ (α_{+}) τ^{D}}{Γ (α_{+} + τ n)}} \end{matrix}$ (50) ) and the posterior (EquationEquation (51)(51) $\begin{matrix} p (ξ_{j} | X, Z) & \propto π (α, β) \prod_{Z_{i j} = 1}^{N} E M M S D (X | ξ_{j}, M) \\ = & exp [\sum_{w = 1}^{W} {ρ_{1 w} (log (α_{w}) - log (τ)) + ρ_{2 w} log (β_{w})} \\ + k \times log {\frac{Γ (α_{+}) τ^{W}}{Γ (α_{+} + τ n)}} \prod_{Z_{i j} = 1}^{N} p_{E M M S D} (X | ξ_{j}, M) \\ \propto & exp {(\sum_{Z_{i j} = 1}^{N} [I (x_{i} w \geq 1) + ρ_{1 w}) (\sum_{w = 1}^{W} log (α_{j w} - τ_{j w})) \\ + (\sum_{Z_{i j} = 1}^{N} [I (x_{i} w \geq 1) x_{i w} + ρ_{2 w}) (\sum_{w = 1}^{W} log (β_{j} w)) \\ + k \times log {\frac{Γ (α_{+}) τ^{D}}{Γ (α_{+} + τ n)} + \sum_{Z_{i j} = 1}^{N} \frac{Γ (α_{+}) τ^{W}}{Γ (α_{+} + τ n_{i})}} \end{matrix}$ (51) ) for EMSSD parameters in infinite mixture model, we can obtain a superior performance in real applications. Thus, we directly use them to the infinite mixture model. Then, the complete algorithm can be presented:

Algorithm 6 Infinite EMSSD (InfEMSSD) learning algorithm

Initialization: Using MOM to initialize model parameters

Input: a data set $X = X_{1} \dots X_{N}$ , each is W-dimensional sparse count data

output: $Θ$

for $t = 1 \dots :$

(2) Update the number of represented components

(4) Generate weight parameters $P^{t}$ from $D i r (η / M + n_{1}, \dots, η / M + n_{M})$

(5) Update $α, β, τ$ in M–H algorithm

M–H algorithm:

(1) Generate ${\tilde{ξ}}_{j}$ from $q ({\tilde{ξ}}_{j} | ξ_{j}^{t - 1})$ and $u \sim U [0, 1]$

(2) compute $r = \frac{π ({\tilde{ξ}}_{j} | M, X) q (ξ_{j}^{t - 1} | {\tilde{ξ}}_{j})}{π (ξ_{j}^{t - 1} | M, X) q ({\tilde{ξ}}_{j} | ξ_{j}^{t - 1})}$

(3) if $r < u$ then: $ξ^{t} = \tilde{ξ}$ else: $ξ^{t} = ξ^{t - 1}$

Experimental Results

In this section, we aim at comparing the proposed algorithms and their corresponding finite mixture models learned in a deterministic way using EM algorithm in different data clustering applications. The first experiment and second one concentrate on textual data for sentiment analysis and fake news detection. The last one considers images data for distinguishing male and female faces. All experiments were conducted using optimized python code on Inter (R) Core (TM) i7-9750 H processor PC with Windows 10 Enterprise Service Pack 1 operating system with a 16 GB main memory. The results that we will present in the following subsections represent the average over 20 runs of the proposed algorithms. For our proposed algorithm, The empirical assessment of MCMC convergence is delicate, especially in high dimensional spaces. In our experiments we applied the widely used one-long run technique as proposed in Raftery and Lewis (Citation1992).

Text Sentiment Analysis

Sentiment analysis, also called opinion mining, involves analyzing evaluations, attitudes, and emotions, expressed in a piece of text, toward entities such as products, services, or movies (Batista and Ratté Citation2014). In our first experiment, we classify whether a whole opinion document expresses a positive or negative sentiment. The challenges in sentiment analysis, as a text clustering application, include that the reviews are usually limited in length, have many misspellings, and shortened forms of words. Thus, the vocabulary size is immense, and the count vector that represents each review will be highly sparse. The experiment used large data set of IMDB movies reviews with two labels: negative and positive, and TripAdvisor Hotel reviews with three labels: negative, neutral and positive. The experimental results are based on comparing recall, precision, and F-measure values. We take 50,000 samples from each IMDB reviews of different labels with 76,340 unique words in total, and we used 5,000 samples from TripAdvisor Hotel reviews with 1,000 unique words. We compare the proposed algorithms with other methods, such as EGDM mixture model (Zamzami and Bouguila Citation2022), EMBL mixture model (Zamzami and Bouguila Citation2020a), EMSSD mixture model (Zamzami and Bouguila Citation2020b) that have been proposed for modeling count data.

The results are shown in . According to the F-measure in these tables, we can note that the proposed approaches outperform other compared models and approaches, and that infinite models show better results, compared with finite mixture models.

Covid-19 Fake News Detection

This data set contains 947 twitters which are related with Covid-19 information, and that have been already divided into two classes, one contains real news and the other contains fake news. In this experiment, we take all samples and select the most frequently used 1,000 unique words as a count data.

From , our proposed algorithms still show excellent performance in the fake news detection task. Compared with other approaches and models, InfEMGD-MCMC yields the best accuracy of 87.45 $%$ and FinEMGD-MCMC also reaches 86.48 $%$ . Comparing with finite mixture models, the performance of our infinite mixture models show higher accuracy rate.

Table 3. The experiment result for CON-19 fake news detection.

Display Table

Human Face Gender Recognition

In this experiment, we use two standard and challenging face recognition databases. The first database is the AR face database, which has 4000 color images corresponding to 126 people’s faces (70 men and 56 women). Images feature frontal view faces with different facial expressions, illumination conditions, and occlusions (sunglasses and scarf). The second database is Caltech faces by California Institute of Technology, consists of 450 face images of around 27 unique people (both genders) with different lighting/expressions/backgrounds (sample images are shown in . We apply bag of feature (BOF) for representing the image vectors where SIFT has been used for feature extraction, treating the local image patches as the visual equivalent of individual words.

Figure 1. Sample from face recognition database.

show that our proposed approaches permit good discrimination. The intraclass performance for the AR using proposed approaches is shown in . We note that InfEMSSD-MCMC shows superior performance in distinguishing women class ( $97 %$ ) from men class ( $94 %$ ) and InfEMBL-MCMC achieves $96.01 %$ in Caltech data set as we can see in . Overall, all of our proposed models and algorithms ensure an accuracy above 85 $%$ in this application. Compared with the EM algorithm, our proposed MCMC algorithms show higher accuracy with the corresponding models.

Figure 2. Intraclass accuracy for proposed approaches models in Caltech data.

Figure 3. Intraclass accuracy for proposed approaches models in AR data.

Conclusion

In this paper, we have proposed a novel approach for finite mixtures of EMGD, EMBL and EMSSD based on the development of conjugate prior distributions and on the Monte Carlo simulation techniques of Gibbs sampling mixed with a M–H step. Generally, with the help of prior information and the stochastic aspect of the simulation in Gibbs sampling, our proposed algorithms ensure accurate models learning. Moreover, via a Bayesian nonparametric extension based on these mixtures, we show that the problem of determining the number of clusters can be cured and avoided by using infinite mixtures which model well the structure of the data. Our proposed approaches and infinite models offer excellent modeling capabilities as shown in the experimental part, which involves text sentiment analysis, fake news detection and human face recognition, compared to the widely used maximal likelihood approaches in high-dimensional count data. However, our modeling framework still has some drawbacks as follows. First, the high computational complexity of the proposed inference led to slow convergence. A promising future work could be replacing the classical M–H by the Scalable M–H algorithm proposed in Cornish et al. (Citation2019). This scheme is based on combination of factorized acceptance probabilities, procedures of Bernoulli processes, and control variate idea. It can be used to reduce the computational complexity by discovering in advance the sampling points that may be rejected. Second, Gibbs sampling might take a long time to converge. When two or more mixture components have similar parameters, the Gibbs sampling method can get stuck in a local mode, resulting in inaccurate data points clustering. A possible solution that could be investigated is to consider the spilt-merge Markov Chain Monte Carlo procedure for the Dirichlet process as described in Jain and Neal (Citation2004). Finally, the proposed approaches may be sensitive to the choice of the hyperparameters values. A potential future work could be devoted to developing an approach for the automatic selection of these values depending on the data to model.

Acknowledgments

The authors would like the thank the associate editor and the reviewers for their comments and suggestions that helped to improve the paper.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada [6656-2017].

References

Alsuroji, R., N. Zamzami, and N. Bouguila. 2018. “Model selection and estimation of a finite shifted-scaled dirichlet mixture model.” In 17th IEEE International Conference on Machine Learning and Applications, ICMLA 2018, Orlando, FL, USA, December 17-20, edited by M. A. Wani, M. M. Kantardzic, M. S. Mouchaweh, J. Gama, and E. Lughofer, 707–2542. IEEE.
Google Scholar
Amirkhani, M., N. Manouchehri, and N. Bouguila. 2021. Birth-death MCMC approach for multivariate beta mixture models in medical applications. In Advances and trends in artificial intelligence. artificial intelligence practices, ed. H. Fujita, A. Selamat, J. C.-W. Lin, and M. Ali, 285–96. Cham: Springer International Publishing.
Google Scholar
Antoniak, C. E. 1974. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics 2 (6):1152–74. doi:10.1214/aos/1176342871.
Web of Science ®Google Scholar
Batista, L., and S. Ratté. 2014. Multi-classifier system for sentiment analysis and opinion mining. In Encyclopedia of social network analysis and mining, ed. R. Alhajj and J. Rokne, 989–98. New York: Springer New York.
Google Scholar
Bouguila, N. 2008. Clustering of count data using generalized dirichlet multinomial distributions. IEEE Transactions on Knowledge and Data Engineering 20 (4):462–74. doi:10.1109/TKDE.2007.190726.
Web of Science ®Google Scholar
Bouguila, N. 2011. Count data modeling and classification using finite mixtures of distributions. IEEE Transactions on Neural Networks 22 (2):186–98. doi:10.1109/TNN.2010.2091428.
PubMedGoogle Scholar
Bouguila, N., and W. Fan. 2020. Mixture models and applications. Springer.
Google Scholar
Bouguila, N., and D. Ziou. 2007. Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization. Journal of Visual Communication and Image Representation 18 (4):295–309. doi:10.1016/j.jvcir.2007.02.005.
Web of Science ®Google Scholar
Bouguila, N., and D. Ziou. 2010. A Dirichlet process mixture of generalized Dirichlet distributions for proportional data modeling. IEEE Transactions on Neural Networks 21 (1):107–22. doi:10.1109/TNN.2009.2034851.
PubMedGoogle Scholar
Bouguila, N., D. Ziou, and R. I. Hammoud. 2009. On Bayesian analysis of a finite generalized dirichlet mixture via a Metropolis-within-Gibbs sampling. Pattern Analysis and Applications 12 (2):151–66. doi:10.1007/s10044-008-0111-4.
Web of Science ®Google Scholar
Brooks, S. P. 2001. On Bayesian analyses and finite mixtures for proportions. Statistics and Computing 11 (2):179–90. doi:10.1023/A:1008983500916.
Web of Science ®Google Scholar
Brown, L. D. 1986. Fundamentals of statistical exponential families with applications in statistical decision theory. Lecture Notes-Monograph Series 9:100–279.
Google Scholar
Cai, L. 2010. High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika 75 (1):33–57. doi:10.1007/s11336-009-9136-x.
Web of Science ®Google Scholar
Church, K. W., and W. A. Gale. 1995. Poisson mixtures. Natural Language Engineering 1 (2):163–90. doi:10.1017/S1351324900000139.
Google Scholar
Cornish, R., P. Vanetti, A. Bouchard-Cote, G. Deligiannidis, and A. Doucet. 2019. “Scalable Metropolis–Hastings for exact Bayesian inference with large datasets.” In International Conference on Machine Learning, 1351–60.
Google Scholar
DasGupta, A. 2011. The exponential family and statistical applications. In Probability for statistics and machine learning. edited by Anirban DasGupta, 583–612. New York, NY: Springer.
Google Scholar
Dhillon, I. S., and D. S. Modha. 2001. Concept decompositions for large sparse text data using clustering. Machine Learning 42 (1/2):143–75. doi:10.1023/A:1007612920971.
Web of Science ®Google Scholar
Dias, J. G., and M. Wedel. 2004. An empirical comparison of EM, SEM and MCMC performance for problematic gaussian mixture likelihoods. Statistics and Computing 14 (4):323–32. doi:10.1023/B:STCO.0000039481.32211.5a.
Web of Science ®Google Scholar
Elkan, C. 2006. “Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution.” In Proceedings of the 23rd International Conference on Machine Learning, ICML ‘06, New York, NY, USA, 289–96. Association for Computing Machinery.
Google Scholar
Emanuel, L., and G. T. Herman. 1987. A maximum a posteriori probability expectation maximization algorithm for image reconstruction in emission tomography. IEEE Transactions on Medical Imaging 6 (3):185–92. doi:10.1109/TMI.1987.4307826.
PubMed Web of Science ®Google Scholar
Escobar, M. D., and M. West. 1995. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90 (430):577–88. doi:10.1080/01621459.1995.10476550.
Web of Science ®Google Scholar
Favaro, S., and Y. Whye Teh. 2013. MCMC for normalized random measure mixture models. Statistical Science 28 (3):335–59. doi:10.1214/13-STS422.
Web of Science ®Google Scholar
Hartigan, J., and M. Wong. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society 28 (1):100–08.
Google Scholar
Jain, A. K. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31 (8):651–66. doi:10.1016/j.patrec.2009.09.011.
Web of Science ®Google Scholar
Jain, S., and R. Neal. 2004. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics 13 (1):158–82. doi:10.1198/1061860043001.
Web of Science ®Google Scholar
Jegou, H., M. Douze, and C. Schmid. 2009. On the burstiness of visual elements. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 1169–76.
Google Scholar
Katz, S. M. 1996. Distribution of content words and phrases in text and language modelling. Natural Language Engineering 2 (1):15–59. doi:10.1017/S1351324996001246.
Google Scholar
Korwar, R. M., and M. Hollander. 1973. Contributions to the theory of dirichlet processes. The Annals of Probability 1 (4):705–11. doi:10.1214/aop/1176996898.
Web of Science ®Google Scholar
Lee, P. M. 2012. Bayesian Statistics: An Introduction. Wiley.
Google Scholar
MacEachern, S. N., and P. Muller. 1998. Estimating mixture of dirichlet process models. Journal of Computational and Graphical Statistics 7:223–38.
Web of Science ®Google Scholar
Madsen, R. E., D. Kauchak, and C. Elkan. 2005. “Modeling WORD BURSTINESS USING THE DIRICHLET DISTRIBUTION.” In Proceedings of the 22nd International Conference on Machine Learning, ICML ‘05, New York, NY, USA, 545?552. Association for Computing Machinery.
Google Scholar
Margaritis, D., and S. Thrun. 2001. A bayesian multiresolution independence test for continuous variables. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 346–53. Morgan Kaufmann Publishers Inc.
Google Scholar
Najar, F., N. Zamzami, and N. Bouguila. 2019. “Fake news detection using bayesian inference.” In 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI), 389–94.
Google Scholar
Najar, F., N. Zamzami, and N. Bouguila. 2020. “Recognition of human interactions in feature films based on infinite mixture of EDCM.” In 2020 International Symposium on Networks, Computers and Communications (ISNCC), 1–6.
Google Scholar
Raftery, A. E., and S. M. Lewis. 1992. [Practical markov chain monte carlo]: comment: one long run with diagnostics: implementation strategies for markov chain monte Carlo. Statistical Science 7 (4):493–97. doi:10.1214/ss/1177011143.
Google Scholar
Rasmussen, C. E. 1999. The infinite Gaussian mixture model. Advances in Neural Information Processing Systems 12:554–60.
Google Scholar
Robert, C. P. 2007. The Bayesian choice. From decision-theoretic foundations to computational implementation. Springer.
Google Scholar
Roeder, K., and L. Wasserman. 1997. Practical bayesian density estimation using mixture of normals. Journal of the American Statistical Association 92 (439):894–902. doi:10.1080/01621459.1997.10474044.
Web of Science ®Google Scholar
Samuel, K., N. Balakrishnan, and N. L. Johnson. 2000. Continuous multivariate distributions. New York: Wiley-Interscience.
Google Scholar
Sorensen, D., and D. Gianola. 2002. Likelihood, Bayesian and MCMC methods in quantitative genetics. Springer.
Google Scholar
Train, K. E. 2009. Discrete choice methods with simulation. 2nd ed. Cambridge University Press.
Google Scholar
Tsionas, E. G. 2004. Bayesian inference for multivariate gamma distributions. Statistics and Computing 14 (3):223–33. doi:10.1023/B:STCO.0000035302.87186.be.
Web of Science ®Google Scholar
Wong, T.-T. 2010. Parameter estimation for generalized Dirichlet distributions from the sample estimates of the first and the second moments of random variables. Computational Statistics & Data Analysis 54 (7):1756–65. doi:10.1016/j.csda.2010.02.008.
Web of Science ®Google Scholar
Xuanbo, S., N. Bouguila, and N. Zamzami. 2021. “Covid-19 News Clustering using MCMC-Based Learing of finite EMSD Mixture Models.” In Proceedings of the Thirty-Fourth International Florida Artificial Intelligence Research Society Conference, North Miami Beach, Florida, USA, May 17-19, edited by E. Bell and F. Keshtkar.
Google Scholar
Zamzami, N., and N. Bouguila. 2019. A novel scaled Dirichlet-based statistical framework for count data modeling: Unsupervised learning and exponential approximation. Pattern Recognition 95:36–47. doi:10.1016/j.patcog.2019.05.038.
Web of Science ®Google Scholar
Zamzami, N., and N. Bouguila. 2020a. High-dimensional count data clustering based on an exponential approximation to the multinomial Beta-Liouville distribution. Information Sciences 524:116–35. doi:10.1016/j.ins.2020.03.028.
Web of Science ®Google Scholar
Zamzami, N., and N. Bouguila. 2020b. Probabilistic modeling for frequency vectors using a flexible shifted-scaled dirichlet distribution prior. ACM Trans. Knowl. Discov. Data 14 (6):1–35. doi:10.1145/3406242.
Web of Science ®Google Scholar
Zamzami, N., and N. Bouguila. 2022. Sparse count data clustering using an exponential approximation to generalized dirichlet multinomial distributions. IEEE Transactions on Neural Networks and Learning Systems 33 (1):89–102. doi:10.1109/TNNLS.2020.3027539.
PubMed Web of Science ®Google Scholar

Appendix A

Proof of EquationEquation 17(17) $\begin{matrix} p (Z | η) & = \int P (Z | P_{i n f}) P (P_{i n f} | η) \\ = \frac{Γ (η)}{Γ {(η / M)}^{M}} \int \prod_{j = 1}^{M} π^{n_{j} + η / M - 1} d π_{j} \\ = \frac{Γ (η)}{Γ (N + η)} \prod_{j = 1}^{M} \frac{Γ (n_{j} + η / M)}{Γ (η / M)} \end{matrix}$ (17)

(A1)

\begin{matrix} π (ξ_{j} | M, X) & \propto π (ξ_{j}) \prod_{Z_{i j} = 1} E G D M (X_{i} | ξ_{j}) \\ = exp (\sum_{w = 1}^{W} ρ_{l} log \frac{α_{w} β_{w}}{α_{w} + β_{w}}) \prod_{Z_{i j} = 1} {(\prod_{W : x_{W} \geq 1} x_{w}^{- 1}) \prod_{W : x_{W} \geq 1} \frac{Γ (z_{w})}{Γ (x_{w} + z_{w})} n \\ \times {exp \sum_{w = 1}^{W} I (x_{w} >= 1) log \frac{α_{w} β_{w}}{α_{w} + β_{w}}}} \end{matrix}

(A1)

Removing the equation parts which is only related with data set $X$ , because it does not have an effect on the r calculation in M–H step.

(A2)

\begin{matrix} π (ξ_{j} | M, X) & \propto π (Θ_{j}) \prod_{Z_{i j} = 1} E G D M (X_{i} | Θ_{j}) \\ \propto exp [\sum_{w = 1}^{W} log \frac{α_{w} β_{w}}{α_{w} + β_{w}} (ρ_{w} + \sum_{Z_{i j} = 1}^{N} I (x_{i w} \geq 1))] \\ \times \prod_{Z_{i j} = 1}^{N} (\prod_{w : x_{i w} \geq 1} x_{i w}^{- 1} \frac{Γ (z_{i w})}{Γ (x_{i w} + z_{i w})} n!) \end{matrix}

(A2)

Appendix B

Proof of EquationEquation 21(21) $p (η | ϑ, ρ, M, N) = \frac{ρ^{ϑ} exp (- ρ / η)}{Γ (ϑ) η^{ϑ + 1}} \times \frac{η^{M} Γ (η)}{Γ (N + η)}$ (21)

(B1)

\begin{matrix} π (ξ_{j} | M, X) & \propto π (ξ_{j}) \prod_{Z_{i j} = 1} E M B L (X_{i} | ξ_{j}) \\ = exp [\sum_{w = 1}^{W} ρ_{d} log (α_{d}) + k (log (\frac{Γ (S) Γ (α^{'}) Γ (β^{'}) α}{Γ (S + n) Γ (α^{'} + β^{'})}))] \\ \times \prod_{Z_{i j} = 1} {(\prod_{d : x_{d} >= 1} x_{d}^{- 1}) n! \frac{Γ (S) Γ (α^{'}) Γ (β^{'}) α}{Γ (S + n) Γ (α^{'} + β^{'})} \times exp {\sum_{d = 1}^{D} I (x_{d} \leq 1) log (α_{d})}} \\ = \prod_{Z_{i j} = 1} (\prod_{d : x_{d} >= 1} x_{d}^{- 1} n!) exp [\sum_{w = 1}^{W} log (α_{w}) (ρ_{w} + \sum_{Z_{i j} = 1}^{N} I (x_{i w} \geq 1)) \\ + k (log (\frac{Γ (S) Γ (α_{j}^{'}) Γ (β_{j}^{'}) α_{j}}{Γ (S + n) Γ (α_{j}^{'} + β_{j}^{'})})) \end{matrix}

(B1)

We remove the equations which are only related with data set $X$ .

For the fact that:

(B2)

Γ (S + n) = Γ (S) (S) \times (S + 1) \dots (S + n - 1)

(B2)

So we have this:

(B3)

\propto exp [\sum_{w = 1}^{W} log (α_{w}) (ρ_{w} + \sum_{Z_{i j} = 1}^{N} I (x_{i w} \geq 1)) + k (log (\frac{Γ (α_{j}^{'}) Γ (β_{j}^{'}) α_{j}}{(S) \times (S + 1) \dots (S + n - 1) Γ (α_{j}^{'} + β_{j}^{'})})) + \sum_{i = 1, z_{i j} = 1} (log (\frac{Γ (α^{'}) Γ (β^{'}) α}{(S) \times (S + 1) \dots (S + n - 1) Γ (α^{'} + β^{'})}))]

(B3)

Appendix C.

Conditional Posterior of model hyperparmeters

In EMGD, those conditional posteriors becomes:

(C1)

\begin{matrix} p (| \dots) & \propto p () \prod_{j = 1}^{M} p (α_{j} | δ,) \\ λ exp (- λ) \\ \prod_{j = 1}^{M} \times \frac{Γ (δ +)}{Γ (δ) Γ ()} α_{j}^{δ - 1} {(1 - α_{j})}^{- 1} \end{matrix}

(C1)

(C2)

\begin{matrix} p (δ | \dots) & \propto p (δ) \prod_{j = 1}^{M} p (α_{j} | δ,) \\ \frac{Γ (ς + υ)}{Γ (ς) Γ (υ)} δ^{ς - 1} {(1 - ς)}^{υ - 1} \\ \times \frac{Γ (δ +)}{Γ (δ) Γ ()} α_{j}^{δ - 1} {(1 - α_{j})}^{- 1} \end{matrix}

(C2)

(C3)

\begin{matrix} p (ρ | \dots) & \propto p (ρ) \prod_{j = 1}^{M} p (β_{j} | ϖ, ρ) \\ = ι exp (- ι ρ) \\ \times \prod_{j = 1}^{M} \frac{Γ (ϖ + ρ)}{Γ (ϖ) Γ (ρ)} β_{j}^{ϖ - 1} {(1 - β_{j})}^{ρ - 1} \end{matrix}

(C3)

(C4)

\begin{matrix} p (ϖ | \dots) \propto p (ϖ)) \prod_{j = 1}^{M} p (β_{j} | ϖ, ρ) \\ = \frac{Γ (κ + ω)}{Γ (κ) Γ (ω ϖ^{κ - 1} {(1 - κ)}^{ω - 1}} \\ \times \prod_{j = 1}^{M} \frac{Γ (ϖ + ρ)}{Γ (ϖ) Γ (ρ)} β_{j}^{ϖ - 1} {(1 - β_{j})}^{ρ - 1} \end{matrix}

(C4)

In EMBL, the form of $p (| \dots)$ and $p (δ | \dots)$ are same in Equation (C1) and Equation (C2). the conditional posterior for $ρ$ and $ϖ$ , we have:

(C5)

\begin{matrix} p (ρ | \dots) & \propto p (ρ) \prod_{j = 1}^{M} p (β_{j} | ϖ, ρ) \\ = ι exp (- ι ρ) \prod_{j = 1}^{M} \frac{ρ^{ϖ} exp (- ρ / β_{j})}{Γ (ϖ) β_{j}^{ϖ - 1}} \end{matrix}

(C5)

(C6)

\begin{matrix} p (ϖ | \dots) & \propto p (ϖ) \prod_{j = 1}^{M} p (β_{j} | ϖ, ρ) \\ = \frac{ω^{κ} exp (- ω / ϖ)}{Γ (κ) ϖ^{κ - 1}} \prod_{j = 1}^{M} \frac{ρ^{ϖ} exp (- ρ / β_{j})}{Γ (ϖ) β_{j}^{ϖ - 1}} \end{matrix}

(C6)

A Fully Bayesian Inference with Gibbs Sampling for Finite and Infinite Discrete Exponential Mixture Models

ABSTRACT

Introduction

Table 1. Experiment results for IMDB movie reviews.

Table 2. Experiment results for Tripadvisor hotel reviews.

Exponential Approximation of Distributions for Count Data

Exponential Family

The Exponential Family Approximation to Multinomial Generalized Dirichlet (MGD) Distribution

The Exponential-Family Approximation to Multinomial Beta-Liouville (MBL) Distribution

The Proposed Bayesian Learning Framework

Finite Mixture of Distributions

Bayesian Learning for Finite Mixture Weight Parameters

The Bayesian Learning for Infinite Mixture Weight Parameters

Learning Algorithm for Finite Mixture Model of EMGD

Learning Algorithm for Infinite Mixture Model of EMGD

Learning Algorithm for Finite Mixture Model of EMBL

Learning Algorithm for Infinite Mixture Model of EMBL

Learning Algorithm for Finite Mixture Model of EMMSD

Learning Algorithm for Infinite Mixture Model of EMMSD

Experimental Results

Text Sentiment Analysis

Covid-19 Fake News Detection

Table 3. The experiment result for CON-19 fake news detection.

Human Face Gender Recognition

Conclusion

Acknowledgments

Disclosure Statement

References

Appendix A

Appendix B

Proof of EquationEquation 21(21) $p (η | ϑ, ρ, M, N) = \frac{ρ^{ϑ} exp (- ρ / η)}{Γ (ϑ) η^{ϑ + 1}} \times \frac{η^{M} Γ (η)}{Γ (N + η)}$ (21)

Appendix C.

Conditional Posterior of model hyperparmeters

Information for

Open access

Opportunities

Help and information

A Fully Bayesian Inference with Gibbs Sampling for Finite and Infinite Discrete Exponential Mixture Models

ABSTRACT

Introduction

Table 1. Experiment results for IMDB movie reviews.

Table 2. Experiment results for Tripadvisor hotel reviews.

Exponential Approximation of Distributions for Count Data

Exponential Family

The Exponential Family Approximation to Multinomial Generalized Dirichlet (MGD) Distribution

The Exponential-Family Approximation to Multinomial Beta-Liouville (MBL) Distribution

The Proposed Bayesian Learning Framework

Finite Mixture of Distributions

Bayesian Learning for Finite Mixture Weight Parameters

The Bayesian Learning for Infinite Mixture Weight Parameters

Learning Algorithm for Finite Mixture Model of EMGD

Learning Algorithm for Infinite Mixture Model of EMGD

Learning Algorithm for Finite Mixture Model of EMBL

Learning Algorithm for Infinite Mixture Model of EMBL

Learning Algorithm for Finite Mixture Model of EMMSD

Learning Algorithm for Infinite Mixture Model of EMMSD

Experimental Results

Text Sentiment Analysis

Covid-19 Fake News Detection

Table 3. The experiment result for CON-19 fake news detection.

Human Face Gender Recognition

Conclusion

Acknowledgments

Disclosure Statement

Additional information

Funding

References

Appendix A

Proof of EquationEquation 17(17) p(Z|η)=∫P(Z|Pinf)P(Pinf|η) =Γ(η)Γ(η/M)M∫∏j=1Mπnj+η/M−1dπj =Γ(η)Γ(N+η)∏j=1MΓ(nj+η/M)Γ(η/M)(17)

Appendix B

Proof of EquationEquation 21(21) p(η|ϑ,ρ,M,N)=ρϑexp(−ρ/η)Γ(ϑ)ηϑ+1×ηMΓ(η)Γ(N+η)(21)

Appendix C.

Conditional Posterior of model hyperparmeters

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Proof of EquationEquation 21(21) $p (η | ϑ, ρ, M, N) = \frac{ρ^{ϑ} exp (- ρ / η)}{Γ (ϑ) η^{ϑ + 1}} \times \frac{η^{M} Γ (η)}{Γ (N + η)}$ (21)