Full article: Variable selection in finite mixture of median regression models using skew-normal distribution

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes. Moreover, data that arise from a heterogeneous population can be efficiently analysed by a finite mixture of regression models. These observations motivate us to propose a novel finite mixture of median regression model based on a mixture of the skew-normal distributions to explore asymmetrical data from several subpopulations. With the appropriate choice of the tuning parameters, we establish the theoretical properties of the proposed procedure, including consistency for variable selection method and the oracle property in estimation. A productive nonparametric clustering method is applied to select the number of components, and an efficient EM algorithm for numerical computations is developed. Simulation studies and a real data set are used to illustrate the performance of the proposed methodologies.

Keywords:

1. Introduction

When the data involve asymmetrical outcomes, inference under the linear regression model with the skewed random errors can be viewed as an alternative procedure to the classical regression models with symmetric errors, since the use of a skewed distribution for the errors could reduce the influence of outliers and thus make statistical analysis more robust. Specifically, suppose that a response variable Y given a set of predictors $x$ takes the form of (1) $y = x^{⊤} β + ϵ,$ (1) where $β$ represents a vector of the unknown regression coefficients and the conditional density of the error term ϵ given $x$ follows an unknown distribution with the probability density function (pdf) $g (ϵ ∣ x)$ . It is known that if $g (ϵ ∣ x)$ is symmetrical about 0, the estimation of β in (Equation1(1) $y = x^{⊤} β + ϵ,$ (1) ) will be the same as the coefficients obtained by conventional mean linear regression. However, if $g (ϵ ∣ x)$ is skewed, the median regression provides a more reliable statistical analysis with adaptive robustness to outliers, since the median of a distribution is less susceptible to outliers, especially when the data involve asymmetrical outcomes. We here refer the interested readers to Kottas and Gelfand (Citation2001), Zhou and Liu (Citation2016) and Hu et al. (Citation2019) for relevant research on the median regression of population distributions.

It is noteworthy to mention that the median regression has been widely used for studying the relationship between the response variable Y and a set of predictors $x$ in symmetrical distribution, whereas such a median regression may not be suitable for analysing the data exhibiting asymmetrical behaviour or the data that arise from a heterogeneous population. To tackle this difficulty, mixture of regression models (known as switching regression models in econometrics), initially introduced by Goldfeld and Quandt (Citation1973), may be employed as a flexible tool for studying the skewed data from two or more subpopulations. Since then, finite mixture of regression (FMR) models has been widely used in a variety of fields including but not limited to biology, medicine, economics, environmental science, sampling survey and engineering technology. The book by McLachlan and Peel (Citation2004) contains a comprehensive review of FMR models. An FMR model is obtained when a response variable with a finite mixture distribution depends on a set of covariates, and FMR models have been discussed extensively when the normality is assumed for the regression error in each component.

However, it has been shown that the commonly used normal mixture model tends to be an over fitting model, since additional components are usually needed to capture the skewness of the data. To overcome the potential inappropriateness of normal mixtures in some context, we may consider the use of the skew-normal distributions (Azzalini, Citation1985) as component densities of the errors; see, for example, Wu et al. (Citation2013), Wu (Citation2014), Tang and Tang (Citation2015), and H. Li et al. (Citation2016, Citation2017), to name just a few. These observations motivate us to develop a novel finite mixture of the median regression (FMMeR) model based on a mixture of the skew-normal distributions to explore asymmetrical data that arise from several subpopulations. There exist two barriers for the development of the FMMeR model. The first barrier is to deal with computational aspects of parameter estimation when fitting the FMMeR model with the skew-normal distribution for the errors. We tackle this barrier by utilizing the stochastic representation and hierarchical representation (see, for example, Liu & Lin, Citation2014) of skew-normal mixtures. A second technical barrier is to determine the number of components of the FMMeR model under consideration. Popularly, the log-likelihood maximum and two information-based criteria, AIC (Akaike, Citation1973) and BIC (Schwarz, Citation1978), can be used to select the number of components. Although some success has been shown using the model choice criteria, choosing the right number of components for a mixture model is known to be difficult. Thus, we consider a procedure of clustering to determine the number of components, which has been shown to be very effective via real-data example, and it is introduced in Subsection 5.3.

To enhance predictability and to give a concise model, it is reasonable to include only the significant covariates in the model. As a result, variable selection has also become increasingly important for FMR models and a rich literature has been generated in recent several decades. All-subset selection methods, such as the AIC and BIC, and their modifications, have been widely investigated in the context of FMR models; for instance, P. Wang et al. (Citation1996) researched model selection in a finite mixture of Poisson regression models via AIC and BIC. However, all-subset selection methods for FMR models are computationally intensive. To improve computational efficiency, the least absolute shrinkage and selection operator (LASSO) of Tibshirani (Citation1996) and the smoothly clipped absolute deviation (SCAD) method of Fan and Li (Citation2001) are proposed as new methods for variable selection. The penalized likelihood for FMR models, the extension of penalized least square methods, were proposed by Khalili and Chen (Citation2007). Recently, Wu et al. (Citation2020) proposed an estimation and variable selection method for mixture of joint mean and variance models; Yin, Wu, and Dai (Citation2020) proposed variable selection procedures in FMR models using the skew-normal distribution.

The remainder of this paper is organized as follows. In Section 2, we briefly introduce the skew-normal distribution and its median expression. In Section 3, we develop a variable selection method for FMMeR model via the penalized likelihood-based procedure for analysing asymmetrical data from several subpopulations. Section 4 studies asymptotic properties of the resulting estimators. In Section 5, a numerical algorithm, a productive nonparametric clustering method for determining the number of components and a data-adaptive method for choosing tuning parameters are discussed. In Section 6, we carry out simulation studies to investigate the finite sample performance of the proposed methodology. A real-data example is provided in Section 7 for illustrative purposes. Some concluding remarks are given in Section 8. Brief proofs of theorems and some technical derivations are given in Appendices 1 and 2.

2. The skew-normal mixture of median regression models

2.1. Skew-normal distribution

A random variable Y is said to follow a univariate skew-normal distribution with location parameter μ, scale parameter $σ \in (0, \infty)$ and skewness parameter $λ \in R$ , denoted by $Y \sim S N (μ, σ^{2}, λ)$ , if its pdf is given by (2) $f (y ∣ μ, σ^{2}, λ) = \frac{2}{σ} ϕ (\frac{y - μ}{σ}) Φ (λ (\frac{y - μ}{σ})),$ (2) where $ϕ (\cdot)$ and $Φ (\cdot)$ denote the pdf and cumulative distribution function (cdf) of the standard normal distribution, respectively. It is worth noting that if $λ = 0$ , the density of Y reduces to a normal density $N (μ, σ^{2})$ and that the distribution is positively skewed if $λ > 0$ and is negatively skewed if $λ < 0$ .

We represent the skew-normal distribution in an incomplete data framework. Specifically, the stochastic representation for random variable $Y \sim S N (μ, σ^{2}, λ)$ is given by (3) $Y_{i} = μ + σ (δ (λ) R_{i} + \sqrt{1 - δ^{2} (λ)} V_{i}),$ (3)

where $i = 1, \dots, n$ with a sample size of n, $δ (λ) = λ / \sqrt{1 + λ^{2}}$ . Here, $R_{i} \sim T N (0, 1) I {r_{i} > 0}$ and $V_{i} \sim N (0, 1)$ , where $R_{i}$ and $V_{i}$ are independent. $R \sim T N (μ, σ^{2}) I {a_{1} < r < a_{2}}$ is a truncated normal distribution with the density $\begin{aligned} f_{R} (r ∣ μ, σ^{2}) = {Φ (\frac{a_{2} - μ}{σ}) - Φ (\frac{a_{1} - μ}{σ})}^{- 1} \times \frac{1}{\sqrt{2 π} σ} \exp {- \frac{1}{2 σ^{2}} (r - μ)^{2}}, \end{aligned}$ where $a_{1} < r < a_{2}$ and $I {\cdot}$ represents an indicator function. For notational simplicity, let $Y = (Y_{1}, \dots, Y_{n})^{⊤}$ and $R = (R_{1}, \dots, R_{n})^{⊤}$ . Furthermore, the skew-normal distribution can be decomposed into a normal distribution and a truncated normal distribution by a hierarchical representation given by (4) $\begin{aligned} Y_{i} ∣ R_{i} = r_{i} & \sim N (μ + σ r_{i} δ (λ), σ^{2} (1 - δ^{2} (λ))), \\ R_{i} & \sim T N (0, 1) I {r_{i} > 0} . \end{aligned}$ (4) Azzalini and Capitanio (Citation2013) adopted the moment-generating function to calculate the mean and variance for the skew-normal distribution in (Equation2(2) $f (y ∣ μ, σ^{2}, λ) = \frac{2}{σ} ϕ (\frac{y - μ}{σ}) Φ (λ (\frac{y - μ}{σ})),$ (2) ) and they are given by (5) $E (Y) = μ + μ_{0} (λ) σ, V a r (Y) = σ_{0}^{2} (λ) σ^{2},$ (5) respectively, where $μ_{0} (λ) \hat{=} \sqrt{2 / π} δ (λ)$ and $σ_{0}^{2} (λ) \hat{=} 1 - μ_{0}^{2} (λ)$ . Of particular note is that Lin et al. (Citation2007) introduced a simple way of obtaining higher moments of the skew-normal distribution without the use of its moment-generating function. Letting $m_{0} (λ)$ be the mode of the distribution $S N (0, 1, λ)$ , a quite accurate approximation of $m_{0} (λ)$ evaluated by the numerical maximization method is given by $\begin{aligned} m_{0} (λ) \approx μ_{0} (λ) - \frac{t_{0} (λ) σ_{0} (λ)}{2} - \frac{s i g n (λ)}{2} \exp {- \frac{2 π}{| λ |}}, \end{aligned}$ where $s i g n (λ)$ indicates the sign function for λ and $t_{0} (λ) \hat{=} \frac{4 - π}{2} \frac{μ_{0}^{3} (λ)}{σ_{0}^{3} (λ)} .$ It deserves mentioning that the logarithm of the density for the skew-normal distribution is a concave function and that this property is not altered by a change of location and scale parameters. Thus, $m_{0} (λ)$ is unique and the mode of the skew-normal distribution in (Equation2(2) $f (y ∣ μ, σ^{2}, λ) = \frac{2}{σ} ϕ (\frac{y - μ}{σ}) Φ (λ (\frac{y - μ}{σ})),$ (2) ) can be reexpressed as $M o d e (Y) = μ + m_{0} (λ) σ$ . Mean(Y), Mode(Y) and Median(Y) have the quantitative relationship when the observations follow a skew-normal distribution: $M e d i a n (Y) \approx [M o d e (Y) + 2 M e a n (Y)] / 3$ , that is, (6) $M e a d i a n (Y) \approx μ + \frac{[m_{0} (λ) + 2 μ_{0} (λ)] σ}{3},$ (6) which could facilitate the development of the median regression with the skew-normal mixtures discussed below.

2.2. Median regression for skew-normal mixtures

In this paper, we assume that the response variable $Y_{i}$ follows a skew-normal distribution with location parameter $μ_{i}$ , scale parameter σ and skewness parameter λ, denoted by $Y_{i} \sim S N (μ_{i}, σ^{2}, λ)$ for $i = 1, \dots, n$ . A linear mode regression model with skew-normal errors can be expressed as (7) $y_{i} = x_{i}^{⊤} β + ϵ_{i},$ (7) where $M e d i a n (Y_{i} ∣ X) = x_{i}^{⊤} β = μ_{i} + [m_{0} (λ) + 2 μ_{0} (λ)] σ / 3$ defined by (Equation6(6) $M e a d i a n (Y) \approx μ + \frac{[m_{0} (λ) + 2 μ_{0} (λ)] σ}{3},$ (6) ). Here $X = (x_{1}, \dots, x_{n})$ is a $p \times n$ design matrix, such that each of its element $x_{i} = (x_{i 1}, \dots, x_{i p})^{⊤}$ is the p-dimensional vector of predictors, and $β = (β_{1}, \dots, β_{p})^{⊤}$ is a p-dimensional vector of the unknown regression coefficients, and $ϵ = (ϵ_{1}, \dots, ϵ_{n})^{⊤}$ stands for the n-dimensional vector of random errors such that $ϵ_{i} \overset{i i d}{\sim} S N (- [m_{0} (λ) + 2 μ_{0} (λ)] σ / 3, σ^{2}, λ)$ .

We consider the case where the data from heterogeneous populations. A finite mixture median regression (FMMeR) model with m-components of the skew-normal distributions is defined as (8) ${\begin{aligned} f (y_{i} ∣ Ψ) & = \sum_{j = 1}^{m} ν_{j} S N (y_{i} ∣ μ_{i j}, σ_{j}^{2}, λ_{j}), i = 1, 2, \dots, n, j = 1, 2, \dots, m, \\ M e d i a n (y_{i j}) & = x_{i}^{⊤} β_{j}, \end{aligned}$ (8) where $S N (y_{i} ∣ μ_{i j}, σ_{j}^{2}, λ_{j}) = \frac{2}{σ_{j}} ϕ (\frac{y_{i} - μ_{j}}{σ_{j}}) Φ (λ_{j} \frac{y_{i} - μ_{j}}{σ_{j}}),$ $ν = (ν_{1}, \dots, ν_{m})^{⊤}$ are the mixing proportions which are constrained to be non-negative and sum to unity, $β_{j} = (β_{j 1}, \dots, β_{j p})^{⊤}$ and $Ψ = (ν_{1}, \dots, ν_{m - 1}, β_{1}^{⊤}, \dots, β_{m}^{⊤}, σ_{1}, \dots, σ_{m}, λ_{1}, \dots, λ_{m})^{⊤}$ . It is obvious that (9) $μ_{i j} = x_{i}^{⊤} β_{j} - \frac{σ_{j}}{3} [m_{0} (λ_{j}) + 2 \sqrt{\frac{2}{π}} δ (λ_{j})],$ (9) which shows that the location in the FMMeR model is altered by a change of scale and skewness parameters.

2.3. Identifiability

An important part associated with statistical inference for FMR models is their identifiability. It is well known that mixture models are not absolutely identifiable in general. However, in some mixture model settings, it is possible to establish a weaker sense of identifiability. Titterington et al. (Citation1985) have given relevant conclusions that the FMR models of continuous distribution are identifiable in most cases. Otiniano et al. (Citation2015) introduced the identifiability of finite mixture of skew-normal distribution and gave detailed explanation. The cumulative distribution function of Y is denoted by $F_{Y}$ . It is possible to define the skew-normal family as the set $F = {F : F_{Y} (y ∣ μ, σ^{2}, λ) = \int_{- \infty}^{y} f (t ∣ μ, σ^{2}, λ) d t}$ and $\begin{aligned} H & = {H : H (y ∣ Ψ) = \sum_{j = 1}^{m} ν_{j} F_{j} (y ∣ μ_{i j}, σ_{j}^{2}, λ_{j}); F_{j} (y ∣ μ_{i j}, σ_{j}^{2}, λ_{j}) \in F} \end{aligned}$ as the class of finite mixture of skew-normal distributions. The class $H$ of all finite mixtures of $F$ is identifiable if and only if for any $H, \bar{H} \in H$ , $H = \sum_{j = 1}^{m} ν_{j} F_{j}, \bar{H} = \sum_{j = 1}^{\bar{m}} {\bar{ν}}_{j} {\bar{F}}_{j} .$ The equality $H = \bar{H}$ implies $m = \bar{m}$ and $(ν_{1}, F_{1}), \dots, (ν_{m}, F_{m})$ are a permutation of $({\bar{ν}}_{1}, {\bar{F}}_{1}), \dots, ({\bar{ν}}_{m}, {\bar{F}}_{m})$ . The following theorem given by Atienza et al. (Citation2006) gives a sufficient condition for the identifiability of finite mixtures of distributions. $A^{'}$ denotes the accumulation set of A.

Theorem 2.1

Atienza et al., Citation2006

Let $F$ be a family of distributions. Let M be a linear mapping which transforms any $F \in F$ into a real function $φ_{F}$ with domain $S_{φ} (F)$ . Let $S_{0} (F) = {k \in S_{φ} (F) : φ_{F} (k) \neq 0}$ . Suppose that there exists a total order ≺ on $F$ , such that for any $F \in F$ there exists a point $k (F) \in S_{0} (F)^{'}$ verifying

if $F_{1}, F_{2}, \dots, F_{l} \in F$ ,with $F_{1} ≺ F_{j}$ for $2 \leq j \leq l$ , then $k (F_{1}) \in [S_{0} (F_{1}) \cup [\cup_{j = 2}^{l} S_{φ} (F_{j})]]^{'}$ ;
if $F_{1} ≺ F_{2}$ , then $lim_{k \to k (F_{1})} \frac{φ_{F_{2}} (k)}{φ_{F_{1}} (k)} = 0$ .

Then, the class $H$ of all finite mixture distributions of $F$ is identifiable.

3. The method for variable selection

Various classical variable selection criteria can be considered as tradeoffs based on the estimation variance and modelling biases of penalized likelihood. The density $f (x)$ is functionally independent of the parameters as an assumption in FMMeR model when $x$ is random. Hence, the variable selection can be done based absolutely on the conditional density function specified in (Equation2(2) $f (y ∣ μ, σ^{2}, λ) = \frac{2}{σ} ϕ (\frac{y - μ}{σ}) Φ (λ (\frac{y - μ}{σ})),$ (2) ). Denote ${(x_{i}, y_{i})}_{i = 1}^{n}$ as a sample of observations from FMMeR model specified in (Equation8(8) ${\begin{aligned} f (y_{i} ∣ Ψ) & = \sum_{j = 1}^{m} ν_{j} S N (y_{i} ∣ μ_{i j}, σ_{j}^{2}, λ_{j}), i = 1, 2, \dots, n, j = 1, 2, \dots, m, \\ M e d i a n (y_{i j}) & = x_{i}^{⊤} β_{j}, \end{aligned}$ (8) ). The log-likelihood function of $Ψ$ is given by $ℓ (Ψ) = \sum_{i = 1}^{n} \log \sum_{j = 1}^{m} ν_{j} S N (y_{i} ∣ μ_{i j}, σ_{j}^{2}, λ_{j}) .$ A maximum likelihood estimate (MLE) is obtained via maximizing $ℓ (Ψ)$ . The MLE is often close to, but not strictly equal to 0 when a component of $x$ is not important. Thus, this covariate is not excluded from the model. To address this problem, according to Khalili and Chen (Citation2007), we define a penalized log-likelihood function as (10) $L (Ψ) = ℓ (Ψ) - p (Ψ),$ (10) with the penalty function $p (Ψ) = n \sum_{j = 1}^{m} ν_{j} \sum_{t = 1}^{p} p_{τ_{j}} (| β_{j t} |),$ where $p_{τ_{j}} (\cdot)$ is a given penalty function with the tuning parameter $τ_{j} \geq 0 (j = 1, 2, \dots, m)$ , and the tuning parameters and the penalty functions are not necessarily the same for all the parameters. A data-driven criterion for determining tuning parameters is introduced in Subsection 5.2. By choosing appropriate tuning parameters and maximizing function $L (Ψ)$ in (Equation10(10) $L (Ψ) = ℓ (Ψ) - p (Ψ),$ (10) ) to obtain penalized maximum likelihood estimator of $Ψ$ , denoted by $\hat{Ψ}$ , the coefficients in the vicinity of 0 are compressed to 0 and automatically excluded. Thus, the procedure combines the parameter estimation and variable selection and reduces the computational burden substantially. We use the following three penalty functions to illustrate the theory that we develop for the FMMeR model: $\begin{aligned} LASSO penalty : p_{τ_{j}} (| β_{j t} |) = τ_{j} | β_{j t} |, \\ HARD penalty : p_{τ_{j}} (| β_{j t} |) = {τ_{j}}^{2} - (| β_{j t} | - τ_{j})^{2} I (| β_{j t} | < τ_{j}), \\ SCAD penalty : p_{τ_{j}}^{'} (| β_{j t} |) = τ_{j} {I (| β_{j t} | \leq τ_{j}) + \frac{(a τ_{j} - | β_{j t} |)}{(a - 1) τ_{j}} I (| β_{j t} | > τ_{j})} . \end{aligned}$ Following the idea of Fan and Li (Citation2001), we set a = 3.7 for application purposes in this article. The LASSO penalty has a good performance in numerical computation because of its convex property. The SCAD penalty gives a good performance in selecting important variables. HARD penalty should work more like SCAD, although less smoothly.

4. Asymptotic properties

In this section, we consider the consistency for variable selection method and the oracle property in estimation. Without loss of generality, the coefficient vector $β_{j} (j = 1, \dots, m)$ of the j-th component is decomposed into $β_{j}^{⊤} = (β_{1 j}^{⊤}, β_{2 j}^{⊤})$ , where $β_{1 j}$ and $β_{2 j}$ contain the nonzero effects and zero effects of $β_{j}$ , respectively. Naturally, we also split the parameter $Ψ^{⊤} = (Ψ_{1}^{⊤}, Ψ_{2}^{⊤})$ such that $Ψ_{2}$ contains all zero effects, that is, $β_{2 j}$ in the true model. The vector of true parameters is denoted as $Ψ_{0}$ . The components of $Ψ_{0}$ are denoted with a superscript, namely $Ψ_{0} = (ν_{1}^{0}, \dots, ν_{m - 1}^{0}, β_{1}^{0 ⊤}, \dots, β_{m}^{0 ⊤}, σ_{1}^{0}, \dots, σ_{m}^{0}, λ_{1}^{0}, \dots, λ_{m}^{0})^{⊤}$ , where $β_{j t}^{0}$ is the t-th component of $β_{j}^{0}$ . Let $d_{j}$ denote the number of nonzero elements $β_{j t}^{0}$ of the subvector $β_{1 j}^{0}$ for each j. Let $\begin{aligned} a_{n} = max_{j, t} {p_{τ_{j}}^{'} (β_{j t}^{0}; β_{j t}^{0} \neq 0)}, b_{n} = max_{j, t} {p_{τ_{j}}^{''} (β_{j t}^{0}; β_{j t}^{0} \neq 0)}, \end{aligned}$ where $1 \leq t \leq d_{j}$ and $1 \leq j \leq m$ . $p_{τ_{j}}^{'} (β_{j t}^{0})$ and $p_{τ_{j}}^{''} (β_{j t}^{0})$ are the first and second derivatives of the penalty function $p_{τ_{j}} (β_{j t}^{0})$ with respect to $β_{j t}^{0}$ , respectively. The asymptotic results obtained in this article are based on the three conditions on the penalty functions $p_{τ_{j}} (\cdot)$ .

C0:	For all j, $p_{τ_{j}} (0) = 0$ , and $p_{τ_{j}} (β)$ is symmetric and non-negative. Furthermore, it is nondecreasing and twice differentiable for all $β \in (0, \infty)$ with at most a few exceptions.
C1:	As $n \to \infty$ , $b_{n} = o (1)$ .
C2:	For all j and $T_{n} = {β; 0 < β \leq n^{- 1 / 2} \log n}$ , $lim_{n \to \infty} inf_{β \in T_{n}} p_{τ_{j}}^{'} (β) = \infty$ .

Condition $C_{1}$ is used to explain the asymptotic properties of the estimators of nonzero effects. Conditions $C_{0}$ and $C_{2}$ are required for sparsity.

Theorem 4.1

Consistency

Let $h_{i} = (x_{i}, Y_{i})$ , $i = 1, 2, \dots, n$ , be a random sample from a density function $f (h, Ψ)$ that satisfies the regularity conditions $R_{1}$ – $R_{4}$ in the Appendix 1. The penalty functions $p_{τ_{j}} (\cdot)$ satisfy conditions $C_{0}$ and $C_{1}$ as a assumption. Then there exists a local maximizer $\hat{Ψ}$ of the penalized log-likelihood function $L (Ψ)$ for which $‖ \hat{Ψ} - {\hat{Ψ}}_{0} ‖ = O_{p} {n^{- 1 / 2} (1 + a_{n})},$ where $‖ \cdot ‖$ represents the Euclidean norm.

Theorem 4.2

Oracle property

Assume that the conditions given in Theorem 4.1 are fulfilled. The penalty functions $p_{τ_{j}} (\cdot)$ satisfy conditions $C_{0}$ – $C_{2}$ , and m is known in parts (a) and (b). We then have the following.

For any $Ψ$ such that $‖ Ψ - Ψ_{0} ‖ = O_{p} (n^{- 1 / 2})$ , with probability tending to 1, $L (Ψ_{1}, Ψ_{2}) - L (Ψ_{1}, 0) < 0.$
For any $\sqrt{n}$ -consistent maximum penalized likelihood estimator $\hat{Ψ}$ of $Ψ$ ,
1. sparity: $P ({\hat{β}}_{2 j} = 0) \to 1, j = 1, 2, \dots, m$ as $n \to \infty$ ;
2. asymptotic normality: $\begin{aligned} \sqrt{n} {[I_{1} (Ψ_{01}) - \frac{p^{''} (Ψ_{01})}{n}] ({\hat{Ψ}}_{1} - Ψ_{01}) + \frac{p^{'} (Ψ_{01})}{n}} ⟹ d N (0, I_{1} (Ψ_{01})), \end{aligned}$ where $I_{1} (Ψ_{01})$ is the Fisher information computed under the true model with all zero effects removed.

Brief proofs of theorems are put in Appendix 1. Detailed proofs can be seen in the previous literature (see, for example, Fan & Li, Citation2001; Khalili & Chen, Citation2007; Yin, Wu, & Dai, Citation2020).

5. Numerical computations

5.1. Maximization algorithm

In general, due to the unboundedness of the likelihood function, the maximum likelihood estimator of the mixture distribution is often inconsistent in the context of finite mixture models. The alternative is to add a regular term that prevents the likelihood function from tending to infinity to get a consistent maximum penalty likelihood estimator, see, for example, Chen and Tan (Citation2009), Chen (Citation2017), including recent works, Chen et al. (Citation2020), He and Chen (Citation2022a, Citation2022b). McLachlan and Peel (Citation2004) proposed that the EM algorithm can calculate the maximum likelihood estimation of arbitrary distribution in finite mixture model. We maximize the regularized log-likelihood function by the EM algorithm. We define the latent component-indicators $Z = (Z_{1}, \dots, Z_{n})$ with $Z_{i} = (z_{i 1}, \dots, z_{i m})^{⊤}$ , $i =, 1 \dots, n$ . Then $Z_{i}$ is an m-dimensional indicator vector with its j-th element given by $z_{i j} = {\begin{cases} 1, & if (x_{i}, y_{i}) belongs to j -th component, \\ 0, & o t h e r w i s e . \end{cases}$ Since an observation cannot simultaneously belong to both components, we have $\sum_{j = 1}^{m} z_{i j} = 1$ . By assuming the component-indicators $Z_{1}, \dots, Z_{n}$ to be independent, we obtain a conditional density of the multinomial distribution given the mixing probabilities (11) $f (z_{i} ∣ ν) = ν_{1}^{z_{i 1}} ν_{2}^{z_{i 2}} \dots ν_{m - 1}^{z_{i, m - 1}} {(1 - \sum_{j = 1}^{m} ν_{j})}^{z_{i m}},$ (11) which is denoted as $Z_{i} \sim M (1; ν_{1}, \dots, ν_{m})$ , and it will be used in combination with (Equation3(3) $Y_{i} = μ + σ (δ (λ) R_{i} + \sqrt{1 - δ^{2} (λ)} V_{i}),$ (3) ) to generate the following hierarchical representation for the skew-normal mixtures, such that (12) $\begin{aligned} Y_{i} ∣ (r_{i}, z_{i j} = 1) & \sim N (μ_{i j} + σ_{j} r_{i} δ (λ_{k}), σ_{j}^{2} (1 - δ^{2} (λ_{j}))), \\ R_{i} ∣ z_{i j} = 1 & \sim T N (0, 1) I (τ_{i} > 0), \\ Z_{i} & \sim M (1; ν_{1}, ν_{2}, \dots, ν_{m}) . \end{aligned}$ (12) It deserves mentioning that the hierarchical representation of the finite skew-normal mixtures in (Equation12(12) $\begin{aligned} Y_{i} ∣ (r_{i}, z_{i j} = 1) & \sim N (μ_{i j} + σ_{j} r_{i} δ (λ_{k}), σ_{j}^{2} (1 - δ^{2} (λ_{j}))), \\ R_{i} ∣ z_{i j} = 1 & \sim T N (0, 1) I (τ_{i} > 0), \\ Z_{i} & \sim M (1; ν_{1}, ν_{2}, \dots, ν_{m}) . \end{aligned}$ (12) ) allows us to address computational barriers of the parameter estimation when fitting the FMMeR model. Let $Y_{o b s} = {y_{i}}_{i = 1}^{n}$ be the observed data. For each $Y_{i} = y_{i}$ , we use the latent variables $Z_{i}$ and $R_{i}$ to form the complete data $Y_{c o m} = Y_{o b s} \cup Y_{m i s} = {y_{i}, z_{i j}, r_{i}}$ , where $Y_{m i s}$ denotes the missing data. From hierarchical representation (Equation12(12) $\begin{aligned} Y_{i} ∣ (r_{i}, z_{i j} = 1) & \sim N (μ_{i j} + σ_{j} r_{i} δ (λ_{k}), σ_{j}^{2} (1 - δ^{2} (λ_{j}))), \\ R_{i} ∣ z_{i j} = 1 & \sim T N (0, 1) I (τ_{i} > 0), \\ Z_{i} & \sim M (1; ν_{1}, ν_{2}, \dots, ν_{m}) . \end{aligned}$ (12) ), the complete data log-likelihood function can be given by (13) $\begin{aligned} ℓ_{c} (Ψ) & = \sum_{i = 1}^{n} \sum_{j = 1}^{m} z_{i j} {\log ν_{j} - \frac{1}{2} \log (2 π σ_{j}^{2}) - \frac{1}{2} \log (1 - δ^{2} (λ_{j})) \\ - \frac{1}{2 σ_{j}^{2} (1 - δ^{2} (λ_{j}))} [e_{i j}^{2} - 2 σ_{j} e_{i j} δ (λ_{j}) r_{i} + σ_{j}^{2} r_{i}^{2} δ^{2} (λ_{j})]} . \end{aligned}$ (13) Similar to the approach in Fan and Li (Citation2001), $p (Ψ)$ is replaced by the following local quadratic function given the value $Ψ^{(0)}$ , $\begin{aligned} p (Ψ) \approx \tilde{p} (Ψ) & = p (Ψ^{(0)}) + \frac{p^{'} (Ψ^{(0)})}{2 Ψ^{(0)}} (Ψ^{2} - Ψ^{(0)^{2}}) \\ = n \sum_{j = 1}^{m} ν_{j} \sum_{t = 1}^{p} [p_{τ_{j}} (β_{j t}^{(0)}) + \frac{p_{τ_{j}}^{'} (β_{j t}^{(0)})}{2 β_{j t}^{(0)}} (β_{j t}^{2} - β_{j t}^{(0)^{2}})] . \end{aligned}$ This approximation is used in the M-step of the EM algorithm in each iteration. The complete penalized log-likelihood function of (Equation10(10) $L (Ψ) = ℓ (Ψ) - p (Ψ),$ (10) ) can be given by (14) $L_{c} (Ψ) = ℓ_{c} (Ψ) - p (Ψ) .$ (14) • E-step. The E-step computes the conditional expectation of the function $L_{c} (Ψ)$ with respect to $z_{i j}$ . Given the observed data ${x_{i}, y_{i}}_{i = 1}^{n}$ from FMMeR model (Equation8(8) ${\begin{aligned} f (y_{i} ∣ Ψ) & = \sum_{j = 1}^{m} ν_{j} S N (y_{i} ∣ μ_{i j}, σ_{j}^{2}, λ_{j}), i = 1, 2, \dots, n, j = 1, 2, \dots, m, \\ M e d i a n (y_{i j}) & = x_{i}^{⊤} β_{j}, \end{aligned}$ (8) ), $Ψ^{(k)}$ is denoted as parameter estimation for k-th iteration. Let $θ = (β^{⊤}, σ, λ)^{⊤}$ . Then the surrogate function can be constructed as (15) $\begin{aligned} Q (Ψ ∣ Ψ^{(k)}) = Q_{1} (ν ∣ Ψ^{(k)}) + Q_{2} (θ ∣ Ψ^{(k)}) - p (Ψ ∣ Ψ^{(k)}), \end{aligned}$ (15) where $\begin{aligned} Q_{1} (ν ∣ Ψ^{(k)}) & = \sum_{i = 1}^{n} \sum_{j = 1}^{m} ω_{i j}^{(k)} \log ν_{j}, \\ Q_{2} (θ ∣ Ψ^{(k)}) & = \sum_{i = 1}^{n} \sum_{j = 1}^{m} ω_{i j}^{(k)} [- \frac{1}{2} \log (2 π σ_{j}^{2}) - \frac{1}{2} \log (1 - δ^{2} (λ_{j})) \\ - \frac{1}{2 σ_{j}^{2} (1 - δ^{2} (λ_{j}))} (e_{i j}^{2} - 2 σ_{j} e_{i j} δ (λ_{j}) r_{1 i}^{(k)} + σ_{j}^{2} δ^{2} (λ_{j}) r_{2 i}^{(k)})] . \end{aligned}$ The required conditional expectations are obtained as follows. First, the conditional expectation $E_{Ψ^{(k)}} (z_{i j} ∣ y_{i}, x_{i})$ is given by (16) $ω_{i j}^{(k)} = \frac{ν_{j}^{(k)} S N (y_{i}; μ_{i j}^{(k)}, σ_{j}^{2 (k)}, λ_{j}^{(k)})}{\sum_{j = 1}^{m} ν_{j}^{(k)} S N (y_{i}; μ_{i j}^{(k)}, σ_{j}^{2 (k)}, λ_{j}^{(k)})} .$ (16) Then, it can be easily shown that $\begin{aligned} r_{1 i}^{(k)} & = E (R_{i} | y_{i}, x_{i}, z_{i j} = 1, Ψ^{(k)}) = \frac{e_{i j}^{(k)} δ (λ_{j}^{(k)})}{σ_{j}^{(k)}} + \frac{δ (λ_{j}^{(k)})}{λ_{j}^{(k)}} \frac{ϕ (γ_{i j}^{(k)})}{Φ (γ_{i j}^{(k)})}, \\ r_{2 i}^{(k)} & = E (R_{i}^{2} | y_{i}, x_{i}, z_{i j} = 1, Ψ^{(k)}) = \frac{1}{1 + λ_{j}^{(k) 2}} + \frac{e_{i j}^{(k)} δ (λ_{j}^{(k)})}{σ_{j}^{(k)}} r_{1 i}^{(k)}, \\ γ_{i j}^{(k)} & = \frac{λ_{j}^{(k)} (y_{i} - μ_{i j}^{(k)})}{σ_{j}^{(k)}} = \frac{λ_{j}^{(k)} e_{i j}^{(k)}}{σ_{j}^{(k)}}, \\ μ_{i j}^{(k)} & = x_{i}^{⊤} β_{j}^{(k)} - \frac{σ_{j}^{(k)}}{3} [m_{0} (λ_{j}^{(k)}) + 2 \sqrt{\frac{2}{π}} δ (λ_{j}^{(k)})] . \end{aligned}$ • M-step. The M-step calculates parameter vector $Ψ^{(k + 1)}$ via maximizing $Q (Ψ; Ψ^{(k)})$ with respect to $Ψ$ . Thus, on the $(k + 1)$ -th iteration of the EM algorithm, the mixing proportions are updated by (17) $ν_{j}^{(k + 1)} = \frac{1}{n} \sum_{i = 1}^{n} ω_{i j}^{(k)}, j = 1, 2, \dots, m .$ (17) It is worth noting that the mixing proportions modelling should be considered in mixture of experts regression models, which can be found in Yin, Wu, Lu, et al. (Citation2020). To improve the efficiency for selecting the number of components in real data analysis for this article, we firstly applied a clustering method to determine the optimal number of components in Subsection 5.3. By maximizing $Q (Ψ; Ψ^{(k)})$ with respect to $Ψ$ without $ν_{j}$ , namely maximizing $Q_{2} (θ; Ψ^{(k)})$ , we can compute $θ_{j}^{(k + 1)}$ . To obtain parameter estimation of FMMeR model without penalty, start from an initial value $θ^{(0)}$ and given k as the current iteration. We use the following method to update (18) $θ^{(k + 1)} = θ^{(k)} + [- H (θ^{(k)})]^{- 1} S (θ^{(k)}),$ (18) where $S (θ) = \frac{\partial Q_{2} (θ; Ψ^{(k)})}{\partial θ} = [S (β), S (σ), S (λ)]^{⊤}$ is referred to as score function without penalty. $H (θ^{(k)})$ is an observation information matrix defined as $H (θ) = \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial θ \partial θ^{⊤}} .$ Detailed derivation can be seen in Appendix 2. We iterate between the E-steps and M-steps until algorithm converges, and the estimators $β_{j}^{(0)}, σ_{j}^{(0)}, λ_{j}^{(0)}$ are obtained.

In order to find the non-significant variables and simplify the FMMeR model, we shrink the coefficients by the penalty function. $β_{j}^{(0)}$ is taken as the initial value of iteration and given k as the current iteration, update $\begin{aligned} β_{j}^{(k + 1)} & = β_{j}^{(k)} + {[\frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial β_{j} \partial β_{j}^{⊤}} - n Δ_{τ} (β_{j}^{(k)})]}^{- 1} [n Δ_{τ} (β_{j}^{(k)}) β_{j}^{(k)} - \frac{\partial Q_{2} (θ; Ψ^{(k)})}{\partial β_{j}}], \end{aligned}$ with $\begin{aligned} Δ_{τ} (β_{j}^{(k)}) & = d i a g {\frac{p_{τ_{j}}^{'} (| β_{j 1}^{(k)} |)}{| β_{j 1}^{(k)} |}, \frac{p_{τ_{j}}^{'} (| β_{j 2}^{(k)} |)}{| β_{j 2}^{(k)} |}, \dots, \frac{p_{τ_{j}}^{'} (| β_{j p}^{(k)} |)}{| β_{j p}^{(k)} |}} . \end{aligned}$

5.2. Choice of the tuning parameters

The degree of penalty is controlled by tuning parameters. When using the method introduced in this article, we need to choose the tuning parameters. Various selection criteria, including cross-validation (CV), generalized cross-validation (GCV), Akaike information criterion (AIC) (Akaike, Citation1973) and Bayesian Information Criterion (BIC) (Schwarz, Citation1978), are often used for choosing tuning parameters. GCV has a non-negligible overfitting effect in the final model selection. H. Wang et al. (Citation2007) suggested using BIC for the SCAD estimator in linear models and partially linear models and proved the consistency of the selection method, that is, the optimal parameter chosen by BIC can identify the true model with probability tending to 1. Considering the maximizer $Ψ_{n}$ of the log-likelihood function (Equation13(13) $\begin{aligned} ℓ_{c} (Ψ) & = \sum_{i = 1}^{n} \sum_{j = 1}^{m} z_{i j} {\log ν_{j} - \frac{1}{2} \log (2 π σ_{j}^{2}) - \frac{1}{2} \log (1 - δ^{2} (λ_{j})) \\ - \frac{1}{2 σ_{j}^{2} (1 - δ^{2} (λ_{j}))} [e_{i j}^{2} - 2 σ_{j} e_{i j} δ (λ_{j}) r_{i} + σ_{j}^{2} r_{i}^{2} δ^{2} (λ_{j})]} . \end{aligned}$ (13) ), we use the estimator $Ψ_{n}$ to calculate the mixing proportions in (Equation17(17) $ν_{j}^{(k + 1)} = \frac{1}{n} \sum_{i = 1}^{n} ω_{i j}^{(k)}, j = 1, 2, \dots, m .$ (17) ). The mixing proportions remain fixed throughout the tuning parameter selection process. For a given value of tuning parameter $τ_{j}$ , let $({\hat{β}}_{j}, {\hat{σ}}_{j}, {\hat{λ}}_{j})$ be the maximum regularized likelihood estimates of the parameters in the j-th component of the FMMeR model fixing the remaining elements of $Ψ$ at $Ψ_{n}$ . Denote the likelihood-based deviance statistics, evaluated at $({\hat{β}}_{j}, {\hat{σ}}_{j}, {\hat{λ}}_{j})$ , corresponding to the j-th component of FMMeR model as $\begin{aligned} D_{j} ({\hat{β}}_{j}, {\hat{σ}}_{j}, {\hat{λ}}_{j}) = \sum_{i = 1}^{n} ω_{i j} [\log S N (y_{i} ∣ y_{i}, {\hat{μ}}_{i j}, {\hat{σ}}_{j}^{2}, {\hat{λ}}_{j}) - \log S N (y_{i} ∣ x_{i} {\hat{β}}_{k}, {\hat{μ}}_{i j}, {\hat{σ}}_{j}^{2}, {\hat{λ}}_{j})], \end{aligned}$ where ${\hat{μ}}_{i j} = x_{i}^{⊤} {\hat{β}}_{j} - \frac{{\hat{σ}}_{j}}{3} [m_{0} ({\hat{λ}}_{j}) + 2 \sqrt{\frac{2}{π}} δ ({\hat{λ}}_{j})]$ and the weights $ω_{i j}$ are given in (Equation16(16) $ω_{i j}^{(k)} = \frac{ν_{j}^{(k)} S N (y_{i}; μ_{i j}^{(k)}, σ_{j}^{2 (k)}, λ_{j}^{(k)})}{\sum_{j = 1}^{m} ν_{j}^{(k)} S N (y_{i}; μ_{i j}^{(k)}, σ_{j}^{2 (k)}, λ_{j}^{(k)})} .$ (16) ). Then we define $\begin{aligned} B I C (τ_{j}) & = 2 D_{j} ({\hat{β}}_{j}, {\hat{σ}}_{j}, {\hat{λ}}_{j}) + N (τ_{j}) \log (n_{j}), j = 1, \dots, m, \end{aligned}$ where $N (τ_{j})$ is the number of nonzero elements of the vector ${\hat{β}}_{j}$ and $n_{j} = \sum_{i = 1}^{n} ω_{i j}$ . It is expected that the choice of $τ_{j t}$ should be such that the tuning parameter for a zero coefficient is larger than that for a nonzero coefficient. Thus, we can simultaneously unbiasedly estimate the larger coefficient, and shrink the small coefficient towards zero. Hence, similar to Wu et al. (Citation2013), we suggest $τ_{j t} = \frac{{\hat{τ}}_{j}}{| β_{j t}^{(0)} |}, j = 1, 2, \dots, m, t = 1, 2, \dots, p,$ where $β_{j t}^{(0)}$ is the MLE of $β_{j t}$ . $β_{j t}$ , $τ_{j t}$ are the t-th component of $β_{j}$ and $τ_{j}$ , respectively. Tuning parameters are obtained via calculating ${\hat{τ}}_{j} = \arg min_{τ_{j}} B I C (τ_{j}) .$

5.3. Determining the number of components

Determining the number of components of an FMR model is a challenge. In the above discussion, we assume that m is known and processing methods are either based on prior information or pre-analysis of data. A feasible method implements reversible jump Markov chain Monte Carlo (RJMCMC) Richardson and Green (Citation1997), since adding skewness even complicates matters, we did not pursue RJMCMC. Moreover, the component posterior probabilities evaluated in mixture modelling for Bayesian inference can be readily used as a soft clustering scheme. Alternatively, the log-likelihood maximum and two information-based criteria, AIC and BIC, can be used to select the number of components. Although some success has been shown using the model choice criteria, choosing the right number of components for a mixture model is known to be difficult.

To improve the efficiency for selecting the number of components in this article, a productive nonparametric clustering method via mode identification is applied, see J. Li et al. (Citation2007). It deserves mentioning that this approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Specifically, a cluster is formed by those sample points that ascend to the same local maximum of the density function, and a pairwise separability measure for clusters is defined using the Ridgeline between the density bumps of two clusters. In this process, the Modal EM (MEM) algorithm and Ridgeline EM (REM) algorithm are used. Numerical results in Section 7 illustrated that this clustering procedure works well for determining the number of components in the FMMeR model.

6. Numerical experiments

In this section, we carry out simulation studies to investigate the finite sample performance of the proposed methodology. To be more specific, in Subsection 6.1, we conduct simulations to study the impact of the sample size on the estimation quality, and in Subsection 6.2, we investigate the quality of the performance for variable selection over different values of the skewness, and we compare the performance of the proposed FMMeR model and NMR model used in Khalili and Chen (Citation2007) in Subsection 6.3.

6.1. Experiments 1

The experiment works to observe the impact of the sample size on the estimation quality. In addition, we compare the performance of different variable selection methods from a number of angles. We generated independently samples of size n from the following FMMeR model with two components, (19) ${\begin{aligned} f (y_{i} ∣ Ψ) & = ν_{1} S N (μ_{i 1}, σ_{1}^{2}, λ_{1}) + (1 - ν_{1}) S N (μ_{i 2}, σ_{2}^{2}, λ_{2}), \\ M e d i a n (y_{i j}) & = x_{i}^{⊤} β_{j}, i = 1, 2, \dots, n, j = 1, 2, \end{aligned}$ (19) where $μ_{i 1}$ and $μ_{i 2}$ are defined by (Equation9(9) $μ_{i j} = x_{i}^{⊤} β_{j} - \frac{σ_{j}}{3} [m_{0} (λ_{j}) + 2 \sqrt{\frac{2}{π}} δ (λ_{j})],$ (9) ), and $Ψ = (ν_{1}, β_{1}^{⊤}, β_{2}^{⊤}, σ_{1}, σ_{2}, λ_{1}, λ_{2})^{⊤}$ . The components of the covariate $x$ in the simulation are generated from a uniform distribution $U (- 1, 1)$ . The true values of parameters are set to be $β_{1 (0)} = (1, 0, 0, - 1.5, 0)^{⊤}$ , $β_{2 (0)} = (- 1, 0, 1, 0, 1.2)^{⊤}$ , $σ_{1 (0)} = σ_{2 (0)} = 2$ . To test the sensitivity of the FMMeR model for positively or negatively skewed data, we set $λ_{1 (0)} = 3$ and $λ_{2 (0)} = - 3$ . A choice of mixing proportion $ν_{1} = 0.5$ and 0.35 is considered, and y is generated according to model (Equation19(19) ${\begin{aligned} f (y_{i} ∣ Ψ) & = ν_{1} S N (μ_{i 1}, σ_{1}^{2}, λ_{1}) + (1 - ν_{1}) S N (μ_{i 2}, σ_{2}^{2}, λ_{2}), \\ M e d i a n (y_{i j}) & = x_{i}^{⊤} β_{j}, i = 1, 2, \dots, n, j = 1, 2, \end{aligned}$ (19) ). According to Karlis and Xekalaki (Citation2003), a faster convergence rate can be achieved by setting the true value of the parameter to the initial value of the iteration. The performance of the estimators $\hat{β}$ , $\hat{σ}$ , $\hat{λ}$ and $\hat{ν}$ will be assessed using the Mean Squared Error (MSE), defined as $\begin{aligned} M S E ({\hat{β}}_{j}) & = E ({\hat{β}}_{j} - β_{j (0)})^{⊤} ({\hat{β}}_{j} - β_{j (0)}), \\ M S E ({\hat{σ}}_{j}) & = E ({\hat{σ}}_{j} - σ_{j (0)})^{2}, \\ M S E ({\hat{λ}}_{j}) & = E ({\hat{λ}}_{j} - λ_{j (0)})^{2}, \\ M S E ({\hat{ν}}_{j}) & = E ({\hat{ν}}_{j} - ν_{j (0)})^{2} . \end{aligned}$ The average numbers of correctly (C) and incorrectly (IC) estimated zero coefficients and their standard deviation (SD) based on 500 repetitions are presented in Table . The results are presented in terms of mixture components 1 and 2. In addition, we report the MSEs and SD of scale, skewness and mixing proportion for $ν_{1} = 0.5$ across the repetitions in Table . Note that when the sample size n increases, as expected, the methods improve for a given penalty. The MSEs of estimators $\hat{β}$ , $\hat{σ}$ , $\hat{λ}$ and $\hat{ν}$ tend to decrease by increasing the sample size, which illustrates the convergence property of the maximum penalized likelihood estimator of FMMeR model. For a given n, the performances of SCAD and HARD methods are similar for model complexity and better than LASSO method. When mixing proportion $ν_{1}$ reduces, and the sample size for component 1 decreases, all procedures for the component 1 of the FMMeR model become less satisfactory. Furthermore, the performances of component 1 and component 2 are similar for $ν_{1} = 0.5$ , which indicates that FMMeR model is insensitive to positively or negatively skewed data.

Table 1. Three penalty functions are used for variable selection procedure.

Display Table

Table 2. Simulation results of the parameters of scale, skewness and mixing proportion for $ν_{1}$ =0.5.

Display Table

6.2. Experiments 2

To investigate how the estimation quality is changed over different skewness, in this section, we set mixing proposition $ν_{1} = 0.5$ and the number of observations n = 400. Observations are generated in the same way as in Experiment 1. Table shows C, IC, MSE $(\hat{β})$ and their SD for different penalty function with $λ = - 3, - 1.5, - 0.5, 0.5, 1.5, 3$ for 500 repetitions. Notice that the variable selection procedures perform similarly in all three cases for a given penalty function, but a larger SD is obtained by LASSO. When combined with the relevant conclusions of Experiment 1, the result indicates that the performance of the variable selection method is not affected by the choice of skewness of data.

Table 3. Varying skewness with n = 400 and $ν_{1} = 0.5$ .

Display Table

6.3. Experiments 3

To demonstrate the ability of the proposed variable selection method at selecting important variables, we compare the performance of the proposed FMMeR model and NMR model used in Khalili and Chen (Citation2007) for a varying sample size n = 200, 400 and $ν_{1} = 0.5$ . The data are generated exactly in the same way as in Experiment 1, and each of the two models is considered for the inference. The simulated results are reported in Table based on 500 repetitions. From Table , it is clear that the performance of the variable selection procedure based on the FMMeR model is better than that based on the NMR model in some settings. This confirms that the FMMeR model clearly outperforms the NMR model at identifying important variables when there is skewness in the data. As expected, the MSEs indicate the convergence property of the maximum penalized likelihood estimator of FMMeR and NMR models.

Table 4. Varying sample size n with $λ_{1} = 3$ , $λ_{2} = - 3$ and $ν = 0.5$ .

Display Table

7. A real-data example

FMR models have been used in the fields of biomedicine. To further demonstrate the ability of the proposed FMMeR model and variable selection method at identifying significant variables, we use a real-data example to illustrate the practical application of the proposed method of the FMMeR model in this section. The data set, analysed by Cook and Weisberg (Citation1994), focused on the body mass index (BMI) for 102 male and 100 female athletes collected at Australian Institute of Sport. We are interested in the relationship between BMI and the 10 performance measures given as red cell count $(x_{1})$ , white cell count $(x_{2})$ , haematocrit $(x_{3})$ , haemoglobin $(x_{4})$ , plasma ferritin concentration $(x_{5})$ , sum of skin folds $(x_{6})$ , body fat percentage $(x_{7})$ , lean body mass $(x_{8})$ , height $(x_{9})$ and weight $(x_{10})$ .

It can be seen from the histogram of the BMI in Figure that the response is right-skewed, indicating the preference of the model with the skew-normal random errors. We determine the number of components via the method in Subsection 5.3. The clustering results are shown in Figure . At the level 3, 4 clusters are formed, as shown by different symbols in Figure (a). The 4 modes identified at level 3 are merged into 2 modes at level 4, as shown in Figure (b,d). Compared with level 4, two influential observations were excluded in cluster 1 and cluster 2 of level 3. Thus, it seems reasonable to use the following FMMeR model with two components to fit the BMI data, (20) ${\begin{aligned} f (y_{i} ∣ Ψ) & = ν_{1} S N (μ_{i 1}, σ_{1}^{2}, λ_{1}) + (1 - ν_{1}) S N (μ_{i 2}, σ_{2}^{2}, λ_{2}), \\ M e d i a n (y_{i j}) & = x_{i}^{⊤} β_{j}, i = 1, 2, \dots, 202, j = 1, 2, \end{aligned}$ (20) where $μ_{i 1}$ and $μ_{i 2}$ are defined by (Equation9(9) $μ_{i j} = x_{i}^{⊤} β_{j} - \frac{σ_{j}}{3} [m_{0} (λ_{j}) + 2 \sqrt{\frac{2}{π}} δ (λ_{j})],$ (9) ), and $Ψ = (ν_{1}, β_{1}^{⊤}, β_{2}^{⊤}, σ_{1}, σ_{2}, λ_{1}, λ_{2})^{⊤}$ . $x_{i}$ is a $10 \times 1$ vector consisting of all 10 potential variables. Three penalty functions are used to select significant variables.

Figure 1. Histogram of the BMI.

Figure 2. Clustering results for the BMI data obtained. (a) The 4 clusters at level 3. (b) The ascending paths from the modes at level 3 to those at level 4 and the contours of the density estimate at level 4. (c) The 2 clusters at level 4. (d) The ascending paths from the modes at level 4 to the next level and the contours of the density estimate at the next level.

We compare the variable selection results of the three models, including the proposed FMMeR model in this article, finite mixture of modal liner regression model and NMR model, where modal liner regression (MODLR) model was proposed by Yao and Li (Citation2014). The results of variable selection for three models are given in Tables . In this data example, three variable selection procedures for a given model perform very similarly in terms of selecting significant variables. For FMMeR model and finite mixture of MODLR model, the same variables are removed for a given penalty function. NMR model, however, reserves more variables, resulting in a failure to select significant variables. Thus, the true structure of the model is not identified. When there is a situation of skewed data, the performances of HARD and SCAD are better than LASSO for identifying the authentic structure of the model. In FMMeR model, seven significant variables, including $x_{1}$ , $x_{4}$ , $x_{5}$ , $x_{7}$ , $x_{8}$ , $x_{9}$ , $x_{10}$ , are identified in component 1. Also seven $x_{4}$ , $x_{5}$ , $x_{6}$ , $x_{7}$ , $x_{8}$ , $x_{9}$ , $x_{10}$ are contained in component 2. This indicates that these variables have a significant effect for BMI of athletes. We also find that there are some variables having different effects on parts one and two. For instance, red cell count $(x_{1})$ and sum of skin folds ( $x_{6}$ ) are another factors affecting athletes' BMI in component 1 and component 2, respectively. Furthermore, $x_{4}$ , $x_{5}$ , $x_{7}$ and $x_{8}$ are helpful for achieving a high BMI in two conponents. In addition, the performance of the variable selection procedure via the FMMeR model is different from that of the variable selection procedure via the NMR model.

8. Conclusions remarks and future works

In this paper, by utilizing the skew-normal distribution as a component density to overcome the potential inappropriateness of normal mixtures in some context, we have developed a novel finite mixture of the median regression (FMMeR) model to explore asymmetrical data that arise from several subpopulations. Thanks to the stochastic representation for the skew-normal distribution, we have constructed a hierarchical representation of the finite skew-normal mixtures to address computational barriers of the parameter estimation and variable selection when fitting the FMMeR model. In addition, in order to determine the number of components, we applied a clustering method via mode identification proposed by J. Li et al. (Citation2007) and a good performance is shown. Numerical results from simulation studies and a real-data example illustrated that the proposed FMMeR model methodology performs well in general, even when the data exhibit symmetrical behaviour.

It is worth noting that we only considered the procedures of parameter estimation and variable selection for the FMMeR model based on a mixture of the skew-normal distributions. Meanwhile, the scenario of p>n has not been considered in this paper. A natural extension of the proposed methodology is to consider other skewed distributions, such as the skew-t and skew-Laplace distributions, and high-dimensional settings. In addition, another research direction is to model the mixing proportions $ν$ , which extends the proposed model to the framework of mixture of experts models. Finally, it will also be of interest to consider Bayesian variable selection, semi-parametric and nonparametric methods for the FMMeR model, which are currently under investigation and will be reported elsewhere.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work is partially supported by the National Natural Science Foundation of China [grant number 11861041], the Natural Science Research Foundation of Kunming University of Science and Technology [grant number KKSY201907003].

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. International Symposium on Information Theory, 1, 610–624. https://doi.org/10.1007/978-1-4612-1694-0_15
Google Scholar
Atienza, N., Garcia-Heras, J., & Muñoz-Pichardo, J. (2006). A new condition for identifiability of finite mixture distributions. Metrika, 63(2), 215–221. https://doi.org/10.1007/s00184-005-0013-z
Web of Science ®Google Scholar
Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171–178. http://www.jstor.org/stable/4615982
Web of Science ®Google Scholar
Azzalini, A., & Capitanio, A. (2013). The skew-normal and related families. Cambridge University Press.
Google Scholar
Chen, J. (2017). Consistency of the MLE under mixture models. Statistical Science, 32(1), 47–63. https://doi.org/10.1214/16-sts578
Web of Science ®Google Scholar
Chen, J., Li, P., & Liu, G. (2020). Homogeneity testing under finite location-scale mixtures. Canadian Journal of Statistics, 48(4), 670–684. https://doi.org/10.1002/cjs.11557
Web of Science ®Google Scholar
Chen, J., & Tan, X. (2009). Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7), 1367–1383. https://doi.org/10.1016/j.jmva.2008.12.005
Web of Science ®Google Scholar
Cook, R.-D., & Weisberg, S. (1994). An introduction to regression graphics. John Wiley and Sons.
Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
Web of Science ®Google Scholar
Goldfeld, S., & Quandt, R. (1973). A Markov model for switching regressions. Journal of Econometrics, 1(1), 3–15. https://doi.org/10.1016/0304-4076(73)90002-X
Google Scholar
He, M., & Chen, J. (2022a). Consistency of the MLE under a two-parameter gamma mixture model with a structural shape parameter. Metrika. https://doi.org/10.1007/s00184-021-00856-9
Google Scholar
He, M., & Chen, J. (2022b). Strong consistency of the MLE under two-parameter gamma mixture models with a structural scale parameter. Advances in Data Analysis and Classification, 16(1), 125–154. https://doi.org/10.1007/s11634-021-00472-5
Web of Science ®Google Scholar
Hu, D., Gu, Y., & Zhao, W. (2019). Bayesian variable selection for median regression. Chinese Journal of Applied Probability and Statistics, 35(6), 594–610.
Google Scholar
Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics & Data Analysis, 41(3–4), 577–590. https://doi.org/10.1016/S0167-9473(02)00177-9
Web of Science ®Google Scholar
Khalili, A., & Chen, J. (2007). Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102(479), 1025–1038. https://doi.org/10.1198/016214507000000590
Web of Science ®Google Scholar
Kottas, A., & Gelfand, A. (2001). Bayesian semiparametric median regression modeling. Journal of the American Statistical Association, 96(456), 1458–1468. https://doi.org/10.1198/016214501753382363
Web of Science ®Google Scholar
Li, H., Wu, L., & Ma, T. (2017). Variable selection in joint location, scale and skewness models of the skew-normal distribution. Journal of Systems Science and Complexity, 30(3), 694–709. https://doi.org/10.1007/S11424-016-5193-2
Web of Science ®Google Scholar
Li, H., Wu, L., & Yi, J. (2016). A skew-normal mixture of joint location, scale and skewness models. Applied Mathematics-A Journal of Chinese Universities, 31(3), 283–295. https://doi.org/10.1007/S11766-016-3367-2
Google Scholar
Li, J., Ray, S., & Lindsay, B.-G. (2007). A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research, 8(8), 1687–1723.
Google Scholar
Lin, T.-I., Lee, J., & Yen, S. (2007). Finite mixture modelling using the skew normal distribution. Statistica Sinica, 17(3), 909–927. http://www.jstor.org/stable/24307705
Web of Science ®Google Scholar
Liu, M., & Lin, T.-I. (2014). A skew-normal mixture regression model. Educational and Psychological Measurement, 74(1), 139–162. https://doi.org/10.1177/0013164413498603
Web of Science ®Google Scholar
McLachlan, G., & Peel, D. (2004). Finite mixture models. John Wiley and Sons.
Google Scholar
Otiniano, C. E. G., Rathie, P. N., & Ozelim, L. C. S. M. (2015). On the identifiability of finite mixture of skew-normal and skew-t distributions. Statistics & Probability Letters, 106, 103–108. https://doi.org/10.1016/j.spl.2015.07.015
Web of Science ®Google Scholar
Richardson, S., & Green, P. (1997). On bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(4), 731–792. https://doi.org/10.1111/1467-9868.00095
Web of Science ®Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/AOS/1176344136
Web of Science ®Google Scholar
Tang, A., & Tang, N. (2015). Semiparametric Bayesian inference on skew-normal joint modeling of multivariate longitudinal and survival data. Statistics in Medicine, 34(5), 824–843. https://doi.org/10.1002/SIM.6373
PubMed Web of Science ®Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 58(1), 267–288. https://doi.org/10.1111/J.2517-6161.1996.TB02080.X
Web of Science ®Google Scholar
Titterington, D., Smith, A., & Makov, U. (1985). Statistical analysis of finite mixture distributions. John Wiley and Sons
Google Scholar
Wang, H., Li, R., & Tsai, C. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3), 553–568. https://doi.org/10.1093/BIOMET/ASM053
PubMed Web of Science ®Google Scholar
Wang, P., Puterman, M., Cockburn, I., & Le, N. (1996). Mixed Poisson regression models with covariate dependent rates. Biometrics, 52(2), 381–400. https://doi.org/10.2307/2532881
PubMed Web of Science ®Google Scholar
Wu, L. (2014). Variable selection in joint location and scale models of the skew-t-normal distribution. Communications in Statistics. Simulation and Computation, 43(3), 615–630. https://doi.org/10.1080/03610918.2012.712182
Web of Science ®Google Scholar
Wu, L., Li, S., & Tao, Y. (2020). Estimation and variable selection for mixture of joint mean and variance models. Communications in Statistics-Theory and Methods, 50(24), 6081–6098. https://doi.org/10.1080/03610926.2020.1738493
Web of Science ®Google Scholar
Wu, L., Zhang, Z., & Xu, D. (2013). Variable selection in joint location and scale models of the skew-normal distribution. Journal of Statistical Computation and Simulation, 83(7), 1266–1278. https://doi.org/10.1080/00949655.2012.657198
Web of Science ®Google Scholar
Yao, W., & Li, L. (2014). A new regression model: Modal linear regression. Scandinavian Journal of Statistics, 41(3), 656–671. https://doi.org/10.1111/SJOS.12054
Web of Science ®Google Scholar
Yin, J., Wu, L., & Dai, L. (2020). Variable selection in finite mixture of regression models using the skew-normal distribution. Journal of Applied Statistics, 47(16), 2941–2960. https://doi.org/10.1080/02664763.2019.1709051
PubMed Web of Science ®Google Scholar
Yin, J., Wu, L., Lu, H., & Dai, L. (2020). New estimation in mixture of experts models using the Pearson type VII distribution. Communications in Statistics. Simulation and Computation, 49(2), 472–483. https://doi.org/10.1080/03610918.2018.1485943
Web of Science ®Google Scholar
Zhou, X., & Liu, G. (2016). LAD-Lasso variable selection for doubly censored median regression models. Communications in Statistics. Theory and Methods, 45(12), 3658–3667. https://doi.org/10.1080/03610926.2014.904357
Web of Science ®Google Scholar

Appendices

Appendix 1. Regularity conditions and proofs

Regularity conditions

R_{1} -- R_{4}

on the joint distribution of

h = (x, Y)

are needed for proving the asymptotic properties of the proposed method. Let

f (h ∣ Ψ)

be the joint density function of h with the parameter space

Ψ \in Ω

. We write

Ψ = (ψ_{1}, ψ_{2}, \dots, ψ_{s})

and s is the total number of parameters in the FMMeR model. The regularity conditions are as follows.

R1:	The density $f (h ∣ Ψ)$ has common support in h for all $Ψ \in Ω$ , and $f (h ∣ Ψ)$ is identifiable in $Ψ$ up to a permutation of the components of the mixture.
R2:	There exists an open subset $Ω^{} \in Ω$ containing the true parameter $Ψ_{0}$ such that for almost all h, the density $f (h ∣ Ψ)$ admits third partial derivatives with respect to $Ψ \in Ω^{}$ .
R3:	For each $Ψ_{0} \in Ψ$ and $t, l, g = 1, 2, \dots, s$ , there exist functions $B_{1} (h)$ and $B_{2} (h)$ (possibly depending on $Ψ_{0}$ ) such that for $Ψ$ in a neighbourhood of $N (Ψ_{0})$ , $\begin{aligned} \| \frac{\partial f (h ∣ Ψ)}{\partial ψ_{t}} \| \leq B_{1} (h), \| \frac{\partial^{2} f (h ∣ Ψ)}{\partial ψ_{t} \partial ψ_{l}} \| \leq B_{1} (h), \| \frac{\partial^{3} \log f (h ∣ Ψ)}{\partial ψ_{t} \partial ψ_{l} \partial ψ_{g}} \| & \leq B_{2} (h), \end{aligned}$ where $\int B_{1} (h) d h < \infty$ and $\int B_{2} (h) f (h ∣ Ψ) d h < \infty$ .
R4:	The Fisher information matrix, $I (Ψ) = E {[\frac{\partial}{\partial Ψ} \log f (h ∣ Ψ)] {[\frac{\partial}{\partial Ψ} \log f (h ∣ Ψ)]}^{⊤}},$ is finite and positive definite for each $Ψ \in Ω$ .

Proof

Proof of Theorem 4.1

Let $ξ_{n} = n^{- 1 / 2} (1 + a_{n})$ . We just have to specify that for any given $ϵ > 0$ , there exists a large constant C such that (A1) $lim_{n \to \infty} P {sup_{‖ u ‖ = C} L (Ψ_{0} + ξ_{n} u) < L (Ψ_{0})} \geq 1 - ϵ .$ (A1) This indicates that for sufficiently large n, with large probability namely $1 - ϵ$ , there is a localmaximum in the ball ${Ψ_{0} + ξ_{n} u : ‖ u ‖ \leq C}$ . This localmaximizer, say $\hat{Ψ}$ , satisfies $‖ \hat{Ψ} - Ψ_{0} ‖ = O_{p} (ξ_{n})$ .

Let $ζ_{n} (u) = L (Ψ_{0} + ξ_{n} u) - L (Ψ_{0})$ . Using $p_{τ_{j}} (0) = 0$ and the definition of $L (\cdot)$ , we have $\begin{aligned} ζ_{n} (u) & = [ℓ (Ψ_{0} + ξ_{n} u) - ℓ (Ψ_{0})] - [p (Ψ_{0} + ξ_{n} u) - p (Ψ_{0})] \\ \leq [ℓ (Ψ_{0} + ξ_{n} u) - ℓ (Ψ_{0})] - [p (Ψ_{01} + ξ_{n} u_{I}) - p (Ψ_{01})] \\ \leq ℓ (Ψ_{0} + ξ_{n} u) - ℓ (Ψ_{0}) - n \sum_{j = 1}^{m} ν_{j} \sum_{t = 1}^{d_{j}} [p_{τ_{j}} (β_{j t}^{0} + ξ_{n} u_{I}) - p_{τ_{j}} (β_{j t}^{0})], \end{aligned}$ where $d_{j}$ is the number of nonzero elements of the vector $β_{j}^{0}$ . $Ψ_{01}$ is the parameter vector with zero regression coefficients removed and $u_{I}$ is a subvector of $u$ with corresponding components. By Taylor's expansion and the triangular inequality (A2) $\begin{aligned} ζ_{n} (u) & \leq ξ_{n} {ℓ (Ψ_{0}^{'})}^{⊤} u - \frac{1}{2} u^{⊤} I (Ψ_{0}) u n ξ_{n}^{2} {1 + o_{p} (1)} \\ - \sum_{j = 1}^{m} ν_{j} \sum_{t = 1}^{d_{j}} [n ξ_{n} p_{τ_{j}}^{'} (β_{j t}^{0}) u_{I} + n ξ_{n}^{2} p_{τ_{j}}^{''} u_{I}^{2} {1 + o (1)}] \\ = q_{1} + q_{2} + q_{3} . \end{aligned}$ (A2) Regularity conditions imply that $n^{- 1 / 2} ℓ^{'} (Ψ_{0}) = O_{p} (1)$ and Fisher informationmatrix $I (Ψ_{0})$ is positive definite. Thus, $q_{1}$ is of the order $O_{p} (n^{1 / 2} ξ_{n}) = O_{p} (n ξ_{n}^{2})$ . By choosing a sufficiently large C, $q_{1}$ is controlled uniformly by $q_{2}$ in $‖ u ‖ = C$ . Note that the $q_{3}$ is bounded by $\begin{aligned} \sum_{j = 1}^{m} ν_{j} {\sqrt{d} n ξ_{n} a_{n} ‖ u ‖ + n ξ_{n}^{2} b_{n} ‖ u ‖^{2}} = \sqrt{d} n ξ_{n} a_{n} ‖ u ‖ + n ξ_{n}^{2} b_{n} ‖ u ‖^{2}, \end{aligned}$ where $d = max_{j} d_{j}$ . By condition $C_{1}$ for the penalty functions, $b_{n} = o (1)$ , this is also dominated by the $q_{2}$ . Hence, by choosing a sufficiently large C, (EquationA1(A1) $lim_{n \to \infty} P {sup_{‖ u ‖ = C} L (Ψ_{0} + ξ_{n} u) < L (Ψ_{0})} \geq 1 - ϵ .$ (A1) ) holds. This completes the proof.

Proof

Proof of Theorem 4.2

To prove part (a), consider the partition $Ψ = (Ψ_{1}, Ψ_{2})$ for any $Ψ$ in the neighbourhood $‖ Ψ - Ψ_{0} ‖ = O (n^{- 1 / 2})$ . By the definition of $L (\cdot)$ , we obtain $\begin{aligned} L (Ψ_{1}, Ψ_{2}) - L (Ψ_{1}, 0) = [ℓ (Ψ_{1}, Ψ_{2}) - ℓ (Ψ_{1}, 0)] - [p (Ψ_{1}, Ψ_{2}) - p (Ψ_{1}, 0)] . \end{aligned}$ By the mean value theorem, (A3) $ℓ (Ψ_{1}, Ψ_{2}) - ℓ (Ψ_{1}, 0) = {[\frac{\partial ℓ (Ψ_{1}, η)}{\partial Ψ_{2}}]}^{⊤} Ψ_{2}$ (A3) with $‖ η ‖ \leq ‖ Ψ_{2} ‖ = O (n^{- 1 / 2})$ . Furthermore, by using regularity condition $R_{3}$ and the mean value theorem, we have $\begin{aligned} ‖ \frac{\partial ℓ (Ψ_{1}, η)}{\partial Ψ_{2}} - \frac{\partial ℓ (Ψ_{01}, 0)}{\partial Ψ_{2}} ‖ & \leq ‖ \frac{\partial ℓ (Ψ_{1}, η)}{\partial Ψ_{2}} - \frac{\partial ℓ (Ψ_{1}, 0)}{\partial Ψ_{2}} ‖ + ‖ \frac{\partial ℓ (Ψ_{1}, 0)}{\partial Ψ_{2}} - \frac{\partial ℓ (Ψ_{01}, 0)}{\partial Ψ_{2}} ‖ \\ \leq [\sum_{i = 1}^{n} B_{1} (h_{i})] ‖ η ‖ + [\sum_{i = 1}^{n} B_{1} (h_{i})] ‖ Ψ_{1} - Ψ_{01} ‖ \\ = {‖ η ‖ + ‖ Ψ_{1} - Ψ_{01} ‖} O_{p} (n) = O_{p} (n^{1 / 2}) . \end{aligned}$ By the regularity conditions $R_{1}$ – $R_{4}$ , $\partial ℓ (Ψ_{01}, 0) / \partial Ψ_{2} = O_{p} (n^{1 / 2})$ . Thus, $\partial ℓ (Ψ_{1}, η) / \partial Ψ_{2} = O_{p} (n^{1 / 2})$ . Applying these order assessments to (EquationA3(A3) $ℓ (Ψ_{1}, Ψ_{2}) - ℓ (Ψ_{1}, 0) = {[\frac{\partial ℓ (Ψ_{1}, η)}{\partial Ψ_{2}}]}^{⊤} Ψ_{2}$ (A3) ), we obtain $ℓ (Ψ_{1}, Ψ_{2}) - ℓ (Ψ_{1}, 0) = O_{p} (n^{1 / 2}) \sum_{j = 1}^{m} \sum_{t = d_{j} + 1}^{p} | β_{j t} |,$ for large n. On the other hand, $p (Ψ_{1}, Ψ_{2}) - p (Ψ_{1}, 0) = n \sum_{j = 1}^{m} \sum_{t = d_{j} + 1}^{p} ν_{j} p_{τ_{j}} (β_{j t}) .$ Thus, $\begin{aligned} L (Ψ_{1}, Ψ_{2}) - L (Ψ_{1}, 0) = \sum_{j = 1}^{m} \sum_{t = d_{j} + 1}^{p} {| β_{j t} | O_{p} (n^{1 / 2}) - n ν_{j} p_{τ_{j}} (β_{j t})} . \end{aligned}$ In a shrinking neighbourhood of 0, $| β_{j t} | O_{p} (n^{1 / 2}) < n ν_{j} p_{τ_{j}} (β_{j t})$ in probability by condition $C_{2}$ . This completes the proof of part (a).

To prove sparsity in part (b(1)), we consider the partition $Ψ = L (Ψ_{1}, Ψ_{2})$ . Let $({\hat{Ψ}}_{1}, 0)$ be the maximizer of the penalized log-likelihood function $L (Ψ, 0)$ , which is considered as a function of $Ψ_{1}$ . It suffices to show that in the neighbourhood $‖ Ψ - Ψ_{0} ‖ = O_{p} {n^{- 1 / 2}}$ , $L (Ψ_{1}, Ψ_{2}) - L ({\hat{Ψ}}_{1}, 0) < 0$ with probability tending to 1 as $n \to \infty$ . By the result in part (a), we obtain that $\begin{aligned} L (Ψ_{1}, Ψ_{2}) - L ({\hat{Ψ}}_{1}, 0) & = [L (Ψ_{1}, Ψ_{2}) - L (Ψ_{1}, 0)] + [L (Ψ_{1}, 0) - L ({\hat{Ψ}}_{1}, 0)] \\ \leq [L (Ψ_{1}, Ψ_{2}) - L (Ψ_{1}, 0)] < 0. \end{aligned}$ To prove asymptotic normality in part (b(2)), we consider $L (Ψ, 0)$ as a function of $Ψ_{1}$ . Using the same argument as in Theorem 4.1, there exists a $\sqrt{n}$ -consistent local maximizer of this function, denoted by ${\hat{Ψ}}_{1}$ , that satisfies $\frac{\partial L (\hat{Ψ})}{\partial Ψ_{1}} = {\frac{\partial ℓ (Ψ)}{\partial Ψ_{1}} - \frac{\partial p (Ψ)}{\partial Ψ_{1}}}_{\hat{Ψ} = ({\hat{Ψ}}_{1}, 0)} = 0 .$ By substituting the first-order Taylor's expansions of $\partial ℓ (Ψ) / \partial Ψ_{1}$ and $\partial p (Ψ) / \partial Ψ_{1}$ into the above expression, we have $\begin{aligned} {\frac{\partial^{2} ℓ (Ψ_{01})}{\partial Ψ_{1} \partial Ψ_{1}^{⊤}} - p^{''} (Ψ_{01}) + o_{p} (n)} ({\hat{Ψ}}_{1} - Ψ_{01}) = \frac{\partial ℓ (Ψ_{01})}{\partial Ψ_{1}} - p^{'} (Ψ_{01}) . \end{aligned}$ On the other hand, under the regularity conditions, we obtain $\frac{\partial^{2} ℓ (Ψ_{01})}{\partial Ψ_{1} \partial Ψ_{1}^{⊤}} = I_{1} (Ψ_{01}) + o_{p} (1),$ and $\frac{1}{\sqrt{n}} \frac{\partial ℓ (Ψ_{01})}{\partial Ψ_{1}} ⟹ d N (0, I_{1} (Ψ_{01})) .$ Using the foregoing facts and Slutsky's theorem, we have $\begin{aligned} \sqrt{n} {[I_{1} (Ψ_{01}) - \frac{p^{''} (Ψ_{01})}{n}] ({\hat{Ψ}}_{1} - Ψ_{01}) + \frac{p^{'} (Ψ_{01})}{n}} ⟹ d N (0, I_{1} (Ψ_{01})), \end{aligned}$ which is the result in part (b(2)).

Appendix 2. Some technical derivations

In (5.9), the score function of j-th component is expressed as $\begin{aligned} S (β_{j}) & = - \sum_{i = 1}^{n} ω_{i j}^{(k)} \frac{(1 + λ_{j}^{2})}{σ_{j}^{2}} (e_{i j} E_{1} - σ_{j} r_{1 i}^{(k)} δ (λ_{j}) E_{1}), \\ S (σ_{j}) & = - \frac{1}{σ_{j}} \sum_{i = 1}^{n} ω_{i j}^{(k)} + \frac{(1 + λ_{j}^{2})}{σ_{j}^{2}} \sum_{i = 1}^{n} ω_{i j}^{(k)} [\frac{e_{i j}^{2}}{σ_{j}} - e_{i j} r_{1 i}^{(k)} δ (λ_{j}) \\ - e_{i j} E_{2} + σ_{j} r_{1 i}^{(k)} δ (λ_{j}) E_{2}], \\ S (λ_{j}) & = \sum_{i = 1}^{n} ω_{i j}^{(k)} \frac{δ^{2} (λ_{j})}{λ_{j}} - \sum_{i = 1}^{n} ω_{i j}^{(k)} [\frac{e_{i j}^{2} λ_{j}}{σ_{j}^{2}} + \frac{e_{i j} (1 + λ_{j}^{2})}{σ_{j}^{2}} E_{3} \\ - \frac{r_{1 i}^{(k)}}{σ_{j}} (\frac{e_{i j} δ (λ_{j})}{λ_{j}} + 2 e_{i j} λ_{j} δ (λ_{j}) + \frac{λ_{j}^{2}}{δ (λ_{j})} E_{3}) + λ_{j} r_{2 i}^{(k)}] . \end{aligned}$ $H (θ)$ is defined as $\begin{aligned} H (θ) = \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial θ \partial θ^{⊤}} = [\begin{matrix} \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial β_{j} \partial β_{j}^{⊤}} & \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial β_{j} \partial σ_{j}} & \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial β_{j} \partial λ_{j}} \\ * & \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial σ_{j} \partial σ_{j}} & \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial σ_{j} \partial λ_{j}} \\ * & * & \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial λ_{j} \partial λ_{j}} \end{matrix}], \end{aligned}$ where $\begin{aligned} \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial β_{j} \partial β_{j}^{⊤}} & = \sum_{i = 1}^{n} ω_{i j}^{(k)} \frac{(1 + λ_{j}^{2})}{σ_{j}^{2}} E_{1} E_{1}^{⊤}, \\ \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial β_{j} \partial σ_{j}} & = \sum_{i = 1}^{n} ω_{i j}^{(k)} \frac{(1 + λ_{j}^{2})}{σ_{j}^{2}} [(δ (λ_{j}) + 2 e_{i j} - 2 σ_{j} r_{1 i}^{(k)} δ (λ_{j})] E_{1}, \\ \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial β_{j} \partial λ_{j}} & = - \sum_{i = 1}^{n} ω_{i j}^{(k)} [\frac{2 λ_{j} e_{i j} + (1 + λ_{j}^{2}) E_{3}}{σ_{j}^{2}} + \frac{r_{1 i}^{(k)}}{σ_{j}} [2 λ_{j} δ (λ_{j}) + \frac{δ (λ_{j})}{λ_{j}}] E_{1}], \\ \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial σ_{j} \partial σ_{j}} & = \frac{1}{σ_{j}^{2}} \sum_{i = 1}^{n} ω_{i j}^{(k)} + \sum_{i = 1}^{n} ω_{i j}^{(k)} \frac{(1 + λ_{j}^{2})}{σ_{j}^{2}} [\frac{2 e_{i j} (2 E_{2} + r_{1 i}^{(k)} δ (λ_{j}))}{σ_{j}} - \frac{3 e_{i j}^{2}}{σ_{j}^{2}} \\ - 2 r_{1 i}^{(k)} δ (λ_{j}) E_{2} - e_{i j} E_{22} + σ_{j} δ (λ_{j}) r_{1 i}^{(k)} E_{22} - E_{2}^{2}], \\ \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial σ_{j} \partial λ_{j}} & = \sum_{i = 1}^{n} ω_{i j}^{(k)} \frac{2 λ_{j} e_{i j}}{σ_{j}^{2}} [\frac{e_{i j}}{σ_{j}} - r_{1 i}^{(k)} δ (λ_{j}) + \frac{σ_{j} r_{1 i}^{(k)} δ (λ_{j}) E_{2}}{e_{i j}} - \frac{r_{1 i}^{(k)} δ (λ_{j})}{2 λ_{j}^{2}} - E_{2}] \\ - \sum_{i = 1}^{n} ω_{i j}^{(k)} \frac{(1 + λ_{j}^{2})}{σ_{j}^{2}} [r_{1 i}^{(k)} δ (λ_{j}) + \frac{e_{i j} E_{23}}{E_{3}} - \frac{2 e_{i j} E_{3}}{σ_{j}} + E_{2}], \\ \frac{\partial^{2} Q_{2} (θ; Ψ^{(k)})}{\partial λ_{j} \partial λ_{j}} & = \sum_{i = 1}^{n} ω_{i j}^{(k)} \frac{1 - λ_{j}^{2}}{(1 + λ_{j}^{2})^{2}} - \sum_{i = 1}^{n} ω_{i j}^{(k)} [\frac{e_{i j}}{σ_{j}^{2}} (e_{i j} + 4 λ_{j} E_{3}) + \frac{(1 + λ_{j}^{2})}{σ_{j}^{2}} (E_{3}^{2} + e_{i j} E_{33}) \\ - \frac{r_{1 i}^{(k)}}{σ_{j}} (\frac{e_{i j} δ (λ_{j})}{1 + λ_{j}^{2}} + 2 δ (λ_{j}) (e_{i j} + λ_{j} E_{3}) + \sqrt{1 + λ_{j}^{2}} (λ_{j} E_{33} + 2 E_{3})) + r_{2 i}^{(k)}], \end{aligned}$ with $e_{i j} = y_{i} - x_{i}^{⊤} β_{j} - \frac{σ_{j}}{3} [m_{0} (λ_{j}) + 2 \sqrt{\frac{2}{π}} δ (λ_{j})]$ . Thus, we have ${\begin{aligned} E_{1} = \frac{\partial e_{i j}}{\partial β_{j}} = - x_{i}, \\ E_{2} = \frac{\partial e_{i j}}{\partial σ_{j}} = \frac{1}{3} [m_{0} (λ_{j}) + 2 \sqrt{\frac{2}{π}} δ (λ_{j})], \\ E_{3} = \frac{\partial e_{i j}}{\partial λ_{j}} = \frac{σ_{j}}{3} [M_{1} + \sqrt{\frac{2}{π}} \frac{2}{(1 + λ_{j}^{2})^{3 / 2}}], \\ E_{11} = \frac{\partial^{2} e_{i j}}{\partial β_{j} \partial β_{j}^{⊤}} = 0, \\ E_{12} = E_{21} = \frac{\partial^{2} e_{i j}}{\partial β_{j} \partial σ_{j}} = \frac{\partial^{2} e_{i j}}{\partial σ_{j} \partial β_{j}^{⊤}} = 0, \\ E_{13} = E_{31} = \frac{\partial^{2} e_{i j}}{\partial β_{j} \partial λ_{j}} = \frac{\partial^{2} e_{i j}}{\partial λ_{j} \partial β_{j}^{⊤}} = 0, \\ E_{22} = \frac{\partial^{2} e_{i j}}{\partial σ_{j} \partial σ_{j}} = 0, \\ E_{23} = E_{32} = \frac{\partial^{2} e_{i j}}{\partial σ_{j} \partial σ_{j}} = \frac{1}{3} [M_{1} + \sqrt{\frac{2}{π}} \frac{2}{(1 + λ_{j}^{2})^{3 / 2}}], \\ E_{33} = \frac{\partial^{2} e_{i j}}{\partial λ_{j} \partial λ_{j}} = \frac{σ_{j}}{3} [M_{2} - \sqrt{\frac{2}{π}} \frac{6 λ_{j}}{(1 + λ_{j}^{2})^{5 / 2}}], \end{aligned}$ and $\begin{aligned} M_{1} = \frac{\partial m_{0} (λ_{j})}{\partial λ_{j}} & = \sqrt{\frac{2}{π}} \frac{1}{(1 + λ_{j}^{2})^{3 / 2}} - \frac{T_{1} σ_{0} (λ_{j}) + S_{1} t_{0} (λ_{j})}{2} - \frac{π {s i g n}^{2} (λ_{j})}{λ_{j}^{2}} \exp (- \frac{2 π}{| λ_{j} |}), \\ M_{2} = \frac{\partial^{2} m_{0} (λ_{j})}{\partial λ_{j} \partial λ_{j}} & = - \sqrt{\frac{2}{π}} - \frac{3 λ_{j}}{(1 + λ_{j}^{2})^{5 / 2}} - \frac{T_{2} σ_{0} (λ_{j}) + 2 T_{1} S_{1} + S_{2} t_{0} (λ_{j})}{2} \\ - \frac{2 π [π s i g n (λ_{j}) - λ_{j} {s i g n}^{2} (λ_{j})]}{λ_{j}^{4}} \exp (- \frac{2 π}{| λ_{j} |}), \\ S_{1} & = \frac{\partial σ_{0} (λ_{j})}{\partial λ_{j}} = - \sqrt{\frac{2}{π}} \frac{μ_{0} (λ_{j})}{σ_{0} (λ_{j}) (1 + λ_{j}^{2})^{3 / 2}}, \\ S_{2} & = \frac{\partial^{2} σ_{0} (λ_{j})}{\partial λ_{j} \partial λ_{j}} = \frac{1}{σ_{0} (λ_{j}) (1 + λ_{j}^{2})^{3}} [3 λ_{j} μ_{0} (λ_{j}) \sqrt{1 + λ_{j}^{2}} - \frac{2}{π σ_{0} (λ_{j})}], \\ T_{1} & = \frac{\partial t_{0} (λ_{j})}{\partial λ_{j}} = \frac{3 (4 - π)}{2 σ_{0}^{4} (λ_{j})} [\sqrt{\frac{2}{π}} \frac{μ_{0}^{2} (λ_{j}) σ_{0} (λ_{j})}{(1 + λ_{j}^{2})^{3 / 2}} - μ_{0}^{3} (λ_{j}) S_{1}], \\ T_{2} & = \frac{\partial^{2} t_{0} (λ_{j})}{\partial λ_{j} \partial λ_{j}} = \frac{3 (4 - π)}{2 σ_{0}^{5} (λ_{j})} [\frac{4 μ_{0} (λ_{j}) σ_{0}^{2} (λ_{j})}{π (1 + λ_{j}^{2})^{3 / 2}} - \sqrt{\frac{2}{π}} \frac{3 λ_{j} μ_{0}^{2} (λ_{j}) σ_{0}^{2} (λ_{j})}{(1 + λ_{j}^{2})^{5 / 2}} \\ - \sqrt{\frac{2}{π}} \frac{6 μ_{0}^{2} (λ_{j}) σ_{0} (λ_{j}) S_{1}}{(1 + λ_{j}^{2})^{3 / 2}} - S_{2} μ_{0}^{3} (λ_{j}) σ_{0} (λ_{j}) + 4 μ_{0}^{3} (λ_{j}) S_{1}^{2}] . \end{aligned}$

Variable selection in finite mixture of median regression models using skew-normal distribution

Abstract

1. Introduction

2. The skew-normal mixture of median regression models

2.1. Skew-normal distribution

2.2. Median regression for skew-normal mixtures

2.3. Identifiability

Atienza et al., Citation2006

3. The method for variable selection

4. Asymptotic properties

Consistency

Oracle property

5. Numerical computations

5.1. Maximization algorithm

5.2. Choice of the tuning parameters

5.3. Determining the number of components

6. Numerical experiments

6.1. Experiments 1

Table 1. Three penalty functions are used for variable selection procedure.

Table 2. Simulation results of the parameters of scale, skewness and mixing proportion for $ν_{1}$ =0.5.

6.2. Experiments 2

Table 3. Varying skewness with n = 400 and $ν_{1} = 0.5$ .

6.3. Experiments 3

Table 4. Varying sample size n with $λ_{1} = 3$ , $λ_{2} = - 3$ and $ν = 0.5$ .

7. A real-data example

Table 5. Variable selection for BMI data set via FMMeR model.

Table 6. Variable selection for BMI data set via finite mixture of MODLR model.

Table 7. Variable selection for BMI data set via NMR model.

8. Conclusions remarks and future works

Disclosure statement

References

Appendices

Appendix 1. Regularity conditions and proofs

Proof of Theorem 4.1

Proof of Theorem 4.2

Appendix 2. Some technical derivations

Information for

Open access

Opportunities

Help and information

Variable selection in finite mixture of median regression models using skew-normal distribution

Abstract

1. Introduction

2. The skew-normal mixture of median regression models

2.1. Skew-normal distribution

2.2. Median regression for skew-normal mixtures

2.3. Identifiability

Atienza et al., Citation2006

3. The method for variable selection

4. Asymptotic properties

Consistency

Oracle property

5. Numerical computations

5.1. Maximization algorithm

5.2. Choice of the tuning parameters

5.3. Determining the number of components

6. Numerical experiments

6.1. Experiments 1

Table 1. Three penalty functions are used for variable selection procedure.

Table 2. Simulation results of the parameters of scale, skewness and mixing proportion for ν1=0.5.

6.2. Experiments 2

Table 3. Varying skewness with n = 400 and ν1=0.5.

6.3. Experiments 3

Table 4. Varying sample size n with λ1=3, λ2=−3 and ν=0.5.

7. A real-data example

Table 5. Variable selection for BMI data set via FMMeR model.

Table 6. Variable selection for BMI data set via finite mixture of MODLR model.

Table 7. Variable selection for BMI data set via NMR model.

8. Conclusions remarks and future works

Disclosure statement

Additional information

Funding

References

Appendices

Appendix 1. Regularity conditions and proofs

Proof of Theorem 4.1

Proof of Theorem 4.2

Appendix 2. Some technical derivations

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 2. Simulation results of the parameters of scale, skewness and mixing proportion for $ν_{1}$ =0.5.

Table 3. Varying skewness with n = 400 and $ν_{1} = 0.5$ .

Table 4. Varying sample size n with $λ_{1} = 3$ , $λ_{2} = - 3$ and $ν = 0.5$ .