742
Views
0
CrossRef citations to date
0
Altmetric
Articles

Variable selection in finite mixture of median regression models using skew-normal distribution

, &
Pages 30-48 | Received 18 Apr 2021, Accepted 25 Jul 2022, Published online: 06 Aug 2022

Abstract

A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes. Moreover, data that arise from a heterogeneous population can be efficiently analysed by a finite mixture of regression models. These observations motivate us to propose a novel finite mixture of median regression model based on a mixture of the skew-normal distributions to explore asymmetrical data from several subpopulations. With the appropriate choice of the tuning parameters, we establish the theoretical properties of the proposed procedure, including consistency for variable selection method and the oracle property in estimation. A productive nonparametric clustering method is applied to select the number of components, and an efficient EM algorithm for numerical computations is developed. Simulation studies and a real data set are used to illustrate the performance of the proposed methodologies.

1. Introduction

When the data involve asymmetrical outcomes, inference under the linear regression model with the skewed random errors can be viewed as an alternative procedure to the classical regression models with symmetric errors, since the use of a skewed distribution for the errors could reduce the influence of outliers and thus make statistical analysis more robust. Specifically, suppose that a response variable Y given a set of predictors x takes the form of (1) y=xβ+ϵ,(1) where β represents a vector of the unknown regression coefficients and the conditional density of the error term ϵ given x follows an unknown distribution with the probability density function (pdf) g(ϵx). It is known that if g(ϵx) is symmetrical about 0, the estimation of β in (Equation1) will be the same as the coefficients obtained by conventional mean linear regression. However, if g(ϵx) is skewed, the median regression provides a more reliable statistical analysis with adaptive robustness to outliers, since the median of a distribution is less susceptible to outliers, especially when the data involve asymmetrical outcomes. We here refer the interested readers to Kottas and Gelfand (Citation2001), Zhou and Liu (Citation2016) and Hu et al. (Citation2019) for relevant research on the median regression of population distributions.

It is noteworthy to mention that the median regression has been widely used for studying the relationship between the response variable Y and a set of predictors x in symmetrical distribution, whereas such a median regression may not be suitable for analysing the data exhibiting asymmetrical behaviour or the data that arise from a heterogeneous population. To tackle this difficulty, mixture of regression models (known as switching regression models in econometrics), initially introduced by Goldfeld and Quandt (Citation1973), may be employed as a flexible tool for studying the skewed data from two or more subpopulations. Since then, finite mixture of regression (FMR) models has been widely used in a variety of fields including but not limited to biology, medicine, economics, environmental science, sampling survey and engineering technology. The book by McLachlan and Peel (Citation2004) contains a comprehensive review of FMR models. An FMR model is obtained when a response variable with a finite mixture distribution depends on a set of covariates, and FMR models have been discussed extensively when the normality is assumed for the regression error in each component.

However, it has been shown that the commonly used normal mixture model tends to be an over fitting model, since additional components are usually needed to capture the skewness of the data. To overcome the potential inappropriateness of normal mixtures in some context, we may consider the use of the skew-normal distributions (Azzalini, Citation1985) as component densities of the errors; see, for example, Wu et al. (Citation2013), Wu (Citation2014), Tang and Tang (Citation2015), and H. Li et al. (Citation2016, Citation2017), to name just a few. These observations motivate us to develop a novel finite mixture of the median regression (FMMeR) model based on a mixture of the skew-normal distributions to explore asymmetrical data that arise from several subpopulations. There exist two barriers for the development of the FMMeR model. The first barrier is to deal with computational aspects of parameter estimation when fitting the FMMeR model with the skew-normal distribution for the errors. We tackle this barrier by utilizing the stochastic representation and hierarchical representation (see, for example, Liu & Lin, Citation2014) of skew-normal mixtures. A second technical barrier is to determine the number of components of the FMMeR model under consideration. Popularly, the log-likelihood maximum and two information-based criteria, AIC (Akaike, Citation1973) and BIC (Schwarz, Citation1978), can be used to select the number of components. Although some success has been shown using the model choice criteria, choosing the right number of components for a mixture model is known to be difficult. Thus, we consider a procedure of clustering to determine the number of components, which has been shown to be very effective via real-data example, and it is introduced in Subsection 5.3.

To enhance predictability and to give a concise model, it is reasonable to include only the significant covariates in the model. As a result, variable selection has also become increasingly important for FMR models and a rich literature has been generated in recent several decades. All-subset selection methods, such as the AIC and BIC, and their modifications, have been widely investigated in the context of FMR models; for instance, P. Wang et al. (Citation1996) researched model selection in a finite mixture of Poisson regression models via AIC and BIC. However, all-subset selection methods for FMR models are computationally intensive. To improve computational efficiency, the least absolute shrinkage and selection operator (LASSO) of Tibshirani (Citation1996) and the smoothly clipped absolute deviation (SCAD) method of Fan and Li (Citation2001) are proposed as new methods for variable selection. The penalized likelihood for FMR models, the extension of penalized least square methods, were proposed by Khalili and Chen (Citation2007). Recently, Wu et al. (Citation2020) proposed an estimation and variable selection method for mixture of joint mean and variance models; Yin, Wu, and Dai (Citation2020) proposed variable selection procedures in FMR models using the skew-normal distribution.

The remainder of this paper is organized as follows. In Section 2, we briefly introduce the skew-normal distribution and its median expression. In Section 3, we develop a variable selection method for FMMeR model via the penalized likelihood-based procedure for analysing asymmetrical data from several subpopulations. Section 4 studies asymptotic properties of the resulting estimators. In Section 5, a numerical algorithm, a productive nonparametric clustering method for determining the number of components and a data-adaptive method for choosing tuning parameters are discussed. In Section 6, we carry out simulation studies to investigate the finite sample performance of the proposed methodology. A real-data example is provided in Section 7 for illustrative purposes. Some concluding remarks are given in Section 8. Brief proofs of theorems and some technical derivations are given in Appendices 1 and 2.

2. The skew-normal mixture of median regression models

2.1. Skew-normal distribution

A random variable Y is said to follow a univariate skew-normal distribution with location parameter μ, scale parameter σ(0,) and skewness parameter λR, denoted by YSN(μ,σ2,λ), if its pdf is given by (2) f(yμ,σ2,λ)=2σϕ(yμσ)Φ(λ(yμσ)),(2) where ϕ() and Φ() denote the pdf and cumulative distribution function (cdf) of the standard normal distribution, respectively. It is worth noting that if λ=0, the density of Y reduces to a normal density N(μ,σ2) and that the distribution is positively skewed if λ>0 and is negatively skewed if λ<0.

We represent the skew-normal distribution in an incomplete data framework. Specifically, the stochastic representation for random variable YSN(μ,σ2,λ) is given by (3) Yi=μ+σ(δ(λ)Ri+1δ2(λ)Vi),(3)

where i=1,,n with a sample size of n, δ(λ)=λ/1+λ2. Here, RiTN(0,1)I{ri>0} and ViN(0,1), where Ri and Vi are independent. RTN(μ,σ2)I{a1<r<a2} is a truncated normal distribution with the density fR(rμ,σ2)={Φ(a2μσ)Φ(a1μσ)}1×12πσexp{12σ2(rμ)2},where a1<r<a2 and I{} represents an indicator function. For notational simplicity, let Y=(Y1,,Yn) and R=(R1,,Rn). Furthermore, the skew-normal distribution can be decomposed into a normal distribution and a truncated normal distribution by a hierarchical representation given by (4) YiRi=riN(μ+σriδ(λ),σ2(1δ2(λ))),RiTN(0,1)I{ri>0}.(4) Azzalini and Capitanio (Citation2013) adopted the moment-generating function to calculate the mean and variance for the skew-normal distribution in (Equation2) and they are given by (5) E(Y)=μ+μ0(λ)σ,Var(Y)=σ02(λ)σ2,(5) respectively, where μ0(λ)=ˆ2/πδ(λ) and σ02(λ)=ˆ1μ02(λ). Of particular note is that Lin et al. (Citation2007) introduced a simple way of obtaining higher moments of the skew-normal distribution without the use of its moment-generating function. Letting m0(λ) be the mode of the distribution SN(0,1,λ), a quite accurate approximation of m0(λ) evaluated by the numerical maximization method is given by m0(λ)μ0(λ)t0(λ)σ0(λ)2sign(λ)2exp{2π|λ|},where sign(λ) indicates the sign function for λ and t0(λ)=ˆ4π2μ03(λ)σ03(λ).It deserves mentioning that the logarithm of the density for the skew-normal distribution is a concave function and that this property is not altered by a change of location and scale parameters. Thus, m0(λ) is unique and the mode of the skew-normal distribution in (Equation2) can be reexpressed as Mode(Y)=μ+m0(λ)σ. Mean(Y), Mode(Y) and Median(Y) have the quantitative relationship when the observations follow a skew-normal distribution: Median(Y)[Mode(Y)+2Mean(Y)]/3, that is, (6) Meadian(Y)μ+[m0(λ)+2μ0(λ)]σ3,(6) which could facilitate the development of the median regression with the skew-normal mixtures discussed below.

2.2. Median regression for skew-normal mixtures

In this paper, we assume that the response variable Yi follows a skew-normal distribution with location parameter μi, scale parameter σ and skewness parameter λ, denoted by YiSN(μi,σ2,λ) for i=1,,n. A linear mode regression model with skew-normal errors can be expressed as (7) yi=xiβ+ϵi,(7) where Median(YiX)=xiβ=μi+[m0(λ)+2μ0(λ)]σ/3 defined by (Equation6). Here X=(x1,,xn) is a p×n design matrix, such that each of its element xi=(xi1,,xip) is the p-dimensional vector of predictors, and β=(β1,,βp) is a p-dimensional vector of the unknown regression coefficients, and ϵ=(ϵ1,,ϵn) stands for the n-dimensional vector of random errors such that ϵiiidSN([m0(λ)+2μ0(λ)]σ/3,σ2,λ).

We consider the case where the data from heterogeneous populations. A finite mixture median regression (FMMeR) model with m-components of the skew-normal distributions is defined as (8) {f(yiΨ)=j=1mνjSN(yiμij,σj2,λj),i=1,2,,n, j=1,2,,m,Median(yij)=xiβj,(8) where SN(yiμij,σj2,λj)=2σjϕ(yiμjσj)Φ(λjyiμjσj),ν=(ν1,,νm) are the mixing proportions which are constrained to be non-negative and sum to unity, βj=(βj1,,βjp) and Ψ=(ν1,,νm1,β1,,βm,σ1,,σm,λ1,,λm). It is obvious that (9) μij=xiβjσj3[m0(λj)+22πδ(λj)],(9) which shows that the location in the FMMeR model is altered by a change of scale and skewness parameters.

2.3. Identifiability

An important part associated with statistical inference for FMR models is their identifiability. It is well known that mixture models are not absolutely identifiable in general. However, in some mixture model settings, it is possible to establish a weaker sense of identifiability. Titterington et al. (Citation1985) have given relevant conclusions that the FMR models of continuous distribution are identifiable in most cases. Otiniano et al. (Citation2015) introduced the identifiability of finite mixture of skew-normal distribution and gave detailed explanation. The cumulative distribution function of Y is denoted by FY. It is possible to define the skew-normal family as the set F={F:FY(yμ,σ2,λ)=yf(tμ,σ2,λ)dt}and H={H:H(yΨ)=j=1mνjFj(yμij,σj2,λj);Fj(yμij,σj2,λj)Fj=1m}as the class of finite mixture of skew-normal distributions. The class H of all finite mixtures of F is identifiable if and only if for any H,H¯H, H=j=1mνjFj,H¯=j=1m¯ν¯jF¯j.The equality H=H¯ implies m=m¯ and (ν1,F1),,(νm,Fm) are a permutation of (ν¯1,F¯1),,(ν¯m,F¯m). The following theorem given by Atienza et al. (Citation2006) gives a sufficient condition for the identifiability of finite mixtures of distributions. A denotes the accumulation set of A.

Theorem 2.1

Atienza et al., Citation2006

Let F be a family of distributions. Let M be a linear mapping which transforms any FF into a real function φF with domain Sφ(F). Let S0(F)={kSφ(F):φF(k)0}. Suppose that there exists a total order ≺ on F, such that for any FF there exists a point k(F)S0(F) verifying

  1. if F1,F2,,FlF,with F1Fj for 2jl, then k(F1)[S0(F1)[j=2lSφ(Fj)]];

  2. if F1F2, then limkk(F1)φF2(k)φF1(k)=0.

Then, the class H of all finite mixture distributions of F is identifiable.

3. The method for variable selection

Various classical variable selection criteria can be considered as tradeoffs based on the estimation variance and modelling biases of penalized likelihood. The density f(x) is functionally independent of the parameters as an assumption in FMMeR model when x is random. Hence, the variable selection can be done based absolutely on the conditional density function specified in (Equation2). Denote {(xi,yi)}i=1n as a sample of observations from FMMeR model specified in (Equation8). The log-likelihood function of Ψ is given by (Ψ)=i=1nlogj=1mνjSN(yiμij,σj2,λj).A maximum likelihood estimate (MLE) is obtained via maximizing (Ψ). The MLE is often close to, but not strictly equal to 0 when a component of x is not important. Thus, this covariate is not excluded from the model. To address this problem, according to Khalili and Chen (Citation2007), we define a penalized log-likelihood function as (10) L(Ψ)=(Ψ)p(Ψ),(10) with the penalty function p(Ψ)=nj=1mνjt=1ppτj(|βjt|),where pτj() is a given penalty function with the tuning parameter τj0 (j=1,2,,m), and the tuning parameters and the penalty functions are not necessarily the same for all the parameters. A data-driven criterion for determining tuning parameters is introduced in Subsection 5.2. By choosing appropriate tuning parameters and maximizing function L(Ψ) in (Equation10) to obtain penalized maximum likelihood estimator of Ψ, denoted by Ψˆ, the coefficients in the vicinity of 0 are compressed to 0 and automatically excluded. Thus, the procedure combines the parameter estimation and variable selection and reduces the computational burden substantially. We use the following three penalty functions to illustrate the theory that we develop for the FMMeR model: LASSO penalty:pτj(|βjt|)=τj|βjt|,HARD penalty:pτj(|βjt|)=τj2(|βjt|τj)2I(|βjt|<τj),SCAD penalty:pτj(|βjt|)=τj{I(|βjt|τj)+(aτj|βjt|)(a1)τjI(|βjt|>τj)}.Following the idea of Fan and Li (Citation2001), we set a = 3.7 for application purposes in this article. The LASSO penalty has a good performance in numerical computation because of its convex property. The SCAD penalty gives a good performance in selecting important variables. HARD penalty should work more like SCAD, although less smoothly.

4. Asymptotic properties

In this section, we consider the consistency for variable selection method and the oracle property in estimation. Without loss of generality, the coefficient vector βj(j=1,,m) of the j-th component is decomposed into βj=(β1j,β2j), where β1j and β2j contain the nonzero effects and zero effects of βj, respectively. Naturally, we also split the parameter Ψ=(Ψ1,Ψ2) such that Ψ2 contains all zero effects, that is, β2j in the true model. The vector of true parameters is denoted as Ψ0. The components of Ψ0 are denoted with a superscript, namely Ψ0=(ν10,,νm10,β10,,βm0,σ10,,σm0,λ10,,λm0), where βjt0 is the t-th component of βj0. Let dj denote the number of nonzero elements βjt0 of the subvector β1j0 for each j. Let an=maxj,t{pτj(βjt0;βjt00)},bn=maxj,t{pτj(βjt0;βjt00)},where 1tdj and 1jm. pτj(βjt0) and pτj(βjt0) are the first and second derivatives of the penalty function pτj(βjt0) with respect to βjt0, respectively. The asymptotic results obtained in this article are based on the three conditions on the penalty functions pτj().

C0:

For all j, pτj(0)=0, and pτj(β) is symmetric and non-negative. Furthermore, it is nondecreasing and twice differentiable for all β(0,) with at most a few exceptions.

C1:

As n, bn=o(1).

C2:

For all j and Tn={β;0<βn1/2logn}, limninfβTnpτj(β)=.

Condition C1 is used to explain the asymptotic properties of the estimators of nonzero effects. Conditions C0 and C2 are required for sparsity.

Theorem 4.1

Consistency

Let hi=(xi,Yi), i=1,2,,n, be a random sample from a density function f(h,Ψ) that satisfies the regularity conditions R1R4 in the Appendix 1. The penalty functions pτj() satisfy conditions C0 and C1 as a assumption. Then there exists a local maximizer Ψˆ of the penalized log-likelihood function L(Ψ) for which ΨˆΨˆ0=Op{n1/2(1+an)},where represents the Euclidean norm.

Theorem 4.2

Oracle property

Assume that the conditions given in Theorem 4.1 are fulfilled. The penalty functions pτj() satisfy conditions C0C2, and m is known in parts (a) and (b). We then have the following.

  1. For any Ψ such that ΨΨ0=Op(n1/2), with probability tending to 1, L(Ψ1,Ψ2)L(Ψ1,0)<0.

  2. For any n-consistent maximum penalized likelihood estimator Ψˆ of Ψ,

    1. sparity: P(βˆ2j=0)1, j=1,2,,m as n;

    2. asymptotic normality: n{[I1(Ψ01)p(Ψ01)n](Ψˆ1Ψ01)+p(Ψ01)n}dN(0,I1(Ψ01)), where I1(Ψ01) is the Fisher information computed under the true model with all zero effects removed.

Brief proofs of theorems are put in Appendix 1. Detailed proofs can be seen in the previous literature (see, for example, Fan & Li, Citation2001; Khalili & Chen, Citation2007; Yin, Wu, & Dai, Citation2020).

5. Numerical computations

5.1. Maximization algorithm

In general, due to the unboundedness of the likelihood function, the maximum likelihood estimator of the mixture distribution is often inconsistent in the context of finite mixture models. The alternative is to add a regular term that prevents the likelihood function from tending to infinity to get a consistent maximum penalty likelihood estimator, see, for example, Chen and Tan (Citation2009), Chen (Citation2017), including recent works, Chen et al. (Citation2020), He and Chen (Citation2022aCitation2022b). McLachlan and Peel (Citation2004) proposed that the EM algorithm can calculate the maximum likelihood estimation of arbitrary distribution in finite mixture model. We maximize the regularized log-likelihood function by the EM algorithm. We define the latent component-indicators Z=(Z1,,Zn) with Zi=(zi1,,zim), i=,1,n. Then Zi is an m-dimensional indicator vector with its j-th element given by zij={1,if (xi,yi) belongs to j-th component,0,otherwise.Since an observation cannot simultaneously belong to both components, we have j=1mzij=1. By assuming the component-indicators Z1,,Zn to be independent, we obtain a conditional density of the multinomial distribution given the mixing probabilities (11) f(ziν)=ν1zi1ν2zi2νm1zi,m1(1j=1mνj)zim,(11) which is denoted as ZiM(1;ν1,,νm), and it will be used in combination with (Equation3) to generate the following hierarchical representation for the skew-normal mixtures, such that (12) Yi(ri,zij=1)N(μij+σjriδ(λk),σj2(1δ2(λj))),Rizij=1TN(0,1)I(τi>0),ZiM(1;ν1,ν2,,νm).(12) It deserves mentioning that the hierarchical representation of the finite skew-normal mixtures in (Equation12) allows us to address computational barriers of the parameter estimation when fitting the FMMeR model. Let Yobs={yi}i=1n be the observed data. For each Yi=yi, we use the latent variables Zi and Ri to form the complete data Ycom=YobsYmis={yi,zij,ri}, where Ymis denotes the missing data. From hierarchical representation (Equation12), the complete data log-likelihood function can be given by (13) c(Ψ)=i=1nj=1mzij{logνj12log(2πσj2)12log(1δ2(λj))12σj2(1δ2(λj))[eij22σjeijδ(λj)ri+σj2ri2δ2(λj)](2πσj2)12}.(13) Similar to the approach in Fan and Li (Citation2001), p(Ψ) is replaced by the following local quadratic function given the value Ψ(0), p(Ψ)p~(Ψ)=p(Ψ(0))+p(Ψ(0))2Ψ(0)(Ψ2Ψ(0)2)=nj=1mνjt=1p[pτj(βjt(0))+pτj(βjt(0))2βjt(0)(βjt2βjt(0)2)].This approximation is used in the M-step of the EM algorithm in each iteration. The complete penalized log-likelihood function of (Equation10) can be given by (14) Lc(Ψ)=c(Ψ)p(Ψ).(14) • E-step. The E-step computes the conditional expectation of the function Lc(Ψ) with respect to zij. Given the observed data {xi,yi}i=1n from FMMeR model (Equation8), Ψ(k) is denoted as parameter estimation for k-th iteration. Let θ=(β,σ,λ). Then the surrogate function can be constructed as (15) Q(ΨΨ(k))=Q1(νΨ(k))+Q2(θΨ(k))p(ΨΨ(k)),(15) where Q1(νΨ(k))=i=1nj=1mωij(k)logνj,Q2(θΨ(k))=i=1nj=1mωij(k)[12log(2πσj2)12log(1δ2(λj))12σj2(1δ2(λj))(eij22σjeijδ(λj)r1i(k)+σj2δ2(λj)r2i(k))12].The required conditional expectations are obtained as follows. First, the conditional expectation EΨ(k)(zijyi,xi) is given by (16) ωij(k)=νj(k)SN(yi;μij(k),σj2(k),λj(k))j=1mνj(k)SN(yi;μij(k),σj2(k),λj(k)).(16) Then, it can be easily shown that r1i(k)=E(Ri|yi,xi,zij=1,Ψ(k))=eij(k)δ(λj(k))σj(k)+δ(λj(k))λj(k)ϕ(γij(k))Φ(γij(k)),r2i(k)=E(Ri2|yi,xi,zij=1,Ψ(k))=11+λj(k)2+eij(k)δ(λj(k))σj(k)r1i(k),γij(k)=λj(k)(yiμij(k))σj(k)=λj(k)eij(k)σj(k),μij(k)=xiβj(k)σj(k)3[m0(λj(k))+22πδ(λj(k))].• M-step. The M-step calculates parameter vector Ψ(k+1) via maximizing Q(Ψ;Ψ(k)) with respect to Ψ. Thus, on the (k+1)-th iteration of the EM algorithm, the mixing proportions are updated by (17) νj(k+1)=1ni=1nωij(k),j=1,2,,m.(17) It is worth noting that the mixing proportions modelling should be considered in mixture of experts regression models, which can be found in Yin, Wu, Lu, et al. (Citation2020). To improve the efficiency for selecting the number of components in real data analysis for this article, we firstly applied a clustering method to determine the optimal number of components in Subsection 5.3. By maximizing Q(Ψ;Ψ(k)) with respect to Ψ without νj, namely maximizing Q2(θ;Ψ(k)), we can compute θj(k+1). To obtain parameter estimation of FMMeR model without penalty, start from an initial value θ(0) and given k as the current iteration. We use the following method to update (18) θ(k+1)=θ(k)+[H(θ(k))]1S(θ(k)),(18) where S(θ)=Q2(θ;Ψ(k))θ=[S(β),S(σ),S(λ)]is referred to as score function without penalty. H(θ(k)) is an observation information matrix defined as H(θ)=2Q2(θ;Ψ(k))θθ.Detailed derivation can be seen in Appendix 2. We iterate between the E-steps and M-steps until algorithm converges, and the estimators βj(0),σj(0),λj(0) are obtained.

In order to find the non-significant variables and simplify the FMMeR model, we shrink the coefficients by the penalty function. βj(0) is taken as the initial value of iteration and given k as the current iteration, update βj(k+1)=βj(k)+[2Q2(θ;Ψ(k))βjβjnΔτ(βj(k))]1[nΔτ(βj(k))βj(k)Q2(θ;Ψ(k))βj],with Δτ(βj(k))=diag{pτj(|βj1(k)|)|βj1(k)|,pτj(|βj2(k)|)|βj2(k)|,,pτj(|βjp(k)|)|βjp(k)|}.

5.2. Choice of the tuning parameters

The degree of penalty is controlled by tuning parameters. When using the method introduced in this article, we need to choose the tuning parameters. Various selection criteria, including cross-validation (CV), generalized cross-validation (GCV), Akaike information criterion (AIC) (Akaike, Citation1973) and Bayesian Information Criterion (BIC) (Schwarz, Citation1978), are often used for choosing tuning parameters. GCV has a non-negligible overfitting effect in the final model selection. H. Wang et al. (Citation2007) suggested using BIC for the SCAD estimator in linear models and partially linear models and proved the consistency of the selection method, that is, the optimal parameter chosen by BIC can identify the true model with probability tending to 1. Considering the maximizer Ψn of the log-likelihood function (Equation13), we use the estimator Ψn to calculate the mixing proportions in (Equation17). The mixing proportions remain fixed throughout the tuning parameter selection process. For a given value of tuning parameter τj, let (βˆj,σˆj,λˆj) be the maximum regularized likelihood estimates of the parameters in the j-th component of the FMMeR model fixing the remaining elements of Ψ at Ψn. Denote the likelihood-based deviance statistics, evaluated at (βˆj,σˆj,λˆj), corresponding to the j-th component of FMMeR model as Dj(βˆj,σˆj,λˆj)=i=1nωij[logSN(yiyi,μˆij,σˆj2,λˆj)logSN(yixiβˆk,μˆij,σˆj2,λˆj)],where μˆij=xiβˆjσˆj3[m0(λˆj)+22πδ(λˆj)] and the weights ωij are given in (Equation16). Then we define BIC(τj)=2Dj(βˆj,σˆj,λˆj)+N(τj)log(nj),j=1,,m,where N(τj) is the number of nonzero elements of the vector βˆj and nj=i=1nωij. It is expected that the choice of τjt should be such that the tuning parameter for a zero coefficient is larger than that for a nonzero coefficient. Thus, we can simultaneously unbiasedly estimate the larger coefficient, and shrink the small coefficient towards zero. Hence, similar to Wu et al. (Citation2013), we suggest τjt=τˆj|βjt(0)|,j=1,2,,m, t=1,2,,p,where βjt(0) is the MLE of βjt. βjt, τjt are the t-th component of βj and τj, respectively. Tuning parameters are obtained via calculating τˆj=argminτjBIC(τj).

5.3. Determining the number of components

Determining the number of components of an FMR model is a challenge. In the above discussion, we assume that m is known and processing methods are either based on prior information or pre-analysis of data. A feasible method implements reversible jump Markov chain Monte Carlo (RJMCMC) Richardson and Green (Citation1997), since adding skewness even complicates matters, we did not pursue RJMCMC. Moreover, the component posterior probabilities evaluated in mixture modelling for Bayesian inference can be readily used as a soft clustering scheme. Alternatively, the log-likelihood maximum and two information-based criteria, AIC and BIC, can be used to select the number of components. Although some success has been shown using the model choice criteria, choosing the right number of components for a mixture model is known to be difficult.

To improve the efficiency for selecting the number of components in this article, a productive nonparametric clustering method via mode identification is applied, see J. Li et al. (Citation2007). It deserves mentioning that this approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Specifically, a cluster is formed by those sample points that ascend to the same local maximum of the density function, and a pairwise separability measure for clusters is defined using the Ridgeline between the density bumps of two clusters. In this process, the Modal EM (MEM) algorithm and Ridgeline EM (REM) algorithm are used. Numerical results in Section 7 illustrated that this clustering procedure works well for determining the number of components in the FMMeR model.

6. Numerical experiments

In this section, we carry out simulation studies to investigate the finite sample performance of the proposed methodology. To be more specific, in Subsection 6.1, we conduct simulations to study the impact of the sample size on the estimation quality, and in Subsection 6.2, we investigate the quality of the performance for variable selection over different values of the skewness, and we compare the performance of the proposed FMMeR model and NMR model used in Khalili and Chen (Citation2007) in Subsection 6.3.

6.1. Experiments 1

The experiment works to observe the impact of the sample size on the estimation quality. In addition, we compare the performance of different variable selection methods from a number of angles. We generated independently samples of size n from the following FMMeR model with two components, (19) {f(yiΨ)=ν1SN(μi1,σ12,λ1)+(1ν1)SN(μi2,σ22,λ2),Median(yij)=xiβj,i=1,2,,n, j=1,2,(19) where μi1 and μi2 are defined by (Equation9), and Ψ=(ν1,β1,β2,σ1,σ2,λ1,λ2). The components of the covariate x in the simulation are generated from a uniform distribution U(1,1). The true values of parameters are set to be β1(0)=(1,0,0,1.5,0), β2(0)=(1,0,1,0,1.2), σ1(0)=σ2(0)=2. To test the sensitivity of the FMMeR model for positively or negatively skewed data, we set λ1(0)=3 and λ2(0)=3. A choice of mixing proportion ν1=0.5 and 0.35 is considered, and y is generated according to model (Equation19). According to Karlis and Xekalaki (Citation2003), a faster convergence rate can be achieved by setting the true value of the parameter to the initial value of the iteration. The performance of the estimators βˆ, σˆ, λˆ and νˆ will be assessed using the Mean Squared Error (MSE), defined as MSE(βˆj)=E(βˆjβj(0))(βˆjβj(0)),MSE(σˆj)=E(σˆjσj(0))2,MSE(λˆj)=E(λˆjλj(0))2,MSE(νˆj)=E(νˆjνj(0))2.The average numbers of correctly (C) and incorrectly (IC) estimated zero coefficients and their standard deviation (SD) based on 500 repetitions are presented in Table . The results are presented in terms of mixture components 1 and 2. In addition, we report the MSEs and SD of scale, skewness and mixing proportion for ν1=0.5 across the repetitions in Table . Note that when the sample size n increases, as expected, the methods improve for a given penalty. The MSEs of estimators βˆ, σˆ, λˆ and νˆ tend to decrease by increasing the sample size, which illustrates the convergence property of the maximum penalized likelihood estimator of FMMeR model. For a given n, the performances of SCAD and HARD methods are similar for model complexity and better than LASSO method. When mixing proportion ν1 reduces, and the sample size for component 1 decreases, all procedures for the component 1 of the FMMeR model become less satisfactory. Furthermore, the performances of component 1 and component 2 are similar for ν1=0.5, which indicates that FMMeR model is insensitive to positively or negatively skewed data.

Table 1. Three penalty functions are used for variable selection procedure.

Table 2. Simulation results of the parameters of scale, skewness and mixing proportion for ν1=0.5.

6.2. Experiments 2

To investigate how the estimation quality is changed over different skewness, in this section, we set mixing proposition ν1=0.5 and the number of observations n = 400. Observations are generated in the same way as in Experiment 1. Table  shows C, IC, MSE(βˆ) and their SD for different penalty function with λ=3,1.5,0.5,0.5,1.5,3 for 500 repetitions. Notice that the variable selection procedures perform similarly in all three cases for a given penalty function, but a larger SD is obtained by LASSO. When combined with the relevant conclusions of Experiment 1, the result indicates that the performance of the variable selection method is not affected by the choice of skewness of data.

Table 3. Varying skewness with n = 400 and ν1=0.5.

6.3. Experiments 3

To demonstrate the ability of the proposed variable selection method at selecting important variables, we compare the performance of the proposed FMMeR model and NMR model used in Khalili and Chen (Citation2007) for a varying sample size n = 200, 400 and ν1=0.5. The data are generated exactly in the same way as in Experiment 1, and each of the two models is considered for the inference. The simulated results are reported in Table  based on 500 repetitions. From Table , it is clear that the performance of the variable selection procedure based on the FMMeR model is better than that based on the NMR model in some settings. This confirms that the FMMeR model clearly outperforms the NMR model at identifying important variables when there is skewness in the data. As expected, the MSEs indicate the convergence property of the maximum penalized likelihood estimator of FMMeR and NMR models.

Table 4. Varying sample size n with λ1=3, λ2=3 and ν=0.5.

7. A real-data example

FMR models have been used in the fields of biomedicine. To further demonstrate the ability of the proposed FMMeR model and variable selection method at identifying significant variables, we use a real-data example to illustrate the practical application of the proposed method of the FMMeR model in this section. The data set, analysed by Cook and Weisberg (Citation1994), focused on the body mass index (BMI) for 102 male and 100 female athletes collected at Australian Institute of Sport. We are interested in the relationship between BMI and the 10 performance measures given as red cell count (x1), white cell count (x2), haematocrit (x3), haemoglobin (x4), plasma ferritin concentration (x5), sum of skin folds (x6), body fat percentage (x7), lean body mass (x8), height (x9) and weight (x10).

It can be seen from the histogram of the BMI in Figure  that the response is right-skewed, indicating the preference of the model with the skew-normal random errors. We determine the number of components via the method in Subsection 5.3. The clustering results are shown in Figure . At the level 3, 4 clusters are formed, as shown by different symbols in Figure (a). The 4 modes identified at level 3 are merged into 2 modes at level 4, as shown in Figure (b,d). Compared with level 4, two influential observations were excluded in cluster 1 and cluster 2 of level 3. Thus, it seems reasonable to use the following FMMeR model with two components to fit the BMI data, (20) {f(yiΨ)=ν1SN(μi1,σ12,λ1)+(1ν1)SN(μi2,σ22,λ2),Median(yij)=xiβj,i=1,2,,202, j=1,2,(20) where μi1 and μi2 are defined by (Equation9), and Ψ=(ν1,β1,β2,σ1,σ2,λ1,λ2). xi is a 10×1 vector consisting of all 10 potential variables. Three penalty functions are used to select significant variables.

Figure 1. Histogram of the BMI.

Figure 1. Histogram of the BMI.

Figure 2. Clustering results for the BMI data obtained. (a) The 4 clusters at level 3. (b) The ascending paths from the modes at level 3 to those at level 4 and the contours of the density estimate at level 4. (c) The 2 clusters at level 4. (d) The ascending paths from the modes at level 4 to the next level and the contours of the density estimate at the next level.

Figure 2. Clustering results for the BMI data obtained. (a) The 4 clusters at level 3. (b) The ascending paths from the modes at level 3 to those at level 4 and the contours of the density estimate at level 4. (c) The 2 clusters at level 4. (d) The ascending paths from the modes at level 4 to the next level and the contours of the density estimate at the next level.

We compare the variable selection results of the three models, including the proposed FMMeR model in this article, finite mixture of modal liner regression model and NMR model, where modal liner regression (MODLR) model was proposed by Yao and Li (Citation2014). The results of variable selection for three models are given in Tables . In this data example, three variable selection procedures for a given model perform very similarly in terms of selecting significant variables. For FMMeR model and finite mixture of MODLR model, the same variables are removed for a given penalty function. NMR model, however, reserves more variables, resulting in a failure to select significant variables. Thus, the true structure of the model is not identified. When there is a situation of skewed data, the performances of HARD and SCAD are better than LASSO for identifying the authentic structure of the model. In FMMeR model, seven significant variables, including x1, x4, x5, x7, x8, x9, x10, are identified in component 1. Also seven x4, x5, x6, x7, x8, x9, x10 are contained in component 2. This indicates that these variables have a significant effect for BMI of athletes. We also find that there are some variables having different effects on parts one and two. For instance, red cell count (x1) and sum of skin folds (x6) are another factors affecting athletes' BMI in component 1 and component 2, respectively. Furthermore, x4, x5, x7 and x8 are helpful for achieving a high BMI in two conponents. In addition, the performance of the variable selection procedure via the FMMeR model is different from that of the variable selection procedure via the NMR model.

Table 5. Variable selection for BMI data set via FMMeR model.

Table 6. Variable selection for BMI data set via finite mixture of MODLR model.

Table 7. Variable selection for BMI data set via NMR model.

8. Conclusions remarks and future works

In this paper, by utilizing the skew-normal distribution as a component density to overcome the potential inappropriateness of normal mixtures in some context, we have developed a novel finite mixture of the median regression (FMMeR) model to explore asymmetrical data that arise from several subpopulations. Thanks to the stochastic representation for the skew-normal distribution, we have constructed a hierarchical representation of the finite skew-normal mixtures to address computational barriers of the parameter estimation and variable selection when fitting the FMMeR model. In addition, in order to determine the number of components, we applied a clustering method via mode identification proposed by J. Li et al. (Citation2007) and a good performance is shown. Numerical results from simulation studies and a real-data example illustrated that the proposed FMMeR model methodology performs well in general, even when the data exhibit symmetrical behaviour.

It is worth noting that we only considered the procedures of parameter estimation and variable selection for the FMMeR model based on a mixture of the skew-normal distributions. Meanwhile, the scenario of p>n has not been considered in this paper. A natural extension of the proposed methodology is to consider other skewed distributions, such as the skew-t and skew-Laplace distributions, and high-dimensional settings. In addition, another research direction is to model the mixing proportions ν, which extends the proposed model to the framework of mixture of experts models. Finally, it will also be of interest to consider Bayesian variable selection, semi-parametric and nonparametric methods for the FMMeR model, which are currently under investigation and will be reported elsewhere.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work is partially supported by the National Natural Science Foundation of China [grant number 11861041], the Natural Science Research Foundation of Kunming University of Science and Technology [grant number KKSY201907003].

References

  • Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. International Symposium on Information Theory, 1, 610–624. https://doi.org/10.1007/978-1-4612-1694-0_15
  • Atienza, N., Garcia-Heras, J., & Muñoz-Pichardo, J. (2006). A new condition for identifiability of finite mixture distributions. Metrika, 63(2), 215–221. https://doi.org/10.1007/s00184-005-0013-z
  • Azzalini, A. (1985). A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2), 171–178. http://www.jstor.org/stable/4615982
  • Azzalini, A., & Capitanio, A. (2013). The skew-normal and related families. Cambridge University Press.
  • Chen, J. (2017). Consistency of the MLE under mixture models. Statistical Science, 32(1), 47–63. https://doi.org/10.1214/16-sts578
  • Chen, J., Li, P., & Liu, G. (2020). Homogeneity testing under finite location-scale mixtures. Canadian Journal of Statistics, 48(4), 670–684. https://doi.org/10.1002/cjs.11557
  • Chen, J., & Tan, X. (2009). Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7), 1367–1383. https://doi.org/10.1016/j.jmva.2008.12.005
  • Cook, R.-D., & Weisberg, S. (1994). An introduction to regression graphics. John Wiley and Sons.
  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
  • Goldfeld, S., & Quandt, R. (1973). A Markov model for switching regressions. Journal of Econometrics, 1(1), 3–15. https://doi.org/10.1016/0304-4076(73)90002-X
  • He, M., & Chen, J. (2022a). Consistency of the MLE under a two-parameter gamma mixture model with a structural shape parameter. Metrika. https://doi.org/10.1007/s00184-021-00856-9
  • He, M., & Chen, J. (2022b). Strong consistency of the MLE under two-parameter gamma mixture models with a structural scale parameter. Advances in Data Analysis and Classification, 16(1), 125–154. https://doi.org/10.1007/s11634-021-00472-5
  • Hu, D., Gu, Y., & Zhao, W. (2019). Bayesian variable selection for median regression. Chinese Journal of Applied Probability and Statistics, 35(6), 594–610.
  • Karlis, D., & Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics & Data Analysis, 41(3–4), 577–590. https://doi.org/10.1016/S0167-9473(02)00177-9
  • Khalili, A., & Chen, J. (2007). Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102(479), 1025–1038. https://doi.org/10.1198/016214507000000590
  • Kottas, A., & Gelfand, A. (2001). Bayesian semiparametric median regression modeling. Journal of the American Statistical Association, 96(456), 1458–1468. https://doi.org/10.1198/016214501753382363
  • Li, H., Wu, L., & Ma, T. (2017). Variable selection in joint location, scale and skewness models of the skew-normal distribution. Journal of Systems Science and Complexity, 30(3), 694–709. https://doi.org/10.1007/S11424-016-5193-2
  • Li, H., Wu, L., & Yi, J. (2016). A skew-normal mixture of joint location, scale and skewness models. Applied Mathematics-A Journal of Chinese Universities, 31(3), 283–295. https://doi.org/10.1007/S11766-016-3367-2
  • Li, J., Ray, S., & Lindsay, B.-G. (2007). A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research, 8(8), 1687–1723.
  • Lin, T.-I., Lee, J., & Yen, S. (2007). Finite mixture modelling using the skew normal distribution. Statistica Sinica, 17(3), 909–927. http://www.jstor.org/stable/24307705
  • Liu, M., & Lin, T.-I. (2014). A skew-normal mixture regression model. Educational and Psychological Measurement, 74(1), 139–162. https://doi.org/10.1177/0013164413498603
  • McLachlan, G., & Peel, D. (2004). Finite mixture models. John Wiley and Sons.
  • Otiniano, C. E. G., Rathie, P. N., & Ozelim, L. C. S. M. (2015). On the identifiability of finite mixture of skew-normal and skew-t distributions. Statistics & Probability Letters, 106, 103–108. https://doi.org/10.1016/j.spl.2015.07.015
  • Richardson, S., & Green, P. (1997). On bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(4), 731–792. https://doi.org/10.1111/1467-9868.00095
  • Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/AOS/1176344136
  • Tang, A., & Tang, N. (2015). Semiparametric Bayesian inference on skew-normal joint modeling of multivariate longitudinal and survival data. Statistics in Medicine, 34(5), 824–843. https://doi.org/10.1002/SIM.6373
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 58(1), 267–288. https://doi.org/10.1111/J.2517-6161.1996.TB02080.X
  • Titterington, D., Smith, A., & Makov, U. (1985). Statistical analysis of finite mixture distributions. John Wiley and Sons
  • Wang, H., Li, R., & Tsai, C. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3), 553–568. https://doi.org/10.1093/BIOMET/ASM053
  • Wang, P., Puterman, M., Cockburn, I., & Le, N. (1996). Mixed Poisson regression models with covariate dependent rates. Biometrics, 52(2), 381–400. https://doi.org/10.2307/2532881
  • Wu, L. (2014). Variable selection in joint location and scale models of the skew-t-normal distribution. Communications in Statistics. Simulation and Computation, 43(3), 615–630. https://doi.org/10.1080/03610918.2012.712182
  • Wu, L., Li, S., & Tao, Y. (2020). Estimation and variable selection for mixture of joint mean and variance models. Communications in Statistics-Theory and Methods, 50(24), 6081–6098. https://doi.org/10.1080/03610926.2020.1738493
  • Wu, L., Zhang, Z., & Xu, D. (2013). Variable selection in joint location and scale models of the skew-normal distribution. Journal of Statistical Computation and Simulation, 83(7), 1266–1278. https://doi.org/10.1080/00949655.2012.657198
  • Yao, W., & Li, L. (2014). A new regression model: Modal linear regression. Scandinavian Journal of Statistics, 41(3), 656–671. https://doi.org/10.1111/SJOS.12054
  • Yin, J., Wu, L., & Dai, L. (2020). Variable selection in finite mixture of regression models using the skew-normal distribution. Journal of Applied Statistics, 47(16), 2941–2960. https://doi.org/10.1080/02664763.2019.1709051
  • Yin, J., Wu, L., Lu, H., & Dai, L. (2020). New estimation in mixture of experts models using the Pearson type VII distribution. Communications in Statistics. Simulation and Computation, 49(2), 472–483. https://doi.org/10.1080/03610918.2018.1485943
  • Zhou, X., & Liu, G. (2016). LAD-Lasso variable selection for doubly censored median regression models. Communications in Statistics. Theory and Methods, 45(12), 3658–3667. https://doi.org/10.1080/03610926.2014.904357

Appendices

Appendix 1. Regularity conditions and proofs

Regularity conditions R1--R4 on the joint distribution of h=(x,Y) are needed for proving the asymptotic properties of the proposed method. Let f(hΨ) be the joint density function of h with the parameter space ΨΩ. We write Ψ=(ψ1,ψ2,,ψs) and s is the total number of parameters in the FMMeR model. The regularity conditions are as follows.

R1:

The density f(hΨ) has common support in h for all ΨΩ, and f(hΨ) is identifiable in Ψ up to a permutation of the components of the mixture.

R2:

There exists an open subset ΩΩ containing the true parameter Ψ0 such that for almost all h, the density f(hΨ) admits third partial derivatives with respect to ΨΩ.

R3:

For each Ψ0Ψ and t,l,g=1,2,,s, there exist functions B1(h) and B2(h) (possibly depending on Ψ0) such that for Ψ in a neighbourhood of N(Ψ0), |f(hΨ)ψt|B1(h),|2f(hΨ)ψtψl|B1(h),|3logf(hΨ)ψtψlψg|B2(h), where B1(h)dh< and B2(h)f(hΨ)dh<.

R4:

The Fisher information matrix, I(Ψ)=E{[Ψlogf(hΨ)][Ψlogf(hΨ)]},is finite and positive definite for each ΨΩ.

Proof

Proof of Theorem 4.1

Let ξn=n1/2(1+an). We just have to specify that for any given ϵ>0, there exists a large constant C such that (A1) limnP{supu=CL(Ψ0+ξnu)<L(Ψ0)}1ϵ.(A1) This indicates that for sufficiently large n, with large probability namely 1ϵ, there is a localmaximum in the ball {Ψ0+ξnu:uC}. This localmaximizer, say Ψˆ, satisfies ΨˆΨ0=Op(ξn).

Let ζn(u)=L(Ψ0+ξnu)L(Ψ0). Using pτj(0)=0 and the definition of L(), we have ζn(u)=[(Ψ0+ξnu)(Ψ0)][p(Ψ0+ξnu)p(Ψ0)][(Ψ0+ξnu)(Ψ0)][p(Ψ01+ξnuI)p(Ψ01)](Ψ0+ξnu)(Ψ0)nj=1mνjt=1dj[pτj(βjt0+ξnuI)pτj(βjt0)],where dj is the number of nonzero elements of the vector βj0. Ψ01 is the parameter vector with zero regression coefficients removed and uI is a subvector of u with corresponding components. By Taylor's expansion and the triangular inequality (A2) ζn(u)ξn{(Ψ0)}u12uI(Ψ0)unξn2{1+op(1)}j=1mνjt=1dj[nξnpτj(βjt0)uI+nξn2pτjuI2{1+o(1)}]=q1+q2+q3.(A2) Regularity conditions imply that n1/2(Ψ0)=Op(1) and Fisher informationmatrix I(Ψ0) is positive definite. Thus, q1 is of the order Op(n1/2ξn)=Op(nξn2). By choosing a sufficiently large C, q1 is controlled uniformly by q2 in u=C. Note that the q3 is bounded by j=1mνj{dnξnanu+nξn2bnu2}=dnξnanu+nξn2bnu2,where d=maxjdj. By condition C1 for the penalty functions, bn=o(1), this is also dominated by the q2. Hence, by choosing a sufficiently large C, (EquationA1) holds. This completes the proof.

Proof

Proof of Theorem 4.2

To prove part (a), consider the partition Ψ=(Ψ1,Ψ2) for any Ψ in the neighbourhood ΨΨ0=O(n1/2). By the definition of L(), we obtain L(Ψ1,Ψ2)L(Ψ1,0)=[(Ψ1,Ψ2)(Ψ1,0)][p(Ψ1,Ψ2)p(Ψ1,0)].By the mean value theorem, (A3) (Ψ1,Ψ2)(Ψ1,0)=[(Ψ1,η)Ψ2]Ψ2(A3) with ηΨ2=O(n1/2). Furthermore, by using regularity condition R3 and the mean value theorem, we have (Ψ1,η)Ψ2(Ψ01,0)Ψ2(Ψ1,η)Ψ2(Ψ1,0)Ψ2+(Ψ1,0)Ψ2(Ψ01,0)Ψ2[i=1nB1(hi)]η+[i=1nB1(hi)]Ψ1Ψ01={η+Ψ1Ψ01}Op(n)=Op(n1/2).By the regularity conditions R1R4, (Ψ01,0)/Ψ2=Op(n1/2). Thus, (Ψ1,η)/Ψ2=Op(n1/2). Applying these order assessments to (EquationA3), we obtain (Ψ1,Ψ2)(Ψ1,0)=Op(n1/2)j=1mt=dj+1p|βjt|,for large n. On the other hand, p(Ψ1,Ψ2)p(Ψ1,0)=nj=1mt=dj+1pνjpτj(βjt).Thus, L(Ψ1,Ψ2)L(Ψ1,0)=j=1mt=dj+1p{|βjt|Op(n1/2)nνjpτj(βjt)}.In a shrinking neighbourhood of 0, |βjt|Op(n1/2)<nνjpτj(βjt) in probability by condition C2. This completes the proof of part (a).

To prove sparsity in part (b(1)), we consider the partition Ψ=L(Ψ1,Ψ2). Let (Ψˆ1,0) be the maximizer of the penalized log-likelihood function L(Ψ,0), which is considered as a function of Ψ1. It suffices to show that in the neighbourhood ΨΨ0=Op{n1/2}, L(Ψ1,Ψ2)L(Ψˆ1,0)<0 with probability tending to 1 as n. By the result in part (a), we obtain that L(Ψ1,Ψ2)L(Ψˆ1,0)=[L(Ψ1,Ψ2)L(Ψ1,0)]+[L(Ψ1,0)L(Ψˆ1,0)][L(Ψ1,Ψ2)L(Ψ1,0)]<0.To prove asymptotic normality in part (b(2)), we consider L(Ψ,0) as a function of Ψ1. Using the same argument as in Theorem 4.1, there exists a n-consistent local maximizer of this function, denoted by Ψˆ1, that satisfies L(Ψˆ)Ψ1={(Ψ)Ψ1p(Ψ)Ψ1}Ψˆ=(Ψˆ1,0)=0.By substituting the first-order Taylor's expansions of (Ψ)/Ψ1 and p(Ψ)/Ψ1 into the above expression, we have {2(Ψ01)Ψ1Ψ1p(Ψ01)+op(n)}(Ψˆ1Ψ01)=(Ψ01)Ψ1p(Ψ01).On the other hand, under the regularity conditions, we obtain 2(Ψ01)Ψ1Ψ1=I1(Ψ01)+op(1),and 1n(Ψ01)Ψ1dN(0,I1(Ψ01)).Using the foregoing facts and Slutsky's theorem, we have n{[I1(Ψ01)p(Ψ01)n](Ψˆ1Ψ01)+p(Ψ01)n}dN(0,I1(Ψ01)),which is the result in part (b(2)).

Appendix 2. Some technical derivations

In (5.9), the score function of j-th component is expressed as S(βj)=i=1nωij(k)(1+λj2)σj2(eijE1σjr1i(k)δ(λj)E1),S(σj)=1σji=1nωij(k)+(1+λj2)σj2i=1nωij(k)[eij2σjeijr1i(k)δ(λj)eij2σjeijE2+σjr1i(k)δ(λj)E2],S(λj)=i=1nωij(k)δ2(λj)λji=1nωij(k)[eij2λjσj2+eij(1+λj2)σj2E3r1i(k)σj(eijδ(λj)λj+2eijλjδ(λj)+λj2δ(λj)E3)+λjr2i(k)]. H(θ) is defined as H(θ)=2Q2(θ;Ψ(k))θθ=[2Q2(θ;Ψ(k))βjβj2Q2(θ;Ψ(k))βjσj2Q2(θ;Ψ(k))βjλj2Q2(θ;Ψ(k))σjσj2Q2(θ;Ψ(k))σjλj2Q2(θ;Ψ(k))λjλj],where 2Q2(θ;Ψ(k))βjβj=i=1nωij(k)(1+λj2)σj2E1E1,2Q2(θ;Ψ(k))βjσj=i=1nωij(k)(1+λj2)σj2[(δ(λj)+2eij2σjr1i(k)δ(λj)]E1,2Q2(θ;Ψ(k))βjλj=i=1nωij(k)[2λjeij+(1+λj2)E3σj2+r1i(k)σj[2λjδ(λj)+δ(λj)λj]E1],2Q2(θ;Ψ(k))σjσj=1σj2i=1nωij(k)+i=1nωij(k)(1+λj2)σj2[2eij(2E2+r1i(k)δ(λj))σj3eij2σj22r1i(k)δ(λj)E2eijE22+σjδ(λj)r1i(k)E22E222eij(2E2+r1i(k)δ(λj))σj3eij2σj2],2Q2(θ;Ψ(k))σjλj=i=1nωij(k)2λjeijσj2[eijσjr1i(k)δ(λj)+σjr1i(k)δ(λj)E2eijr1i(k)δ(λj)2λj2E2]i=1nωij(k)(1+λj2)σj2[r1i(k)δ(λj)+eijE23E32eijE3σj+E2],2Q2(θ;Ψ(k))λjλj=i=1nωij(k)1λj2(1+λj2)2i=1nωij(k)[(1+λj2)σj2eijσj2(eij+4λjE3)+(1+λj2)σj2(E32+eijE33)r1i(k)σj(eijδ(λj)1+λj2+2δ(λj)(eij+λjE3)+1+λj2(λjE33+2E3)eijδ(λj)1+λj2)+r2i(k)],with eij=yixiβjσj3[m0(λj)+22πδ(λj)]. Thus, we have {E1=eijβj=xi,E2=eijσj=13[m0(λj)+22πδ(λj)],E3=eijλj=σj3[M1+2π2(1+λj2)3/2],E11=2eijβjβj=0,E12=E21=2eijβjσj=2eijσjβj=0,E13=E31=2eijβjλj=2eijλjβj=0,E22=2eijσjσj=0,E23=E32=2eijσjσj=13[M1+2π2(1+λj2)3/2],E33=2eijλjλj=σj3[M22π6λj(1+λj2)5/2],and M1=m0(λj)λj=2π1(1+λj2)3/2T1σ0(λj)+S1t0(λj)2πsign2(λj)λj2exp(2π|λj|),M2=2m0(λj)λjλj=2π3λj(1+λj2)5/2T2σ0(λj)+2T1S1+S2t0(λj)22π[πsign(λj)λjsign2(λj)]λj4exp(2π|λj|),S1=σ0(λj)λj=2πμ0(λj)σ0(λj)(1+λj2)3/2,S2=2σ0(λj)λjλj=1σ0(λj)(1+λj2)3[3λjμ0(λj)1+λj22πσ0(λj)],T1=t0(λj)λj=3(4π)2σ04(λj)[2πμ02(λj)σ0(λj)(1+λj2)3/2μ03(λj)S1],T2=2t0(λj)λjλj=3(4π)2σ05(λj)[4μ0(λj)σ02(λj)π(1+λj2)3/22π3λjμ02(λj)σ02(λj)(1+λj2)5/22π6μ02(λj)σ0(λj)S1(1+λj2)3/2S2μ03(λj)σ0(λj)+4μ03(λj)S12].