Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In this paper, we study optimal model averaging estimators of regression coefficients in a multinomial logit model, which is commonly used in many scientific fields. A Kullback–Leibler (KL) loss-based weight choice criterion is developed to determine averaging weights. Under some regularity conditions, we prove that the resulting model averaging estimators are asymptotically optimal. When the true model is one of the candidate models, the averaged estimators are consistent. Simulation studies suggest the superiority of the proposed method over commonly used model selection criterions, model averaging methods, as well as some other related methods in terms of the KL loss and mean squared forecast error. Finally, the website phishing data is used to illustrate the proposed method.

Keywords:

1. Introduction

Model selection is a traditional data analysis methodology. By minimizing a model selection criterion, such as Akaike information criterion (AIC) (Akaike, Citation1973), Bayesian information criterion (BIC) (Schwarz, Citation1978) and Mallow's C $_{p}$ (Mallows, Citation1973), one model can be chosen from a number of candidate models. After that, one can make statistical inferences under the selected model. In this progress, we ignore the additional uncertainty or even bias introduced by the model selection produce, and thus often underreport the variance of inferences, as discussed in H. Wang et al. (Citation2009). Instead of focusing on one model, the model averaging approach considers a series of candidate models and gives higher weights to the better models. It is an integrated progress that avoids ignoring the uncertainty introduced by the model selection procedure and reduces the risk in regression estimations.

Model averaging can be classified as Bayesian model averaging (BMA) and Frequentist model averaging (FMA). Compared with the FMA approach, there has been an enormous literature about the BMA method, See Hoeting et al. (Citation1999) for a comprehensive review. Unlike the BMA approach which considers the model uncertainty by giving a prior probability to each candidate model, FMA approach does not require priors and the corresponding estimators are totally determined by the data itself. Therefore, the current studies pay more attention to the FMA approach in statistics and econometrics.

In recent years, optimal model averaging methods have received a substantial amount of interests. Hansen (Citation2007) proposed a Mallows model averaging (MMA) method for linear regression models with independent and homoscedastic errors. He developed its asymptotic optimality for a class of nested models by constraining the model weights in a special discrete set. A. T. Wan et al. (Citation2010) provided a more flexible theoretical framework for the MMA which kept its asymptotic optimality for continuous weights and non-nested models. Hansen and Racine (Citation2012), Liu and Okui (Citation2013) developed a jackknife model averaging (JMA) method and heteroscedasticity-robust C $_{p}$ model averaging (HRC $_{p}$ ) for the linear regression with independent and heteroscedastic errors, respectively. Zhang et al. (Citation2013) broadened the JMA to the linear regression with dependent errors. Cheng et al. (Citation2015) provided a feasible autocovariance-corrected MMA method to select weights across generalized least squares for the linear regression with time series errors. Zhu et al. (Citation2018) proposed the MMA for multivariate multiple regression models.

Hansen's approach and the subsequent extensions listed above mainly focus on linear models. Recently, some optimal model averaging literatures for nonlinear models have also been developed, including optimal model averaging criterion for partially linear models (Zhang & Wang, Citation2019), quantile regressions (Lu & Su, Citation2015), generalized linear models and generalized linear mixed-effects models (Zhang et al., Citation2016), varying coefficient models (Li et al., Citation2018), varying-coefficients partially linear models (Zhu et al., Citation2019), spatial autoregressive models (Zhang & Yu, Citation2018), and others. All of these methods are asymptotically optimal in the sense of achieving the lowest loss in the large sample case. To the best of our knowledge, there are few optimal averaging estimations for a multinomial logit model that allows all candidate models to be possibly misspecified. The main contribution of this paper is to fill this gap.

The multinomial logit model is widely used in marketing research (Guadagni & Little, Citation1983), risk analysis (Bayaga, Citation2010), credit ratings (Ederington, Citation1985) and other fields including categorical data. A. T. Wan et al. (Citation2014) developed the ‘approximately optimal’ (A-opt) method for the multinomial logit model under a local misspecification model assumption but did not establish asymptotic optimality of the resulting model averaging estimator. Besides, there have been many debates concerning the realism of the local misspecification assumption, e.g., Raftery and Zheng (Citation2003). After that, S. Zhao et al. (Citation2019) proposed M-fold cross-validation (MCV) criterion for the multinomial logit model and yielded forecasting superior to the strategy proposed by A. T. Wan et al. (Citation2014). Then, its asymptotic optimality is proved for the dimension of covariates being fixed.

These two papers for multinomial logit models listed above both concerned a squared estimation error-based risk. Different from squared errors, the KL loss was produced to measure the closeness between the model and the true data generating process. Then, there are amounts of criterion developed from the KL loss, such as Generalised information criteria (GIC) (Konishi & Kitagawa, Citation1996), Kullback–Leibler information criteria (KIC) (Cavanaugh, Citation1999) and an improved version of a criterion based on the Akaike information criterion (AIC $_{c}$ ) (Hurvich et al., Citation1998). In addition, Zhang et al. (Citation2015) clarified that the model averaging methods based on the KL loss yield better forecasts than these model averaging approaches in terms of squared errors under linear regressions. Motivated by these facts, to propose a novel model averaging method based on the KL loss seems to be potentially interesting. Our simulation study demonstrates that the model averaging method based on the KL loss has strong competitive advantages than the model averaging strategy by considering the squared estimation error for the multinomial logit model.

In order to develop an optimal model averaging method for the multinomial logit model, the weights are obtained through minimizing the KL loss. That is, we use a plug-in estimator of the KL loss plus a penalty term as the weight choice criterion, which is equivalent to penalizing the negative log-likelihood. It is interesting to note that this criterion reduces to the Mahalanobis Mallows criterion of Zhu et al. (Citation2018) where they assume that the distribution of multiple responses is multivariate normal. The asymptotic optimality based on the KL loss of the proposed method is built on the consistency of estimators in misspecified models which is more flexible than the above-mentioned local misspecification assumption. Moreover, the asymptotic optimality will be established for the dimension of covariates being either fixed or diverging.

This article is the first study that proposes optimal model averaging estimation for multinomial logit models based on the KL loss. When the number of candidate models is small, the corresponding numerically solutions obtained are nearly instantaneous. If the number of candidate models is large, the computational burden of our model averaging procedure will be heavy. In this case, a model screening step prior to model averaging is desirable. That is, we use penalized regression with LASSO (Friedman et al., Citation2010) to prepare candidate models. Different tuning parameters may results in different models, which will be included in our resulting candidate models. Using the website phishing data, we demonstrate the superiority of our proposed method.

Our work is related to Zhang et al. (Citation2016), which developed the model averaging method for univariate generalized linear models. We differ from this study by establishing the asymptotic optimality based on some original conditions, while they prove the asymptotic optimality by assuming some conclusions are valid. Moreover, we discuss the case when the true model is one of the candidate models and prove that the model averaging estimators based on our weight choice criterion are consistent.

The remainder of this article is organized as follows. In Section 2, we first describe the multinomial logit model. Then, we introduce the model averaging estimation for the multinomial logit model and propose a weight choice criterion by considering the KL loss. The asymptotic optimality of the proposed method and the estimation consistency are discussed in Sections 3 and 4, respectively. Sections 5 and 6 present the numerical results through various simulations and a real data example, respectively. Technical proofs of the main results are presented in the Appendix.

2. Model framework and weight choice

2.1. Multinomial logit model

Consider a general discrete choice model with n independent individuals and d nominal alternatives. And $y_{i} = j$ means individual i selects alternative j. We use a multinomial logit regression to describe the discrete choice model (A. T. Wan et al., Citation2014). The corresponding assumption is that the log odds of category j relative to the reference category (without losing generality, we regard alternative d as reference) are determined by a linear combination of regressors. Thus, the choice probabilities for the ith individual can then be expressed as (1) $\begin{aligned} {\begin{array}{l} f (y_{i} = j | X_{i}) \\ = \frac{\exp (X_{i} β_{j})}{1 + \sum_{j = 1}^{d - 1} \exp (X_{i} β_{j})}, f o r j = 1, \dots, (d - 1), \\ f (y_{i} = d | X_{i}) \\ = \frac{1}{1 + \sum_{j = 1}^{d - 1} \exp (X_{i} β_{j})}, \end{array} \end{aligned}$ (1) where $X$ is an $n \times k$ covariate matrix with full column rank, $X_{i}$ is constructed from the ith row of $X$ , and $β_{j}$ is an unknown parameter vector. We first assume k to be fixed and discuss the diverging situation in Section 3. Formula (Equation1(1) $\begin{aligned} {\begin{array}{l} f (y_{i} = j | X_{i}) \\ = \frac{\exp (X_{i} β_{j})}{1 + \sum_{j = 1}^{d - 1} \exp (X_{i} β_{j})}, f o r j = 1, \dots, (d - 1), \\ f (y_{i} = d | X_{i}) \\ = \frac{1}{1 + \sum_{j = 1}^{d - 1} \exp (X_{i} β_{j})}, \end{array} \end{aligned}$ (1) ) can be rewritten as an exponential family form. (2) $\begin{aligned} f (y_{i} | θ_{i}, X_{i}) = \exp {T (y_{i})^{T} θ_{i} - b (θ_{i})}, i = 1, \dots, n, \end{aligned}$ (2) where $θ_{i} = (θ_{i 1}, \dots, θ_{i (d - 1)})^{T}$ is a parameter vector, with the canonical parameter $θ_{i j}$ connecting the parameters $β_{j}$ and the k-dimension covariate vector in the form $θ_{i j} = X_{i} β_{j}$ . And $θ_{i} = (I_{d - 1} \otimes X_{i}) β$ , where ⊗ is a Kronecker product, $I_{d - 1}$ is a $(d - 1) \times (d - 1)$ identity matrix, and $β = (β_{1}^{T}, \dots, β_{d - 1}^{T})^{T}$ , which is a $k (d - 1) \times 1$ parameter vector. Besides, $b (θ_{i}) = \log (\sum_{j = 1}^{d - 1} \exp (θ_{i j}) + 1)$ is a vector-valued function, $T (y) = (I_{{y = 1}}, \dots, I_{{y = d - 1}})^{T}$ , and $I_{{\cdot}}$ is an indicator function.

2.2. Model averaging estimator

We denote a set of S candidate models $M_{1}, \dots, M_{S}$ by (3) $\begin{aligned} M_{s} : f (y_{i} | θ_{i} {β_{s}}, X_{(s), i}) \\ = \exp {T (y_{i})^{T} (I_{d - 1} \otimes X_{(s), i}) β_{s} \\ - b ((I_{d - 1} \otimes X_{(s), i}) β_{s})}, \end{aligned}$ (3) where S is fixed, and $X_{(s)}$ is an $n \times k_{s}$ matrix containing $k_{s}$ columns of $X$ with full column rank, whose rows are $X_{(s), 1}, \dots,$ $X_{(s), n}$ . Under the sth candidate model, the maximum-likelihood estimator of the regression coefficients is ${\hat{β}}_{s}$ . And let ${\hat{β}}_{s} \in R^{(d - 1) \times k_{s}}$ be the subvector containing estimators in ${\hat{β}}_{(s)} \in R^{(d - 1) \times k}$ . Note that the rest components of ${\hat{β}}_{(s)}$ are restricted to be zeros. Let $θ_{0 i}$ be the true value of $θ_{i}$ . And $θ_{0 i}$ is not required that there exists a $β_{0}$ so that $θ_{0 i} = (I_{d - 1} \otimes X_{i}) β_{0}$ . Thus, each of the candidate models can be misspecified. After their maximum-likelihood estimators are obtained, we need to determine the weight of each candidate model. Let $ω = (w_{1}, \dots, w_{S})$ be a weight vector in the unit simplex of $R^{S} : H = {ω \in [0, 1]^{S}, \sum_{s = 1}^{S} w_{s} = 1}$ . Then the model averaging estimator of $\hat{β} (ω)$ is $\hat{β} (ω) = \sum_{s = 1}^{S} w_{s} {\hat{β}}_{(s)} .$

Let $Y = (T (y_{1}), \dots, T (y_{n}))^{T}$ , $U = E (Y)$ be $n \times (d - 1)$ matrices, $Θ = (θ_{1}, \dots, θ_{n})^{T}$ be an $n \times (d - 1)$ parameter matrix and $Θ_{0} = (θ_{01}, \dots, θ_{0 n})^{T}$ be the true value of $Θ$ . We put the model estimator in vector form by using the vectoring operation $Vec (\cdot)$ , which creates a column vector by stacking the column vectors of below one another. Then, the model averaging estimator can be expressed as (4) $\begin{aligned} Vec (Θ^{T} {\hat{β} (ω)}) = Z \hat{β} (ω), \end{aligned}$ (4) where $Z = ((I_{d - 1} \otimes X_{1})^{T}, \dots, (I_{d - 1} \otimes X_{n})^{T})^{T}$ , which is an $n (d - 1) \times k (d - 1)$ matrix.

2.3. KL loss-based weight choice criterion

For linear models, the weight choice criterion is based on squared prediction error. In this paper, we use the KL divergence as a replacement for the squared prediction error to establish the asymptotic optimality. The KL loss of $Θ {\hat{β} (ω)}$ is $\begin{aligned} K L (ω) & = 2 \sum_{i = 1}^{n} E_{Y^{*}} {\log {f (Y^{*} | Θ_{0})} \\ - \log (f [Y^{*} | Θ {\hat{β} (ω)}])} \\ = 2 B {\hat{β} (ω)} - 2 Vec (U^{T})^{T} Vec (Θ^{T} {\hat{β} (ω)}) \\ - 2 B_{0} + 2 Vec (U^{T})^{T} Vec (Θ_{0}^{T}) \\ = 2 J (ω) - 2 B_{0} + 2 Vec (U^{T})^{T} Vec (Θ_{0}^{T}), \end{aligned}$ where $Y^{*} = U + Ξ^{*}$ is another realization from $f (Y^{*} | Θ_{0})$ , $Ξ^{*}$ is independent of $Ξ$ , $Ξ = Y - U$ , $B_{0} =$ $\sum_{i = 1}^{n} b (θ_{0 i})$ , $J (ω) = B {\hat{β} (ω)} - Vec (U^{T})^{T} Vec Θ^{T} {\hat{β} (ω)}$ , and $B {\hat{β} (ω)} = \sum_{i = 1}^{n} b (θ_{i} {\hat{β} (ω)})$ .

Because of the relationship between $J (ω)$ and $K L (ω)$ , we can obtain $ω$ to minimize $J (ω)$ instead of $K L (ω)$ . In practice, minimizing $J (ω)$ is infeasible because the value of $U$ is unknown. A intuition idea is that we can plug $Y$ into $J (ω)$ instead of $U$ . That is, we can get $ω$ by minimizing $J^{*} (ω) = B {\hat{β} (ω)} - Vec (Y^{T})^{T} Vec (Θ^{T} {\hat{β} (ω)})$ . But, this progress will lead to overfitting. Motivated by that $J^{*} (ω)$ equals the corresponding negative log-likelihood of $Vec (Θ^{T} {\hat{β} (ω)}) = Z \hat{β} (ω)$ . We add a penalty term $λ_{n} (d - 1) ω^{T} K$ to $2 J^{*} (ω)$ , where $K = (k_{1}, \dots, k_{S})^{T}$ , and $k_{s}$ is the number of columns of $X$ used in the sth candidate model. And the weight choice criterion is introduced as $\begin{aligned} ℘ (ω) & = 2 B {\hat{β} (ω)} - 2 Vec (Y^{T})^{T} Vec (Θ^{T} {\hat{β} (ω)}) \\ + λ_{n} (d - 1) ω^{T} K . \end{aligned}$ The resultant weight vector is defined as $\hat{ω} = \arg min_{ω \in H} ℘ (ω)$ . Because $℘ (ω)$ is convex in $ω$ , the global optimization can be performed efficiently through constrained optimization programming. For example, the fmincon of MATLAB can be applied for this purpose. Note that when we restrict one element of $ω$ to 1 others 0, then $℘ (ω)$ is equivalent to AIC and BIC in the sense of choosing weights when $λ_{n} = 2$ and $λ_{n} = \log (n)$ , respectively. In addition, when $λ_{n} = 2$ , the criterion $℘ (ω)$ can reduce to the Mahalanobis Mallows criterion of Zhu et al. (Citation2018) where they assume that the distribution of multiple responses is multivariate normal.

3. Asymptotic optimality

This section presents the main theoretic results of this paper, which demonstrates the asymptotic optimality of the model averaging estimator $Θ {\hat{β} (\hat{ω})}$ . We define the pseudo true regression parameter vector as $β_{(s)}^{*}$ which minimizes the KL divergence between the true model and the sth candidate model. From Theorem 3.2 of White (Citation1982), when the dimension of $β_{(s)}^{*}$ , k, is fixed, under some regularity conditions, we have (5) $\begin{aligned} ‖ {\hat{β}}_{(s)} - β_{(s)}^{*} ‖ = O_{p} (n^{- 1 / 2}) . \end{aligned}$ (5) Before we provide the relevant theorems, we first list the relevant notations in this paper. Let $β^{*} (ω) = \sum_{s = 1}^{S} w_{s} β_{(s)}^{*}$ , $K L^{*} (ω) = 2 B {β^{*} (ω)} - 2 Vec (U^{T})^{T} Vec (Θ^{T} {β^{*} (ω)}) - 2 B_{0} + 2 Vec (U^{T})^{T} Vec$ $(Θ_{0}^{T})$ , $ξ_{n} = inf_{ω \in H} K L^{*} (ω)$ , $Ξ_{i}$ be the ith row of $Ξ$ , $\bar{λ} = max_{i \in {1, \dots, n}} λ_{max}$ ${Cov (Ξ_{i})},$ and $\underline{λ} = min_{i \in {1, \dots, n}}$ $λ_{min} {Cov (Ξ_{i})}$ , where $Cov (\cdot)$ , $λ_{max} {\cdot}$ and $λ_{min} {\cdot}$ denote the covariance, the maximum and minimum eigenvalues of a matrix, respectively. Note that all the limiting properties here and throughout the text hold under $n \to \infty$ . The following conditions will be made:

R.1	There exist constants $\underline{C}$ and $\bar{C}$ , such that $0 < \underline{C} < λ_{min} {X^{T} X / n} < λ_{max} {X^{T} X / n} < \bar{C}$ .
R.2	$max_{i \in {1, \dots, n}} ‖ X_{i} ‖^{2} / n \to 0$ , and there exist constants $C_{1}$ and $C_{2}$ , such that $0 < C 1 < \underline{λ} < \bar{λ} < C_{2}$ .
R.3	$n ξ_{n}^{- 2} = o (1)$ .
R.4	$n^{- 1 / 2} λ_{n} = O (1)$ .

Remark 3.1

Conditions R.1 and R.2 are regular. Condition R.1 is the same as condition C.2 and the second part of condition C.3 of Zhang et al. (Citation2020). The first part of Condition R.2 is the same as the first part of condition C.3 of Zhang et al. (Citation2020). The second part of Condition R.2 assumes the covariance matrix of $Ξ_{i}$ is positively definite. Condition R.3 requires $ξ_{n}$ grows at a rate no slower than $n^{\frac{1}{2}}$ . And both $λ_{n} = 2$ and $λ_{n} = \log (n)$ satisfy Condition R.4, which means that if one prefers AIC or BIC, this can be achieved by choosing $λ_{n} = 2$ or $λ_{n} = \log (n)$ . Condition R.4 is also used by Theorem 2 of Zhang et al. (Citation2020).

The following theorem illustrates the asymptotic optimality of the model averaging estimators for fixed k situation.

Theorem 3.1

For fixed k, if Equation (Equation5(5) $\begin{aligned} ‖ {\hat{β}}_{(s)} - β_{(s)}^{*} ‖ = O_{p} (n^{- 1 / 2}) . \end{aligned}$ (5) ) and the regularity Conditions R.1–R.4 hold. Then $\hat{ω}$ is asymptotically optimal in the sense that (6) $\begin{aligned} \frac{K L (\hat{ω})}{inf_{ω \in H} K L (ω)} \to 1, \end{aligned}$ (6) where the convergence is in probability.

Remark 3.2

Theorem 1 of S. Zhao et al. (Citation2019) is based on the squared loss. The squared loss only concerns on the point distance. Different from them, the KL loss measures the closeness between the model and the true data generating process, which concerns on the full distribution. In addition, from $K L (ω) = \int_{- \infty}^{+ \infty} f (Y^{*} | Θ_{0}) \log \frac{f (Y^{*} | Θ_{0})}{f (Y^{*} | Θ {\hat{β} (ω)})} d_{Y^{*}}$ , we know that the KL loss pays more attention to these points with high probability. However, the squared loss considers all points are equally important.

Considering for diverging k, let $β_{s}^{*} \in R^{(d - 1) \times k_{s}}$ be the corresponding subvector of $β_{(s)}^{*}$ and define $\begin{aligned} B_{n} (β_{s}^{*} | δ) \\ = {β_{s} \in R^{(d - 1) \times k_{s}} : ‖ \frac{n^{1 / 2}}{{(d - 1) k}^{1 / 2}} (β_{s} - β_{s}^{*}) ‖ \leq δ} . \end{aligned}$ Let $b^{(2)} = \frac{\partial^{2} b (x)}{\partial x \partial x^{T}}$ , $D_{s} = diag {b^{(2)} [(I_{d - 1} \otimes X_{(s), i}) β_{s}]}_{i = 1, \dots, n}$ and $Z_{(s)} = ((I_{d - 1} \otimes X_{(s), 1})^{T}, (I_{d - 1} \otimes X_{(s), 2})^{T}, \dots, (I_{d - 1} \otimes X_{(s), n})^{T})^{T}$ , which are $n (d - 1) \times n (d - 1)$ and $n (d - 1) \times k_{s} (d - 1)$ matrices, respectively. We list the following conditions required for the case with diverging k.

R.5	There exists a constant $C_{3} > 0$ such that $\sum_{i = 1}^{n} ‖ X_{i} ‖ / (k^{1 / 2} n) \leq C_{3} < \infty$ .
R.6	There exists a constant $C_{0} > 0$ such that for any fixed $δ > 0$ , any $β_{s} \in B_{n} (β_{s}^{*} \| δ)$ and every $s = 1, \dots, S$ , the minimum eigenvalue of $\frac{1}{n} Z_{(s)}^{T} D_{s} Z_{(s)}$ is bound below by $C_{0}$ for all sufficiently large n.
R.7	$k^{2} n ξ_{n}^{- 2} = o (1)$ .

Remark 3.3

Condition R.5 is implied by condition A.1(iii) of Lu and Su (Citation2015). Condition R.6 guarantees that $‖ {\hat{β}}_{(s)} - β_{(s)}^{*} ‖ = O_{p} (n^{- 1 / 2} k_{s}^{1 / 2}$ ), which is an extension of the first part of condition C.4 of Zhang et al. (Citation2016). Condition R.7 is an extension of Condition R.3 under the diverging k situation. Condition R.7 allows k to increase with n, but restricts its rate. Obviously, as k increases, $ξ_{n}$ decreases. Therefore, the smaller k is easier to satisfy Condition R.7. In practice, we can exclude redundant variables from the candidate set prior to model averaging to control k.

Theorem 3.2

For diverging k, if Conditions R.1–R.2 and R.4–R.7 are satisfied, then (Equation6(6) $\begin{aligned} \frac{K L (\hat{ω})}{inf_{ω \in H} K L (ω)} \to 1, \end{aligned}$ (6) ) remains valid as $n \to \infty$ .

Remark 3.4

Note that both $λ_{n} = 2$ and $λ_{n} = \log (n)$ satisfy Condition R.4. In Section 4, the numerical analysis shows that both of them outperform alternative model selection methods (AIC and BIC) and model averaging methods (Smoothed AIC and Smoothed BIC), respectively. And, when the sample size is small, the optimal value of $λ_{n}$ increases as the level of the model misspecification improves.

4. Estimation consistency

Here we would like to comment on the case when the true model is included in the candidate models. That is, $\begin{aligned} θ_{0 i} = (I_{d - 1} \otimes X_{i}) β_{0}, \end{aligned}$ where $β_{0} \in R^{(d - 1) \times k}$ is the true value of $β$ and the number of non-zero coefficients of $β_{0}$ is $k_{t r u e}$ . Let $ω_{t r u e}$ be a weight vector in which the element corresponding to the true model is 1, and all others are 0. When k is fixed, from chapter 3.4.1 of Fahrmeir and Tutz (Citation2013), under some regularity conditions, we have (7) $\begin{aligned} ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖ = O_{p} (n^{- 1 / 2}) . \end{aligned}$ (7) For diverging k, from Theorem 2.1 of Portnoy (Citation1988), under some regularity conditions, we can obtain (8) $\begin{aligned} ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖ = O_{p} (k_{t r u e}^{1 / 2} n^{- 1 / 2}) . \end{aligned}$ (8) Denote $D_{i} (β) = b^{(2)} [(I_{d - 1} \otimes X_{i}) β]$ . In order to prove the estimation consistency, we further impose the following condition.

R.8

There exists $δ (r) \geq \underline{d} > 0$ , such that uniformly for $ω \in H$ and $r \in (0, 1)$ and for almost all $i \in {1, \dots, n}$ , $\begin{aligned} ‖ D_{i}^{1 / 2} [β_{0} + r (\hat{β} (ω) - β_{0})] (I_{d - 1} \otimes X_{i}) \\ \times {(\hat{β} (ω) - β_{0}) ‖}^{2} / {‖ \hat{β} (ω) - β_{0} ‖}^{2} > δ (r) . \end{aligned}$

Remark 4.1

Condition R.8 states that most $D_{i}^{1 / 2} [β_{0} + r (\hat{β} (ω) - β_{0})] (I_{d - 1} \otimes X_{i}) s$ do not degenerate, in the sense that their inner products with $\hat{β} (ω) - β_{0}$ do not approach zero, which is implied by $λ_{min} {diag {D_{i} [β_{0} + r (\hat{β} (ω) - β_{0})]}_{i = 1, \dots, n}} > 0$ , uniformly for $ω \in H$ and $r \in (0, 1)$ , and the first part of condition R.1. Our asymptotic study mainly requires Condition R.8 so that the positive definition of $diag {D_{i} [β_{0} + r (\hat{β} (ω) - β_{0})]}_{i = 1, \dots, n}$ is not necessary.

We now describe the performance of the weighted estimator when the true model is among the candidate models.

Theorem 4.1

When k is fixed and the true model is one of the candidate models, if Conditions R.1, R.2, R.4, R.8 and Equation (Equation7(7) $\begin{aligned} ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖ = O_{p} (n^{- 1 / 2}) . \end{aligned}$ (7) ) are satisfied, then the weighted estimator satisfies (9) $\begin{aligned} ‖ \hat{β} (\hat{ω}) - β_{0} ‖ = O_{p} (n^{- 1 / 2} λ_{n}^{1 / 2}) = o_{p} (1) . \end{aligned}$ (9)

Remark 4.2

Theorem 4.1 states that $‖ \hat{β} (\hat{ω}) - β_{0} ‖ = o_{p} (1)$ . I conjecture that it may be possible to extend the converge rate of the weighted estimator to $n^{1 / 2}$ , similar to Theorem 2 of Zhang and Liu (Citation2019).

Theorem 4.2

For diverging k, if $k = o (n^{1 / 2})$ , Conditions R.1, R.2, R.4, R.8 and Equation (Equation8(8) $\begin{aligned} ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖ = O_{p} (k_{t r u e}^{1 / 2} n^{- 1 / 2}) . \end{aligned}$ (8) ) are satisfied, then the weighted estimator satisfies (10) $\begin{aligned} ‖ \hat{β} (\hat{ω}) - β_{0} ‖ = O_{p} (k^{1 / 2} n^{- 1 / 2} λ_{n}^{1 / 2}) = o_{p} (1) . \end{aligned}$ (10)

5. Monte Carlo simulations

In this section, we evaluate the empirical performance of our proposed model averaging strategy for the multinomial logit model. We use two versions of our proposed model averaging method named OPT1-KL with $λ_{n} = 2$ and OPT2-KL with $λ_{n} = \log (n)$ to compare with some alternative FMA methods and model selection strategies. Model selection methods include AIC, BIC, and LASSO proposed by Friedman et al. (Citation2010), where the tuning parameter, $\hat{ζ}$ , is selected by cross-validation. Model averaging strategies include A-OPT (A. T. Wan et al., Citation2014), MCV (S. Zhao et al., Citation2019)), Smoothed AIC (SAIC) and Smoothed BIC (SBIC) (Buckland et al., Citation1997). The SAIC strategy assigns the weight $\begin{aligned} \exp (- {A I C}_{s} / 2) / \sum_{s = 1}^{S} \exp (- {A I C}_{s} / 2) \end{aligned}$ to the sth model and SBIC allocates weights similarly.

We use the KL loss for assessment and generate 1000 simulated data. For the convenient comparison, we normalize all KL losses by dividing the KL loss corresponding to the best method. The sample size varies at 100, 200. Note that MCV and A-OPT are the model averaging methods to average estimate of the probability $y_{i} = j$ . Which leads to computing the KL loss is infeasible. Therefore, we compare our methods with MCV and A-OPT in terms of the mean squared forecast error (MSFE) instead of the KL loss. Without loss of generality, we also normalize all MSFEs by dividing the MSFE corresponding to the best method.

Two situations of simulations are used. In the first situation, when the candidate models do not contain the true model, we examine the effect of the changing magnitude of coefficients and the changing level of the model misspecification. Moreover, we consider the case when the number of covariates is diverging with the sample size. Note that all candidate models are misspecified in this situation, so there does not exist the full model. It implies that the A-OPT method is not feasible for this situation. In the second situation, when the candidate models include the true model, we compare our methods with other competitive methods and validate the estimation consistency.

Setting 1. We generate $y_{i}$ from the setup of the multinomial logit model (Equation2(2) $\begin{aligned} f (y_{i} | θ_{i}, X_{i}) = \exp {T (y_{i})^{T} θ_{i} - b (θ_{i})}, i = 1, \dots, n, \end{aligned}$ (2) ) with the following specifications: $d = 3, X_{i 1} = 1$ , $X_{i j}, j = 2, \dots, 6$ follow normal distributions with mean zeros, variance ones and the correlation between different components of $X_{i}$ is $ρ = 0.75$ , and $θ_{i} = (θ_{i 1}, θ_{i 2})^{T} = (I_{2} \otimes X_{i}) (β_{1}^{T}, β_{2}^{T})^{T}$ , where $\begin{aligned} β_{1} & = γ_{1} (1, 1, 0.2, - 1.2, - 0.5, 0.7)^{T}; \\ β_{2} & = γ_{1} (0.7, 0.9, 0.3, - 1.1, - 0.6, 0.7)^{T} . \end{aligned}$ In order to imitate that all candidate models are misspecified, we pretend the last covariate missed. Let $X_{i 1}$ contain in all candidate models. So there are $2^{4} - 1 = 15$ candidate models to combine. The parameter $γ_{1}$ is used to control the magnitude of coefficients, and we let it vary in the set ${0.5, 1, 2}$ .

Simulation results are shown in Table . One remarkable aspect of the results is that OPT1-KL and OPT2-KL yield smaller mean and standard deviance (SD) values than the other four competitions (SAIC, SBIC, AIC and BIC) in different magnitudes of coefficients. In the majority of cases, FMA approaches are superior to model selection methods. This pattern appears to be more obvious when $γ_{1}$ is small than when it is large. For example, when $γ_{1}$ = 0.5, all model averaging methods deliver smaller mean values than model selection strategies. When $γ_{1}$ increases to 2, for n = 200, AIC has marginal advantages than SBIC regarding the mean values. This result is reasonable, because when $γ_{1}$ is small, and the non-zero coefficients in the true model are all close to zero, which makes it difficult to distinguish the non-zero coefficients from a false model that contains many zeros. As a result, model selection criterion scores can be quite similar for different candidate models and the choice of models becomes unstable. On the other hand, when the absolute values of the non-zero coefficients are large, and a model selection criterion can identify a non-zero coefficient more readily.

Table 1. Simulations results of the KL loss for setting 1.

Display Table

In addition, we calculate the means of $K L (\hat{ω}) / {i n f}_{ω \in H} K L (ω) (r a t i o)$ by methods of OPT1-KL and OPT2-KL with $γ_{1} = 1$ . The values of mean, presented in Figure , decrease and approach to 1 as the sample size n increases. This feature verifies asymptotic optimality numerically. Then, we compare our strategy with MCV in terms of MSFE. We generate 100 observations as the training sample and 10 observations as the test sample under setting 1 with $γ_{1} = 1$ . And the simulation results are based on 1000 replications. $\begin{aligned} M S F E = \frac{1}{10000} \sum_{r = 1}^{1000} \sum_{v = 1}^{10} \sum_{j = 1}^{3} ({\hat{p}}_{v j}^{[r]} - p_{v j}^{[r]})^{2}, \end{aligned}$ where ${\hat{p}}_{v j}^{[r]}$ is the forecast of $p_{v j}^{[r]}$ , which represents the probability that the vth test observation selects alternative j for the rth replication.

Figure 1. The means of ratio by methods of OPT1-KL and OPT2-KL with $γ_{1} = 1$ .

Table shows that our proposed approaches outperform other strategies. Then, SAIC and SBIC perform better than MCV. Note that MCV is the model averaging method based on the squared loss, and our strategy is based on the KL loss. It implies that the approach based on the KL loss has a strong competitive advantage than this approach based on the square loss for a multinomial logit model.

Table 2. Simulation results of MSFE.

Download CSV Display Table

Setting 2. In order to examine the effects of the changing level of the model misspecification, we set $θ_{i 1} = U_{i} β_{1} + γ_{2} \exp (0.5 X_{i 6})$ , $θ_{i 2} = U_{i} β_{2} + γ_{2} \exp (0.6 X_{i 6})$ , and $\begin{aligned} β_{1} & = (1, 1, 0.2, - 1.2, - 0.5)^{T}; \\ β_{2} & = (0.7, 0.9, 0.3, - 1.1, - 0.6)^{T}, \end{aligned}$ where $U_{i} = (1, X_{i 2}, \dots, X_{i 5})$ , $X_{i 2}, \dots, X_{i 6}$ have the same specification as the previous design, $γ_{2}$ controls the level of the model misspecification, we study it in the set ${0.25, 0.5, 0.75}$ . We take the multinomial logit model (Equation3(3) $\begin{aligned} M_{s} : f (y_{i} | θ_{i} {β_{s}}, X_{(s), i}) \\ = \exp {T (y_{i})^{T} (I_{d - 1} \otimes X_{(s), i}) β_{s} \\ - b ((I_{d - 1} \otimes X_{(s), i}) β_{s})}, \end{aligned}$ (3) ) to fit the data. We still omit the last covariate $X_{i 6}$ and consider $S = 2^{4} - 1$ candidate models.

The simulation results under different levels of the model misspecification are shown in Table . It is seen that OPT1-KL and OPT2-KL always deliver better performances than their competitors SAIC/AIC and SBIC/BIC in terms of mean values, respectively. Focusing on SD values, OPT1-KL always performs much better than SAIC and AIC, and OPT2-KL outperforms SBIC and BIC in most cases. It demonstrates the superiority of our methods.

Table 3. Simulations results of the KL loss for setting 2.

Display Table

In addition, we explore our strategies with other values of $λ_{n}$ differing from 2 and $\log (n)$ . That is, we vary $λ_{n}$ from 0.5 to $n^{0.4}$ . The simulation results are presented in Figure . For cases of $γ_{2} = 0.25, 0.5$ and 0.75, when n = 100, the means of KL loss are minimized at $λ_{n} = 2.5$ , $λ_{n} = 2.75$ and $λ_{n} = 3.25$ , respectively. It states that when the level of the model misspecification improves, the optimal value of $λ_{n}$ increases slightly for the small sample size. For a larger sample size n = 200, the optimal values of $λ_{n}$ are same for all cases with $λ_{n} = 2$ .

Figure 2. The relationship between the mean of KL loss and $λ_{n}$ . The points with the smallest losses are indicated by the filled circle •. (a) n = 100 and (b) n = 200.

Setting 3. This setup discusses the case when the number of covariates is diverging. The data generate progress is the same as those in setting 1. Except that we adapt the covariance matrix to $R = (r_{i j})$ with $r_{i j} = {0.40}^{| i - j |}$ , for that the model screening method is not suitable for the case when the covariances have strong dependence which is implied by the first part of Lemma 3.2 in Ando and Li (Citation2014). Then, $β_{1}$ and $β_{2}$ are chosen according to the following cases: $\begin{aligned} β_{1} & = (1, 1.2, 0, 0, 1.5, 0, 0, 1.1, 0, 0, 0.1, \\ \dots, 0.1, 0.9)_{[3 n^{1 / 3}] \times 1}^{T}; \\ β_{2} & = (1, 1.3, 0, 0, 2, 0, 0, 1.2, 0, 0, 0.1, \\ \dots, 0.1, 0.8)_{[3 n^{1 / 3}] \times 1}^{T} . \end{aligned}$ Similar to setting 1, we also pretend the last covariate was missed. Then, there are $2^{[3 n^{1 / 3}] - 2} - 1$ candidate models. The computation burden will be heavy. Therefore, a screening method to screen candidate models is desirable. That is, we use penalized regression with LASSO (Friedman et al., Citation2010) to prepare candidate models. Different tuning parameters may result in different models, which will be included in our resulting candidate models. Obviously, the resulting candidate model contains lots of redundant variables when the tuning parameter is very small. In order to avoid the generated candidate models including a lot of redundant variables, we use tuning parameters larger than $\hat{ζ}$ to prepare candidate models.

Simulation results are provided in Table . Focusing on the mean values, Table shows that OPT1-KL always performs better than SAIC and AIC, and OPT2-KL still outperforms SBIC and BIC, respectively. In addition, OPT2-KL always outperforms LASSO. It implies the advantages of our proposed method comparing with other competitive methods.

Table 4. Simulations results of the KL loss for setting 3.

Download CSV Display Table

Setting 4. This setup verifies the estimation consistency. The data generate progress is the same as those in setting 3. Except that we choose $β_{1}$ and $β_{2}$ as follows: $\begin{aligned} c a s e 1 : β_{1} & = (1, 1.5, 3, 0, 0)^{T}; \\ β_{2} & = (1, 1.7, 4, 0, 0)^{T}; \end{aligned}$ and $\begin{aligned} c a s e 2 : β_{1} & = (1, 1.5, 3, 1.2, 0)^{T}; \\ β_{2} & = (1, 1.7, 4, 1.3, 0)^{T} . \end{aligned}$ All candidate models include $X_{i 1}$ . Thus, we consider a total of $2^{4} - 1 = 15$ candidate models. Note that the true model is included in the candidate models. We compare our proposed method with other competitive methods based on the KL loss and MSFE. The results are presented in Tables and , respectively. These results show that OPT2-KL always obtain the smallest KL loss and MSFE among these methods, which validates the superiority of our method.

Table 5. Simulations results of the KL loss for setting 4.

Download CSV Display Table

Table 6. Simulations results of MSFE for setting 4.

Download CSV Display Table

Also, we calculate the mean squared error (MSE) by methods of OPT1-KL and OPT2-KL. $\begin{aligned} M S E = \frac{1}{1000} \sum_{r = 1}^{1000} ‖ β^{(r)} (\hat{ω}) - β ‖^{2}, \end{aligned}$ where $β^{(r)} (\hat{ω})$ represents the estimator of $β$ for the rth replication and $β = (β_{1}^{T}, β_{2}^{T})^{T}$ . The MSE curves, shown in Figure , decrease and approach zero with the increase of sample size n. The feature confirms estimation consistency numerically.

Figure 3. Assessing the estimation consistency of OPT1-KL and OPT2-KL. (a) case 1 and (b) case 2.

6. An empirical application

In this part, we apply the proposed method to the website phishing data, which was previously used by Abdelhamid et al. (Citation2014). This data set contains three types of website (702 phishing websites, 548 legitimate websites and 103 suspicious websites). The dependent variables consist of Server Form Handler, Using Pop-Up Window, Fake HTTPs protocol, Request URL, URL of Anchor, Website Traffic, URL Length, Age of Domain and Having IP address. These variables are categorical (or binary). We transform this information into indicator variables. After this operation, the total number of predictors is 16. After the screening method, we analyse this dataset using candidate multinomial logit models. We randomly select 677 observations as the training sample and predict the remaining instances. We repeat this progress 500 times. We use the following KL-type prediction loss $L_{K L}$ to measure the prediction performance. $\begin{aligned} L_{K L} = - \frac{2}{n_{0}} \sum_{v = 1}^{n_{0}} \sum_{j = 1}^{d} I_{{y_{t e s t, v} = j}} \log {\hat{p} (y_{t e s t, v} = j)}, \end{aligned}$ where ${y_{t e s t, 1}, \dots, y_{t e s t, n_{0}}}$ are testing observations, and $\hat{p} (y_{t e s t, v} = j)$ is the predicted probability of the vth test observation taking on j.

Figure shows the box plot of all KL-type prediction losses by seven methods. It is observed that our proposed methods of OPT1-KL and OPT2-KL produce better performances than their competitions SAIC/AIC and SBIC/BIC, respectively. In addition, OPT2-KL outperforms LASSO in terms of the KL loss.

Figure 4. Boxplots of KL-type prediction losses by seven methods in the website phishing data.

In addition, we evaluate the prediction performance based on the hit-rate, which is computed by dividing the number of correct predictions by the size of the test sample. The prediction value of an observation is $j (1 \leq j \leq 3)$ if the predicted probability of this observation taking on j has the largest value among the three predicted probability values. In addition, we also calculate the optimal rate (OPR) and the worst rate (WOR) of each method, which is the proportion of times with the largest hite-rate and the smallest hit-rate. Table presents mean values of hit-rate (HRV), OPR and WOR corresponding to these methods, which shows that OPT1-KL and OPT2-KL methods obtain the larger HRV, OPR and smaller WOR than other competitions SAIC, SBIC, AIC, BIC and LASSO, demonstrating the superiority of our proposed strategies.

Table 7. Out-of-sample performances in the website phishing data.

Download CSV Display Table

Table reports the Diebold–Mariano test (Diebold & Mariano, Citation2002) results for the differences in hit-rate. Note that a positive Diebold–Mariano statistic indicates that the estimator in the numerator produces a larger hit-rate than the estimator in the denominator. The test statistics and p-values show that the differences in hit-rate between our methods and other strategies are all statistically significant.

Table 8. Diebold–Mariano statistics of hit-rate in the website phishing data.

Display Table

7. Discussion

In the context of multinomial logit model, we proposed model averaging estimator and weight choice criterion based on KL loss with a penalty term. And we proved the asymptotic optimality of the resulting model averaging estimator under some regularity conditions. When the true model is one of the candidate models, the averaged estimators are consistent. Also, in order to reduce the computational burden, we applied a model screening step before averaging. Numerical experiments were performed to demonstrate the superiority of the proposed methods over other commonly used model selection strategies, model averaging methods, MCV, Lasso in terms of KL loss and MSFE.

While we consider the multinomial logit model, the extension to other models, such as ordered logit model, warrants further investigation. And the data structure of the regressors further complicates this issue. Another interesting question is how to choose an optimal $λ_{n}$ . Shen et al. (Citation2004) have proposed an adaptive method to choose $λ_{n}$ for model selection criterion in generalized linear models. Building similar methods for our proposed model averaging method to choose an optimal $λ_{n}$ warrants future researches.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been republished with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

The research is supported by Natural Science Foundation of China (No. 11771268) and a center named Shanghai Research Center for Data Science and Decision Technology.

References

Abdelhamid, N., Ayesh, A., & Thabtah, F. (2014). Phishing detection based associative classification data mining. Expert Systems with Applications, 41(13), 5948–5959. https://doi.org/10.1016/j.eswa.2014.03.019
Web of Science ®Google Scholar
Akaike, H. (1973). Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika, 60(2), 255–265. https://doi.org/10.1093/biomet/60.2.255
Web of Science ®Google Scholar
Ando, T., & Li, K. C. (2014). A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association, 109, 254–265. https://doi.org/10.1080/01621459.2013.838168
Web of Science ®Google Scholar
Bayaga, A. (2010). Multinomial logistic regression: Usage and application in risk analysis. Journal of Applied Quantitative Methods, 5, 288–297.
Google Scholar
Buckland, S. T., Burnham, K. P., & Augustin, N. H. (1997). Model selection: An integral part of inference. Biometrics, 53(2), 603–618. https://doi.org/10.2307/2533961
Web of Science ®Google Scholar
Cavanaugh, J. E. (1999). A large-sample model selection criterion based on Kullback's symmetric divergence. Statistics & Probability Letters, 42(4), 333–343. https://doi.org/10.1016/S0167-7152(98)00200-4
Web of Science ®Google Scholar
Cheng, T. C. F., Ing, C. K., & Yu, S. H. (2015). Toward optimal model averaging in regression models with time series errors. Journal of Econometrics, 189(2), 321–334. https://doi.org/10.1016/j.jeconom.2015.03.026
Web of Science ®Google Scholar
Diebold, F. X., & Mariano, R. S. (2002). Comparing predictive accuracy. Journal of Business & Economic Statistics, 20, 134–144. https://doi.org/10.1198/073500102753410444
Web of Science ®Google Scholar
Ederington, L. H. (1985). Classification models and bond ratings. Financial Review, 20, 237–262. https://doi.org/10.1111/fire.1985.20.issue-4
Google Scholar
Fahrmeir, L., & Tutz, G. (2013). Multivariate statistical modelling based on generalized linear models. Springer Science & Business Media.
Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22. https://doi.org/10.18637/jss.v033.i01
PubMed Web of Science ®Google Scholar
Guadagni, P. M., & Little, J. D. C. (1983). A logit model of brand choice calibrated on scanner data. Marketing Science, 2, 203–238. https://doi.org/10.1287/mksc.2.3.203
Google Scholar
Hansen, B. E. (2007). Least squares model averaging. Econometrica, 75, 1175–1189. https://doi.org/10.1111/ecta.2007.75.issue-4
Web of Science ®Google Scholar
Hansen, B. E., & Racine, J. S. (2012). Jackknife model averaging. Journal of Econometrics, 167, 38–46. https://doi.org/10.1016/j.jeconom.2011.06.019
Web of Science ®Google Scholar
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14, 382–417. https://doi.org/10.1214/ss/1009212519
Web of Science ®Google Scholar
Hurvich, C. M., Simonoff, J. S., & Tsai, C. L. (1998). Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60, 271–293. https://doi.org/10.1111/1467-9868.00125
Web of Science ®Google Scholar
Konishi, S., & Kitagawa, G. (1996). Generalised information criteria in model selection. Biometrika, 83, 875–890. https://doi.org/10.1093/biomet/83.4.875
Web of Science ®Google Scholar
Li, C., Li, Q., Racine, J., & Zhang, D. Q. (2018). Optimal model averaging of varying coefficient models. Statistica Sinica, 28, 2795–2809. https://doi.org/10.5705/ss.202017.0034
Web of Science ®Google Scholar
Liu, Q., & Okui, R. (2013). Heteroskedasticity-Robust Cp model averaging. Econometrics Journal, 16(3), 463–472. https://doi.org/10.1111/ectj.12009
Web of Science ®Google Scholar
Lu, X., & Su, L. (2015). Jackknife model averaging for quantile regressions. Journal of Econometrics, 188, 40–58. https://doi.org/10.1016/j.jeconom.2014.11.005
Web of Science ®Google Scholar
Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15, 661–675. https://doi.org/10.1080/00401706.1973.10489103
Web of Science ®Google Scholar
Portnoy, S. (1988). Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. The Annals of Statistics, 16(1), 356–366. https://doi.org/10.1214/aos/1176350710
Web of Science ®Google Scholar
Raftery, A. E., & Zheng, Y. (2003). Discussion: Performance of Bayesian model averaging. Journal of the American Statistical Association, 98(464), 931–938. https://doi.org/10.1198/016214503000000891
Web of Science ®Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/aos/1176344136
Web of Science ®Google Scholar
Shen, X., Huang, H. C., & Ye, J. (2004). Adaptive model selection and assessment for exponential family distributions. Technometrics, 46(3), 306–317. https://doi.org/10.1198/004017004000000338
Web of Science ®Google Scholar
Wan, A. T., Zhang, X., & Wang, S. (2014). Frequentist model averaging for multinomial and ordered logit models. International Journal of Forecasting, 30(1), 118–128. https://doi.org/10.1016/j.ijforecast.2013.07.013
Web of Science ®Google Scholar
Wan, A. T., Zhang, X., & Zou, G. (2010). Least squares model averaging by Mallows criterion. Journal of Econometrics, 156(2), 277–283. https://doi.org/10.1016/j.jeconom.2009.10.030
Web of Science ®Google Scholar
Wang, H., Zhang, X., & Zou, G. (2009). Frequentist model averaging estimation: A review. Journal of Systems Science and Complexity, 22(4), 732–748. https://doi.org/10.1007/s11424-009-9198-y
Web of Science ®Google Scholar
White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50(1), 1–25. https://doi.org/10.2307/1912526
Web of Science ®Google Scholar
Zhang, X., & Liu, C. A. (2019). Inference after model averaging in linear regression models. Econometric Theory, 35(4), 816–841. https://doi.org/10.1017/S0266466618000269
Web of Science ®Google Scholar
Zhang, X., Wan, A. T., & Zou, G. (2013). Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics, 174(2), 82–94. https://doi.org/10.1016/j.jeconom.2013.01.004
Web of Science ®Google Scholar
Zhang, X., & Wang, W. (2019). Optimal model averaging estimation for partially linear models. Statistica Sinica, 29, 693–718. https://doi.org/10.5705/ss.202015.0392
Web of Science ®Google Scholar
Zhang, X., Yu, D., Zou, G., & Liang, H. (2016). Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. Journal of the American Statistical Association, 111(516), 1775–1790. https://doi.org/10.1080/01621459.2015.1115762
Web of Science ®Google Scholar
Zhang, X., & Yu, J. (2018). Spatial weights matrix selection and model averaging for spatial autoregressive models. Journal of Econometrics, 203(1), 1–18. https://doi.org/10.1016/j.jeconom.2017.05.021
Web of Science ®Google Scholar
Zhang, X., Zou, G., & Carroll, R. J. (2015). Model averaging based on Kullback-Leibler distance. Statistica Sinica, 25, 1583–1598. https://doi.org/10.5705/ss.2013.326
PubMed Web of Science ®Google Scholar
Zhang, X., Zou, G., Liang, H., & Carroll, R. J. (2020). Parsimonious model averaging with a diverging number of parameters. Journal of the American Statistical Association, 115(530), 972–984. https://doi.org/10.1080/01621459.2019.1604363
PubMed Web of Science ®Google Scholar
Zhao, P., & Li, Z. (2008). Central limit theorem for weighted sum of multivariate random vector sequences. Journal of Mathematics, 28, 171–176. https://doi.org/10.1007/s12033-007-0073-6
Google Scholar
Zhao, S., Zhou, J., & Yang, G. (2019). Averaging estimators for discrete choice by M-fold cross-validation. Economics Letters, 174, 65–69. https://doi.org/10.1016/j.econlet.2018.10.014
Web of Science ®Google Scholar
Zhu, R., Wan, A. T., Zhang, X., & Zou, G. (2019). A mallows-type model averaging estimator for the varying-coefficient partially linear model. Journal of the American Statistical Association, 114(526), 882–892. https://doi.org/10.1080/01621459.2018.1456936
Web of Science ®Google Scholar
Zhu, R., Zou, G., & Zhang, X. (2018). Model averaging for multivariate multiple regression models. Statistics, 52(1), 205–227. https://doi.org/10.1080/02331888.2017.1367794
Web of Science ®Google Scholar

Appendix

Proof

Proof of Theorem 3.1

Let

\tilde{℘} (ω) = ℘ (ω) - 2 B_{0} + 2 Vec (U^{T})^{T} Vec (Θ_{0}^{T})

. It's obvious that

\tilde{℘}

and ℘ are equivalent in the sense of choosing weights. From the proof of Theorem 1 in Zhang et al. (Citation2016), Theorem 3.1 is valid if the following holds:

(A1)

\begin{aligned} sup_{ω \in H} \frac{| K L (ω) - K L^{*} (ω) |}{K L^{*} (ω)} \to o_{p} (1) \end{aligned}

(A1) and

(A2)

\begin{aligned} sup_{ω \in H} \frac{| \tilde{℘} (ω) - K L^{*} (ω) |}{K L^{*} (ω)} \to o_{p} (1) . \end{aligned}

(A2) By Equation (Equation5), we can know that uniformly for

ω \in H

(A3)

\begin{aligned} ‖ \hat{β} (ω) - β^{*} (ω) ‖ = ‖ \sum_{s = 1}^{S} ω_{s} ({\hat{β}}_{(s)} - β_{(s)}^{*}) ‖ = O_{p} (n^{- 1 / 2}) . \end{aligned}

(A3) From (EquationA3),

λ_{max} (Z^{T} Z) = λ_{max} (X^{T} X)

, Condition R.1 and Taylor expansion uniformly for

ω \in H

(A4)

\begin{aligned} | B {\hat{β} (ω)} - B {β^{*} (ω)} | \\ = | \sum_{i = 1}^{n} [(\frac{\exp (X_{i} \tilde{β} (ω)_{1}) X_{i}}{\sum_{j = 1}^{d - 1} (\exp (X_{i} \tilde{β} (ω)_{j}) + 1}, \\ \dots, \frac{\exp (X_{i} \tilde{β} (ω)_{d - 1}) X_{i}}{\sum_{j = 1}^{d - 1} (\exp (X_{i} \tilde{β} (ω)_{j}) + 1}) (\hat{β} (ω) - β^{*} (ω))] | \\ \leq ‖ \hat{β} (ω) - β^{*} (ω) ‖ \sum_{i = 1}^{n} ‖ (\frac{\exp (X_{i} \tilde{β} (ω)_{1}) X_{i}}{\sum_{j = 1}^{d - 1} (\exp (X_{i} \tilde{β} (ω)_{j}) + 1} \\ \times, \dots, \frac{\exp (X_{i} \tilde{β} (ω)_{1}) X_{i}}{\sum_{j = 1}^{d - 1} (\exp (X_{i} \tilde{β} (ω)_{j}) + 1)}) ‖ \\ \leq ‖ \hat{β} (ω) - β^{*} (ω ‖ \sum_{l = 1}^{n (d - 1)} ‖ Z_{l} ‖ \\ \leq ‖ \hat{β} (ω) - β^{*} (ω ‖ \sum_{l = 1}^{n (d - 1)} (1 + ‖ Z_{l} ‖^{2}) \\ = ‖ \hat{β} (ω) - β^{*} (ω ‖ [trace (Z^{T} Z) + n (d - 1)] \\ \leq ‖ \hat{β} (ω) - β^{*} (ω ‖ [λ_{max} (Z^{T} Z) (k (d - 1)) + n (d - 1)] \\ = O_{p} (n^{1 / 2}), \end{aligned}

(A4) where

\tilde{β} (ω)

is between

\hat{β} (ω)

and

β^{*} (ω)

. From

U_{i j} < 1, i = 1, \dots, n, j = 1, \dots, (d - 1)

, we can obtain

‖ Vec (U^{T}) ‖^{2} = O (n)

, which along with (EquationA3),

λ_{max} (Z^{T} Z) = λ_{max} (X^{T} X)

and Condition R.1, we have

(A5)

\begin{aligned} Vec (U^{T})^{T} [Vec (Θ^{T} {\hat{β} (ω)}) - Vec (Θ^{T} {β^{*} (ω)})] \\ = Vec (U^{T})^{T} {Z \hat{β} (ω) - Z β^{*} (ω)} \\ \leq ‖ Vec (U^{T})^{T} Z ‖ ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ = [(Vec (U^{T})^{T} Z^{(1)})^{2} \\ + \dots + (Vec (U^{T})^{T} Z^{(k (d - 1))})^{2}]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ \leq [‖ Vec (U^{T}) ‖^{2} (‖ Z^{(1)} ‖^{2} \\ + \dots + ‖ Z^{(k (d - 1))} ‖^{2})]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ \leq [‖ Vec (U^{T}) ‖^{2} t r a c e (Z^{T} Z)]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ \leq [λ_{max} (Z^{T} Z) k (d - 1) ‖ Vec (U^{T}) ‖^{2}]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ = O_{p} (n^{1 / 2}), \end{aligned}

(A5) where

Z^{(j)}

is the jth column of

Z

. Note that

\sum_{l = 1}^{n (d - 1)} ‖ Z_{l} ‖^{2} = trace (Z^{T} Z) \leq λ_{max} (Z^{T} Z) k (d - 1)

, which combined with central limit theorem, Condition R.1, and the second part of Condition R.2, we obtain

‖ Vec (Ξ^{T})^{T} Z ‖ = O_{p} (n^{1 / 2})

. From

‖ Vec (Ξ^{T})^{T} Z ‖ = O_{p} (n^{1 / 2})

and (EquationA3), we have

(A6)

\begin{aligned} Vec (Ξ^{T})^{T} [Vec (Θ^{T} {\hat{β} (ω)}) - Vec (Θ^{T} {β^{*} (ω)})] \\ = Vec (Ξ^{T})^{T} {Z \hat{β} (ω) - Z β^{*} (ω)} \\ \leq ‖ Vec (Ξ^{T})^{T} Z ‖ ‖ \hat{β} (ω) - β^{*} (ω) ‖ = O_{p} (1) . \end{aligned}

(A6) From Condition R.1 and the first part of Condition R.2, we have

\begin{aligned} \sum_{i = 1}^{n} θ_{i}^{T} (β_{(s)}^{*}) Cov (Ξ_{i}) θ_{i} (β_{(s)}^{*}) \\ < C_{2} \sum_{i = 1}^{n} ‖ θ_{i} (β_{(s)}^{*}) ‖^{2} = C_{2} {β_{(s)}^{*}}^{T} Z^{T} Z β_{(s)}^{*} \\ \leq C_{2} λ_{max} (Z^{T} Z) ‖ β_{(s)}^{*} ‖^{2} = O (n), \end{aligned}

and

\begin{aligned} max_{i \in {1, \dots, n}} ‖ θ_{i} (β_{(s)}^{*}) ‖^{2} / \sum_{i = 1}^{n} ‖ θ_{i} (β_{(s)}^{*}) ‖^{2} \\ = max_{i \in {1, \dots, n}} ‖ (X_{i} β_{(s) 1}^{*}, \dots, X_{i} β_{(s) (d - 1)}^{*}) ‖^{2} / \sum_{i = 1}^{n} ‖ θ_{i} (β_{(s)}^{*}) ‖^{2} \\ \leq max_{i \in {1, \dots, n}} (d - 1) ‖ X_{i} ‖^{2} ‖ β_{(s)}^{*} ‖^{2} / \sum_{i = 1}^{n} ‖ θ_{i} (β_{(s)}^{*}) ‖^{2} \\ \leq max_{i \in {1, \dots, n}} (d - 1) ‖ X_{i} ‖^{2} ‖ β_{(s)}^{*} ‖^{2} / (‖ β_{(s)}^{*} ‖^{2} λ_{min} (Z^{T} Z)) \\ = o (1) . \end{aligned}

These along with Theorem 1 in P. Zhao and Li (Citation2008) and the second part of Condition R.2, we know that uniformly for

ω \in H

(A7)

\begin{aligned} Vec (Ξ^{T})^{T} Vec (Θ^{T} {β^{*} (ω)}) \\ = \sum_{i = 1}^{n} Ξ_{i} θ_{i} {β^{*} (ω)} \\ = \sum_{s = 1}^{S} w_{s} \sum_{i = 1}^{n} Ξ_{i} θ_{i} (β_{(s)}^{*}) = O_{p} (n^{1 / 2}) . \end{aligned}

(A7) Therefore, (EquationA4) and (EquationA5) indicate that

(A8)

\begin{aligned} sup_{ω \in H} | K L (ω) - K L^{*} ω | \\ \leq 2 sup_{ω \in H} | B {\hat{β} (ω)} - B {β^{*} (ω)} | \\ + 2 | Vec (U^{T})^{T} [Vec (Θ^{T} {\hat{β} (ω)}) - Vec (Θ^{T} {β^{*} (ω)})] | \\ = O_{p} (n^{1 / 2}) . \end{aligned}

(A8) And (EquationA4), (EquationA5), (EquationA6), (EquationA7) indicate that

(A9)

\begin{aligned} sup_{ω \in H} | \tilde{℘} (ω) - K L^{*} (ω) | \leq 2 sup_{ω \in H} | B {\hat{β} (ω)} - B {β^{*} (ω)} | \\ + 2 sup_{ω \in H} | Vec (Y^{T})^{T} Vec (Θ^{T} {\hat{β} (ω)}) \\ - Vec (U^{T})^{T} Vec (Θ^{T} {β^{*} (ω)}) | + λ_{n} (d - 1) ω^{T} K \\ \leq 2 sup_{ω \in H} | B {\hat{β} (ω)} - B {β^{*} (ω)} | \\ + 2 sup_{ω \in H} | Vec (U^{T}) (Vec (Θ^{T} {\hat{β} (ω)}) \\ - 2 Vec (Θ^{T} {β^{*} (ω)})) | \\ + 2 sup_{ω \in H} | Vec (Ξ^{T})^{T} Vec (Θ^{T} {\hat{β} (ω)}) \\ - Vec (Θ^{T} {β^{*} (ω)}) | \\ + 2 sup_{ω \in H} | Vec (Ξ^{T})^{T} Vec (Θ^{T} {β^{*} (ω)}) | \\ + λ_{n} (d - 1) ω^{T} K = O_{p} (n^{1 / 2}) + λ_{n} (d - 1) ω^{T} K . \end{aligned}

(A9) Now, from (EquationA8), (EquationA9) and Conditions R.3 and R.4, we can get (EquationA1) and (EquationA2). This completes the proof of Theorem 3.1.

Proof

Proof of Theorem 3.2

From the result of Theorem 3.1, it suffices to prove that (EquationA10(A10) $\begin{aligned} sup_{ω \in H} \frac{| B {\hat{β} (ω)} - B {β^{*} (ω)} |}{K L^{*} (ω)} = o_{p} (1), \end{aligned}$ (A10) )–(EquationA14(A14) $\begin{aligned} \frac{λ_{n} (d - 1) ω^{T} K}{K L^{*}} = o (1) . \end{aligned}$ (A14) ), as $n \to \infty$ (A10) $\begin{aligned} sup_{ω \in H} \frac{| B {\hat{β} (ω)} - B {β^{*} (ω)} |}{K L^{*} (ω)} = o_{p} (1), \end{aligned}$ (A10) (A11) $\begin{aligned} sup_{ω \in H} \frac{| Vec (U^{T})^{T} (Vec (Θ^{T} {\hat{β} (ω)}) - Vec (Θ^{T} {β^{*} (ω)})) |}{K L^{*} (ω)} \\ = o_{p} (1), \end{aligned}$ (A11) (A12) $\begin{aligned} sup_{ω \in H} \frac{| Vec (Ξ^{T})^{T} Vec (Θ^{T} {β^{*} (ω)}) |}{K L^{*}} = o_{p} (1), \end{aligned}$ (A12) (A13) $\begin{aligned} sup_{ω \in H} \frac{| Vec (Ξ^{T})^{T} (Vec (Θ^{T} {\hat{β} (ω)}) - Vec (Θ^{T} {β^{*} (ω)})) |}{K L^{*}} \\ = o_{p} (1), \end{aligned}$ (A13) (A14) $\begin{aligned} \frac{λ_{n} (d - 1) ω^{T} K}{K L^{*}} = o (1) . \end{aligned}$ (A14) First of all, we show that for any fixed $ϵ > 0$ , there exists $δ_{ϵ} > 0$ such that for all sufficiently large n $\begin{aligned} P (‖ \frac{n^{1 / 2}}{{(d - 1) k}^{1 / 2}} ({\hat{β}}_{s} - β_{s}^{*}) ‖ \leq δ_{ϵ}) \geq 1 - ϵ . \end{aligned}$ Write $u_{s}^{*} = (b^{(1)} [(I_{d - 1} \otimes X_{(s), i}) β_{s}^{*}]^{T}, \dots, b^{(1)} [I_{d - 1} \otimes X_{(s), n}) β_{s}^{*}]^{T})^{T}$ , where $u_{s}^{*}$ is a $n (d - 1) \times 1$ vector. The quasi true value $β_{s}^{*}$ minimizes the KL divergence so that $\begin{aligned} \partial {B (β_{s}) - Vec (U^{T})^{T} Vec (Θ^{T} (β_{s}))} / \partial β_{s} |_{β_{s} = β_{s}^{*}} \\ = 0_{k_{s} (d - 1) \times 1}, \end{aligned}$ which implies that $Z_{(s)}^{T} u_{s}^{*} = Z_{(s)}^{T} Vec (U^{T})$ . Then, by using first-order Taylor expansion of $\partial \log f (Y | Z_{(s)}$ $β_{s}) / \partial β_{s} = 0_{k_{s} (d - 1) \times 1}$ at $β_{(s)}^{*}$ , we can get $\begin{aligned} 0_{k_{s} (d - 1) \times 1} & = - Z_{(s)}^{T} {Vec (Y^{T}) - Vec (U^{T})} \\ + Z_{(s)}^{T} D_{s} Z_{(s)} ({\hat{β}}_{s} - β_{s}^{*}), \end{aligned}$ which implies $(Z_{(s)}^{T} D_{s} Z_{(s)})^{- 1} Z_{(s)}^{T} {Vec (Y^{T}) - Vec (U^{T})} = ({\hat{β}}_{s} - β_{s}^{*})$ , then $\begin{aligned} \frac{n^{1 / 2}}{{(d - 1) k}^{1 / 2}} ({\hat{β}}_{s} - β_{s}^{*}) \\ = {(\frac{1}{n} Z_{(s)}^{T} D_{s} Z_{(s)} |_{β_{s} = {\tilde{β}}_{s}})}^{- 1} \frac{Z_{(s)}^{T} {Vec (Y^{T}) - Vec (U^{T})}}{{(d - 1) k}^{1 / 2} n^{1 / 2}}, \end{aligned}$ where ${\tilde{β}}_{s}$ between ${\hat{β}}_{s}$ and $β_{s}^{*}$ . It follows Condition R.6 and sufficiently large n $\begin{aligned} P (‖ \frac{n^{1 / 2}}{{(d - 1) k}^{1 / 2}} ({\hat{β}}_{s} - β_{s}^{*}) ‖ \leq δ) \\ \geq P (C_{0}^{- 1} ‖ \frac{Z_{(s)}^{T} {Vec (Y^{T}) - Vec (U^{T})}}{{(d - 1) k}^{1 / 2} n^{1 / 2}} ‖ \leq δ) \\ \geq 1 - \frac{\sum_{i = 1}^{n} {‖ X_{i} ‖}^{2} {\bar{λ}}^{2} (d - 1)}{C_{0}^{2} δ^{2} (d - 1) k n} \\ \geq 1 - \frac{C_{1}}{C_{0}^{2} δ^{2}}, \end{aligned}$ By taking $δ_{ϵ} = C_{1}^{1 / 2} / (ϵ^{1 / 2} C_{0})$ , we can obtain $‖ {\hat{β}}_{s} - β_{s}^{*} ‖ = ‖ {\hat{β}}_{(s)} - β_{(s)}^{*} ‖ = O_{p} ({k (d - 1)}^{1 / 2} n^{- 1 / 2})$ , and thus (A15) $\begin{aligned} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ \leq \sum_{s = 1}^{S} ω_{s} ‖ {\hat{β}}_{(s)} - β_{(s)}^{*} ‖ \\ = O_{p} ({k (d - 1)}^{1 / 2} n^{- 1 / 2}) . \end{aligned}$ (A15) From (EquationA15(A15) $\begin{aligned} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ \leq \sum_{s = 1}^{S} ω_{s} ‖ {\hat{β}}_{(s)} - β_{(s)}^{*} ‖ \\ = O_{p} ({k (d - 1)}^{1 / 2} n^{- 1 / 2}) . \end{aligned}$ (A15) ) and Condition R.5, we can show that uniformly for $ω \in H$ (A16) $\begin{aligned} | B {\hat{β} (ω)} - B {β^{*} (ω)} | & \leq ‖ \hat{β} (ω) - β^{*} (ω ‖ \sum_{l = 1}^{n (d - 1)} ‖ Z_{l} ‖ \\ = ‖ \hat{β} (ω) - β^{*} (ω ‖ (d - 1) \sum_{i = 1}^{n} ‖ X_{i} ‖ \\ = O_{p} (k (d - 1) n^{1 / 2}), \end{aligned}$ (A16) and similar to the proof of (EquationA5(A5) $\begin{aligned} Vec (U^{T})^{T} [Vec (Θ^{T} {\hat{β} (ω)}) - Vec (Θ^{T} {β^{*} (ω)})] \\ = Vec (U^{T})^{T} {Z \hat{β} (ω) - Z β^{*} (ω)} \\ \leq ‖ Vec (U^{T})^{T} Z ‖ ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ = [(Vec (U^{T})^{T} Z^{(1)})^{2} \\ + \dots + (Vec (U^{T})^{T} Z^{(k (d - 1))})^{2}]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ \leq [‖ Vec (U^{T}) ‖^{2} (‖ Z^{(1)} ‖^{2} \\ + \dots + ‖ Z^{(k (d - 1))} ‖^{2})]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ \leq [‖ Vec (U^{T}) ‖^{2} t r a c e (Z^{T} Z)]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ \leq [λ_{max} (Z^{T} Z) k (d - 1) ‖ Vec (U^{T}) ‖^{2}]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ = O_{p} (n^{1 / 2}), \end{aligned}$ (A5) ), we can obtain (A17) $\begin{aligned} Vec (U^{T})^{T} [Vec (Θ^{T} {\hat{β} (w)}) - Vec (Θ^{T} {β^{*} (ω)})] \\ = Vec (U^{T})^{T} (Z \hat{β} (ω) - Z β^{*} (ω)) \\ \leq [λ_{max} (Z^{T} Z) k (d - 1) ‖ Vec (U^{T}) ‖^{2}]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ = O_{p} (k n^{1 / 2}) . \end{aligned}$ (A17) By combining (EquationA16(A16) $\begin{aligned} | B {\hat{β} (ω)} - B {β^{*} (ω)} | & \leq ‖ \hat{β} (ω) - β^{*} (ω ‖ \sum_{l = 1}^{n (d - 1)} ‖ Z_{l} ‖ \\ = ‖ \hat{β} (ω) - β^{*} (ω ‖ (d - 1) \sum_{i = 1}^{n} ‖ X_{i} ‖ \\ = O_{p} (k (d - 1) n^{1 / 2}), \end{aligned}$ (A16) ), (EquationA17(A17) $\begin{aligned} Vec (U^{T})^{T} [Vec (Θ^{T} {\hat{β} (w)}) - Vec (Θ^{T} {β^{*} (ω)})] \\ = Vec (U^{T})^{T} (Z \hat{β} (ω) - Z β^{*} (ω)) \\ \leq [λ_{max} (Z^{T} Z) k (d - 1) ‖ Vec (U^{T}) ‖^{2}]^{1 / 2} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ = O_{p} (k n^{1 / 2}) . \end{aligned}$ (A17) ) and Condition R.7, we obtain (EquationA10(A10) $\begin{aligned} sup_{ω \in H} \frac{| B {\hat{β} (ω)} - B {β^{*} (ω)} |}{K L^{*} (ω)} = o_{p} (1), \end{aligned}$ (A10) ) and (EquationA11(A11) $\begin{aligned} sup_{ω \in H} \frac{| Vec (U^{T})^{T} (Vec (Θ^{T} {\hat{β} (ω)}) - Vec (Θ^{T} {β^{*} (ω)})) |}{K L^{*} (ω)} \\ = o_{p} (1), \end{aligned}$ (A11) ). By using $\sum_{l = 1}^{n (d - 1)} ‖ Z_{l} ‖^{2} = trace (Z^{T} Z) \leq λ_{max} (Z^{T} Z) k (d - 1)$ , central limit theorem, Condition R.1, and the second part of Condition R.2, we obtain $‖ Vec (Ξ^{T})^{T} Z ‖ = O_{p} ({k (d - 1)}^{1 / 2} n^{1 / 2})$ , which combined with (EquationA15(A15) $\begin{aligned} ‖ \hat{β} (ω) - β^{*} (ω) ‖ \\ \leq \sum_{s = 1}^{S} ω_{s} ‖ {\hat{β}}_{(s)} - β_{(s)}^{*} ‖ \\ = O_{p} ({k (d - 1)}^{1 / 2} n^{- 1 / 2}) . \end{aligned}$ (A15) ) implies (A18) $\begin{aligned} Vec (Ξ^{T})^{T} (Vec (Θ^{T} {\hat{β} (ω)}) - Vec (Θ^{T} {β^{*} (ω)})) \\ = O_{p} (k (d - 1)) . \end{aligned}$ (A18) From Condition R.1 and the first part of Condition R.2, we obtain $\begin{aligned} \sum_{i = 1}^{n} θ_{i}^{T} (β_{(s)}^{*}) Cov (Ξ_{i}) θ_{i} (β_{(s)}^{*}) \\ < C_{2} \sum_{i = 1}^{n} ‖ θ_{i} (β_{(s)}^{*}) ‖^{2} = C_{2} {β_{(s)}^{*}}^{T} Z^{T} Z β_{(s)}^{*} \\ \leq C_{2} λ_{max} (Z^{T} Z) ‖ β_{(s)}^{*} ‖^{2} = O (n k), \end{aligned}$ and $\begin{aligned} max_{i \in {1, \dots, n}} ‖ θ_{i} (β_{(s)}^{*}) ‖^{2} / \sum_{i = 1}^{n} ‖ θ_{i} (β_{(s)}^{*}) ‖^{2} = o (1) . \end{aligned}$ These along with Theorem 1 in P. Zhao and Li (Citation2008), and the second part of Condition R.2, we know that uniformly for $ω \in H$ (A19) $\begin{aligned} Vec (Ξ^{T})^{T} Vec (Θ^{T} {β^{*} (ω)}) \\ = \sum_{i = 1}^{n} Ξ_{i} θ_{i} {β^{*} (ω)} \\ = \sum_{s = 1}^{S} w_{s} \sum_{i = 1}^{n} Ξ_{i} θ_{i} (β_{(s)}^{*}) = O_{p} ({(n k)}^{1 / 2}) . \end{aligned}$ (A19) Using (EquationA18(A18) $\begin{aligned} Vec (Ξ^{T})^{T} (Vec (Θ^{T} {\hat{β} (ω)}) - Vec (Θ^{T} {β^{*} (ω)})) \\ = O_{p} (k (d - 1)) . \end{aligned}$ (A18) ), (EquationA19(A19) $\begin{aligned} Vec (Ξ^{T})^{T} Vec (Θ^{T} {β^{*} (ω)}) \\ = \sum_{i = 1}^{n} Ξ_{i} θ_{i} {β^{*} (ω)} \\ = \sum_{s = 1}^{S} w_{s} \sum_{i = 1}^{n} Ξ_{i} θ_{i} (β_{(s)}^{*}) = O_{p} ({(n k)}^{1 / 2}) . \end{aligned}$ (A19) ) and Conditions R.4, R.7, the claims(EquationA12(A12) $\begin{aligned} sup_{ω \in H} \frac{| Vec (Ξ^{T})^{T} Vec (Θ^{T} {β^{*} (ω)}) |}{K L^{*}} = o_{p} (1), \end{aligned}$ (A12) )–(EquationA14(A14) $\begin{aligned} \frac{λ_{n} (d - 1) ω^{T} K}{K L^{*}} = o (1) . \end{aligned}$ (A14) ) are obtained. This completes the proof of Theorem 3.2.

Proof

Proof of Theorem 4.1

Note that the true value $β_{0}$ minimizes the KL divergence so that $\begin{aligned} \partial {B (β) - Vec (U^{T})^{T} Vec (Θ^{T} (β))} / \partial β |_{β = β_{0}} = 0_{k (d - 1) \times 1}, \end{aligned}$ which implies that (A20) $\begin{aligned} Z^{T} u_{0} = Z^{T} Vec (U^{T}), \end{aligned}$ (A20) where $u_{0} = (b^{(1)} [(I_{d - 1} \otimes X_{i}) β_{0}]^{T}, \dots, b^{(1)} [I_{d - 1} \otimes X_{n}) β_{0}]^{T})^{T}$ , which is a $n (d - 1) \times 1$ vector. Then by using second order Taylor expansion of $B {\hat{β} (ω_{t r u e})}$ at $β_{0}$ , we have (A21) $\begin{aligned} B {\hat{β} (ω_{t r u e})} \\ = B_{0} + (\hat{β} (ω_{t r u e}) - β_{0})^{T} Z^{T} u_{0} \\ + \frac{1}{2} (\hat{β} (ω_{t r u e}) - β_{0})^{T} Z^{T} \tilde{D} (ω_{t r u e}) Z (\hat{β} (ω_{t r u e}) - β_{0}) . \end{aligned}$ (A21) where $\tilde{D} (ω_{t r u e}) = diag {D_{i} [β_{0} + r (\hat{β} (ω_{t r u e}) - β_{0})]}_{i = 1, \dots, n}$ . From every element of symmetric matrix $D_{i} [β_{0} + r (\hat{β} (ω_{t r u e}) - β_{0})]$ bound, (EquationA20(A20) $\begin{aligned} Z^{T} u_{0} = Z^{T} Vec (U^{T}), \end{aligned}$ (A20) ), (EquationA21(A21) $\begin{aligned} B {\hat{β} (ω_{t r u e})} \\ = B_{0} + (\hat{β} (ω_{t r u e}) - β_{0})^{T} Z^{T} u_{0} \\ + \frac{1}{2} (\hat{β} (ω_{t r u e}) - β_{0})^{T} Z^{T} \tilde{D} (ω_{t r u e}) Z (\hat{β} (ω_{t r u e}) - β_{0}) . \end{aligned}$ (A21) ), (Equation7(7) $\begin{aligned} ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖ = O_{p} (n^{- 1 / 2}) . \end{aligned}$ (7) ) and Condition R.1, we have (A22) $\begin{aligned} K L (ω_{t r u e}) & = 2 (B {\hat{β} (ω_{t r u e})} - B_{0}) \\ - 2 [Vec (U^{T}) (Vec (Θ^{T} {\hat{β} (ω_{t r u e})}) - Vec (Θ_{0}^{T})] \\ = 2 [\hat{β} (ω_{t r u e}) - β_{0}]^{T} Z^{T} u_{0} \\ + [\hat{β} (ω_{t r u e}) - β_{0}]^{T} Z^{T} \tilde{D} (ω_{t r u e}) Z [\hat{β} (ω_{t r u e}) - β_{0}] \\ - 2 [(\hat{β} (ω_{t r u e}) - β_{0})^{T} Z^{T} Vec (U^{T})] \\ = [\hat{β} (ω_{t r u e}) - β_{0}]^{T} Z^{T} \tilde{D} (ω_{t r u e}) Z [\hat{β} (ω_{t r u e}) - β_{0}] \\ \leq λ_{max} (\tilde{D} (ω_{t r u e})) λ_{max} (Z^{T} Z) ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖^{2} \\ \leq max_{i = 1, \dots, n} trace (b^{(2)} [(I_{d - 1} \otimes X_{i}) \tilde{β} (ω)]) λ_{max} \\ \times (X^{T} X) ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖^{2} \\ \leq (d - 1) λ_{max} (X^{T} X) ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖^{2} \\ \leq O_{p} (n) ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖^{2} \\ = O_{p} (1) . \end{aligned}$ (A22) In addition, let $ℵ$ be the set of i such that the inequality in Condition R.8 holds. From Condition R.8, and the second-order Taylor expansion of $B {\hat{β} (\hat{ω})}$ at $β_{0}$ , we have (A23) $\begin{aligned} K L (\hat{ω}) & = 2 (B {\hat{β} (\hat{ω})} - B_{0}) \\ - 2 [Vec (U^{T}) (Vec (Θ^{T} {\hat{β} (\hat{ω})}) - Vec (Θ_{0}^{T})] \\ = 2 (\hat{β} (\hat{ω}) - β_{0})^{T} Z^{T} u_{0} \\ - 2 (\hat{β} (\hat{ω}) - β_{0})^{T} Z^{T} Vec (U^{T}) \\ + \sum_{i = 1}^{n} ‖ D_{i}^{1 / 2} [β_{0} + r (\hat{β} (\hat{ω}) - β_{0})] \\ \times {(I_{d - 1} \otimes X_{i}) (\hat{β} (\hat{ω}) - β_{0}) ‖}^{2} \\ = \sum_{i = 1}^{n} ‖ D_{i}^{1 / 2} [β_{0} + r (\hat{β} (\hat{ω}) - β_{0})] \\ \times {(I_{d - 1} \otimes X_{i}) (\hat{β} (\hat{ω}) - β_{0}) ‖}^{2} \\ \geq \sum_{i \in ℵ} \underline{d} {‖ \hat{β} (\hat{ω}) - β_{0}) ‖}^{2} \\ \geq \underline{d} n^{*} {‖ \hat{β} (\hat{ω}) - β_{0}) ‖}^{2}, \end{aligned}$ (A23) where $n^{*}$ is the number of elements in $ℵ$ , and from Condition R.8 we know that $n^{*}$ has the same order as n. Note that $\begin{aligned} \tilde{℘} (ω_{t r u e}) & = K L (ω_{t r u e}) + 2 Vec (Ξ^{T})^{T} Vec (Θ^{T} {\hat{β} (ω_{t r u e})}) \\ + λ_{n} (d - 1) k_{t r u e}, \end{aligned}$ and (A24) $\begin{aligned} \tilde{℘} (\hat{ω}) & = K L (\hat{ω}) + 2 Vec (Ξ^{T})^{T} Vec (Θ^{T} {\hat{β} (\hat{ω})}) \\ + λ_{n} (d - 1) {\hat{ω}}^{T} K . \end{aligned}$ (A24) These along (EquationA22(A22) $\begin{aligned} K L (ω_{t r u e}) & = 2 (B {\hat{β} (ω_{t r u e})} - B_{0}) \\ - 2 [Vec (U^{T}) (Vec (Θ^{T} {\hat{β} (ω_{t r u e})}) - Vec (Θ_{0}^{T})] \\ = 2 [\hat{β} (ω_{t r u e}) - β_{0}]^{T} Z^{T} u_{0} \\ + [\hat{β} (ω_{t r u e}) - β_{0}]^{T} Z^{T} \tilde{D} (ω_{t r u e}) Z [\hat{β} (ω_{t r u e}) - β_{0}] \\ - 2 [(\hat{β} (ω_{t r u e}) - β_{0})^{T} Z^{T} Vec (U^{T})] \\ = [\hat{β} (ω_{t r u e}) - β_{0}]^{T} Z^{T} \tilde{D} (ω_{t r u e}) Z [\hat{β} (ω_{t r u e}) - β_{0}] \\ \leq λ_{max} (\tilde{D} (ω_{t r u e})) λ_{max} (Z^{T} Z) ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖^{2} \\ \leq max_{i = 1, \dots, n} trace (b^{(2)} [(I_{d - 1} \otimes X_{i}) \tilde{β} (ω)]) λ_{max} \\ \times (X^{T} X) ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖^{2} \\ \leq (d - 1) λ_{max} (X^{T} X) ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖^{2} \\ \leq O_{p} (n) ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖^{2} \\ = O_{p} (1) . \end{aligned}$ (A22) ), (EquationA24(A24) $\begin{aligned} \tilde{℘} (\hat{ω}) & = K L (\hat{ω}) + 2 Vec (Ξ^{T})^{T} Vec (Θ^{T} {\hat{β} (\hat{ω})}) \\ + λ_{n} (d - 1) {\hat{ω}}^{T} K . \end{aligned}$ (A24) ) and $\tilde{℘} (ω_{t r u e}) \geq \tilde{℘} (\hat{ω})$ , we have $\begin{aligned} K L (ω_{t r u e}) + 2 Vec (Ξ^{T})^{T} Vec (Θ^{T} {\hat{β} (ω_{t r u e})}) \\ + λ_{n} (d - 1) k_{t r u e} \geq K L (\hat{ω}) \\ + 2 Vec (Ξ^{T})^{T} Vec (Θ^{T} {\hat{β} (\hat{ω})}) + λ_{n} (d - 1) {\hat{ω}}^{T} K, \end{aligned}$ which follows that $\begin{aligned} K L (ω_{t r u e}) + 2 Vec (Ξ^{T})^{T} [Vec (Θ^{T} {\hat{β} (ω_{t r u e})}) \\ - Vec (Θ_{0}^{T})] + λ_{n} (d - 1) k_{t r u e} \\ - 2 Vec (Ξ^{T})^{T} [Vec (Θ^{T} {\hat{β} (\hat{ω})}) - Vec (Θ_{0}^{T})]] \\ - λ_{n} (d - 1) {\hat{ω}}^{T} K \geq K L (\hat{ω}) . \end{aligned}$ Note that $\sum_{l = 1}^{n (d - 1)} ‖ Z_{l} ‖^{2} = trace (Z^{T} Z) \leq λ_{max} (Z^{T} Z) k (d - 1)$ , which along with central limit theorem, Condition R.1, and the second part of Condition R.2, we obtain $‖ Vec (Ξ^{T})^{T} Z ‖ = O_{p} (n^{1 / 2})$ . From $‖ Vec (Ξ^{T})^{T} Z ‖ = O_{p} (n^{1 / 2})$ , (EquationA23(A23) $\begin{aligned} K L (\hat{ω}) & = 2 (B {\hat{β} (\hat{ω})} - B_{0}) \\ - 2 [Vec (U^{T}) (Vec (Θ^{T} {\hat{β} (\hat{ω})}) - Vec (Θ_{0}^{T})] \\ = 2 (\hat{β} (\hat{ω}) - β_{0})^{T} Z^{T} u_{0} \\ - 2 (\hat{β} (\hat{ω}) - β_{0})^{T} Z^{T} Vec (U^{T}) \\ + \sum_{i = 1}^{n} ‖ D_{i}^{1 / 2} [β_{0} + r (\hat{β} (\hat{ω}) - β_{0})] \\ \times {(I_{d - 1} \otimes X_{i}) (\hat{β} (\hat{ω}) - β_{0}) ‖}^{2} \\ = \sum_{i = 1}^{n} ‖ D_{i}^{1 / 2} [β_{0} + r (\hat{β} (\hat{ω}) - β_{0})] \\ \times {(I_{d - 1} \otimes X_{i}) (\hat{β} (\hat{ω}) - β_{0}) ‖}^{2} \\ \geq \sum_{i \in ℵ} \underline{d} {‖ \hat{β} (\hat{ω}) - β_{0}) ‖}^{2} \\ \geq \underline{d} n^{*} {‖ \hat{β} (\hat{ω}) - β_{0}) ‖}^{2}, \end{aligned}$ (A23) ) and (Equation7(7) $\begin{aligned} ‖ \hat{β} (ω_{t r u e}) - β_{0} ‖ = O_{p} (n^{- 1 / 2}) . \end{aligned}$ (7) ), we can know $\begin{aligned} O_{p} (1) + λ_{n} (d - 1) k_{t r u e} + O_{p} (n^{1 / 2}) ‖ \hat{β} (\hat{ω}) - β_{0} ‖ \\ + λ_{n} (d - 1) {\hat{ω}}^{T} K \geq \underline{d} n^{*} {‖ \hat{β} (\hat{ω}) - β_{0}) ‖}^{2} . \end{aligned}$

Thus, there exists ${\tilde{a}}_{n} = O_{p} (n)$ , $\tilde{c_{n}} = O_{p} (n^{1 / 2})$ , such that $\begin{aligned} {\tilde{a}}_{n} ‖ \hat{β} (\hat{ω}) - β_{0} ‖^{2} + \tilde{c_{n}} ‖ \hat{β} (\hat{ω}) - β_{0} ‖ \leq O_{p} (λ_{n}) . \end{aligned}$ This lead to $\begin{aligned} ‖ \hat{β} (\hat{ω}) - β_{0} ‖^{2} + \frac{\tilde{c_{n}}}{{\tilde{a}}_{n}} ‖ \hat{β} (\hat{ω}) - β_{0} ‖ \leq O_{p} (\frac{λ_{n}}{{\tilde{a}}_{n}}), \end{aligned}$ and thus, $\begin{aligned} {(‖ \hat{β} (\hat{ω}) - β_{0} ‖ + \frac{\tilde{c_{n}} / 2}{{\tilde{a}}_{n}})}^{2} \leq O_{p} (\frac{λ_{n}}{{\tilde{a}}_{n}}) + \frac{{\tilde{c_{n}}}^{2} / 4}{{\tilde{a}}_{n}^{2}}, \end{aligned}$ which implies that $\begin{aligned} ‖ \hat{β} (\hat{ω}) - β_{0} ‖ = O_{p} (n^{- 1 / 2} λ_{n}^{1 / 2}) . \end{aligned}$ This completes the proof of Theorem 4.1.

Proof

Proof of Theorem 4.2

The proof of Theorem 4.2 can be treated analogously to the proof of Theorem 4.1.

Optimal model averaging estimator for multinomial logit models

Abstract

1. Introduction