Full article: Matching a Distribution by Matching Quantiles Estimation

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Motivated by the problem of selecting representative portfolios for backtesting counterparty credit risks, we propose a matching quantiles estimation (MQE) method for matching a target distribution by that of a linear combination of a set of random variables. An iterative procedure based on the ordinary least-squares estimation (OLS) is proposed to compute MQE. MQE can be easily modified by adding a LASSO penalty term if a sparse representation is desired, or by restricting the matching within certain range of quantiles to match a part of the target distribution. The convergence of the algorithm and the asymptotic properties of the estimation, both with or without LASSO, are established. A measure and an associated statistical test are proposed to assess the goodness-of-match. The finite sample properties are illustrated by simulation. An application in selecting a counterparty representative portfolio with a real dataset is reported. The proposed MQE also finds applications in portfolio tracking, which demonstrates the usefulness of combining MQE with LASSO.

Keywords

1. INTRODUCTION

Basel III is a global regulatory standard on bank capital adequacy, stress testing and market liquidity risk put forward by the Basel Committee on Banking Supervision in 2010–2011, in response to the deficiencies in risk management revealed by the late-2000s financial crisis. One of the mandated requirements under Basel III is an extension of the backtesting of internal counterparty credit risk (CCR) models. Backtesting tests the performance of CCR measurement, to determine the need for recalibration of the simulation and/or pricing models and readjustment of capital charges. Since the number of the trades between two major banks could easily be in the order of tens of thousands or more, Basel III allows banks to backtest representative portfolios for each counterparty, which consist of subsets of the trades. However, the selected representative portfolios should represent the various characteristics of the total counterparty portfolio including risk exposures, sensitivity to the risk factors, etc. We propose in this article a new method for constructing such a representative portfolio. The basic idea is to match the distribution of total counterparty portfolio by that of a selected portfolio. However, we do not match the two distribution functions directly. Instead we choose the representative portfolio to minimize the mean squared difference between the quantiles of the two distributions across all levels. This leads to the matching quantiles estimation (MQE) for the purpose of matching a target distribution. To the best of our knowledge, MQE has not been used in this particular context, though the idea of matching quantiles has been explored in other contexts; see, for example, Karian and Dudewicz (Citation1999), Small and McLeish (Citation1994), and Dominicy and Veredas (Citation2013). Furthermore, our inference procedure is different from those in the aforementioned papers due to the different nature of our problem.

Formally, the proposed MQE bears some similarities to the ordinary least squares estimation (OLS) for regression models. However, the fundamental difference is that MQE is for matching (unconditional) distribution functions, while OLS is for estimating conditional mean functions. Unlike OLS, MQE seldom admits an explicit expression. We propose an iterative algorithm applying least-squares estimation repeatedly to the recursively sorted data. We show that the algorithm converges as the mean squared difference of the two-sample quantiles decreases monotonically. Some asymptotic properties of MQE are established based on the Bahadur-Kiefer bounds for the empirical quantile processes.

MQE method facilitates some variations naturally. First, it can be performed by matching the quantiles between levels α₁ and α₂ only, where 0 ⩽ α₁ < α₂ ⩽ 1. The resulting estimator matches only a part of the target distribution. This could be attractive if we are only interested in mimicking, for example, the behavior at the lower end of the target distribution. Second, MQE can also be performed with a LASSO-penalty, leading to a sparser representation. Though MQE was motivated by the problem of estimating representative portfolios, its potential usefulness is wider. We illustrate how it can be used in a portfolio tracking problem. Since MQE does not require the data being paired together, it can also be used for analyzing asynchronous measurements which arise from various applications including atmospheric sciences (He et al. Citation2012), space physics, and other areas (O’Brien et al. Citation2001).

MQE is an estimation method for matching unconditional distribution functions. It is different from the popular quantile regression which refers to the estimation for conditional quantile functions. See Koenker (Citation2005), and references therein. It also differs from the unconditional quantile regression of Firpo et al. (Citation2009) which deals with the estimation for the impact of explanatory variables on quantiles of the unconditional distribution of an outcome variable. For nonnormal models, sample quantiles have been used for different inference purposes. For example, Kosorok (Citation1999) used quantiles for nonparametric two-sample tests. Gneiting (Citation2011) argued that quantiles should be used as the optimal point forecasts under some circumstances. MQE also differs from the statistical asynchronous regression (SAR) method introduced by O’Brien et al. (Citation2001), although it can provide an alternative way to establish a regression-like relationship based on unpaired data. See Remark 1(v) in Section 2.

The rest of the article is organized as follows. The MQE methodology including an iterative algorithm is presented in Section 2. The convergence of the algorithm is established in Section 3. Section 4 presents some asymptotic properties of MQE. To assess the goodness-of-match, a measure and an associated statistical test are proposed in Section 5. The finite sample properties of MQE are examined in simulation in Section 6. We illustrate in Section 7 how the proposed methodology can be used to select a representative portfolio for CCR backtesting with a real dataset. Section 8 deals with the application of MQE to a different financial problem—tracking portfolios. It also illustrates the usefulness of combining MQE and LASSO together.

2. METHODOLOGY

Let Y be a random variable, and X = (X₁, …, X_p)′ be a collection of p random variables. The goal is to find a linear combination $\begin{matrix} β^{'} X = β_{1} X_{1} + \dots + β_{p} X_{p} \end{matrix}$ such that its distribution matches the distribution of Y. We propose to search for β such that the following integrated squared difference of the two quantile functions is minimized $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ where Q_ξ(α) denotes the αth quantile of random variable ξ, that is, $\begin{matrix} P {ξ \leq Q_{ξ} (α)} = α, for α \in [0, 1] . \end{matrix}$ In fact (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ) is a squared Mallows’ metric introduced by Mallows (Citation1972) and Tanaka (Citation1973). It is also known as L₂-Wasserstein distance (del Barrio et al. Citation1999). See also Section 8 of Bickel and Freedman (Citation1981) for a mathematical account of the Mallows metrics.

Given the goal is to match the two distributions, one may adopt the approaches of matching the two distribution functions or density functions directly. However, our approach of matching quantiles provides the better fitting at the tails of the distributions, which is important for risk management; see Remark 1(iv) below. Furthermore, it turns out that the method of matching quantiles is easier than that for matching distribution functions or density functions directly.

Suppose the availability of random samples {Y₁, …, Y_n} and {X₁, …, X_n} drawn respectively from the distributions of Y and X. Let Y₍₁₎ ⩽ ⋅⋅⋅ ⩽ Y_(n) be the order statistics of Y₁, …, Y_n. Then Y_(j) is the j/nth sample quantile. To find the sample counterpart of the minimizer of (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ), we define the estimator $\begin{matrix} \hat{β} = arg min_{β} \sum_{j = 1}^{n} {Y_{(j)} - {(β^{'} X)}_{(j)}}^{2}, \end{matrix}$ where (β′X)₍₁₎ ⩽ ⋅⋅⋅ ⩽ (β′X)_(n) are the order statistics of β′X₁, …, β′X_n. We call $\hat{β}$ the matching quantiles estimator (MQE), as it tries to match the quantiles at all possible levels between 0 and 1. Unfortunately $\hat{β}$ does not admit an explicit solution. We define below an iterative algorithm to evaluate its values. We will show that the algorithm converges. To this end, we introduce some notation first. Suppose that β^(k) is the kth iterated value, let {X^(k)_(j)} be a permutation of {X_j} such that $\begin{matrix} {(β^{(k)})}^{'} X_{(1)}^{(k)} \leq \dots \leq {(β^{(k)})}^{'} X_{(n)}^{(k)} . \end{matrix}$

Step 1. Set an initial value β⁽⁰⁾.
Step 2. For k ⩾ 1, let $β^{(k)} = arg {min}_{β} R_{k} (β)$ , where $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ where {X^{(k − 1)}_(j)} is defined as in (Equation2.4 $\begin{matrix} {(β^{(k)})}^{'} X_{(1)}^{(k)} \leq \dots \leq {(β^{(k)})}^{'} X_{(n)}^{(k)} . \end{matrix}$ ). We stop the iteration when |R_k(β^(k)) − R_{k − 1}(β^{(k − 1)})| is smaller than a prescribed small positive constant. We then define $\hat{β} = β_{k}$ .

In the above algorithm, we may take the ordinary least squares estimator (OLS) $\tilde{β}$ as an initial estimator β⁽⁰⁾, where $\begin{matrix} \tilde{β} \equiv arg min_{β} \sum_{j = 1}^{n} {(Y_{j} - β^{'} X_{j})}^{2} = {(X^{'} X)}^{- 1} X^{'} Y . \end{matrix}$ and $Y = {(Y_{1}, ..., Y_{n})}^{'}$ , $X$ is an n × p matrix with X′_j as its jth row. However we stress that OLS $\tilde{β}$ is an estimator for the minimizer of the mean squared error $E {{(Y - β^{'} X)}^{2}},$ which is different from the minimizer of (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ) in general. Hence, OLS $\tilde{β}$ and MQE $\hat{β}$ are two estimators for two different parameters, although the MQE is obtained by applying least squares estimation repeatedly to the recursively sorted data; see Step 2 above.

Figure 1 Boxplots of OLS $\tilde{β}$ for the true value 1, and MQE $\hat{β}$ for the true value 1.414 for model (Equation2.8 $\begin{matrix} Y = X + Z, \end{matrix}$ ).

Figure 1 Boxplots of OLS β˜ for the true value 1, and MQE β^ for the true value 1.414 for model (Equation2.8Y=X+Z,).

Figure 2 Boxplots of OLS $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ for the true value (1, 1), MQE $({\hat{β}}_{1}, {\hat{β}}_{2})$ , and ${{\hat{β}}_{1}^{2} + {\hat{β}}_{2}^{2}}^{\frac{1}{2}}$ for the true value 2 for model (Equation2.9 $Y = X_{1} + X_{2} + 1.414 Z,$ ).

$Figure 2 Boxplots of OLS (β˜1,β˜2) for the true value (1, 1), MQE (β^1,β^2), and {β^12+β^22}12 for the true value 2 for model (Equation2.9Y=X1+X2+1.414Z,).$

To gain some intuitive appreciation of MQE and the difference from OLS, we report below some simulation results with two toy models.

Example 1.

Consider a simple scenario $\begin{matrix} Y = X + Z, \end{matrix}$ where X and Z are independent and N(0, 1), and Z is unobservable. Now p = 1, the minimizer of (Equation2.7 $E {{(Y - β^{'} X)}^{2}},$ ) is β⁽¹⁾ = 1. Note that $L (Y) = N (0, 2) = L (1.414 X)$ . Thus, (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ) admits a minimizer β⁽²⁾ = 1.414. We generate 1000 samples from (Equation2.8 $\begin{matrix} Y = X + Z, \end{matrix}$ ) with each sample of size n = 100. For each sample, we calculate MQE $\hat{β}$ using the iterative algorithm above with OLS $\tilde{β}$ as the initial value. presents the boxplots of the 1000 estimates. It is clear that both OLS $\tilde{β}$ and MQE $\hat{β}$ provide accurate estimates for β⁽¹⁾ and β⁽²⁾, respectively. In fact, the mean squared estimation errors over the 1000 replications is, respectively, 0.0107 for $\tilde{β}$ and 0.0109 for $\hat{β}$ . The algorithm for computing $\hat{β}$ only took two iterations to reach the convergence in all the 1000 replications.

Example 2.

Now we repeat the exercise in Example 1 above for the model $Y = X_{1} + X_{2} + 1.414 Z,$ where X₁, X₂, and Z are independent and N(0, 1), and Z is unobservable. The boxplots of the estimates are displayed in . Now p = 2, the minimizer of (Equation2.7 $E {{(Y - β^{'} X)}^{2}},$ ) is (β⁽¹⁾₁, β₂⁽¹⁾) = (1, 1). Since $L (Y) = N (0, 4)$ , there are infinite numbers of minimizers of (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ). In fact any (β₁, β₂) satisfying the condition $\sqrt{β_{1}^{2} + β_{2}^{2}} = 2$ is a minimizer of (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ), as then $\begin{matrix} L (β_{1} X_{1} + β_{2} X_{2}) = N (0, β_{1}^{2} + β_{2}^{2}) = N (0, 4) . \end{matrix}$ One such minimizer is (β⁽²⁾₁, β₂⁽²⁾) = (1.414, 1.414). It is clear from that over the 1000 replications, OLS $({\tilde{β}}_{1}, {\tilde{β}}_{2})$ are centered at the minimizer (β⁽¹⁾₁, β₂⁽¹⁾) of (Equation2.7 $E {{(Y - β^{'} X)}^{2}},$ ). While MQE $({\hat{β}}_{1}, {\hat{β}}_{2})$ are centered around one minimizer (β⁽²⁾₁, β₂⁽²⁾) of (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ), their variations over 1000 replications are significantly larger. On the other hand, the values of ${{\hat{β}}_{1}^{2} + {\hat{β}}_{2}^{2}}^{\frac{1}{2}}$ are centered around its unique true value 2 with the variation comparable to those of the OLS ${\tilde{β}}_{1}$ and ${\tilde{β}}_{2}$ . In fact, the mean squared estimation errors of ${\tilde{β}}_{1}, {\tilde{β}}_{2}$ , and ${{\hat{β}}_{1}^{2} + {\hat{β}}_{2}^{2}}^{1 / 2}$ are, respectively, 0.0191, 0.0196, and 0.0198. The mean squared differences between ${\hat{β}}_{1}$ and β⁽²⁾₁, and between ${\hat{β}}_{2}$ and β⁽²⁾₂ are 0.0608 and 0.0661, respectively. All these clearly indicate that in the 1000 replications, MQE may estimate different minimizers of (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ). However, the end-product, that is, the estimation for the distribution of Y is very accurate, measured by the mean squared error 0.0198 for estimating {β²₁ + β₂²}^1/2. The iterative algorithm for calculating the MQE always converges quickly in the 1000 replications. The average number of iterations is 5.15 with the standard deviation 4.85. Like in Example 1, we used the OLS as the initial values for calculating the MQE. We repeated the exercise with the two initial values generated randomly from U[ − 2, 2]. The boxplots for ${\hat{β}}_{1}$ and ${\hat{β}}_{2}$ , not presented here to save space, are now centered at 0 with about [ − 1.5, 1.5] as their inter-half ranges. But remarkably the boxplot for ${{({\hat{β}}_{1})}^{2} + {({\hat{β}}_{2})}^{2}}^{1 / 2}$ remains about the same. The mean and the standard deviation for the number of iterations required in calculating the MQE are 7.83 and 9.12.

We conclude this section with some remarks.

Remark 1.

When there exist more than one minimizer of (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ), $\hat{β}$ may estimate different values in different instances. However, the goodness of the resulting approximations for the distribution of Y is about the same, guaranteed by the least squares property. See also Theorem 2 in Section 4.
If we are interested only in matching a part of distribution of Y, say, that between the α₁th quantile and the α₂th quantile, 0 ⩽ α₁ < α₂ ⩽ 1, we may replace (Equation2.5 $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ ) by $\begin{matrix} R_{k} (β; α_{1}, α_{2}) & = & \frac{1}{n_{2} - n_{1}} \sum_{j = n_{1} + 1}^{n_{2}} \\ \times {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ where n_i = [nα_i], where [x] denotes the integer part of x.
To obtain a sparse MQE, we change R_k(β) in Step 2 of the iteration to $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2} + λ \sum_{i = 1}^{p} | β_{i} |, \end{matrix}$ where λ > 0 is a constant controlling the penalty on the L₁ norm of β. This is a LASSO estimation, which can be equivalently represented as the problem of minimizing R_k(β) in (Equation2.5 $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ ) subject to $\begin{matrix} \sum_{i = 1}^{p} | β_{i} | \leq C_{0}, \end{matrix}$ where C₀ > 0 is a constant. The LARS–LASSO algorithm due to Efron et al. (Citation2004) provides the solution path for the OLS–LASSO optimization problem for all positive values of C₀.
Since our goal is to match the distribution of Y by that of β′X, a natural approach is to estimate β which minimizes, for example, $\begin{matrix} min_{x} {F_{Y} (x) - F_{β^{'} X} (x)}^{2}, \end{matrix}$ where F_ξ( · ) denotes the distribution function of random variable ξ. However, such a β is predominantly determined by the center parts of the distributions as both the distributions are close to 1 for extremely large values of x, and are close to 0 for extremely negatively large values of x. For risk management, those extreme values are clearly important.
MQE does not require that Y_j and X_j are paired together. It can be used to recover the nearly perfect linear relationship Y ≈ β′X based on unpaired observations {Y_j} and {X_j}, as then $L (Y) \approx L (β^{'} X)$ , where $L (ξ)$ denotes the distribution of random variable ξ. It also applies when the distribution of Y is known and we have only the observations on X. In this case, the methodology described above is still valid with Y_(j) replaced by the true j/nth quantile of $L (Y)$ for j = 1, …, n.
When Y_j and X_j are paired together, as in many applications, the pairing is ignored in the MQE estimation (Equation2.3 $\begin{matrix} \hat{β} = arg min_{β} \sum_{j = 1}^{n} {Y_{(j)} - {(β^{'} X)}_{(j)}}^{2}, \end{matrix}$ ). Hence, the correlation between Y and ${\hat{β}}^{'} X$ may be smaller than that between Y and ${\tilde{β}}^{'} X$ . Intuitively, the loss in the correlation should not be substantial unless the ratio of noise-to-signal is large, which is confirmed by our numerical experiments with both simulated and real data. See in Section 6 and also Section 7 below.

3. CONVERGENCE OF THE ALGORITHMS

We will show in this section that the iterative algorithm proposed in Section 2 above for computing MQE converges—a property reminiscent of the convergence of the EM algorithm (Wu Citation1983). We introduce a lemma first.

Lemma 1.

Let a₁, …, a_n and b₁, …, b_n be any two sequences of real numbers. Then $\begin{matrix} \sum_{i = 1}^{n} {(a_{(i)} - b_{(i)})}^{2} \leq \sum_{i = 1}^{n} {(a_{i} - b_{i})}^{2}, \end{matrix}$ where {a_(i)} and {b_(i)} are, respectively, the order statistics of {a_i} and {b_i}.

Proof.

We proceed by the mathematical induction. When n = 2, we only need to show that $\begin{matrix} {(a_{(1)} - b_{(1)})}^{2} + {(a_{(2)} - b_{(2)})}^{2} \leq {(a_{(1)} - b_{(2)})}^{2} + {(a_{(2)} - b_{(1)})}^{2}, \end{matrix}$ which is equivalent to $\begin{matrix} 0 \leq a_{(1)} (b_{(1)} - b_{(2)}) + a_{(2)} (b_{(2)} - b_{(1)}) = (a_{(2)} - a_{(1)}) (b_{(2)} - b_{(1)}) . \end{matrix}$ This is true.

Assuming the lemma is true for all n = k, we show below that it is also true for n = k + 1. Without loss of generality, we may assume that a_{k + 1} = a₍₁₎ and b_ℓ = b₍₁₎. If ℓ = k + 1, (Equation3.1 $\begin{matrix} \sum_{i = 1}^{n} {(a_{(i)} - b_{(i)})}^{2} \leq \sum_{i = 1}^{n} {(a_{i} - b_{i})}^{2}, \end{matrix}$ ) holds for k + 1 now. When ℓ ≠ k + 1, it follows the proof above for the case of n = 2, $\begin{matrix} {(a_{(1)} - b_{(1)})}^{2} + {(a_{ℓ} - b_{k + 1})}^{2} \leq {(a_{ℓ} - b_{ℓ})}^{2} + {(a_{k + 1} - b_{k + 1})}^{2} . \end{matrix}$ Consequently, $\begin{matrix} \sum_{i = 1}^{k + 1} {(a_{i} - b_{i})}^{2} \geq {(a_{(1)} - b_{(1)})}^{2} + {(a_{ℓ} - b_{k + 1})}^{2} + \sum_{1 \leq i \leq k, i \neq ℓ} {(a_{i} - b_{i})}^{2} \\ \geq {(a_{(1)} - b_{(1)})}^{2} + \sum_{i = 2}^{k + 1} {(a_{(i)} - b_{(i)})}^{2} . \end{matrix}$ The last inequality follows from the induction assumption for n = k. This completes the proof.

Theorem 1.

For R_k( · ) defined in (Equation2.5 $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ ) or (Equation2.11 $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2} + λ \sum_{i = 1}^{p} | β_{i} |, \end{matrix}$ ), and $β^{(k)} = arg {min}_{β} R_{k} (β)$ , it holds that R_k(β^(k)) → c as k → ∞, where c ⩾ 0 is a constant.

Proof.

We show that the LASSO estimation with R_k defined in (Equation2.11 $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2} + λ \sum_{i = 1}^{p} | β_{i} |, \end{matrix}$ ) converges. When λ = 0, (Equation2.11 $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2} + λ \sum_{i = 1}^{p} | β_{i} |, \end{matrix}$ ) reduces to (Equation2.5 $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ ).

We only need to show that $R_{k + 1} (β^{(k + 1)}) \leq R_{k} (β^{(k)}) for k = 1, 2, ... .$ This is true because $\begin{matrix} R_{k + 1} (β^{(k + 1)}) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - {β^{(k + 1)}}^{'} X_{(j)}^{(k)})}^{2} + λ \sum_{i = 1}^{p} | β_{i}^{(k + 1)} | \\ \leq \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - {β^{(k)}}^{'} X_{(j)}^{(k)})}^{2} + λ \sum_{i = 1}^{p} | β_{i}^{(k)} | \\ \leq \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - {β^{(k)}}^{'} X_{(j)}^{(k - 1)})}^{2} + λ \sum_{i = 1}^{p} | β_{i}^{(k)} | = R_{k} (β^{(k)}) . \end{matrix}$ In the above expression, the first inequality follows from the definition of β^{(k + 1)} and the second inequality is guaranteed by Lemma 1. < tex − math/ >

Remark 2.

Theorem 1 shows that the iterations in Step 2 of the algorithm in Section 2 above converge. But it does not guarantee that they will converge to the global minimum. In practice, one may start with multiple initial values selected, for example, randomly, and take the minimum among the converged values from the different initial values. If necessary, one may also treat the algorithm as a function of the initial value and apply, for example, simulated annealing to search for the global minimizer.
In practice, we may search for β′X to match a part of distribution of Y only, that is, we use R_k( · ; α₁, α₂) defined in (Equation2.10 $\begin{matrix} R_{k} (β; α_{1}, α_{2}) & = & \frac{1}{n_{2} - n_{1}} \sum_{j = n_{1} + 1}^{n_{2}} \\ \times {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ ) instead of R_k( · ) in (Equation2.5 $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ ). Note that {X^(k)_(j), n₁ < j ⩽ n₂} may be a different subset of {X_j, j = 1, …, n} for different k, see (Equation2.4 $\begin{matrix} {(β^{(k)})}^{'} X_{(1)}^{(k)} \leq \dots \leq {(β^{(k)})}^{'} X_{(n)}^{(k)} . \end{matrix}$ ). Hence Theorem 1 no longer holds. Our numerical experiments indicate that the algorithm still converges as long as p is small in relation to n (e.g., p ⩽ 4n). See and in Section 6.
Lemma 1 above can be deduced from Lemmas 8.1 and 8.2 of Bickel and Freedman (Citation1981) in an implicit manner, while the proof presented here is simpler and more direct.

4. ASYMPTOTIC PROPERTIES OF THE ESTIMATION

We present the asymptotic properties for a more general setting in which MQE is combined with LASSO, and the estimation is defined to match a part of the distribution between the α₁th quantile and the α₂th quantile, where 0 ⩽ α₁ < α₂ ⩽ 1 are fixed. Obviously matching the whole distribution is a special case with α₁ = 0 and α₂ = 1. Furthermore when λ = 0 in (Equation4.1 $\begin{matrix} β_{0} & = & arg min_{β} S (β), S (β) \equiv S (β; α_{1}, α_{2}) \\ = & \int_{α_{1}}^{α_{2}} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α + λ \sum_{j = 1}^{p} | β_{j} | . \end{matrix}$ ) and (Equation4.3 $\begin{matrix} S_{n} (β) \equiv S_{n} (β; α_{1}, α_{2}) & = & \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Y_{(j)} - {(β^{'} X)}_{(j)}}^{2} + λ \sum_{j = 1}^{p} | β_{j} | \\ = & \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{n, Y} (j / n) - Q_{n, β^{'} X} \\ (j / n)}^{2} \times λ \sum_{j = 1}^{p} | β_{j} |, \end{matrix}$ ), it reduces to the MQE without LASSO.

For λ ⩾ 0, let $\begin{matrix} β_{0} & = & arg min_{β} S (β), S (β) \equiv S (β; α_{1}, α_{2}) \\ = & \int_{α_{1}}^{α_{2}} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α + λ \sum_{j = 1}^{p} | β_{j} | . \end{matrix}$ Intuitively β₀ could be regarded as the true value to be estimated. However, it is likely that β₀ so defined is not unique. Such a scenario may occur when, for example, two components of X are identically distributed. Furthermore it is conceivable that those different β₀ may lead to different distributions $L (β_{0}^{'} X)$ which provide an equally good approximation to $L (Y)$ in the sense that S(β₀) takes the same value for those different β₀.

Similar to (Equation2.3 $\begin{matrix} \hat{β} = arg min_{β} \sum_{j = 1}^{n} {Y_{(j)} - {(β^{'} X)}_{(j)}}^{2}, \end{matrix}$ ), the MQE for matching a part of the distribution is defined as $\begin{matrix} \hat{β} = arg min_{β} S_{n} (β), \end{matrix}$ where $\begin{matrix} S_{n} (β) \equiv S_{n} (β; α_{1}, α_{2}) & = & \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Y_{(j)} - {(β^{'} X)}_{(j)}}^{2} + λ \sum_{j = 1}^{p} | β_{j} | \\ = & \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{n, Y} (j / n) - Q_{n, β^{'} X} \\ (j / n)}^{2} \times λ \sum_{j = 1}^{p} | β_{j} |, \end{matrix}$ n_i = [nα_i], (β′X)₍₁₎ ⩽ ⋅⋅⋅ ⩽ (β′X)_(n) are the order statistics of β′X₁, …, β′X_n, Q_{n, Y}( · ) is the quantile function corresponding to the empirical distribution of {Y_j}, that is, $\begin{matrix} Q_{n, Y} (α) = inf {y : F_{n, Y} (y) \geq α}, α \in (0, 1) . \end{matrix}$ In the above expression, F_{n, Y}(y) = n^{− 1}∑_{1 ⩽ j ⩽ n}I(Y_j ⩽ y). $F_{n, β^{'} X}$ and $Q_{n, β^{'} X}$ are defined in the same manner.

Similar to its theoretical counterpart β₀ in (Equation4.1 $\begin{matrix} β_{0} & = & arg min_{β} S (β), S (β) \equiv S (β; α_{1}, α_{2}) \\ = & \int_{α_{1}}^{α_{2}} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α + λ \sum_{j = 1}^{p} | β_{j} | . \end{matrix}$ ), the estimator $\hat{β}$ defined in (Equation4.2 $\begin{matrix} \hat{β} = arg min_{β} S_{n} (β), \end{matrix}$ ) may not be unique either, see Example 2 and Remark 1(i) above. Hence, we show below that $S_{n} (\hat{β})$ converges to S(β₀). This implies that the distribution of ${\hat{β}}^{'} X$ provides an optimal approximation to the distribution of Y in the sense that the mean square residuals $S_{n} (\hat{β})$ converge to the minimum of S(β), although $L ({\hat{β}}^{'} X)$ may not converge to a fixed distribution. Furthermore, we also show that $\hat{β}$ is consistent in the sense that $d (\hat{β}, B_{0}) \equiv {min}_{β \in B_{0}} ∥ \hat{β} - β ∥$ converges to 0, where ‖ · ‖ denotes the Euclidean norm for vectors, and $B_{0}$ is the set consisting of all the minimizers of S( · ) defined in (Equation4.1 $\begin{matrix} β_{0} & = & arg min_{β} S (β), S (β) \equiv S (β; α_{1}, α_{2}) \\ = & \int_{α_{1}}^{α_{2}} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α + λ \sum_{j = 1}^{p} | β_{j} | . \end{matrix}$ ), that is, $\begin{matrix} B_{0} = {β : S (β) = S (β_{0})}, \end{matrix}$

We introduce some regularity conditions first. We denote by, respectively, F_ξ( · ) and f_ξ( · ) the distribution function and the probability density function of a random variable ξ.

Condition B.

Let {Y_j} be a random sample from the distribution of Y and {X_j} be a random sample from the distribution of X. Both f_Y( · ) and f_X( · ) exist.
(The Kiefer condition.) It holds for any fixed β that $\begin{matrix} sup_{α_{1} \leq α \leq α_{2}} | f_{β^{'} X}^{'} (Q_{β^{'} X} (α)) | < \infty, \\ inf_{α_{1} \leq α \leq α_{2}} f_{β^{'} X} (Q_{β^{'} X} (α)) > 0 . \end{matrix}$ Furthermore $\begin{matrix} sup_{α_{1} \leq α \leq α_{2}} | f_{Y}^{'} (Q_{Y} (α)) | < \infty, inf_{α_{1} \leq α \leq α_{2}} f_{Y} (Q_{Y} (α)) > 0 . \end{matrix}$
X has bounded support.

Remark 3.

Condition B (ii) is the Kiefer condition. It ensures the uniform Bahadur–Kiefer bounds for empirical quantile processes for iid samples. More precisely, (Equation4.5 $\begin{matrix} sup_{α_{1} \leq α \leq α_{2}} | f_{β^{'} X}^{'} (Q_{β^{'} X} (α)) | < \infty, \\ inf_{α_{1} \leq α \leq α_{2}} f_{β^{'} X} (Q_{β^{'} X} (α)) > 0 . \end{matrix}$ ) implies that $\begin{matrix} sup_{α_{1} \leq α \leq α_{2}} | \sqrt{n} f_{β^{'} X} (Q_{β^{'} X} (α)) {Q_{n, β^{'} X} (α) - Q_{β^{'} X} (α)} \\ + \sqrt{n} {F_{n, β^{'} X} (Q_{β^{'} X} (α)) - α} | \\ = O_{P} (n^{- 1 / 4} {(log n)}^{1 / 2} {(log log n)}^{1 / 4}), \end{matrix}$ and (Equation4.6 $\begin{matrix} sup_{α_{1} \leq α \leq α_{2}} | f_{Y}^{'} (Q_{Y} (α)) | < \infty, inf_{α_{1} \leq α \leq α_{2}} f_{Y} (Q_{Y} (α)) > 0 . \end{matrix}$ ) implies that $\begin{matrix} sup_{α_{1} \leq α \leq α_{2}} | \sqrt{n} f_{Y} (Q_{Y} (α)) {Q_{n, Y} (α) - Q_{Y} (α)} \\ + \sqrt{n} {F_{n, Y} (Q_{Y} (α)) - α} | \\ = O_{P} (n^{- 1 / 4} {(log n)}^{1 / 2} {(log log n)}^{1 / 4}) . \end{matrix}$ See Kiefer (Citation1970), and also Kulik (Citation2007).
The assumption of independent samples in Condition B(i) is imposed for simplicity of the technical proofs. In fact, Theorem 2 still holds for some weakly dependent processes, as the Bahadur-Kiefer bounds (Equation4.7 $\begin{matrix} sup_{α_{1} \leq α \leq α_{2}} | \sqrt{n} f_{β^{'} X} (Q_{β^{'} X} (α)) {Q_{n, β^{'} X} (α) - Q_{β^{'} X} (α)} \\ + \sqrt{n} {F_{n, β^{'} X} (Q_{β^{'} X} (α)) - α} | \\ = O_{P} (n^{- 1 / 4} {(log n)}^{1 / 2} {(log log n)}^{1 / 4}), \end{matrix}$ ) and (Equation4.8 $\begin{matrix} sup_{α_{1} \leq α \leq α_{2}} | \sqrt{n} f_{Y} (Q_{Y} (α)) {Q_{n, Y} (α) - Q_{Y} (α)} \\ + \sqrt{n} {F_{n, Y} (Q_{Y} (α)) - α} | \\ = O_{P} (n^{- 1 / 4} {(log n)}^{1 / 2} {(log log n)}^{1 / 4}) . \end{matrix}$ ) may be established based on the results in Kulik (Citation2007).
The requirement for X having a bounded support is for technical convenience. When α₁ = 0 and α₂ = 1, it is implied by Condition B(ii), as (Equation4.5 $\begin{matrix} sup_{α_{1} \leq α \leq α_{2}} | f_{β^{'} X}^{'} (Q_{β^{'} X} (α)) | < \infty, \\ inf_{α_{1} \leq α \leq α_{2}} f_{β^{'} X} (Q_{β^{'} X} (α)) > 0 . \end{matrix}$ ) entails that β′X has a bounded support for any β.

Theorem 2.

Let Condition B hold and λ in (Equation4.1 $\begin{matrix} β_{0} & = & arg min_{β} S (β), S (β) \equiv S (β; α_{1}, α_{2}) \\ = & \int_{α_{1}}^{α_{2}} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α + λ \sum_{j = 1}^{p} | β_{j} | . \end{matrix}$ ) and (Equation4.3 $\begin{matrix} S_{n} (β) \equiv S_{n} (β; α_{1}, α_{2}) & = & \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Y_{(j)} - {(β^{'} X)}_{(j)}}^{2} + λ \sum_{j = 1}^{p} | β_{j} | \\ = & \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{n, Y} (j / n) - Q_{n, β^{'} X} \\ (j / n)}^{2} \times λ \sum_{j = 1}^{p} | β_{j} |, \end{matrix}$ ) be a nonnegative constant. Then as n → ∞, $S_{n} (\hat{β}) \to S (β_{0})$ in probability, and $d (\hat{β}, B_{0}) \to 0$ in probability.

We present the proof of Theorem 2 in Appendix I.

5. GOODNESS OF MATCH

The goal of MQE is to match the distribution of Y by that of a selected linear combination β′X. We introduce below a measure for the goodness of match, and also a statistical test for the hypothesis $\begin{matrix} H_{0} : L (Y) = L (β^{'} X) . \end{matrix}$

5.1 A Measure for the Matching Goodness

Let F( · ) be the distribution function of Y. Let g( · ) be the probability density function of the random variable F(β′X). When Y and β′X have the same distribution, F(β′X) is a random variable uniformly distributed on the interval [0, 1], and g(x) ≡ 1 for x ∈ [0, 1]. We define a measure for the goodness of match as follows: $\begin{matrix} ρ = 1 - \frac{1}{2} \int_{0}^{1} | g (x) - 1 | d x . \end{matrix}$ It is easy to see that ρ ∈ [0, 1], and ρ = 1 if and only if the matching is perfect in the sense that $L (Y) = L (β^{'} X)$ . When the difference between g( · ) and 1 (i.e., the density function of U[0, 1]) increases, ρ decreases. Hence the larger the difference between the distributions of Y and β′X, the smaller the value of ρ. For example, ρ = 0.5 if Y ∼ U[0, 1] and β′X ∼ U[0, 0.5], and ρ = 1/m if Y ∼ U[0, 1] and β′X ∼ U[0, 1/m] for any m ⩾ 1.

With the given observations {(Y_i, X_i)}, let $\begin{matrix} U_{i} = F_{n} (β^{'} X_{i}), where F_{n} (x) = \frac{1}{n} \sum_{j = 1}^{n} I (Y_{j} \leq x) . \end{matrix}$ A natural estimator for ρ defined in (Equation5.2 $\begin{matrix} ρ = 1 - \frac{1}{2} \int_{0}^{1} | g (x) - 1 | d x . \end{matrix}$ ) is $\begin{matrix} \hat{ρ} = 1 - \frac{1}{2} \sum_{j = 1}^{[n / k]} | C_{j} - k / n |, where \\ C_{j} = \frac{1}{n} \sum_{i = 1}^{n} I (\frac{(j - 1) k}{n} < U_{i} \leq \frac{j k}{n}) . \end{matrix}$ In the above expression, k ⩾ 1 is an integer, [x] denotes the integer part of x. It also holds that $\hat{ρ} \in [0, 1]$ . Furthermore, $\hat{ρ} = 1$ if and only if n/k is an integer and each of the n/k intervals $(\frac{(j - 1) k}{n}, \frac{j k}{n})$ (j = 1, …, n/k) contains exactly k points from U₁, …, U_n. This also indicates that we should choose k large enough such that there are enough sample points on each of those [n/k] intervals and, hence, the relative frequency on each interval is a reasonable estimate for its corresponding probability.

Remark 4.

Formula (Equation5.2 $\begin{matrix} ρ = 1 - \frac{1}{2} \int_{0}^{1} | g (x) - 1 | d x . \end{matrix}$ ) only applies when the distribution of F(β′X) is continuous. If this is not the case, the random variable F(β′X) has nonzero probability masses at 0 or/and 1, and (Equation5.2 $\begin{matrix} ρ = 1 - \frac{1}{2} \int_{0}^{1} | g (x) - 1 | d x . \end{matrix}$ ) should be written in a more general form ρ = 1 − 0.5∫¹₀|dG − dx|, where G( · ) denotes the probability measure of F(β′X). It is clear now that ρ = 0 if and only if the supports of $L (Y)$ and $L (β^{'} X)$ do not overlap. Note that the estimator $\hat{ρ}$ defined in (Equation5.3 $\begin{matrix} \hat{ρ} = 1 - \frac{1}{2} \sum_{j = 1}^{[n / k]} | C_{j} - k / n |, where \\ C_{j} = \frac{1}{n} \sum_{i = 1}^{n} I (\frac{(j - 1) k}{n} < U_{i} \leq \frac{j k}{n}) . \end{matrix}$ ) still applies.

5.2 A Goodness-of-Match Test

There exist several goodness-of-fit tests for the hypothesis H₀ defined in (Equation5.1 $\begin{matrix} H_{0} : L (Y) = L (β^{'} X) . \end{matrix}$ ); see, for example, Section 2.1 of Serfling (Citation1980). We propose a test statistic T_n below, which is closely associated with the goodness-of-match measure $\hat{ρ}$ in (Equation5.3 $\begin{matrix} \hat{ρ} = 1 - \frac{1}{2} \sum_{j = 1}^{[n / k]} | C_{j} - k / n |, where \\ C_{j} = \frac{1}{n} \sum_{i = 1}^{n} I (\frac{(j - 1) k}{n} < U_{i} \leq \frac{j k}{n}) . \end{matrix}$ ) and is reminiscent of the Cramér-von Mises goodness-of-fit statistic. Under the hypothesis H₀, U₁, …, U_n behave like a sample from U[0, 1] for large n. Hence based on the relative counts {C_j} defined in (Equation5.3 $\begin{matrix} \hat{ρ} = 1 - \frac{1}{2} \sum_{j = 1}^{[n / k]} | C_{j} - k / n |, where \\ C_{j} = \frac{1}{n} \sum_{i = 1}^{n} I (\frac{(j - 1) k}{n} < U_{i} \leq \frac{j k}{n}) . \end{matrix}$ ), we may define the following goodness-of-match test statistic for testing hypothesis H₀. $\begin{matrix} T_{n} = \sqrt{n} \sum_{j = 1}^{[n / k]} | C_{j} - k / n | . \end{matrix}$ By Proposition 1, the distribution of T_n under H₀ is distribution-free. The critical values listed below was evaluated from a simulation with 50,000 replications, n = 1000, and both {ξ_i} and {η_i} drawn independently from U[0, 1].

Table

Download CSV Display Table

The changes in the critical values led by different sample sizes n, as long as n ⩾ 300, are smaller than 0.05 when k/n ⩾ 0.05, and are smaller than 0.1 when k/n = 0.025.

Proposition 1.

Let {ξ₁, …, ξ_n} and {η₁, …, η_n} be two independent random samples from two distributions F and G, and F be a continuous distribution. Let $F_{n} (x) = \frac{1}{n} \sum_{i} I (ξ_{i} \leq x)$ and U_i = F_n(η_i). Let C_j be defined as in (Equation5.3 $\begin{matrix} \hat{ρ} = 1 - \frac{1}{2} \sum_{j = 1}^{[n / k]} | C_{j} - k / n |, where \\ C_{j} = \frac{1}{n} \sum_{i = 1}^{n} I (\frac{(j - 1) k}{n} < U_{i} \leq \frac{j k}{n}) . \end{matrix}$ ) and T_n as in (Equation5.4 $\begin{matrix} T_{n} = \sqrt{n} \sum_{j = 1}^{[n / k]} | C_{j} - k / n | . \end{matrix}$ ). Then, the distribution T_n is independent of F and G provided F( · ) ≡ G( · ).

This proposition follows immediately from the fact that $U_{i} = \frac{1}{n} \sum_{j = 1}^{n} I {F (ξ_{j}) \leq F (η_{i})}$ almost surely, and {F(ξ_i)} and {F(η_i)} are two independent samples from U[0, 1] when F( · ) ≡ G( · ).

6. SIMULATION

To illustrate the finite-sample properties, we conduct simulations under the setting $\begin{matrix} Y_{j} & = & β^{'} X_{j} + Z_{j} = β_{1} X_{j 1} + \dots + β_{p} X_{j p} + Z_{j}, \\ j = 1, ..., n, \end{matrix}$ to check the performance of MQE for β = (β₁, …, β_p)′, where X_j = (X_j1, …, X_jp)′ represent p observed variables, and Z_j represents collectively the unobserved factors. We let X_j be defined by a factor model $\begin{matrix} X_{j} = A U_{j} + ϵ_{j}, \end{matrix}$ where A is a p × 3 constant factor loading matrix, the components of U_j are three independently linear AR(1) processes defined with positive or negative centered log-N(0, 1) innovations, the components of $ϵ_{j}$ are all independent and t-distributed with 4 degrees of freedom. Hence, the components of X_j are correlated with each other with skewed and heavy tailed distributions. We let Z_j in (Equation6.1 $\begin{matrix} Y_{j} & = & β^{'} X_{j} + Z_{j} = β_{1} X_{j 1} + \dots + β_{p} X_{j p} + Z_{j}, \\ j = 1, ..., n, \end{matrix}$ ) be independent N(0, σ²). For each sample, the coefficients β_j are drawn independently from U[ − 0.5, 0.5], the elements of the factor loading matrix A are drawn independently from U[ − 1, 1], and the three autoregressive coefficients in the three AR(1) factor processes are drawn independently from U[ − 0.95, 0.95]. For this example, no linear combinations of X_j can provide a perfect match for the distribution of Y_j.

Table 1 The means and standard deviations (STD) of the number of iterations required for computing MQE $\hat{β}$ in a simulation with 1000 replications

Display Table

Table 2 The means and standard deviations (in parentheses) of estimated goodness-of-match measure $\hat{ρ}$ defined in (Equation5.3 $\begin{matrix} \hat{ρ} = 1 - \frac{1}{2} \sum_{j = 1}^{[n / k]} | C_{j} - k / n |, where \\ C_{j} = \frac{1}{n} \sum_{i = 1}^{n} I (\frac{(j - 1) k}{n} < U_{i} \leq \frac{j k}{n}) . \end{matrix}$ ) in a simulation with 1000 replications, calculated for both the sample used for estimating β and the post-sample

Display Table

For comparison purposes, we also compute OLS $\tilde{β}$ defined in (Equation2.6 $\begin{matrix} \tilde{β} \equiv arg min_{β} \sum_{j = 1}^{n} {(Y_{j} - β^{'} X_{j})}^{2} = {(X^{'} X)}^{- 1} X^{'} Y . \end{matrix}$ ). For computing MQE $\hat{β}$ , we use $\tilde{β}$ as the initial value, and let $\hat{β} = β_{k}$ and $\begin{matrix} rMSE (\hat{β}) = {R_{k} (β^{(k)})}^{1 / 2} \end{matrix}$ when $\begin{matrix} | {R_{k} (β^{(k)})}^{1 / 2} - {R_{k - 1} (β^{(k - 1)})}^{1 / 2} | < 0.001, \end{matrix}$ where R_k( · ) is defined in (Equation2.5 $\begin{matrix} R_{k} (β) = \frac{1}{n} \sum_{j = 1}^{n} {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ ). The reason to use square-root of R_k instead of R_k in the above is that R_k itself can be very small. We set the sample size n = 300 or 800, the dimension p = 50, 100, or 200, the ratio $\begin{matrix} r \equiv \frac{STD (Z_{j})}{STD (β_{1} X_{j 1} + \dots + β_{p} X_{j p})} = 0.5, 1, or 2 . \end{matrix}$ For the simplicity, we call r the noise-to-signal ratio, which represents the ratio of the unobserved signal to the observed signal. For each setting, we draw 1000 samples and calculate both $\hat{β}$ and $\tilde{β}$ for each sample.

displays the boxplots of the rMSE( $\hat{β}$ ) defined in (Equation6.2 $\begin{matrix} rMSE (\hat{β}) = {R_{k} (β^{(k)})}^{1 / 2} \end{matrix}$ ). It indicates that the approximation with n = 800 is more accurate than that with n = 300. When the noise-to-signal ratio r increases from 0.5, 1, to 2, the values and also the variation of $rMSE (\hat{β})$ increase. shows that rMSE( $\hat{β}$ ) is right-skewed, indicating that the algorithm may be stuck at a local minimum. This problem can be significantly alleviated by using multiple initial values generated randomly, which was confirmed in an experiment not reported here.

Figure 3 Boxplots of rMSE( $\hat{β}$ ) defined in (Equation6.2 $\begin{matrix} rMSE (\hat{β}) = {R_{k} (β^{(k)})}^{1 / 2} \end{matrix}$ ) with sample size n = 300 or 800, dimension p = 50, 100, or 200, and the noise-to-signal ratio r = 0.5, 1, or 2.

$Figure 3 Boxplots of rMSE(β^) defined in (Equation6.2rMSE(β^)={Rk(β(k))}1/2) with sample size n = 300 or 800, dimension p = 50, 100, or 200, and the noise-to-signal ratio r = 0.5, 1, or 2.$

list the means and standard deviations of the number of iterations required in calculating MQE $\hat{β}$ , controlled by (Equation6.3 $\begin{matrix} | {R_{k} (β^{(k)})}^{1 / 2} - {R_{k - 1} (β^{(k - 1)})}^{1 / 2} | < 0.001, \end{matrix}$ ), over the 1000 replications. Over all tested settings, the algorithm converges fast. The number of iterations tends to decrease when the dimension p increases. This may be because there are more “true values” of β when p is larger, or simply when p becomes really large.

With each drawn sample, we also generate a post-sample of size 300 denoted by {(y_j, x_j), i = 1, …, 300}. We measure the matching power for the distribution Y by rMME $(\hat{β})$ for MQE, and by rMME $(\tilde{β})$ for OLS, where the root mean matching error rMME is defined as $\begin{matrix} rMME (β) = {(\frac{1}{300} \sum_{j = 1}^{300} {\{y_{(j)} - {(β^{'} x)}_{(j)}\}}^{2})}^{1 / 2}, \end{matrix}$ where y₍₁₎ ⩽ ⋅⋅⋅ ⩽ y₍₃₀₀₎ are the order statistics of {y_j}, and (β′x)₍₁₎ ⩽ ⋅⋅⋅ ⩽ (β′x)₍₃₀₀₎ are the order statistics of {β′x_j}. presents the scatterplots of $rMME (\hat{β})$ against $rMME (\tilde{β})$ with sample size n = 800. The dashed diagonal lines mark the positions y = x. Since most the dots are below the diagonals, the matching error for the distribution Y based on MQE $\hat{β}$ is smaller than the corresponding matching error based on OLS $\tilde{β}$ in most cases. When the noise-to-signal ratio r is as small as 0.5, the difference between the two methods is relatively small, as then the minimizers of (Equation2.2 $\begin{matrix} \int_{0}^{1} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α, \end{matrix}$ ) do not differ that much from the minimizer of (Equation2.7 $E {{(Y - β^{'} X)}^{2}},$ ). However when the ratio increases to 1 and 2, the matching based on the MQE is overwhelmingly better. This confirms that MQE should be used when the goal is to match the distribution of Y.

Figure 4 Scatterplots of $rMME (\hat{β})$ against $rMME (\tilde{β})$ with sample size n = 800 in a simulation with 1000 replications. The dashed lines mark the diagonal y = x.

Figure 4 Scatterplots of rMME(β^) against rMME(β˜) with sample size n = 800 in a simulation with 1000 replications. The dashed lines mark the diagonal y = x.

Figure 5 Scatterplots of $rMME (\hat{β})$ against $rMME (\tilde{β})$ with sample size n = 300 in a simulation with 1000 replications. The dashed lines mark the diagonal y = x.

Figure 5 Scatterplots of rMME(β^) against rMME(β˜) with sample size n = 300 in a simulation with 1000 replications. The dashed lines mark the diagonal y = x.

Table 3 The means and standard deviations (in parentheses) of the sample correlation coefficients between Y and ${\tilde{β}}^{'} X$ , and between Y and ${\hat{β}}^{'} X$ in a simulation with 1000 replications, calculated for both the sample used for estimating β and the post-sample

Display Table

The same plots with sample size n = 300 are presented in . When the dimension p is small such as p = 50 or 100, MQE still provides a better matching performance overall, although the matching errors are greater than those when n = 800. When dimension p = 200 and sample size n = 300, we step into overfitting territory. While the in-sample fitting is fine (see the top panel in and the bottom-left part of below), the post-sample matching power of both OLS and MQE is poor and MQE performs even worse than the “wrong” method OLS.

To assess the goodness-of-match, we also calculate the measure $\hat{ρ}$ defined in (Equation5.3 $\begin{matrix} \hat{ρ} = 1 - \frac{1}{2} \sum_{j = 1}^{[n / k]} | C_{j} - k / n |, where \\ C_{j} = \frac{1}{n} \sum_{i = 1}^{n} I (\frac{(j - 1) k}{n} < U_{i} \leq \frac{j k}{n}) . \end{matrix}$ ) with k = 20. The mean and standard deviation of $\hat{ρ}$ over 1000 replications are reported for in . We line up side by side the results calculated using both the sample used for estimating β and the post-sample. Except the overfitting cases (i.e., n = 300 and p = 200), the values of $\hat{ρ}$ with MQE are greater (or much greater when r = 2 or 1) than those with OLS, noting the small standard deviations across all the settings. With MQE, $\hat{ρ} \geq 0.92$ for the in-sample matching, and $\hat{ρ} \geq 0.87$ for the post-sample matching (except when n = 300 and p = 200). With OLS, the minimum value of $\hat{ρ}$ is 0.71 for the in-sample matching, and is 0.72 for the post-sample matching.

One side-effect of MQE $\hat{β}$ is the disregard of the pairing of (Y_j, X_j); see (Equation2.3 $\begin{matrix} \hat{β} = arg min_{β} \sum_{j = 1}^{n} {Y_{(j)} - {(β^{'} X)}_{(j)}}^{2}, \end{matrix}$ ). Hence we expect that the sample correlation between Y and ${\hat{β}}^{'} X$ will be smaller than that between Y and ${\tilde{β}}^{'} X$ . lists the means and standard deviations of the sample correlation coefficients between Y and ${\hat{β}}^{'} X$ , and of those between Y and ${\tilde{β}}^{'} X$ in our simulation. Over all different settings, the mean sample correlation coefficient for both in-samples and post-samples between Y and ${\tilde{β}}^{'} X$ is always greater than that between Y and ${\hat{β}}^{'} X$ . However the difference is small. In fact if we take the difference of the two means, denoted as D, as the estimator for the “true” difference and treat the two means independently of each other, the (absolute) value of D is always smaller than its standard error over all the settings.

Finally we investigate the performance of MQE in matching only a part of distribution. To this end, we repeat the above exercise but using R_k(β) = R_k(β, 0, 0.3) defined in (Equation2.10 $\begin{matrix} R_{k} (β; α_{1}, α_{2}) & = & \frac{1}{n_{2} - n_{1}} \sum_{j = n_{1} + 1}^{n_{2}} \\ \times {(Y_{(j)} - β^{'} X_{(j)}^{(k - 1)})}^{2}, \end{matrix}$ ) instead, that is, the MQE is sought to match the lower 30% of the distribution of Y. presents the boxplots of $rMSE (\hat{β})$ . Comparing it with , there are no entries for n = 300 and p = 100 or 200, for which the algorithm did not converge after 500 iterations. See Remark 2(ii). For the cases presented in , $rMSE (\hat{β})$ are smaller than the corresponding entries in . This is because the matching now is easier, as the MQE is sought such that the lower 30% of $L ({\hat{β}}^{'} X)$ matches the counterpart of $L (Y)$ . But there are no any constraints on the upper 70% of $L ({\hat{β}}^{'} X)$ . list the means and standard deviations of the number of iterations required in calculating MQE over the 1000 replications. Comparing it with , the algorithm converges faster for matching a part of $L (Y)$ than for matching the whole $L (Y)$ .

7. A REAL-DATA EXAMPLE

In the context of selecting a representative portfolio for backtesting counterparty credit risks, Y is the total portfolio of a counterparty, and X = (X₁, …, X_p) are the p mark-to-market values of the trades. The goal is to find a linear combination β′X which provides an adequate approximation for the total portfolio Y. Since Basel III requires that a representative portfolio matches various characteristics of the total portfolio, we use the proposed methodology to select β′X to match the whole distribution of Y. We illustrate below how this can be done using the records for a real portfolio.

The data contains 1000 recorded total portfolios at one month tenor (i.e., one month stopping period) and the corresponding mark-to-market values of 146 trades (i.e., p = 146). Those 146 trades were selected from over 2000 trades across different tenors (i.e., from 3 days to 25 years) by the stepwise regression method of An et al. (Citation2008). The data has been rescaled. As some trades are heavily skewed to the left while the total portfolio data are very symmetric for this particular dataset, we truncate those trades at $\hat{μ} - 6 \hat{σ}$ , where $\hat{μ}$ and $\hat{σ}$ denote, respectively, the sample mean and the sample standard deviation of the trade concerned. The absence of the heavy left tail in the total portfolio data is because there exist highly correlated trades in opposite directions (i.e., sales in contrast to buys) which were eliminated at the initial stage by the method of An et al. (Citation2008). We estimate both OLS $\tilde{β}$ and MQE $\hat{β}$ using the first 700 (i.e., n = 700) of the 1000 available observations. The algorithm for computing MQE took 7 iterations to converge. We compare Y with ${\tilde{β}}^{'} X$ and ${\hat{β}}^{'} X$ using the last 300 observations. The in-sample and post-sample correlations between Y and ${\tilde{β}}^{'} X$ are 0.566 and 0.248. The in-sample and post-sample correlations between Y and ${\hat{β}}^{'} X$ are 0.558 and 0.230. Once again the loss of correlation with MQE is minor.

Table 4 The means and standard deviations (STD) of the number of iterations required for computing MQE $\hat{β}$ for matching the lower 30% of the distribution of Y

Display Table

Setting k/n = 0.05 in (Equation5.3 $\begin{matrix} \hat{ρ} = 1 - \frac{1}{2} \sum_{j = 1}^{[n / k]} | C_{j} - k / n |, where \\ C_{j} = \frac{1}{n} \sum_{i = 1}^{n} I (\frac{(j - 1) k}{n} < U_{i} \leq \frac{j k}{n}) . \end{matrix}$ ), the in-sample and post-sample goodness of fit measures $\hat{ρ}$ are 0.905 and 0.855 with MQE, and are 0.741 and 0.785 with OLS. This indicates that MQE provides a much better matching than OLS. The goodness-of-match test presented in Section 5.2 reinforces this assertion. The test statistic T_n defined in (Equation5.4 $\begin{matrix} T_{n} = \sqrt{n} \sum_{j = 1}^{[n / k]} | C_{j} - k / n | . \end{matrix}$ ), when applied to the 300 post-sample points, is equal to 5.023 for the MQE matching, and is 7.448 for the OLS matching. Comparing to the critical values listed in Section 5.2, we reject the OLS matching at the 0.5% significance level, but we cannot reject the MQE matching even at the 10% level. Note that we do not apply the test to the in-sample data as the same data points were used in estimating β (though the conclusions would be the same).

To further showcase the improvement of MQE matching over OLS, plots the sample quantiles of the representative portfolios ${\tilde{β}}^{'} X$ and ${\hat{β}}^{'} X$ against the sample quantiles of the total counterparty portfolio Y, based on the 300 post-sample points. It shows clearly that the distribution of the representative portfolio based on MQE $\hat{β}$ provides much more accurate approximation for the distribution of the total counterparty portfolio than that based on the OLS $\tilde{β}$ . For the latter, the discrepancy is alarmingly large at the two tails of the distribution, where matter most for risk management.

Figure 6 Boxplots of rMSE $(\hat{β})$ for matching the lower 30% of the distribution of Y, where n is sample size, p is the dimension of X, and r is the noise-to-signal ratio.

Figure 6 Boxplots of rMSE(β^) for matching the lower 30% of the distribution of Y, where n is sample size, p is the dimension of X, and r is the noise-to-signal ratio.

Figure 7 The plots of the sample quantiles of the representative portfolios based on OLS (the left panel) and MQE (the right panel) against the sample quantiles of the total counterparty portfolio. The straight lines mark the diagonal y = x on which the two quantiles are equal. All the quantiles are calculated based on the 300 post-sample points.

8. PORTFOLIO TRACKING

Portfolio tracking refers to a portfolio assembled with securities which mirrors a benchmark index, such as S&P500 or FTSE100 (Jansen and van Dijk Citation2002, and Dose and Cincotti Citation2005). Tracking portfolios can be used as the strategies for investment, hedging and risk management for investment, or as macroeconomic forecasting (Lamont Citation2001).

Let Y be the return of an index to be tracked, X₁, …, X_p be the returns of the p securities to be used for tracking Y. One way to choose a tracking portfolio is to select weights {w_i} to minimize $\begin{matrix} E {(Y - \sum_{i = 1}^{p} w_{i} X_{i})}^{2} \end{matrix}$ subject to $\begin{matrix} \sum_{i = 1}^{p} w_{i} = 1 and \sum_{i = 1}^{p} | w_{i} | \leq c, \end{matrix}$ where c ⩾ 1 is a constant. See, for example, Section 3.2 of Fan et al. (Citation2012). In the above expression, w_i is the proportion of the capital invested on the ith security X_i, and w_i < 0 indicates a short sale on X_i. It follows from (Equation8.2 $\begin{matrix} \sum_{i = 1}^{p} w_{i} = 1 and \sum_{i = 1}^{p} | w_{i} | \leq c, \end{matrix}$ ) that $\begin{matrix} \sum_{w_{i} > 0} w_{i} \leq \frac{1 + c}{2}, \sum_{w_{i} < 0} | w_{i} | \leq \frac{c - 1}{2} . \end{matrix}$ Hence, the constant c controls the exposure to short sales. When c = 1, short sales are not permitted.

Instead of using the constrained OLS as in above, one alternative in selecting the tracking portfolio is to match the whole (or a part) of distribution of Y. This leads to a constrained MQE, subject to the constraints in (Equation8.2 $\begin{matrix} \sum_{i = 1}^{p} w_{i} = 1 and \sum_{i = 1}^{p} | w_{i} | \leq c, \end{matrix}$ ). Given a set of historical returns {(Y_j, X_j1, …, X_jp), j = 1, …, n}, we use the iterative algorithm in Section 2 to calculate MQE $\hat{β}$ subject to the constraint $\begin{matrix} \sum_{j = 1}^{p} | β_{j} | \leq δ \sum_{i = 1}^{p} | {\hat{β}}_{i}^{(0)} |, \end{matrix}$ and ${\hat{β}}^{(0)} = {({\hat{β}}_{1}^{(0)}, ..., {\hat{β}}_{p}^{(0)})}^{'}$ is the unconstrained MQE for β, and δ ∈ (0, 1) is a constant which controls, indirectly, the total exposure to short-sales. This is the standard MQE-LASSO; see (Equation2.12 $\begin{matrix} \sum_{i = 1}^{p} | β_{i} | \leq C_{0}, \end{matrix}$ ) in Remark 1(iii) in Section 2. For δ ⩾ 1, $\hat{β} = {\hat{β}}^{(0)}$ . We transform the constrained MQE $\hat{β} = {({\hat{β}}_{1}, ..., {\hat{β}}_{p})}^{'}$ to the estimates for the proportion weights as follows: $\begin{matrix} {\hat{w}}_{i} = {\hat{β}}_{i} / \sum_{1 \leq j \leq n} {\hat{β}}_{j}, i = 1, ..., p . \end{matrix}$ Then ${{\hat{w}}_{i}}$ fulfill the constraints in (Equation8.2 $\begin{matrix} \sum_{i = 1}^{p} w_{i} = 1 and \sum_{i = 1}^{p} | w_{i} | \leq c, \end{matrix}$ ) with any c satisfying the following condition: $\begin{matrix} c \geq δ \sum_{i} | {\hat{β}}_{i}^{(0)} | / | \sum_{j} {\hat{β}}_{j} | . \end{matrix}$ Such a c is always greater than 1 as $\begin{matrix} δ \sum_{i} | {\hat{β}}_{i}^{(0)} | / | \sum_{j} {\hat{β}}_{j} | \geq δ \sum_{i} | {\hat{β}}_{i}^{(0)} | / \sum_{j} | {\hat{β}}_{j} | \geq 1, \end{matrix}$ see (Equation8.4 $\begin{matrix} \sum_{j = 1}^{p} | β_{j} | \leq δ \sum_{i = 1}^{p} | {\hat{β}}_{i}^{(0)} |, \end{matrix}$ ). Note that the LARS-LASSO algorithm gives the whole solution path for all positive values of δ. Hence for a given value c in (Equation8.2 $\begin{matrix} \sum_{i = 1}^{p} w_{i} = 1 and \sum_{i = 1}^{p} | w_{i} | \leq c, \end{matrix}$ ), we can always find the largest possible value δ from the solution path for which (Equation8.5 $\begin{matrix} c \geq δ \sum_{i} | {\hat{β}}_{i}^{(0)} | / | \sum_{j} {\hat{β}}_{j} | . \end{matrix}$ ) holds.

Table 5 The mean, maximum and minimum daily log returns (in percentages) of FTSE100 and the estimated track portfolios in 2007. The estimation was based on the data in 2004–2006. Also included in the table are the number of stocks present in each portfolio, the standard deviations (STD) and the negative mean (NM) of the daily returns, and the percentages (of the capital) for short sales

Download CSV Display Table

Figure 8 The plots of the daily log returns of FTSE100 index (thick black cycles), the MQE-LASSO portfolio with δ = 0.7 (thin red cycle in the top panel), and the MQE-LASSO portfolio with δ = 0.5 and (α₁, α₂) = (0, 0.5) (thin blue cycles in the bottom panel).

Figure 9 The plots of the annual means and standard deviations (STD) of daily log returns of FTSE100 index, the OLS portfolio, the MQE portfolio, the OLS-LASSO portfolio and the MQE-LASSO portfolio in the period of 2007–2013.

Figure 10 The plots of the annual means and standard deviations (STD) of daily log returns of FTSE100 index, and the portfolios based on the MQE-LASSO matching the lower half, the middle half, and the upper half of distribution in the period of 2007–2013.

Remark 5.

One would be tempted to absorb the constraint condition ∑_jw_j = 1 in the estimation directly by letting, for example, $\begin{matrix} Y^{'} = Y - X_{p}, X_{i}^{'} = X_{i} - X_{p} for 1 \leq i < p . \end{matrix}$

Then, one could estimate w₁, …, w_{p − 1} directly by regressing Y′ on X′₁, …, X_{p − 1}′. However, this puts the pth security X_p on a nonequal footing as the other p − 1 securities, which may lead to an adverse effect.

We illustrate our proposal by tracking FTSE100 using 30 actively traded stocks included in FTSE100. The company names and the symbols of those 30 stocks are listed in Appendix II.

We use the log returns (in percentages) calculated using the adjusted daily close prices in 2004–2006 (n = 758) to estimate the tracking portfolios by MQE with or without the LASSO, and compare their performance with the returns of FTSE100 in 2007 (in total 253 trading days). We also include in the comparison the portfolios estimated by OLS. The market is overall bullish in the period 2004–2007. The data were downloaded from Yahoo!Finance.

list some summary statistics of the daily log-returns in 2007 of FTSE100 and the various tracking portfolios. Both the OLS and the MQE track well the FTSE100 index with almost identical daily mean 0.014%. In addition to the standard deviations (STD), we also include in the table the negative mean (NM) as a risk measure, which is defined as the mean value of all the negative returns. According to both STD and NM, both the OLS and the MQE are slightly less risky than FTSE100 in 2007.

We also form the portfolios based on OLS-LASSO and MQE-LASSO with the truncated parameter δ = 0.7 and 0.5; see (Equation8.4 $\begin{matrix} \sum_{j = 1}^{p} | β_{j} | \leq δ \sum_{i = 1}^{p} | {\hat{β}}_{i}^{(0)} |, \end{matrix}$ ). Now all the four portfolios yield noticeably greater average daily returns than that of FTSE100 with noticeably greater risks. Furthermore, the performances of OLS and MQE part from each other with MQE producing substantially larger returns with larger risks. For example, the MQE-LASSO portfolio with δ = 0.5 yields average daily return of 0.119% and NM −1.336% while the OLS-LASSO yields average daily return of 0.045% and NM −1.119%. The number of stocks selected in portfolio is 10 by MQE, and 14 by OLS.

We continue the experiment by using the MQE matching the lower half, the middle half and the upper half of the distribution only; see Remark 1(ii). With δ = 0.7, the portfolios resulted from matching either the lower or the upper half of the distribution incur excessive short sales of, respectively, 38.4% and 885% of the initial capital, and are therefore too risky. By using δ = 0.5, short sales are reduced to 1.4% and 22% respectively. Especially matching the lower half distribution with δ = 0.5 leads to a portfolio with average daily return 0.223%, the STD 2.33%, the NA −1.56% and short sales 1.4%.

plots the daily returns of FTSE100 together with the two portfolios estimated by the MQE-LASSO with δ = 0.7, and δ = 0.5, (α₀, α₁) = (0, 0.5), respectively. Both the portfolios track well the index with increased volatility. Especially the portfolio plotted in blue is obtained by matching the lower half distribution only. Comparing with FTSE100, the increase of the STD is 1.23% while the increase of the NM is merely 0.654%. The increase of the return for this portfolio is resulted from mimicking the loss of FTSE100 and “freeing” the top half distribution.

Now we apply the above approach with a rolling window to the data in 2007–2013. More precisely, for each calendar year within the period, we use the data in its previous three years for estimation to form the different portfolios. We then calculate the means and standard deviations for the daily returns in that year based on each of the portfolios. (The data for 2013 were only up to 10 September when this exercise was conducted.) The results for the portfolios based on OLS, MQE with and without LASSO are plotted in . We set δ = 0.5 in all the LASSO estimations. shows that the MQE-LASSO portfolio generated greater average returns in the 5 out of 7 years than the other four portfolios. But it also led to greater losses than FTSE100 index in both 2008 and 2011. Judging by the standard deviations it is the most risky strategy among the five portfolios reported in . Note that both the OLS and MQE portfolios incur small increases in standard deviation while the gains in average returns in 2011 and 2013 are noticeable. This shows that it is possible to match the overall performance of the index by trading on much fewer stocks.

compares the three portfolios based on the MQE-LASSO matching, respectively, the lower half, the middle half and the upper half of the distributions for the returns of FTSE100 index. The first panel in the figure suggests that matching the upper-half distributions leads to very volatile average returns which are worse than the returns of FTSE100 index overall. In contrast, matching the lower half or the middle half of the distributions provide better return than the index in the 6 out of 7 years during the period. The risks of those portfolios, measured by the standard deviations, are higher that those of the index; see the second panel in the figure.

Overall the MQE-LASSO portfolios tend to overshoot at both the peaks and the troughs. Therefore they tend to outperform FTSE100 index when the market is bullish, and they may also do worse than the index when the market is bearish (such as 2008 and 2011).

Additional information

Notes on contributors

Nikolaos Sgouropoulos

Nikolaos Sgouropoulos is Quantitative Analyst, QA Exposure Analytics, Barclays, London, UK (E-mail: [email protected]).

Qiwei Yao

Qiwei Yao is Professor, Department of Statistics, The London School of Economics and Political Science, Houghton Street, London, WC2A 2AE, UK; Guanghua School of Management, Peking University, China (E-mail: [email protected]).

Claudia Yastremiz

Claudia Yastremiz is Senior Technical Specialist, Market and Counterparty Credit Risk Team, Prudential Regulation Authority, Bank of England, London, UK (E-mail: [email protected]).

REFERENCES

An, H.-Z., Huang, D., Yao, Q., and Zhang, C.-H. (2008), “Stepwise Searching for Feature Variables in High-dimensional Linear Regression,” unpublished manuscript, available at http://stats.lse.ac.uk/q.yao/qyao.links/paper/ahyz08.pdf.
Google Scholar
Bickel, P.J., and Freedman, D.A. (1981), “Some Asymptotic Theory for the Bootstrap,” The Annals of Statistics, 9, 1196–1217.
Web of Science ®Google Scholar
del Barrio, E., Cuesta-Albertos, J.A., Matrán, C., and Rodríguez-Rodríguez, J.M. (1999), “Tests of Goodness of Fit Based on the-Wasserstein Distance L2,” The Annals of Statistics, 27, 1230–1239.
Web of Science ®Google Scholar
Dominicy, Y., and Veredas, D. (2013), “The Method of Simulated Quantiles,” Journal of Econometrics, 172, 208–221.
Web of Science ®Google Scholar
Dose, C., and Cincotti, S. (2005), “Clustering of Financial Time Series with Application to Index and Enhanced Index Tracking Portfolio,” Physica A, 355, 145–151.
Web of Science ®Google Scholar
Efron, B., Johnstone, I., Hastie, T., and Tibshirani, R. (2004), “Least Angle Regression” (with discussions), The Annals of Statistics, 32, 409–499.
Web of Science ®Google Scholar
Fan, J., Zhang, J., and Yu, K. (2012), “Vast Portfolio Selection with Gross-exposure Constraints,” Journal of the American Statistical Association, 107, 592–606.
PubMed Web of Science ®Google Scholar
Firpo, S., Fortin, N., and Lemieux, T. (2009), “Unconditional Quantile Regressions,” Econometrica, 77, 953–973.
Web of Science ®Google Scholar
Gneiting, T. (2011), “Quantiles as Optimal Point Forecasts,” International Journal of Forecasting, 27, 197–207.
Web of Science ®Google Scholar
He, X., Yang, Y., and Zhang, J. (2012), “Bivariate Downscaling with Asynchronous Measurements,” Journal of Agricultural, Biological, and Environmental Statistics, 17, 476–489.
Web of Science ®Google Scholar
Jansen, R. and van Dijk, R. (2002), “Optimal Benchmark Tracking with Small Portfolios,” The Journal of Portfolio Management, 28, 33–39.
Web of Science ®Google Scholar
Karian, Z., and Dudewicz, E. (1999), “Fitting the Generalized Lambda Distribution to Data: A Method Based on Percentiles,” Communications in Statistics: Simulation and Computation, 28, 793–819.
Web of Science ®Google Scholar
Kiefer, J. (1970), “Deviations Between the Sample Quantile Process and the Sample DF,” in Nonparametric Techniques in Statistical Inference ed. M. L. Puri, pp. 299–319, London: Cambridge University Press.
Google Scholar
Koenker, R. (2005), Quantile Regression, Cambridge: Cambridge University Press.
Google Scholar
Kosorok, M.R. (1999), “Two-Sample Quantile Tests Under General Conditions,” Biometrika, 86, 909–921.
Web of Science ®Google Scholar
Kulik, R. (2007), “Bahadur-Kiefer Tample Quantiles of Weakly Dependent Linear Processes,” Bernoulli, 13, 1071–1090.
Web of Science ®Google Scholar
Lamont, O.A. (2001), “Economic Tracking Portfolios,” Journal of Econometrics, 105, 161–184.
Web of Science ®Google Scholar
Mallows, C.L. (1972), “A Note on Asymptotic Joint Normality,” The Annals of Mathematical Statistics, 43, 508–515.
Google Scholar
Massart, P. (1990), “The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality,” The Annals of Probability, 18, 1269–1283.
Web of Science ®Google Scholar
O’Brien, T.P., Sornette, D., and McPherro, R.L. (2001), “Statistical Asynchronous Regression Determining: The Relationship Between Two Quantities that are not Measured Simultaneously,” Journal of Geophysical Research, 106, 13247–13259.
Web of Science ®Google Scholar
Serfling, R.J. (1980), Approximation Theorems of Mathematical Statistics, New York: Wiley.
Google Scholar
Small, C., and McLeish, D. (1994), Hilbert Space Methods in Probability and Statistical Inference, New York: Wiley.
Google Scholar
Tanaka, H. (1973), “An Inequality for a Functional of Probability Distribution and its Application to Kac’s One-Dimensional Model of a Maxwellian Gas,” Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 27, 47–52.
Web of Science ®Google Scholar
Wu, C. F.J. (1983), “On the Convergence Properties of the EM Algorithm,” The Annals of Statistics, 11, 95–103.
Web of Science ®Google Scholar

APPENDIX I: PROOF OF THEOREM 2

We split the proof of Theorem 2 into several lemmas.

Lemma A.1.

Under Conditions B(i) and (ii), n^τ{S_n(β) − S(β)} → 0 in probability for any fixed β and τ < 1/2.

Proof.

Put W = β′X. By (Equation4.7) and (Equation4.8),

\begin{matrix} \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{n, Y} (j / n) - Q_{n, W} (j / n)}^{2} \\ - \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{Y} (j / n) - Q_{W} (j / n)}^{2} \\ = \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {\{\frac{F_{n, Y} (Q_{Y} (α)) - α}{f_{Y} (Q_{Y} (α))}\}}^{2} \\ + \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {\{\frac{F_{n, W} (Q_{W} (α)) - α}{f_{W} (Q_{W} (α))}\}}^{2} \\ + \frac{2 R_{n}}{n^{3 / 2}} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{Y} (j / n) - Q_{W} (j / n) + F_{n, Y} (Q_{Y} (α)) \\ - \frac{α}{f_{Y} (Q_{Y} (α))} - \frac{F_{n, W} (Q_{W} (α)) - α}{f_{W} (Q_{W} (α))}} + O_{P} (R_{n}^{2} / n), \end{matrix}

where R_n = O_P(n^{− 1/4}(log n)^1/2(log log n)^1/4) = o_P(1). By the Dvoretzky-Kiefer-Wolfowitz inequality (Massart Citation1990), it holds for any constant C > 0 and any integer n ⩾ 1 that

\begin{matrix} P \{sup_{0 \leq α \leq 1} | F_{n, Y} (Q_{Y} (α)) - α | > C\} \leq 2 e^{- 2 n C^{2}}, \\ P \{sup_{0 \leq α \leq 1} | F_{n, W} (Q_{W} (α)) - α | > C\} \leq 2 e^{- 2 n C^{2}} . \end{matrix}

Let

C = n^{- τ_{1}}

for some τ₁ ∈ (τ/2, 1/4), and

\begin{matrix} A_{n} = \{sup_{0 \leq α \leq 1} | F_{n, Y} (Q_{Y} (α)) - α | \leq C\} ⋂ \\ \{sup_{0 \leq α \leq 1} | F_{n, W} (Q_{W} (α)) - α | \leq C\} . \end{matrix}

Then by (EquationA.2),

P (A_{n}) \geq 1 - 4 e^{- 2 n C^{2}} \to 1

, and on the set A_n,

\begin{matrix} n^{τ} {\frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{n, Y} (j / n) - Q_{n, W} (j / n)}^{2} \\ - \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{Y} (j / n) - Q_{W} (j / n)}^{2}} = o_{P} (1), \end{matrix}

which is guaranteed by Condition B(ii) and the fact that

\begin{matrix} \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} \{Q_{Y} (j / n) - Q_{W} (j / n)\} \to \int_{α_{1}}^{α_{2}} {Q_{Y} (α) - Q_{W} (α)} d α . \end{matrix}

Note that

\begin{matrix} | \int_{α_{1}}^{α_{2}} {Q_{Y} (α) - Q_{W} (α)} d α | \leq \int_{α_{1}}^{α_{2}} \{| Q_{Y} (α) | + | Q_{W} (α) |\} d α \\ = E [| Y | I {G_{Y} (α_{1}) < Y \leq G_{Y} (α_{2})}] + E | W | < \infty, \end{matrix}

as |Y|I{G_Y(α₁) < Y ⩽ G_Y(α₂)} is bounded under Condition B(ii). See also condition B(iii) and Remark 3(iii).

Under Condition B(ii), |Q_Y(α) − Q_Y(j/n)| = f_Y(j/n)^{− 1}/n{1 + o(1)} for any |α − j/n| ⩽ 1/n. Hence $\begin{matrix} \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{Y} (j / n) - Q_{W} (j / n)}^{2} \\ = \int_{α_{1}}^{α_{2}} {Q_{Y} (α) - Q_{W} (α)}^{2} d α + o (1 / n) . \end{matrix}$ Combining this with (EquationA.3 $\begin{matrix} n^{τ} {\frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{n, Y} (j / n) - Q_{n, W} (j / n)}^{2} \\ - \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {Q_{Y} (j / n) - Q_{W} (j / n)}^{2}} = o_{P} (1), \end{matrix}$ ), we obtain the required result.

Lemma A.2.

Let a₁ ⩽ ⋅⋅⋅ ⩽ a_n be n real numbers. Let b_i = a_i + δ_i for i = 1, …, n, and δ_i are real numbers. Then $\begin{matrix} max_{1 \leq i \leq n} | a_{i} - b_{(i)} | \leq max_{1 \leq j \leq n} | δ_{j} |, \end{matrix}$ where b₍₁₎ ⩽ ⋅⋅⋅ ⩽ b_(n) is a permutation of {b₁, …, b_n}.

Proof.

We use the mathematical induction to prove the lemma. Let ε = max _j|δ_j|. It is easy to see that (EquationA.4 $\begin{matrix} max_{1 \leq i \leq n} | a_{i} - b_{(i)} | \leq max_{1 \leq j \leq n} | δ_{j} |, \end{matrix}$ ) is true for n = 2. Let it be also true for n = k. We now prove it for n = k + 1.

Let c_i = b_i for i = 1, …, k. Then by the induction assumption, $\begin{matrix} max_{1 \leq i \leq k} | a_{i} - c_{(i)} | \leq ϵ . \end{matrix}$ If b_{k + 1} = a_{k + 1} + δ_{k + 1} ⩾ c_(k), the required result holds. However, if for some 1 ⩽ i < k, $\begin{matrix} c_{(i)} \leq b_{k + 1} < c_{(i + 1)}, \end{matrix}$ then $\begin{matrix} b_{(j)} = \{\begin{matrix} c_{(j)} & 1 \leq j \leq i, \\ b_{k + 1} & j = i + 1, \\ c_{(j - 1)} & i + 2 \leq j \leq k + 1 . \end{matrix} \end{matrix}$

Note that |b_{k + 1} − a_{i + 1}| ⩽ ε since $\begin{matrix} b_{k + 1} = a_{k + 1} + δ_{k + 1} \geq a_{i + 1} - ϵ and b_{k + 1} < c_{(i + 1)} \leq a_{i + 1} + ϵ . \end{matrix}$ The second expression above is implied by (EquationA.5 $\begin{matrix} max_{1 \leq i \leq k} | a_{i} - c_{(i)} | \leq ϵ . \end{matrix}$ ).

On the other hand, for j = i + 2, …, k + 1, we need to show that |c_{(j − 1)} − a_j| ⩽ ε. This is true, as c_{(j − 1)} ⩽ a_{j − 1} + ε ⩽ a_j + ε, and furthermore $\begin{matrix} c_{(j - 1)} > b_{k + 1} = a_{k + 1} + δ_{k + 1} \geq a_{j} - ϵ . \end{matrix}$ Hence |b_(j) − a_j| ⩽ ε for all 1 ⩽ j ⩽ k + 1. This completes the proof.

Lemma A.3.

Let Condition B hold. Let $B$ be any compact subset of R^p. It holds that ${sup}_{β \in B} | S_{n} (β) - S (β) |$ converges to 0 in probability.

Proof.

We denote by ||β|| the Euclidean norm of vector β, and |β| = ∑_j|β_j|. Note that S(β) is a continuous function in β. For any ε > 0, there exist $β_{1}, ..., β_{m} \in B$ , where m is finite, such that for any $β \in B$ , there exists 1 ⩽ i ⩽ m for which $\begin{matrix} | | β - β_{i} | | < ϵ / max (M, \sqrt{p}) and | S (β) - S (β_{i}) | < ϵ, \end{matrix}$ where M > 0 is a constant such that ||x|| < M for any f_X(x) > 0; see Condition B(iii). Thus $\begin{matrix} | β^{'} x - β_{i}^{'} x | \leq | | x | | \cdot | | β - β_{i} | | \leq ϵ, | | β | - | β_{i} | | \leq | β - β_{i} | \\ \leq \sqrt{p} | | β - β_{i} | | \leq ϵ . \end{matrix}$ Now it follows from Lemma A.2 that $\begin{matrix} | S_{n} (β) - S_{n} (β_{i}) | \leq \frac{1}{n} \sum_{j = n_{1} + 1}^{n_{2}} {\{{(β_{i}^{'} X)}_{(j)} - {(β^{'} X)}_{(j)}\}}^{2} \\ + \frac{2}{n} \sum_{j = n_{1} + 1}^{n_{2}} | {(β_{i}^{'} X)}_{(j)} - {(β^{'} X)}_{(j)} | | Y_{(j)} - {(β_{i}^{'} X)}_{(j)} | \\ + | | β | - | β_{i} | | \leq ϵ^{2} + ϵ \frac{2}{n} \sum_{j = n_{1} + 1}^{n_{2}} | Y_{(j)} - {(β_{i}^{'} X)}_{(j)} | \\ + ϵ \to ϵ^{2} + 2 ϵ \int_{α_{1}}^{α_{2}} | G_{Y} (α) - G_{β_{i}^{'} X} (α) | d α + ϵ \end{matrix}$

in probability. This limit can be verified in the similar manner as in the proof of Lemma A.1. Consequently, there exists a set A with P(A) ⩾ 1 − ε such that on the set A it holds that $\begin{matrix} | S_{n} (β) - S_{n} (β_{i}) | \leq ϵ C, \end{matrix}$ where C > 0 is a constant. Now on the set A, $\begin{matrix} | S_{n} (β) - S (β) | \leq | S_{n} (β) - S_{n} (β_{i}) | \\ + | S_{n} (β_{i}) - S (β_{i}) | + | S (β_{i}) - S (β) | \\ \leq ϵ C + | S_{n} (β_{i}) - S (β_{i}) | + ϵ . \end{matrix}$ See (EquationA.6 $\begin{matrix} | | β - β_{i} | | < ϵ / max (M, \sqrt{p}) and | S (β) - S (β_{i}) | < ϵ, \end{matrix}$ ). Hence it holds on the set A that $\begin{matrix} sup_{β \in B} | S_{n} (β) - S (β) | \leq ϵ (C + 1) + \sum_{i = 1}^{m} | S_{n} (β_{i}) - S (β_{i}) | . \end{matrix}$ Now the required convergence follows from Lemma A.1.

Proof of Theorem 2.

Under Condition B(ii), YI{Q_Y(α₁) ⩽ Y ⩽ Q_Y(α₂)} is bounded. As X is also bounded, the MQE $\hat{β}$ defined in (Equation4.2 $\begin{matrix} \hat{β} = arg min_{β} S_{n} (β), \end{matrix}$ ) is also bounded. Let $B$ be a compact set which contains $\hat{β}$ with probability 1.

By (Equation4.1 $\begin{matrix} β_{0} & = & arg min_{β} S (β), S (β) \equiv S (β; α_{1}, α_{2}) \\ = & \int_{α_{1}}^{α_{2}} {Q_{Y} (α) - Q_{β^{'} X} (α)}^{2} d α + λ \sum_{j = 1}^{p} | β_{j} | . \end{matrix}$ ) and (Equation4.2 $\begin{matrix} \hat{β} = arg min_{β} S_{n} (β), \end{matrix}$ ), $\begin{matrix} S_{n} (β_{0}) - S (β_{0}) \geq S_{n} (\hat{β}) - S (β_{0}) \geq S_{n} (\hat{β}) - S (\hat{β}) . \end{matrix}$ Now it follows from Lemma A.3 that both S_n(β₀) − S(β₀) and $S_{n} (\hat{β}) - S (\hat{β})$ converge to 0 in probability. Hence, $S_{n} (\hat{β}) - S (β_{0})$ also converges to 0 in probability.

For the second assertion, we need to prove that $P {d ({\hat{β}}_{n}, B_{0}) \geq ϵ} \to 0$ for any constant ϵ > 0. We now write ${\hat{β}}_{n} = \hat{β}$ to indicate explicitly that the estimator is defined with the sample of size n. We proceed by contradiction. Suppose there exists an ϵ > 0 for which $\begin{matrix} \underset{n \to \infty}{lim sup} P {d (\hat{β}, B_{0}) \geq ϵ} > 0 . \end{matrix}$ Hence, there exists an integer subsequence n_k such that lim_kP(A_k) = δ > 0, where A_k is defined as $\begin{matrix} A_{k} = {d ({\hat{β}}_{n_{k}}, B_{0}) \geq ϵ} . \end{matrix}$ Let $B_{1} = {β \in B : d (β, B_{0}) \geq ϵ}$ . Then $B_{1}$ is a compact set which is ϵ-distance away from $B_{0}$ . By the definition of $B_{0}$ in (Equation4.4 $\begin{matrix} B_{0} = {β : S (β) = S (β_{0})}, \end{matrix}$ ), $\begin{matrix} inf_{β \in B_{1}} S (β) = δ + S (β_{0}) . \end{matrix}$ where δ > 0 is a constant. By Lemma A.3, P(B_k) → 1 for $\begin{matrix} B_{k} = {| S_{n_{k}} ({\hat{β}}_{n_{k}}) - S ({\hat{β}}_{n_{k}}) | < δ / 2} . \end{matrix}$ Now it holds on the set A_k∩B_k that $\begin{matrix} S_{n_{k}} ({\hat{β}}_{n_{k}}) \geq S ({\hat{β}}_{n_{k}}) - δ / 2 \geq inf_{β \in B_{1}} S (β) - δ / 2 > S (β_{0}) + δ / 2 > S (β_{0}) . \end{matrix}$ This contradicts to the fact that $S_{n} (\hat{β})$ converges to S(β₀) in probability, which was established earlier. This completes the proof. $□$

ACKNOWLEDGMENTS

We thank Professor Wolfgang Polonik for his helpful comments, in particular for drawing our attention to references Kiefer (Citation1970) and Kulik (Citation2007). We also thank the Editor and three reviewers for their critical and helpful comments and suggestions.

APPENDIX II: THE NAMES OF SYMBOLS OF THE 30 STOCKS USED IN TRACKING FTSE100

Table

Download CSV Display Table

Matching a Distribution by Matching Quantiles Estimation

Abstract

1. INTRODUCTION

2. METHODOLOGY

Example 1.

Example 2.

Remark 1.

3. CONVERGENCE OF THE ALGORITHMS

Lemma 1.

Proof.

Theorem 1.

Proof.

Remark 2.

4. ASYMPTOTIC PROPERTIES OF THE ESTIMATION

Remark 3.

Theorem 2.

5. GOODNESS OF MATCH

5.1 A Measure for the Matching Goodness

Remark 4.

5.2 A Goodness-of-Match Test

Proposition 1.

6. SIMULATION

Table 1 The means and standard deviations (STD) of the number of iterations required for computing MQE β^ in a simulation with 1000 replications

Table 3 The means and standard deviations (in parentheses) of the sample correlation coefficients between Y and β˜'X, and between Y and β^'X in a simulation with 1000 replications, calculated for both the sample used for estimating β and the post-sample

7. A REAL-DATA EXAMPLE

Table 4 The means and standard deviations (STD) of the number of iterations required for computing MQE β^ for matching the lower 30% of the distribution of Y

8. PORTFOLIO TRACKING

Remark 5.

Additional information

Notes on contributors

Nikolaos Sgouropoulos

Qiwei Yao

Claudia Yastremiz

REFERENCES

APPENDIX I: PROOF OF THEOREM 2

Lemma A.1.

Proof.

Lemma A.2.

Proof.

Lemma A.3.

Proof.

Proof of Theorem 2.

ACKNOWLEDGMENTS

APPENDIX II: THE NAMES OF SYMBOLS OF THE 30 STOCKS USED IN TRACKING FTSE100

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 1 The means and standard deviations (STD) of the number of iterations required for computing MQE $\hat{β}$ in a simulation with 1000 replications

Table 3 The means and standard deviations (in parentheses) of the sample correlation coefficients between Y and ${\tilde{β}}^{'} X$ , and between Y and ${\hat{β}}^{'} X$ in a simulation with 1000 replications, calculated for both the sample used for estimating β and the post-sample

Table 4 The means and standard deviations (STD) of the number of iterations required for computing MQE $\hat{β}$ for matching the lower 30% of the distribution of Y