494
Views
13
CrossRef citations to date
0
Altmetric
Original Articles

Geometric Classifier for Multiclass, High-Dimensional Data

&
Pages 279-294 | Received 30 Jul 2014, Accepted 31 May 2015, Published online: 14 Aug 2015

Abstract

In this article, we consider a geometric classifier that is applicable to multiclass classification for high-dimensional data. We show the consistency property and the asymptotic normality of the geometric classifier under certain mild conditions. We discuss sample size determination so that the geometric classifier can ensure that its misclassification rates are less than prespecified thresholds. We give a two-stage procedure to estimate the sample sizes required in such a geometric classifier and propose a misclassification rate–adjusted classifier (MRAC) based on the geometric classifier. We evaluate the performance of the MRAC theoretically and numerically. Finally, we demonstrate the MRAC in actual data analyses by using a microarray data set.

Subject Classifications:

1. INTRODUCTION

High-dimensional data situations occur in many areas of modern science, such as genetic microarrays, medical imaging, text recognition, finance, chemometrics, and so on. A common feature of high-dimensional data is that the data dimension is high but the sample size is relatively low. This is the so-called HDLSS or “large p, small n” situation where p/n → ∞; here p is the data dimension and n is the sample size. Aoshima and Yata (Citation2011a,b) provided a variety of statistical inference for high-dimensional data such as given-bandwidth confidence region, two-sample test, classification, variable selection, regression, pathway analysis, and so on. They considered sample size determination to ensure prespecified high accuracy for high-dimensional, non-Gaussian inference and developed the theory of Stein's (1945, 1949) two-stage procedure that was originally given for inference on the univariate Gaussian mean. Aoshima and Yata (Citation2015a) verified the asymptotic normality of statistics appearing in inference on high-dimensional mean vectors under certain mild conditions. In this article, we focus on high-dimensional classification and make an attempt to give a multiclass classifier to hold misclassification rates less than prespecified thresholds.

Suppose we have independent and p-variate populations, πi, i = 1,…, k, having un unknown mean vector μi and unknown covariance matrix Σi(>O) for each i. We assume that for all i ≠ j, where ||·|| denotes the Euclidean norm. Also, we assume that tr(Σi)/p ∈ (0, ∞) as p → ∞ for i = 1,…, k. Here, for a function, f(·), “f(p) ∈ (0, ∞) as p → ∞” implies and . We do not assume that Σ1 = … =Σk. The eigen-decomposition of Σi is given by , where Λi is a diagonal matrix of eigenvalues, λi1 ≥ … ≥ λip > 0, and Hi is an orthogonal matrix of the corresponding eigenvectors. We have independent and identically distributed (i.i.d.) observations, xi1,…, xini, from each πi. Let , where zij is considered as a sphered data vector from a distribution with the zero mean vector and the identity covariance matrix. We assume ni ≥ 2, i = 1,…, k. We estimate μi and Σi by and .

As for population πi, i = 1,…, k, we make the following assumption:

(A-i) Let yij, j = 1,…, ni, be i.i.d. random qi-vectors having E(yij) = 0 and Var(yij) = Iqi for each i (= 1,…, k), where qi ≥ p. Let yij = (yi1j,…, yiqij)T in which for all r, for all r ≠ s, t, u. Then, the observations, xijs, from each πi (i = 1,…, k) are given by (1) where Γi is a p × qi matrix such that .

Here, Iqi denotes the identity matrix of dimension qi. Note that (1.1) includes the case that and yij = zij. Also, note that (A-i) is met when πis have Np(μi, Σi) for i = 1,…, k. In addition, we assume the following assumptions for Σis as necessary:

(A-ii) and as p → ∞ for i, j, l = 1,…, k.

Note that “ as p → ∞” is equivalent to the condition that “ as p → ∞”. Also, the sphericity condition such as “ as p → ∞ for i = 1,…, k” holds under (A-ii).

Remark 1.1

If all λijs are bounded such as λij ∈ (0, ∞) as p → ∞, (A-ii) trivially holds. For a spiked model such as λij = aijpαij (j = 1,…, ti) and λij = cij (j = ti + 1,…, p) with positive constants, aijs, cijs and αijs, and positive integers tis, (A-ii) holds under the condition that αij < 1/2 for j = 1,…, ti(< ∞); i = 1,…, k.

Let x0 be an observation vector of an individual belonging to one of the k populations. When k = 2, a typical classification rule is that one classifies the individual into π1 if and into π2 otherwise. However, the inverse matrix of Sini does not exist in the HDLSS context (p > ni). Dudoit et al. (Citation2002) considered substituting the inverse matrix defined by only diagonal elements of Sini. Chan and Hall (Citation2009) and Aoshima and Yata (Citation2014) considered distance-based classifiers. Particularly, Aoshima and Yata (Citation2014) gave a distance-based classifier for multiclass, non-Gaussian, high-dimensional data and considered sample size determination to hold misclassification rates less than prespecified thresholds. When k = 2, the distance-based classifier is simplified as follows: One classifies the individual into π1 if (2) and into π2 otherwise. Here, −tr(S1n1)/(2n1) +tr(S2n2)/(2n2) is a bias-correction term. Aoshima and Yata (Citation2014) showed that the classifier holds a consistency property in which misclassification rates go to zero as p → ∞ even when (A-i) is not met. In that sense, the classifier is quite robust and applicable to actual high-dimensional data. On the other hand, Aoshima and Yata (Citation2011a) considered substituting {tr(Sini)/p}Ip for Sini in order to use a geometric representation of HDLSS data from each πi and gave a two-class quadratic classifier called the geometric classifier as follows: One classifies the individual into π1 if (3) and into π2 otherwise. Here, −p/n1 + p/n2 is a bias-correction term. Aoshima and Yata (Citation2014, Citation2015b) showed that the classifier holds the consistency property even when μ1 = μ2. Recently, Aoshima and Yata (Citation2015b) provided a general theory of quadratic classifiers for high-dimensional data in non-sparse settings.

In this article, we develop the geometric classifier by (1.3) to multiclass classification when k (≥2). In Section 2, we show the consistency property and the asymptotic normality of the geometric classifier for multiclass high-dimensional data. In Section 3, we discuss sample size determination so that the geometric classifier can ensure that its misclassification rates are less than prespecified thresholds. We give a two-stage procedure to estimate the sample sizes required in such the geometric classifier and propose a misclassification rate–adjusted classifier (MRAC) based on the geometric classifier. In Section 4, we evaluate the performance of the MRAC numerically as well. Finally, in Section 5, we demonstrate the MRAC in actual data analyses by using a microarray data set.

2. ASYMPTOTIC PROPERTIES OF THE GEOMETRIC CLASSIFIER

Let (4) for i = 1,…, k. We consider the geometric classifier when k (≥2) as follows: One classifies the individual into πi if (5)

When argminj=1,…, kWj(x0|nj) = {i1,…, il} with integers l ∈ [2, k] and i1 < … <il, we have max {argminj=1,…, kWj(x0|nj)} = il. Note that the difference, W1(x0|n1) − W2(x0|n2), is equivalent to (1.3).

2.1. Consistency Property

Let Δij(1) = ||μi − μj||2 and Δij(2) = tr(Σi) −tr(Σj) +tr(Σj)log {tr(Σj)/tr(Σi)} for all i ≠ j. Note that Δij(2) ≥ 0 (i ≠ j) with equality if and only if tr(Σi) = tr(Σj). Let for all i ≠ j. We assume the followings as p → ∞ either when ni is fixed or ni → ∞ for i = 1,…, k:

(A-iii) and for all i ≠ j;

(A-iv) for all i ≠ j.

We denote the error rate of misclassifying an individual from πi (into another class) by e(i). Then, we have the following result.

Theorem 2.1

Under (A-i), (A-iii), and (A-iv), it holds that as p → ∞

Remark 2.1

When k = 2, Aoshima and Yata (Citation2014) gave partial results of Theorem 2.1 under different conditions.

Remark 2.2

If as p → ∞ for all i ≠ j, (A-iii) and (A-iv) naturally hold. Then, one can claim Theorem 2.1 even when ni is fixed for i = 1,…, k.

2.2. Asymptotic Normality

Let for all i ≠ j. Note that Wi(x0|ni) − Wj(x0|nj) is equivalent to with Σi = Sini and Σj = Sjnj for all i ≠ j. We have that when x0 ∈ πi for all i ≠ j. Under (A-i), it holds that when x0 ∈ πi for all i ≠ j. Let for all i ≠ j. We assume extra assumptions as p → ∞ and ni → ∞, i = 1,…, k:

(A-v) and for all i ≠ j.

Note that under (A-ii) it holds for all i ≠ j, so that tr(Σi)/tr(Σj) → 1 as p → ∞ for all i ≠ j under (A-ii) and (A-v). Then, we have the following results.

Theorem 2.2

Assume that Δij(1)/tr(Σj) → 0 as p → ∞ for all i ≠ j. Under (A-i), (A-ii) and (A-v), it holds that as p → ∞ and ni → ∞, i = 1,…, k where “ ⇒ ” denotes the convergence in distribution and Yij denotes a random variable distributed as the standard normal distribution.

Remark 2.3

When k = 2, Aoshima and Yata (Citation2011a) gave the asymptotic normality under some stronger conditions.

Corollary 2.1

Assume that Δij(1)/tr(Σj) → 0 as p → ∞ for all i ≠ j. Under (A-i), (A-ii), and (A-v), the classification rule by (2.2) has that as p → ∞ and ni → ∞, i = 1,…, k where Φ(·) denotes the cumulative distribution function of the standard normal distribution.

Remark 2.4

When k = 2, the above result is given as

3. SAMPLE SIZE DETERMINATION TO CONTROL MISCLASSIFICATION RATES

Let Δij* = {tr(Σj)/pij = Δij(1) + Δij(2) for all i ≠ j. Let Δi* = min j(≠i)=1,…, kmin {Δij*, Δji*} for i = 1,…, k. We are interested in determining the sample size for (2.2) to ensure the requirement: where αi ∈ (0, 1/2) and Δi*L(> 0) i = 1,…, k, are prespecified constants. We assume , i = 1,…, k.

3.1. Sample Size Determination

Let zα be the upper α point of the standard normal distribution. We consider nis satisfying (6) for all i ≠ j, where Δ(ij) = pmax {Δi*L, Δj*L}/max {tr(Σi), tr(Σj)} (i ≠ j). Note that Δ(ij) = Δ(ji) and Δ(ij) ≤ min {Δij, Δji} for all i ≠ j. Under (3.1), we have that so that from Theorem 2.2 it follows that for i = 1,…, k under (3.1) and the assumptions of Theorem 2.2. First, we consider the case when  − 1| > 0 for i ≠ j. In the case, it holds . Under (A-i) and (A-ii), from Theorem 2.1 we have that even if nis are fixed for i ≠ j. Next, we consider the case when tr(Σ1) = … = tr(Σk). Let for i = 1,…, k. From the fact that (i ≠ j), it holds that for i ≠ j

Let us write σ(i) = max j(≠i)=1,…, kσj and α(i) = min j(≠i)=1,…, kαj for i = 1,…, k. From the above arguments, we can find ni, i = 1,…, k, to satisfy (3.1) by (7)

Note that ni → ∞, i = 1,…, k, as p → ∞ from the fact that . For example, when k = 2, tr(Σ1) = tr(Σ2) and Δ1*L = Δ2*L, the smallest integer (n1, n2) satisfying (3.2) holds the following optimality:

According to (3.2), we take samples from each πi and calculate Wi(x0|ni), i = 1,…, k, in (2.1). We consider the following classification procedure based on the misclassification rate adjusted classifier by Aoshima and Yata (Citation2014):

Misclassification rate–adjusted classifier (MRAC)

Step 1: Set i = 0.

Step 2: Put i = i + 1. If i = k, go to Step 4; otherwise go to Step 3.

Step 3: If it holds that for all j = i + 1,…, k, go to Step 4; otherwise go to Step 2.

Step 4: Classify x0 into πi.

We have the following result.

Theorem 3.1

Under (A-i) to (A-iii), for the MRAC with (3.2), it holds that as p → ∞ (8)

3.2. Designing a Lower Bound, Δi*L

First, we consider a lower bound of Δij(1). Let . By using the two-sample test by Aoshima and Yata (Citation2015a) under certain regularity conditions, it holds that as p → ∞ and ni → ∞, i = 1,…, k where Yij denotes a random variable distributed as the standard normal distribution and having Winis defined by (9) in Yata and Aoshima (Citation2013). Here, Wini is an unbiased estimator of and as p → ∞ and ni → ∞ under (A-i). See Aoshima and Yata (Citation2014) for the details. It follows that for given α′ ∈ (0, 1/2). Thus, one may design a lower bound of Δij(1) by (9) for sufficiently small α′. Next, we consider a lower bound of Δij(2). For i ≠ j it holds that with equality if and only if tr(Σi) = tr(Σj). We note that as p → ∞ and ni → ∞, i = 1,…, k under (A-i). Thus, one may design a lower bound of Δij(2) by for i ≠ j. Let Δij*L = Δij(1)L + Δij(2)L for all i ≠ j. Note that Δij*L = Δji*L for i ≠ j. Finally, we choose a lower bound, Δi*L, by Δi*L = min j(≠i)=1,…, kΔij*L for sufficiently small α′.

3.3. Two-Stage Procedure

In order to estimate Cis in (3.2), we proceed with the following two steps:

1.

Choose mi(≥4) satisfying (10) for i = 1,…, k. Note that (3.5) holds when mi/Ci ∈ (0, 1) as p → ∞. Take pilot samples, xij, j = 1,…, mi, of size mi from each πi. Then, calculate Wimi for each πi according to (9) in Yata and Aoshima (Citation2013). Let and for i = 1,…, k. Define the total sample size for each πi by (11) where ⌈ x ⌉ denotes the smallest integer ≥x.

2.

For each i, if Ni = mi, do not take any additional samples from πi and otherwise—that is, if Ni > mi—take additional samples, xij, j = mi + 1,…, Ni, of size Ni − mi from πi. By combining the initial samples and the additional samples, calculate and SiNi, i = 1,…, k. Then, follow MRAC by using Wi(x0|Ni) and tr(SiNi) instead of Wi(x0|ni) and tr(Sini).

Theorem 3.2

Under (A-i) to (A-iii), (3.3) holds for the MRAC with (3.5) and (3.6).

Remark 3.1

When k = 2, Aoshima and Yata (Citation2011a) gave a two-stage classification rule based on the geometric classifier. See Theorem 4.3 in Aoshima and Yata (Citation2011a) for the details. We emphasize that the MRAC can claim (3.3) for k ≥ 2 even under milder conditions than the original one by Aoshima and Yata (Citation2011a).

Remark 3.2

Under (A-i), (A-ii), and (3.5), it holds that Ni/Ci = 1 + oP(1) as p → ∞, which is in the HDLSS situation in the sense that Ni/p = oP(1) under the condition that .

Remark 3.3

Even when mi/Ci > 1 for some i, the assertion in Theorem 3.2 is still claimed. However, it may cause oversampling in the sense that Ni/Ci > 1 w.p.1.

4. SIMULATION

In order to examine the performance of the MRAC with (3.5) and (3.6), we used computer simulations. First, we considered two classes having Gaussian distributions. Independent pseudorandom observations were generated from πiNp(μi, Σi), i = 1, 2. We considered Σ1 = B{(−1)|ij|0.3|ij|1/3}B and Σ2 = c{(−1)|ij|0.4|ij|1/3}, where B = diag[{0.5 + 1/(p + 1)}1/2,…, {0.5 + p/(p + 1)}1/2]. Note that tr(Σ1) = p and tr(Σ2) = cp. We set μ1 = (1,…, 1, 0,…, 0)T whose the first 30 elements are 1 and μ2 = (0,…, 0)T, so that Δij(1) = ||μ1 − μ2||2 = 30. We prespecified Δ1*L = Δ2*L = Δ12(1) = 30. We set (α1, α2) = (0.05, 0.15) and mi = ⌈ 0.5 × (Ci − 1) ⌉ +1, i = 1, 2, where Ci is defined by (3.2). We considered four cases: (a) p = 500 when c = 1, (b) p = 1, 000 when c = 1, (c) p = 500 when c = 1.2, and (d) p = 1, 000 when c = 1.2. By averaging the outcomes from 2,000 (=R, say) replications, the findings are summarized in Table . Under a fixed scenario, suppose that the rth replication ends with Ni = nir (i = 1, 2) observations for r = 1,…, R. Let and . In the end of the rth replication, we checked whether the classifier does (or does not) classify x0 from πi correctly and defined Pir = 0 (or 1) accordingly for each i. We calculated for each i as un estimate of e(i). Their estimated standard errors were given by for each i, where . As observed in Table , the two-class MRAC with (3.5) and (3.6) gave adequate performances for all the cases when considered those standard errors. Especially, when tr(Σ1) ≠ tr(Σ2) such as in (c) and (d), the MRAC gave good performances because Δi* > Δi*L, i = 1, 2.

Table 1. Accuracy of the two-class MRAC with (3.5) and (3.6)

Next, we considered three classes having non-Gaussian distributions generated by yijl = (8/10)1/2wijl, where wijl, j = 1,…, p (l = 1, 2,…) are independently distributed as t-distribution with 10 degrees of freedom for each πi (i = 1, 2, 3). Note that E(yijl) = 0, , and yijl, j = 1,…, p (i = 1, 2, 3; l = 1, 2,…) are independent. Let , where . Then, the distribution of xil satisfies (A-i) for each πi. We considered Σ1 = B{(−1)|ij|0.3|ij|1/3}B, Σ2 = B{(−1)|ij|0.4|ij|1/3}B and Σ3 = 1.2{(−1)|ij|0.4|ij|1/3}. We set μ1 = (1,…, 1, 0,…, 0)T whose first 40 elements are 1, μ2 = (0,…, 0, 1,…, 1, 0,…, 0)T whose the 21st to the 60th elements are 1, and μ3 = (0,…, 0)T. Then, we had Δi* ≥ 40 for i = 1, 2, 3. We prespecified Δi*L = 40, i = 1, 2, 3. We set mi = ⌈ 0.5 × (Ci − 1) ⌉ +1 for each πi. We considered four cases: (a) p = 500 when (α1, α2, α3) = (0.1, 0.1, 0.1), (b) p = 1, 000 when (α1, α2, α3) = (0.1, 0.1, 0.1), (c) p = 500 when (α1, α2, α3) = (0.05, 0.1, 0.15), and (d) p = 1, 000 when (α1, α2, α3) = (0.05, 0.1, 0.15). By averaging the outcomes from 2,000 (=R, say) replications, the findings are summarized in Table . Throughout, the three-class MRAC with (3.5) and (3.6) gave adequate performance for all cases when considering those standard errors.

Table 2. Accuracy of the three-class MRAC with (3.5) and (3.6)

5. EXAMPLE

We analyzed gene expression data by Armstrong et al. (Citation2002) in which the data set consisted of 12, 582 (=p) genes. We had three classes of leukemia subtypes; that is, π1: acute lymphoblastic leukemia (24 samples), π2: mixed-lineage leukemia (20 samples), and π3: acute myeloid leukemia (28 samples). We used the MRAC and compared the geometric classifier by (3.5) and (3.6) with the distance-based classifier by Aoshima and Yata (Citation2014). The total sample size of the distance-based classifier is defined by for each πi, where Δi(1) = min j(≠i)=1,…, kΔij(1) for i = 1,…, k, and Δi(1)L is a lower bound of Δi(1) such as Δi(1) ≥ Δi(1)L. Since Δi* ≥ Δi(1), Ni*s are larger than Nis in (3.6) w.p.1 when Δi*L > Δi(1)L.

We prespecified (α1, α2, α3) = (0.05, 0.15, 0.1), so that α(1) = 0.1, α(2) = 0.05 and α(3) = 0.05. We set m1 = m2 = m3 = 10. According to Section 3.2, by setting α′ = 0.05 and ni = mi(= 10), i = 1, 2, 3, we had Δ12*L = 6.11 × 109, Δ13*L = 2.45 × 1010 and Δ23*L = 8.09 × 109. Thus, we prespecified Δ1L* = min (Δ12L*, Δ13L*) = 6.11 × 109, Δ2L* = min (Δ12L*, Δ23L*) = 6.11 × 109 and Δ3L* = min (Δ13L*, Δ23L*) = 8.09 × 109. Also, we had Δ12(1)L = 5.96 × 109, Δ13(1)L = 2.37 × 1010 and Δ23(1)L = 7.81 × 109 according to (3.4). Thus, we prespecified Δ1(1)L = 5.96 × 109, Δ2(1)L = 5.96 × 109, and Δ3(1)L = 7.81 × 109.

By using pilot samples of size m1 = m2 = m3 = 10, we calculated W1m1 = 2.59 × 1019, W2m2 = 2.16 × 1019, and W3m3 = 2.51 × 1019. From (3.6), the total sample size for π1 was calculated by

Similarly, we had N2 = 16 and N3 = 12. We considered constructing the geometric classifier, Wi(x0|Ni), i = 1, 2, 3, by (N1, N2, N3) = (19, 16, 12) samples and checking the accuracy of the MRAC by using the remaining (24 − N1, 20 − N2, 28 − N3) = (5, 4, 16) samples. We randomly split the data set from each πi into training sets of sizes (N1, N2, N3) = (19, 16, 12) and test sets of sizes (5, 4, 16). We constructed Wi(x0|Ni), i = 1, 2, 3, by the training sets and checked the accuracy of the MRAC by using the test sets. We repeated this procedure 100 times. Then, we had the average of misclassification rates as , , and . Also, for the distance-based classifier by Aoshima and Yata (Citation2014), we calculated the total sample sizes as (N1*, N2*, N3*) = (20, 17, 12) and had the average misclassification rates as , , and . Similarly, for various settings of αis, we investigated the performance of the geometric classifier and the distance-based classifier in the MRAC. Throughout, we used the same settings as m1 = m2 = m3 = 10 and (Δ1*L, Δ2*L, Δ3*L) = (6.11 × 109, 6.11 × 109, 8.09 × 109) or (Δ1(1)L, Δ2(1)L, Δ3(1)L) = (5.96 × 109, 5.96 × 109, 7.81 × 109). We summarized the results in Table . Both classifiers seem to give adequate performance in such an HDLSS situation. The geometric classifier would save more observations compared to the distance-based classifier, especially in small sample size settings. On the other hand, the distance-based classifier is very versatile and it holds (3.3) under milder conditions than the geometric classifier. See Sections 3 and 4 in Aoshima and Yata (Citation2014) for details.

Table 3. Average misclassification rates of the MRAC by the geometric classifier with (3.5) and (3.6) and by the distance-based classifier by Aoshima and Yata (2014). We set m1 = m2 = m3 = 10 and (Δ1*L, Δ2*L, Δ3*L) = (6.11 × 109, 6.11 × 109, 8.09 × 109) or (Δ1(1)L, Δ2(1)L, Δ3(1)L) = (5.96 × 109, 5.96 × 109, 7.81 × 109). When αi ≤ 0.05 at least for two πis, the result was not available within the data sets

FUNDING

Research of the first author was partially supported by Grants-in-Aid for Scientific Research (B) and Challenging Exploratory Research, Japan Society for the Promotion of Science (JSPS), under Contract Numbers 22300094 and 26540010. Research of the second author was partially supported by Grant-in-Aid for Young Scientists (B), Japan Society for the Promotion of Science (JSPS), under Contract Number 26800078.

ACKNOWLEDGMENT

The authors thank the Editor-in-Chief, Professor Nitis Mukhopadhyay, for giving us the opportunity to contribute to Stein's (1945) 70-Year Celebration Issue.

Notes

Recommended by Nitis Mukhopadhyay

REFERENCES

  • Aoshima, M., and Yata, K., 2011a. Authors’ Response, Sequential Analysis 30 (2011a), pp. 432–440.
  • Aoshima, M., and Yata, K., 2011b. Two-Stage Procedures for High-Dimensional Data (Editor's special invited paper), Sequential Analysis 30 (2011b), pp. 356–399.
  • Aoshima, M., and Yata, K., 2014. A Distance-Based, Misclassification Rate Adjusted Classifier for Multiclass, High-Dimensional Data, Annals of Institute of Statistical Mathematics 66 (2014), pp. 983–1010.
  • Aoshima, M., and Yata, K., 2015a. Asymptotic Normality for Inference on Multisample, High-Dimensional Mean Vectors under Mild Conditions, Methodology and Computing in Applied Probability 17 (2015a), pp. 419–439.
  • Aoshima, M., and Yata, K., 2015b. 2015b, High-Dimensional Quadratic Classifiers in Non-Sparse Settings. arXiv:1503.04549.
  • Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R., and Korsmeyer, S. J., 2002. MLL Translocations Specify a Distinct Gene Expression Profile That Distinguishes a Unique Leukemia, Nature Genetics 30 (2002), pp. 41–47.
  • Chan, Y.-B., and Hall, P., 2009. Scale Adjustments for Classifiers in High-Dimensional, Low Sample Size Settings, Biometrika 96 (2009), pp. 469–478.
  • Dudoit, S., Fridlyand, J., and Speed, T. P., 2002. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data, Journal of American Statistical Association 97 (2002), pp. 77–87.
  • Stein, C., 1945. A Two-Sample Test for a Linear Hypothesis Whose Power Is Independent of the Variance, Annals of Mathematical Statistics 16 (1945), pp. 243–258.
  • Stein, C., 1949. Some Problems in Sequential Estimation (abstract), Econometrica 17 (1949), pp. 77–78.
  • Yata, K., and Aoshima, M., 2013. Correlation Tests for High-Dimensional Data Using Extended Cross-Data-Matrix Methodology, Journal of Multivariate Analysis 117 (2013), pp. 313–331.

Appendix

Proof of Theorem 2.1

Under (A-iv), it holds that and for all i, j. Note that for all i ≠ j, under (A-iv). Then, it holds that for all i ≠ j, under (A-iii) and (A-iv). Thus, by using Chebyshev's inequality, under (A-iii) and (A-iv) we obtain that when x0 ∈ πi for all i ≠ j. Under (A-i) and (A-iv) we have that and for all i ≠ j, so that tr(Sini) = tr(Σi) + oPij) and when x0 ∈ πi for all i ≠ j. Note that tr(Σi)/p ∈ (0, ∞) as p → ∞ for i = 1,…, k. Then, under (A-i), (A-iii), and (A-iv), we have that (12) when x0 ∈ πi for all i ≠ j. Hence, we conclude the results.

Proof of Theorem 2.2

We note that for all i ≠ j under (A-ii). Also, note that for all i ≠ j under (A-ii) and (A-v) since δji/(njδij) = o(1) for all i ≠ j under (A-ii). Let for i ≠ j. Then, similar to (A.1), under (A-i), (A-ii), (A-v), and Δij(1)/tr(Σj) = o(1) for all i ≠ j, we have that (13) when x0 ∈ πi for all i ≠ j since tr(Sini)/tr(Σi) −1 = OPij/p) = oP(1). Here, we note that , under (A-ii) from the fact that under (A-ii). It holds that when x0 ∈ πi and , i = 1,…, k. Then, under (A-ii) and (A-v), we have that (14) when x0 ∈ πi for all i ≠ j. On the other hand, under (A-i) and (A-ii), it holds that (15) for all i ≠ j. Then, by combining (A.2) with (A.3) and (A.4), under the assumptions of Theorem 2.2 we have that when x0 ∈ πi. Note that for all i ≠ j under (A-ii). Then, in a way similar to the proof of Theorem 3 in Aoshima and Yata (Citation2014), under (A-i) and (A-ii) we can claim that ω(x0|ni, nj)/δij ⇒ Yij for all i ≠ j. Thus, it concludes the result.

Proof of Corollary 2.1

By using Theorem 2.2 and Bonferroni's inequality, we have that when x0 ∈ πi. This concludes the proof.

Proof of Theorem 3.1

From (3.2), it holds that δij ≤ 2Δ(ij){1 + o(1)}/(zαi/(k − 1) + zαj/(k − 1)) when tr(Σi)/tr(Σj) = 1 + o(1) for all i ≠ j. We denote the error of misclassifying an individual from πi into πj by e(j|i) for i ≠ j. Then, under (3.2) and the assumptions of Theorem 2.2, we have that

when x0 ∈ πi for i ≠ j, where Yij denotes a random variable distributed as the standard normal distribution. We note that (A-v) holds under (A-iii) when for all i ≠ j. On the other hand, when δijij = o(1) for i ≠ j, from Theorem 2.1 it holds that for x0 ∈ πi

under (A-i) to (A-iii) without (A-v). We note that δijij = o(1) for i ≠ j under (A-ii) when it holds that or . Thus, one can claim e(j|i) ≤ αi/(k − 1) + o(1) for all i ≠ j under (3.2) and (A-i) to (A-iii). Then, from Bonferroni's inequality, we have that when x0 ∈ πi. This concludes the proof.

Proof of Theorem 3.2

Let CiL = ⌊Ci − (ωCi)1/2⌋, i = 1,…, k, where ω (> 0) is a variable such that ω → 0 as p → ∞. Then, from the proof of Theorem 5 in Aoshima and Yata (Citation2014), it holds that max {mi, CiL} ≤ Ni < Ci + (ωCi)1/2 as p → ∞ w.p.1. Then, in a way similar to the proofs of Theorems 2.4 and 2.5 in Aoshima and Yata (Citation2011a), under (A-i) to (A-iii) we have that for all i ≠ j where ω(x0|Ni, Nj) is given in the proof of Theorem 2.2. Similar to the proof of Theorem 2.2, under (A-i) to (A-iii) we have that when x0 ∈ πi for all i ≠ j. Then, in a way similar to the proof of Theorem 3.1, we can conclude the results.