![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
Nonparametric two-stage procedures to construct fixed-width confidence intervals are studied to quantify uncertainty. It is shown that the validity of the random central limit theorem (RCLT) accompanied by a consistent and asymptotically unbiased estimator of the asymptotic variance already guarantees consistency and first-order as well as second-order efficiency of the two-stage procedures. This holds under the common asymptotics where the length of the confidence interval tends toward 0 as well as under the novel proposed high-confidence asymptotics where the confidence level tends toward 1. The approach is motivated by and applicable to data analysis from distributed big data with nonnegligible costs of data queries. The following problems are discussed: Fixed-width intervals for the mean, for a projection when observing high-dimensional data, and for the common mean when using nonlinear common mean estimators under order constraints. The procedures are investigated by simulations and illustrated by a real data analysis.
1. INTRODUCTION
In this article, we study fully nonparametric two-stage procedures to construct a fixed-width interval for a parameter to quantify uncertainty. Both the common high-accuracy framework, where the asymptotics assumes that the width of the interval shrinks, and a novel high-confidence framework are studied. Under high-confidence asymptotics the required uncertainty in terms of the width of the interval is fixed and the asymptotics assumes that the confidence level increases. General sufficient conditions are derived that yield consistency and efficiency for both frameworks. We study three statistical problems: Nonparametric fixed-width intervals for the mean of univariate data, which may be the most common setting; for the mean projection of high-dimensional data to illustrate the application to big data; and for the common mean of two samples as a classic statistical problem leading to a nonlinear estimator, which has not yet been treated in the literature from a two-stage sampling perspective. The focus is on two-stage procedures, because they provide a good compromise between the conflicting goals of a minimal sample size, which requires purely sequential sampling, and feasibility in applications in terms of required computing resources and logistic simplicity, which is better matched by one- or two-stage sampling procedures.
Two-stage sequential sampling is a well-established approach motivated by the aim to make statistical statements with minimal samples sizes without relying on purely sequential sampling. Instead, the data are sampled in two batches, a first-stage sample and a second-stage sample, if required. In the second stage, the final sample size is determined using the information contained in the first-stage (pilot) sample. The development of such procedures was mainly motivated by the need to base statistical inference on small samples in a world where large samples are not available. But this technique is also of interest in various areas including emerging ones such as data science and big data, where massive amounts of variables are collected and need to be processed and analyzed: When analyzing big data distributed over many nodes of a network, each single query may be associated with a high response time and substantial data transmission costs, ruling out a purely sequential sampling strategy, because the benefit of fewer required observations on average is overcompensated by the high costs for each query. In contrast, the two-stage methods proposed in this article allow efficient estimations of the means of the variables and their projections with preassigned accuracy and confidence. The general construction of the sample size rules mainly follows the established approach. However, compared to the existing literature, we use a slightly modified first-stage sample size rule that takes into account prior knowledge and historical estimates, respectively, of the data uncertainty. Our studies indicate that even if we use only three data points to get a rough guess of variability, the resulting first-stage sample size is much closer to the actually required sample size, thus avoiding oversampling at this stage. In the context of distributed data, the proposed methods with this three-observations rule need at most three database queries.
This article contributes to the existing literature on two-stage procedures (see Stein (Citation1945), Mukhopadhyay Citation1980; Ghosh, Mukhopadhyay, and Sen Citation1997; Mukhopadhyay and Silva Citation2009; Steland Citation2015, Citation2017; and references therein) by proposing a concrete nonparametric procedure with the following properties: For any estimator of the mean (or a parameter θ) that satisfies the random central limit theorem (CLT) and whose asymptotic variance can be estimated by a consistent and asymptotically unbiased estimator, the random sample size leading to the proposed fixed-width confidence interval is consistent and asymptotically unbiased for the optimal sample size. Further, the procedure yields the right asymptotic coverage probability and exhibits first-order as well as second-order efficiency.
Further, and more important, we go beyond the classic framework that establishes the above properties when the width of the confidence interval tends toward 0. We argue that this is to some extent counterintuitive in view of the posed problem to construct a fixed-width confidence interval. It also limits the approximate validity of the results to cases where one aims at high-accuracy estimation. But in many applications it is more appropriate to fix the width of the confidence interval and to assume that a larger number of observations is due to a higher confidence level. Therefore, we propose a novel framework and study the construction of a fixed-width interval under high-confidence asymptotics (equivalently: low error probability asymptotics). This is also motivated by the fact that in many areas such as high-quality, high-throughput production engineering or statistical genetics and brain research, where high-dimensional data are collected, large confidence levels and small significance levels, respectively, are in order and used in practice. For example, in production engineering the accuracy is fixed by the technical specifications and not by the statistician, and in genetics as well as in brain research small error probabilities are required to reach scientific relevance and to take multiple testing into account.
It is shown that the proposed two-stage procedure is valid under high-confidence asymptotics and exhibits first- and second-order efficiency properties, as long as the parameter estimator satisfies the random central limit theorem (RCLT) and a consistent and asymptotically unbiased estimator of the asymptotic variance is at our disposal.
Having in mind big data sets with a large number of variables, we then apply the general results to projections of high-dimensional data. It is assumed that the observations are given by a data stream of (possibly) increasing dimension, which is sampled in batches by our two-stage procedure. Two-stage procedures for high-dimensional data have been studied in-depth in Aoshima and Yata (Citation2011) assuming that the dimension p tends toward and the sample size is either fixed or tends toward
as well. Here we consider a projection of high-dimensional data, where, when having sampled n observations, the projection may depend on the sample size n. The asymptotic properties (consistency and efficiency) of the fixed-width confidence interval for the mean projection hold for high-accuracy asymptotics as well as high-confidence asymptotics. The dimension p may be increasing with n in an unconstrained way.
As an interesting and nontrivial classical application, we consider the problem of common mean estimation. Here one aims at estimating the mean from two samples assuming that they have the same mean but possibly different or ordered variances. Many of the estimators proposed and studied in the literature are given by a convex combination of the sample means with convex weights depending on the sample means and the sample variances.
This article is organized as follows. Section 2 studies nonparametric two-stage fixed-width confidence intervals for the mean under both asymptotic frameworks, starting with the usual high-accuracy approach and then discussing the novel high-confidence asymptotics. Section 3 provides the results when dealing with a projection of high-dimensional data. Common mean estimation is treated in Section 4. Results from simulations and a data example are provided in Section 5.
2. NONPARAMETRIC TWO-STAGE FIXED-WIDTH CONFIDENCE INTERVAL
Let be independent and identically distributed with common distribution function F (F observations with mean μ and finite variance
Further, let
be an estimator for μ using the first n observations
We focus on the mean as the parameter of interest, but it is easy to see that all results remain true for any univariate parameter
and an estimator
The classical approach to the construction of a confidence interval is based on a sample of fixed (but large) sample size N and determines a random interval depending on the sample(s), such that its coverage probability equals the given confidence level
for each
or has asymptotic coverage
as N tends toward
As a consequence, the length
of the interval, which represents the reported accuracy, is random.
There are, however, situations where we want to report an interval of a fixed, preassigned accuracy d, symmetric around the point estimator of μ, so that Then the coverage probability of the resulting interval
depends on the distribution of
and the sample size N becomes the parameter we may select to achieve a certain confidence level. In mathematical terms, we wish to find some N so that the interval around the estimator
based on a sample of size N has coverage probability
(2.1)
(2.1)
as the precision parameter d tends toward 0. The o(1) term is required as the CLT (respectively RCLT) for
is used to construct a solution.
Usually, d is small and N increases when d decreases. Thus, it is reasonable to consider asymptotic properties as d tends toward 0. We shall, however, also consider the case of a fixed accuracy d, not necessarily small, and study asymptotic properties when the confidence level tends toward 1.
Suppose that the estimator satisfies the CLT; that is,
(2.2)
(2.2)
as
for some positive constant
the asymptotic variance of our estimator for μ. Throughout the article we shall assume that we have an estimator for
at our disposal, which we denote by
if it is based on the first n observations. The most common choice for
is, of course, the sample average
which satisfies Equation(2.2)
(2.2)
(2.2) with
The canonical estimator for
is
When considering a parameter θ estimated by
such that the analogue of Equation(2.2)
(2.2)
(2.2) holds—that is,
—one formally replaces
by
and needs an estimator
having the properties required in Assumption (E) in the next section. For simplicity of presentation and proofs, we stick to the case of the mean, however.
Invoking the CLT for it is easy to see that the problem is solved by the asymptotically optimal sample size
where
(2.3)
(2.3)
because the left-hand side of Equation(2.1)
(2.1)
(2.1) is equal to
Observe that
if
which in turn justifies the application of the CLT.
If were known, then
would solve the posed problem. The proposed two-stage procedure draws a random sample of size N0 at the first stage, which is larger than or equal to a given minimal sample size
The first-stage sample size N0 will be larger if the required precision gets smaller. The first-stage sample is used to estimate the uncertainty of the estimator, and that (random) estimate is then used to specify the final sample size
used in the second stage. Before discussing how one should specify N0 and
let us summarize the basic algorithm:
Preparations: Specify the minimal sample size the confidence level
and the precision d.
Stage I: Draw an initial sample of size in order to estimate unknowns (in our case
) based on that data, yielding an estimator
for
that is, a random sample size.
Stage II: Draw additional observations to obtain a sample of size
Estimate μ by
Solution: Output the fixed-width confidence interval
2.1. Fixed-Width Interval under High-Accuracy Asymptotics
Let us first study the classical approach to fix the confidence level and to assume that the accuracy is small suggesting to investigate approximations for
This framework can be called high-accuracy asymptotics.
In the sequel, we review part of the literature that focused on normal data and the associated optimal estimators. We follow the arguments developed for the Gaussian case to motivate our fully nonparametric proposal where may be an arbitrary estimator satisfying the required regularity assumptions stated below in detail.
The original Stein procedure (see Stein Citation1945) addresses Gaussian i.i.d. observations and estimates μ by the sample mean, such that and a natural estimator for
based on
is
Stein uses the rule
Because is fixed, the procedure turns out to be inconsistent. To overcome this issue, Chow and Robbins (Citation1965) proposed a purely sequential rule, namely,
Mukhopadhyay (Citation1980) noted that one gets for small d the lower bound and proposed to increase the variance estimate slightly by
Indeed, for small enough d we have the lower bound
if one replaces the estimate
by
because for
such that any
with
satisfies
and hence
This leads to
for any such n, such that we obtain the lower bound
Therefore, the purely sequential rule
satisfies
The idea is to now to use this lower bound as the first-stage sample size for Gaussian data; for nonnormal samples one replaces
by the corresponding quantile,
of the standard normal distribution. However, this rule does not take into account the scale of data and can lead to unrealistically large sample sizes; see the data example in Section 5. Mukhopadhyay (Citation1980) has proposed the modified rule
for some
γ can be selected to obtain a reasonable first-stage sample size; see the discussion and example in Mukhopadhyay and Silva (Citation2009, p. 115).
But, indeed, Mukhopadhyay’s argument also applies when using for some
and then one gets the lower bound
and
respectively. It is easy to check that all of the above arguments go through as well, if we replace
by any guess or pilot estimate using a (very) small sample.
Three-Observations Rule: Because frequently in applications it is possible to sample at least three observations, we propose to choose f as an estimate of
using three additional observations, on which we condition in what follows. This leads to our proposal for the first-stage sample size, namely,
(2.4)
(2.4)
Note that N0 depends on the preassigned precision d and satisfies as
It is natural to estimate by
and this leads to the final sample size of the procedure,
(2.5)
(2.5)
which is a random variable (depending on the first-stage data), because
estimates the asymptotic variance of the estimator
using the first-stage sample of size N0. Observe that we continue to add a
in notation to indicate quantities that may not be integer valued.
Let us briefly review the following facts and considerations leading to the notions of consistency and unbiasedness: Note that and
in probability, for any (arbitrary) weakly consistent estimator
of the asymptotic variance
(based on the first-stage sample of size N0), which slightly complicates their comparison. If we only know that
which follows from ratio consistency
then the difference
is not guaranteed to be bounded, because
where the first factor is
but the second one diverges, as
For this reason,
is called consistent for the asymptotically optimal sample size
if the ratio approaches 1 in probability; that is, if
as
Having in mind that the second-stage (final) sample size
is random whereas the unknown optimal value,
is nonrandom, the question arises as to whether
is at least close to
on average. Therefore, to address this question and going beyond consistency,
is called asymptotically first-order efficient in the sense of Chow and Robbins (Citation1965) if
as
Observe that the last property allows for the case that the difference tends toward
even in the mean, as d tends toward 0. This typically indicates that the procedure is substantially oversampling the optimal sample size. A procedure for which the estimated optimal sample size remains on average in a bounded vicinity of the optimal truth is, of course, preferable.
is called second-order asymptotically efficient if
as
The regularity assumptions we need to impose are as follows:
Assumption (E). The estimator is consistent and asymptotically unbiased for
; that is,
as
This assumption is not restrictive and is satisfied by many estimators. For example, the jackknife variance estimator studied in Shao (Citation1993), Shao and Wu (Citation1989), and Steland and Chang (Citation2019) provides an example satisfying Assumption (E).
Further, we require the following strengthening of the CLT to hold.
Assumption (A). satisfies the RCLT; that is, for any family
of stopping times for which
, it holds that
as
The validity of the RCLT is required, because we have to employ a normal approximation with the first-stage sample size, which is random by construction. Clearly, however, for i.i.d. observations following an arbitrary distribution with finite second moment and the arithmetic mean, Assumption (A) is well known (see, e.g., Ghosh, Mukhopadhyay, and Sen Citation1997, theorem 2.7.2).
The following theorem summarizes the main asymptotic first-order properties of the proposed two-stage approach to construct a fixed-width confidence interval.
Theorem 2.1.
Suppose that Assumption (E) is satisfied. Then the following two assertions hold true:
The estimated optimal sample size
is consistent for
; that is,
ii.
is asymptotically first-order efficient for
; that is,
as
If, in addition, Assumption (A) holds, then we have
iii. The fixed-width confidence interval
has asymptotic coverage
; that is,
as
Remark 2.1.
It is worth mentioning that the proof of Theorem 2.1 (i)–(ii) shows the following stronger properties:
is consistent for
if and only if
is consistent for
is asymptotically unbiased for
if and only if
is asymptotically unbiased for
Let us now discuss the second-order properties of the fully nonparametric procedure. In the literature, so far second-order efficiency for the problem at hand has been studied for parametric (Gaussian) models (see Mukhopadhyay and Duggan Citation1999), leading to a known distribution of a chi-square distribution induced by the fact that the sample variance follows a chi-square law, which converges to the normal law if d → 0. To achieve second-order efficiency, the probability
that the sample size is not increased in the second stage needs to decrease faster than the first-stage sample size N0. In a parametric setting that probability can be handled and estimated by means of appropriate Taylor expansions using properties of the known distribution function.
In a fully nonparametric framework, the exact distribution is unknown to us, and estimating the probability under the limiting law is not sufficient, because we have to take into account the error of approximation. But due to the Berry-Esseen bound, the error is of the order Therefore, the following result proceeds in a different way than the proofs for parametric settings and bounds the probability
using nonparametric techniques.
Theorem 2.2.
Assume that Assumption (E) holds and are i.i.d. with
. Then the two-stage procedure given by
is second-order efficient; that is,
as
2.2. Fixed-Width Interval under High-Confidence Asymptotics
Theorem 2.1 establishes the validity of the proposed sampling strategy for small-accuracy d; that is, in a high-accuracy framework: The (asymptotic) first-order properties hold if d tends toward zero. To some extent, this is counterintuitive because we aim at constructing a fixed-width confidence interval and, for applications, we are then essentially limited to confidence statement when d is small.
In some applications, however, d may be not small (enough) but one aims to ensure the confidence statement that the interval covers the true parameter with high confidence. This suggests considering the case in which d is fixed and tends toward 1 (or equivalently
). That type of asymptotics may be of particular importance in fields such as statistical genetics or brain research, where it is common to use very small significance levels (α).
Recalling formula Equation(2.3)(2.3)
(2.3) for the asymptotically optimal (unknown) sample size
and noticing that, for fixed d
holds if
again justifying the application of the CLT, the question arises as to whether consistency and efficiency can be established under this different asymptotic regime.
To begin, let us notice that the notions of consistency, asymptotically first- and second-order efficiency, and asymptotic coverage under high-confidence asymptotics can be defined analogously as under high-accuracy asymptotics by replacing the limits by
The following theorem asserts that the proposed methodology is valid without any modification of under high-confidence asymptotics, although the proof differs.
Theorem 2.3.
Assume that (E) holds. Then
is consistent for
; that is,
as
ii. If Assumptions (E) and (A) hold, then
has asymptotic coverage
; that is,
Remark 2.2.
The assertions of Theorem 2.3 also hold true if the first-stage sample size N0 is defined as
for some
and an arbitrary given constant
as proposed in Mukhopadhyay (Citation1980; with f = 1).
The question arises as to whether the procedure exhibits second-order efficiency under the high-accuracy regime as well. The answer is positive.
Theorem 2.4.
Assume that are i.i.d. with
. Then the two-stage procedure given by
is second-order efficient. Precisely, we have
as
2.3. Proofs
Proof of Theorem 2.1.
First, notice that, by definition of
(2.6)
(2.6)
It is easy to see that for all nonnegative real z and any positive constant a. Therefore, we have
Plugging in the definition of N0, we further obtain
Combining the last estimate with Equation(2.6)(2.6)
(2.6) , we arrive at
which implies, due to Equation(2.3)
(2.3)
(2.3) ,
(2.7)
(2.7)
We are led to
(2.8)
(2.8)
Recalling that with
implies
because for any
we have
as
the first assertion follows: By Equation(2.8)
(2.8)
(2.8) ,
is consistent for
if
is consistent for
as
Equation(2.7)
(2.7)
(2.7) also immediately yields
which shows that “only if” holds as well. Next, taking expectations in Equation(2.7)
(2.7)
(2.7) implies the result on the asymptotic unbiasedness. It remains to show that the fixed-width confidence interval has asymptotic coverage
First, observe that
Define
By Assumption (A) and Slutzky’s lemma, we have
as
Now, by definition of Hd and
and linearity of the function evaluation
(for fixed reals
) for a function defined on
Therefore, we obtain
as
which completes the proof. □
Proof of Theorem 2.2.
First observe that if and only if
so that we can show the result for
We use the refined basic inequality
which implies, in view of Assumption (E),
This means
and the result follows if we show that
We may assume that N0 is large enough to ensure that Because by definition of
and using the elementary estimates
and
if
we obtain
Because as
we obtain
But if then
see, for example, lemmas A.1 and A.2 in Steland and Chang (Citation2019). Hence, the assertion follows. □
Let us now prove the corresponding results under the high-confidence asymptotic framework.
Proof
of Theorem 2.3. Repeating the purely algebraic calculations from above we again obtain Equation(2.8)(2.8)
(2.8) for any d:
Clearly, the second and third terms are o(1) if d is fixed and Further, noting that
if d is fixed and
the first term is
if d is fixed and
Taking expectations, similar arguments apply. Hence, (i) is shown. To establish (ii), first observe that if Na,
is an arbitrary family of stopping times with
then
as
and therefore by Slutzky’s lemma and the RCLT,
as
Now consider
as a family of stopping times parameterized by
Obviously,
Then, using (i) and recalling Equation(2.3)
(2.3)
(2.3) ,
as
Because d is fixed, we have
Consequently,
(2.9)
(2.9)
as
Because
it follows that
where Ha denotes the distribution function of
By Equation(2.9)
(2.9)
(2.9) and Polya’s theorem, we obtain
Now we can conclude that
as
which completes the proof. □
Proof of Remark 2.2.
The proof of the consistency needs a minor modification. Plugging in the new definition of N0, we obtain
Combining the last estimate with Equation(2.6)(2.6)
(2.6) , we arrive at
which implies, due to Equation(2.3)
(2.3)
(2.3) ,
(2.7)
(2.7)
We are led to
(2.8)
(2.8)
Because the second term is
if
□
Proof of Theorem 2.4.
Observing that if and only if
and
the proof of Theorem 2.2 carries over and provides the bound
□
3. APPLICATION TO A PROJECTION OF HIGH-DIMENSIONAL DATA
An interesting application is to study the construction of a fixed-width confidence interval for the mean projection of high-dimensional data. As an example, we may ask for how long we need to observe the asset returns of stocks associated to a portfolio given by portfolio vector in order to set up a
-confidence interval for the mean portfolio return
with a precision d. Such uncertainty quantification of a mean projection also arises when projecting a data vector to reduce dimensionality, with calculating a projection
of a pn-dimensional data vector
being a common approach to handle multivariate data when the dimension of the observed vectors is large. Widely used methods are principal component analysis, where one projects onto eigenvectors of the (estimated) covariance matrix of the data, sparse principal component analysis yielding sparse projection vectors, or LASSO (least absolute shrinkage and selection operator) regressions. For the latter approach, recall that a LASSO regression determines a sparse weighting vector, the regression coefficients, such that the associated projection of the regressors provides a good explanation of the response variable.
Of particular interest is the situation in which the dimension gets larger when the sample size increases, in order to mimic the case of large dimension when relying on asymptotics where the sample size increases. Then the weighting vectors depend on the sample size. We show that the general assumptions established in the previous section apply under mild uniform integrability conditions on
3.1. Procedure
Suppose that we are given a series of random vectors of increasing dimension, such that at time instant n we have at our disposal n i.i.d. random vectors
of dimension
where
as
is allowed. Here the notation
means that the random vector
follows a distribution with mean vector
and covariance matrix
We are interested in an asymptotic fixed-width confidence interval for the asymptotic projected mean vector
that is, for a preassigned accuracy
and a given confidence level
we aim at finding a sample size N such that the confidence interval
has asymptotic confidence
as
Here, for any (generic) sample size n, Tn is an estimator of
For simplicity, we consider the unbiased estimator
where
The proposed two-stage procedure is as in the previous section: In the first stage, draw
(3.1)
(3.1)
observations, where
is a given minimal sample size and
is an initial estimate of the standard deviation of the projections using independent pilot data, such as the three-observations rule estimator. Next, we estimate the variance of the projections from the first-stage sample by
and calculate the final sample size
(3.2)
(3.2)
In practice, one considers a certain number of variables, so we assume that these variables are observable for the relevant sample sizes N0 and as well as that the projection vectors only have nonzero entries for those variables of interest. The mathematical framework allowing for an increasing dimension is used to justify the procedure when the number of variables (respectively dimension) is large compared to the sample sizes in use.
Let us assume that
(3.3)
(3.3)
and for
(3.4)
(3.4)
Assumption Equation(3.3)(3.3)
(3.3) is mild and rules out cases where the variance of the projection vanishes asymptotically. The uniform integrability required in Equation(3.4)
(3.4)
(3.4) is a crucial technical condition to ensure that the weak law of large numbers and the CLT apply. In many cases the condition can be formulated in terms of the
For that purpose, suppose additionally that the norms
are bounded by constant
Then, using the simple fact that
condition Equation(3.4)
(3.4)
(3.4) holds, if the
satisfy
(3.5)
(3.5)
A simple sufficient condition for Equation(3.5)(3.5)
(3.5) is the moment condition
(3.6)
(3.6)
for some
We need to verify Assumptions (A) and (E). The verification of the RCLT is somewhat involved, because the projection statistic of interest is a weighted sum with weights depending on the sample size. Denote
Theorem 3.1.
Suppose that Equation(3.3)(3.3)
(3.3) and Equation(3.4)
(3.4)
(3.4) hold. If τa,
, is a sequence of integer-valued random variables with
, in probability, for some
, then the statistic
satisfies the RCLT; that is,
as
Define the estimator
(3.7)
(3.7)
The following result verifies that Assumptions (E) holds for this estimator under weak technical conditions.
Theorem 3.2.
If Equation(3.3)(3.3)
(3.3) , then
and
, as
and
, as
The above two theorems imply that the proposed two-stage procedure is asymptotically consistent as well as first-order and second-order efficient.
3.2. Proofs
3.2.1. Preliminaries on the RCLT. According to the general theoretical results of the previous section, it suffices to establish Assumption (A), the validity of the RCLT, and Assumption (E) for the statistic of interest; that is, the projected data. As a preparation, let us briefly review Anscombe’s RCLT (see, e.g., Ghosh, Mukhopadhyay, and Sen Citation1997) and sufficient conditions. Consider a sequence of i.i.d. random variables with mean μ and finite variance
and a family τa,
of integer-valued random variables, often but not necessarily stopping times, with
as
for a finite constant c. The RCLT asserts that
as where for
This means that the sample size n can be replaced by τa. The basic idea why this holds is as follows (cf. Ghosh, Mukhopadhyay, and Sen Citation1997): The approximation suggests that
should have the same limiting distribution as
where
The CLT gives
as
in distribution. Now, for any
and
(3.8)
(3.8)
because in the event
we have
The second term on the right-hand side of Equation(3.8)
(3.8)
(3.8) can be made arbitrarily small if the sequence
is uniformly continuous in probability (u.c.i.p.). In general, a sequence
is called u.c.i.p. if for any
there exists some
so that for large enough n,
The u.c.i.p. property of can be shown using Kolmogorov’s maximal inequality, because
is a sum of i.i.d. random variables.
3.3. Proofs
Let us first show the sufficiency of Equation(3.6)(3.6)
(3.6) for Equation(3.5)
(3.5)
(3.5) . Observe that
As a preparation for the following proofs, observe that we may choose c large enough to ensure that
(3.9)
(3.9)
holds for all Therefore, the sequence of second moments of
is bounded. This implies
(3.10)
(3.10)
(3.11)
(3.11)
For technical reasons, we first show Theorem 3.2.
Proof of Theorem 3.2.
Observe that the random variables are independent and nonnegative with mean
where
Hence, by the strong law of large numbers for arrays due to Gut (Citation1992) under the uniform integrability condition (e) given there,
we obtain
(3.12)
(3.12)
as
Similarly, by Equation(3.4)
(3.4)
(3.4) , we also obtain the convergence of the sample moment
as
Further,
as
and
such that
(3.13)
(3.13)
as
by Equation(3.10)
(3.10)
(3.10) . Using the decomposition
and rearranging terms, (3.12) and (3.13) imply
as
which verifies (ii). □
Proof of Theorem 3.1.
Let and observe that the standardized version is given by
where
The random variables
are i.i.d. with
and
for each n. Let us verify that condition Equation(3.4)
(3.4)
(3.4) provides the Lindeberg condition for the array
Put
Let First observe that because the ξni are row-wise identically distributed, the Lindeberg condition collapses to
The latter expression can be rewritten as
The first factor is O(1) by assumption Equation(3.3)(3.3)
(3.3) . Therefore, it suffices to show that the second factor, denoted by
in what follows, converges to 0 if
Observe that
where
First, observe that
as
by Equation(3.4)
(3.4)
(3.4) , because
see Equation(3.11)
(3.11)
(3.11) . Similarly,
as
again by Equation(3.4)
(3.4)
(3.4) . Lastly, by the Cauchy-Schwarz inequality and Equation(3.11)
(3.11)
(3.11) ,
as
using (3.10) and
Hence, by virtue of the CLT for row-wise i.i.d. arrrays under the Lindeberg condition (see, e.g., Durrett Citation2019),
converges in distribution to a standard normal distribution: and by (Equation3.3
(3.3)
(3.3) ) we may replace
by
and
by
in
yielding the asymptotic normality of
. To verify that the RCLT holds, we need to check the u.c.i.p. condition; see the discussion in the next section for more details. But because the Kolmogorov maximal inequality also applies to the rows of the row-wise i.i.d. array
depending on n, we have
because
Therefore, the proof of the RCLT can be completed as in Ghosh, Mukhopadhyay, and Sen Citation1997 (theorem 2.7.2). □
4. COMMON MEAN ESTIMATION
4.1. A Review of Common Mean Estimation
We study the construction of a fixed-width confidence interval for the common mean of two samples based on two-stage sequential estimation with equal sample sizes. For motivation consider the following example: A good is produced using two machines operating at the same speed but with possibly different accuracies. Both machines have to be used to satisfy demand and interest focuses on estimation of the common mean and providing a fixed-width confidence interval for it. The same situation arises when combining data from two laboratories where usually one laboratory has better equipment or expertise than the other lab and therefore provides the measurements with smaller uncertainty, such that the assumption of ordered variances is justified. In such settings, one should base inference on two random samples with equal sample sizes, and the goal is to determine the minimal number of observations required to estimate the common mean with preassigned accuracy.
Let us briefly review some related facts and results about common mean estimation. The literature focuses on parametric settings, mainly the case of two independent Gaussian samples. It is nevertheless worth recalling the estimators studied under normality. Indeed, our aim is to study a class of common mean estimators that covers many of the estimators proposed in the literature as special cases.
It is well known that for Gaussian samples the minimal variance unbiased estimator in the case of known variances is a convex combination of the sample means with weights depending on the unknown
and
Hence, it is natural to study weighted means of the sample means where the weights depend on the sample variances
as we shall do here, or equivalently depending on their unbiased versions denoted by
throughout the article. The canonical approach is to estimate the optimal weights by substituting the unknown variances by their unbiased estimators. First proposed and studied by Graybill and Deal (Citation1959), this leads to the Graybill-Deal (GD) estimator
which does assume an order constraint. Observe that the GD estimator is a convex combination of the sample means with random weights given by smooth functions of the unbiased variance estimators. When there is an order constraint on the variances,
the GD estimator can be improved, and several proposals have been made and investigated; see the discussion in Steland and Chang (Citation2019). For example, Nair (Citation1982) showed that the GD estimator can be improved by using random convex weights whose definition (as smooth functions of the sample variances) depends on the ordering of variance estimates: If
then Nair’s proposal switches from the formula for
to
This estimator stochastically dominates the GD estimator, as shown by Elfessi and Pal (Citation1992). A further proposal is to switch to the formula
instead; see Elfessi and Pal (Citation1992). As shown in Chang, Oono, and Shinozaki (Citation2012), such an estimator can be further improved in terms of stochastic dominance by further modifying the weights; see also Chang and Shinozaki (Citation2015) for a study using a different criterion.
The above findings motivate us to study the general class of common mean estimators given by
(4.1)
(4.1)
where the weight γN is random and of the form
(4.2)
(4.2)
Here the functions and
are three times continuously differentiable with bounded derivatives.
Although the existing literature suggests using such convex combinations with random weights due to their performance in terms of variance, squared error loss, or stochastic dominance, the established methodology to use those estimators for inference is rather limited and is mainly confined to Gaussian samples. Even estimation of the variance of these estimators is nontrivial and not well studied.
In the sequel, we relax the restrictive condition of Gaussian measurements and consider estimators of the above class for independent samples following a distribution with finite 12th moment. Under this condition, Steland and Chang (Citation2019) have shown that jackknife variance estimators are consistent and asymptotically normal for common mean estimators from the class Equation(4.1)(4.1)
(4.1) . For the special case of equal sample sizes as studied here, a simplified jackknifing scheme was proposed. Especially, it follows from Steland and Chang (Citation2019) that
with asymptotic variance
and
When combined with the validity of the RCLT established in the next section, that result allows the construction of a fixed-width confidence interval for the common mean using a two-stage sampling approach, according to the results of the previous section. Therefore, we establish the RCLT for the above class of common mean estimators.
4.2. The Random Central Limit Theorem for Common Mean Estimators
For the class of common mean estimators under investigation, the u.c.i.p. property cannot be shown by a simple application of Kolmogorov’s inequality. The situation is more involved and requires special treatment, because the weights are random and depend on the sample size.
The following three main results on common mean estimators with random weights assert that the RCLT holds. It turns out that the RCLT is directly related to the tightness of the weighting sequence. This is formulated in the first theorem: For arbitrary random weights γN that are bounded and u.c.i.p., the RCLT holds.
Theorem 4.1.
Let Xij, , be i.i.d. samples of random variables with finite second moment. Suppose that
is a sequence of bounded random weights satisfying the u.c.i.p. condition. Let τa,
, be a sequence of integer-valued random variables with
, as
. If
satisfies the CLT, then
also satisfies the RCLT; that is,
as
The next result shows that the weights γN considered by the class of common mean estimators Equation(4.1)(4.1)
(4.1) satisfy the u.c.i.p. property.
Theorem 4.2.
The weights γN Equation(4.2)(4.2)
(4.2) of the common mean estimator have the u.c.i.p. property if
Consequently, we arrive at the following RCLT for the class of common mean estimators of interest, which verifies Assumption (A) for them.
Theorem 4.3.
Let Xij, , be i.i.d. samples of random variables with
. Let τa,
, be a family of integer-valued random variables with
, as
. Then the common mean estimator
satisfies the RCLT; that is,
as
, such that Assumption (A) holds true.
4.3. Proofs
Proof of Theorem 4.1.
We have the representation
where
for
Hence,
We may assume μ = 0 and Then
and for the sequence of the standardized versions
we obtain
By boundedness of the weights, we may assume the bound is 1, we may find such that
for all
For the third term the same argument applies. The remaining two terms require the u.c.i.p. property of the weights and are treated as follows: We may find a constant
with
Now choose
so that
This leads to
The remaining fourth term is treated analogously. It follows that the sequence of standardized estimates is u.c.i.p.; that is,
Because as
the RCLT follows; cf. Equation(3.8)
(3.8)
(3.8) . □
Proof of Theorem 4.2.
In what follows we need to make the sample sizes explicit in the notation for the estimators and therefore write for i =1,2. Of course,
and therefore
is u.c.i.p. Similarly,
is u.c.i.p. Next, observe that
We may find with
Now choose
such that
as well as
Then,
which verifies the u.c.i.p. property for
For a function f of several variables, write
for the partial derivative with respect to the jth argument. Observe that on
a Lipschitz constant of γ with respect to the jth variable is given by
and on
we have the Lipschitz constant
Therefore, a Lipschitz constant on
is given by
We obtain the following Lipschitz property:
Consequently, there exists with
and from this fact the u.c.i.p. property follows easily, because
and we may find
such that each term is smaller than
□
Proof of Theorem 4.3.
The CLT for has been shown under the stated assumptions in Steland and Chang (Citation2019). Therefore, the result follows from Theorems 4.1 and 4.2. □
5. SIMULATIONS AND DATA EXAMPLE
5.1. One-Sample Setting
The aim of the simulations is to investigate the accuracy of the proposed two-stage procedure when applied to nonnormal data. The accuracy of related fixed-width confidence intervals for a class of jump-preserving estimators has been examined in Steland (Citation2017). Here, we are especially interested in investigating whether the proposed high-confidence asymptotic frameworks works.
For simplicity, we used the arithmetic mean to estimate the mean μ. Independent and identically distributed data were simulated following the model
with μ = 10 and zero mean error terms
distributed according to a standard normal distribution (model 1), a t(5)-distribution (model 2) showing heavier tails, and a
distribution (model 3). The confidence level was chosen as
and
Further, the accuracy parameter d was selected from the values
The first-stage sample size was calculated using the three-observations rule and a minimal sample size of equal to 15 or 30. To calculate
the asymptotic variance of the estimator was estimated by
using the first-stage sample.
shows the results. It can be seen that the accuracy of the procedure is very good, both in terms of the coverage probability and in terms of over- or undershooting. The coverage probability is only slightly smaller than the nominal value. This effect is more pronounced for heavier tails, whereas shorter tails somewhat compensate for this effect. The over- or undershooting is quite moderate, even for a minimal sample size of When the minimal sample size is 30 instead of 15, the over- or undershooting is further reduced. Fixing a large value for d, say
and comparing the results for the confidence levels 90% and 99%, one can observe that the high-accuracy asymptotics works very well for cases under investigation: A high confidence level ensures convincing coverage even for large, fixed d.
Table 1. One-sample setting: Fixed-width interval for the mean: Simulated coverage probabilities (p), expected sample sizes, and over- or undershooting, for the three-observations first-stage rule and
5.2. Common Mean Estimation
To investigate the statistical accuracy of the proposed two-stage procedure to construct a fixed width confidence interval for the common mean, a simulation study was conducted. Data for the ith sample were simulated according to the model with
and noise terms
(model 1),
(model 2) to study the effect of heavier tails, and
(model 3) as a distribution with short tails. The confidence levels were chosen as
and
Lastly, the precision parameter d was chosen as
and 0.05.
We used at least observations for each sample; that is, 20 in total. From the first-stage sample of size N0—that is, using the three-observations rule—the asymptotic variance of the GD common mean estimator was estimated using the canonical estimator
with the GD weights given by
in order to calculate
The results are provided in . One can observe that the approach works well, although in general the true coverage probabilities are somewhat lower than nominal.
5.3. Data Example
In chip manufacturing, the width of the cut-out chips is a critical quantity, and the machines, which operate with different accuracies, have to be calibrated well in order to meet the specifications. Consequently, quality samples taken from different machines have a common mean but different variances. For quality control purposes, it is of great interest to be in a position to obtain fixed-width confidence intervals in order to report the chip’s width at a specified uncertainty level.
To design, analyze, and assess a fixed-width confidence interval, we analyze data from two cutting machines at our disposal. We refer to Steland and Chang (Citation2019) for more information on that data set. These measurements are nonnormal, as confirmed by the Shapiro-Wilk test, and the sample autocorrelation functions are in good agreement with the i.i.d. assumption. For the problem at hand, a precision of 0.01 (i.e., ) for the mean chip width at a confidence level of
was selected. The precision parameter d is roughly equal to one eighth of the measurement’s standard deviation.
When calculating the classical first-stage sample size, one gets because the standard deviation is very low but not taken into account by this formula. Instead, the proposed three-observations rule leads to
and using the first 23 observations from both machines leads to
The resulting confidence interval using the GD common mean estimator is given by
If the three-observations rule is not feasible because there is absolutely no exploitable information, one may use the first-stage rule discussed in Remark 2.2.
DISCLOSURE
The authors have no conflicts of interest to report.
Table 2. Fixed-width interval for the common mean: Simulated coverage probabilities (p), expected sample sizes, and oversampling, o, for the three-observations first-stage rule and
ACKNOWLEDGMENTS
A large part of the article was prepared during visits of A. Steland at Mejiro University, Tokyo. Both authors thank Nitis Mukhopadhyay for discussion, especially on two-stage procedures, at the International Symposium on Statistical Theory and Methodology for Large Complex Data 2018 held at Tsukuba University and when he visited the Institute of Statistics at RWTH Aachen University. The authors gratefully acknowledge the support of Takenori Takahashi, Mejiro University and Keio University Graduate School, and Akira Ogawa, Mejiro University, for providing the chip manufacturing data.
REFERENCES
- Aoshima, M., and K. Yata. 2011. “Two-Stage Procedures for High-Dimensional Data.” Sequential Analysis 30 (4):356–99.
- Chang, Y. T., Y. Oono, and N. Shinozaki. 2012. “Improved Estimators for the Common Mean and Ordered Means of Two Normal Distributions with Ordered Variances.” Journal of Statistical Planning and Inference 142 (9):2619–28.
- Chang, Y. T., and N. Shinozaki. 2015. “Estimation of Two Ordered Normal Means under Modified Pitman Nearness Criterion.” Annals of the Institute of Statistical Mathematics 67 (5):863–83.
- Chow, Y. S., and H. Robbins. 1965. “On the Asymptotic Theory of Fixed-Width Sequential Confidence Intervals for the Mean.” The Annals of Mathematical Statistics 36 (2):457–62.
- Durrett, R. 2019. “Probability—Theory and Examples.” Cambridge Series in Statistical and Probabilistic Mathematics, vol 49. Cambridge, UK: Cambridge University Press.
- Elfessi, Abdulaziz, and Nabendu Pal. 1992. “A Note on the Common Mean of Two Normal Populations with Order Restrictions in Location–Scale Families.” Communications in Statistics - Theory and Methods 21 (11):3177–84.
- Ghosh, M., N. Mukhopadhyay, and P. K. Sen. 1997. “Sequential Estimation.” Wiley Series in Probability and Statistics: Probability and Statistics, New York: John Wiley & Sons.
- Graybill, F. A., and R. B. Deal. 1959. “Combining Unbiased Estimators.” Biometrics 15 (4):543–50.
- Gut, A. 1992. “The Weak Law of Large Numbers for Arrays.” Statistics and Probability Letters 14 (1):49–52.
- Mukhopadhyay, N. 1980. “A Consistent and Asymptotically Efficient Two-Stage Procedure to Construct Fixed Width Confidence Intervals for the Mean.” Metrika 27 (1):281–4.
- Mukhopadhyay, N., and W. Duggan. 1999. “On a Two-Stage Procedure Having Second-Order Properties with Applications.” Annals of the Institute of Statistical Mathematics 51 (4):621–36.
- Mukhopadhyay, N., and B. M. Silva. 2009. Sequential Methods and Their Applications. Boca Raton, FL: Taylor and Francis.
- Nair, K. 1982. “An Estimator of the Common Mean of Two Normal Populations.” Journal of Statistical Planning and Inference 6 (2):119–22.
- Shao, J. 1993. “Differentiability of Statistical Functionals and Consistency of the Jackknife.” The Annals of Statistics 21 (1):61–75.
- Shao, J., and C. F. Wu. 1989. “A General Theory for Jackknife Variance Estimation.” The Annals of Statistics 17 (3):1176–97.
- Stein, C. 1945. “A Two-Sample Test for a Linear Hypothesis Whose Power is Independent of the Variance.” The Annals of Mathematical Statistics 16 (3):243–58.
- Steland, A. 2015. “Vertically Weighted Averages in Hilbert Spaces and Applications to Imaging: Fixed Sample Asymptotics and Efficient Sequential Two-Stage Estimation.” Sequential Analysis 34 (3):295–323.
- Steland, A. 2017. “On the Accuracy of Fixed Sample and Fixed Width Confidence Intervals Based on the Vertically Weighted Average.” Journal of Statistical Theory and Practice 11 (3):375–92.
- Steland, A., and Y. Chang. 2019. “Jackknife Variance Estimation for General Two-Sample Statistics and Applications to Common Mean Estimators under Ordered Variances.” Japanese Journal of Statistics and Data Science 2 (1):173–217.