![MathJax Logo](/templates/jsp/_style2/_tandf/pb2/images/math-jax.gif)
Abstract
Small area estimation techniques have got a lot of attention during the last decades due to their important applications in survey studies. Mixed linear models and reduced rank regression analysis are jointly used when considering small area estimation. Estimates of parameters are presented as well as prediction of random effects and unobserved area measurements.
1. Introduction
Small area estimation is an active research area in statistics with many applications. Today there exist several approaches to deal with small area problems. For recent contributions to the topic when considering discrete data (e.g. see Chandra et al. 2018), and for continuous data (see Baldermann 2018). In both articles many works of other authors are presented and discussed. For those who want to get a solid introduction to the subject the book (Rao and Molina Citation2015) can be recommended. Usually a common basis for small area estimation problems is that a survey study (finite population case) has been conducted and based on the survey one intends to extract information concerning small domains by adding specific local information which was not accounted for in the comprehensive survey.
In this article a case is considered where units of a survey sample are investigated several times, leading to the existence of dependent observations. It is assumed that observed background information exist which then is implemented via a linear model, giving a model which includes variance components. Moreover, it is assumed that latent processes exist which often show up when different types of systems are studied, for example when weather impact is studied. In this case often a huge number of variables are studied but we do not know the effects of the individual variables, for example we cannot directly relate temperature to plant growth. This leads us to exploit reduced rank regression models. Thus, the model which will be discussed in this article will include time dependent fix effects, i.e. structured mean responses, random parameters and rank restrictions on parameters, which to our knowledge is a new type of model. In Section 2 more motivating details are given, including basic references, and in Section 3 the assumptions are summarized together with all technical details. Briefly, in Section 4 the estimation of all unknown parameters is considered and finally two prediction results are presented in Section 5.
2. Background
In this section, via an example, we will introduce our way of thinking about extracting information from a survey sample. In particular, a number of different sources of uncertainty are presented which then will be included in a statistical model. It turns out that the model will comprise three different types of uncertainty. Section 3 includes all details of the model and in the subsequent section likelihood-based estimators are derived. In the last section these estimators are utelized when finding predictors.
Example 2.1.
Let there be a national survey which collects data about some specific production from a certain type of companies. The survey is followed-up once per year and it is repeated, say, five years.
Altogether there exist 20 regions and each region consists of 10 subregions. The survey sample (units) consists of four companies from each subregion. Thus, in the survey, in total 800 companies are followed five years. With the models presented in this article the aim is to obtain information about the companies in the subregions. Note that in reality the companies should be classified into strata and then at least one company should be sampled from each stratum and each subregion. To simplify the presentation we will only have one stratum. However, in each subregion we assume that there is a lot of background information which we would like to take into account when predicting production in subregions or predicting production of unobserved units.
Small area models, also called small domain models, can be classified into two categories. One category is the area-level models category (Fay and Herriot Citation1979) and the other category is the unit-level models category (Battese et al. Citation1988). Both model categories are discussed in detail in a well-written book (Rao and Molina Citation2015). In this article we are thinking of area levels but if unit-level information would be available the proposed model can also be used.
Example 2.1 indicates the following use of notation. Let
be the direct estimates from the survey for the jth subarea within the ith area. The estimates are based on the survey design where units have been sampled on the subarea level. Note that we will not discuss the underlying survey in this article, only use the outcome of the survey, i.e. the direct estimators of the subareas.
Statistics is partly about quantifying uncertainty. Often this is carried out by assuming that an estimate is a representation of an estimator which follows some distribution. However, we also have cases where uncertainty is not described via distributions. Examples are rounding errors, specific types of truncations, identification of influential observations, outlier detection, model comparisons, the number of observations in a sample, etc. One choice of distribution for the estimator would be the one generated by the survey design, i.e. how many and under what model the units have been sampled would determine the distribution. However, for our purposes we have difficulties to work with the probability space generated by the survey design. Instead, we now think of the direct estimates
being misspecified quantities since the survey sample distribution does not take into account those covariables we are going to use in the adjustment of
The covariables can be used to correct for bias as well as diminish the uncertainty in the estimate. For example, returning to our example, if the survey design does not take into account the size of companies small companies can be overrepresented in the sample and then θij is wrongly estimated.
In order to derive an applicable model lets start with
(2.1)
(2.1)
where for notational convenience
has been replaced by yij. The joint distribution for
is specified via
(2.2)
(2.2)
where
is a row vector, V is supposed to be known and determined by the survey design. Often V is solely a function of the number of sampled units. Moreover,
is defined on another probability space than the one generated by the survey design. However, the uncertainty in the direct estimates, due to the survey, is incorporated in the dispersion of
Furthermore, one basic assumption in this article is that some of the θij will be the same, i.e. the survey estimators are the same for clusters of subregions. To make such an assumption is often realistic but has to be checked in practise. Under this assumption, instead of (2.1), the model can be written in matrix notation
(2.3)
(2.3)
where
and
consists of unknown parameters and C: k × n is a usual design matrix describing the relations among the elements
Usually C consists of blocks of one-vectors but in principle it can be arbitrary as long as it fits together with the covariate structures which will be presented later. For example, with three areas and four subareas in each area C equals
The error in EquationEquation (2.3)(2.3)
(2.3) is defined by EquationEquation (2.2)
(2.2)
(2.2) . The model means that a probability space is only connected to the modeling of
i.e. the set of estimated θij, but we include in the model extra variation due to the survey design.
In the introductory example it was stated that the survey was repeated five times. When survey studies are repeated it will for this article be assumed that when there exists a linear trend it is of the same form for all subareas. In this case there will be a matrix Y where each row in the matrix follows EquationEquation (2.3)(2.3)
(2.3) , besides an unknown scalar multiplier. Thus, EquationEquation (2.3)
(2.3)
(2.3) implies
(2.4)
(2.4)
where A is a design matrix which models p repeated surveys (within subarea design matrix), B is a matrix of unknown parameters and
i.e. E is matrix normally distributed with a separable dispersion structure
where ⊗ stands for the Kronecker product and
is an unknown positive definite matrix. The parameter
is used to model the dependency due to the repeated measurement which exist because the observed units of the design have been repeatedly measured. No particular structure in
is assumed since it would only make sense if we would have precise knowledge about the mean structure, which is not the case in this article. For example, it does not make sense to assume an autocorrelated structure if the mean has not been established. The model in EquationEquation (2.4)
(2.4)
(2.4) is often called Growth Curve model, GMANOVA and recently bilinear regression model (see von Rosen Citation2018, for basic results and references).
As an example of A we can have
meaning that there is a second order trend which is observed over five years.
When expressing θij as a linear function of an unknown parameter vector as it was done in EquationEquation (2.3)
(2.3)
(2.3) , we usually cannot say that the model is completely true for all subareas. Instead we believe that there is some random variation around a true imaginary model, i.e. we introduce a variance component. Thus, instead of EquationEquation (2.4)
(2.4)
(2.4) , the following model is set up:
(2.5)
(2.5)
where
is independent of E and Z is related to C because, in the model, usually via C, the observations are linked to the subareas and Z helps to describe the random variation (errors) in the subareas. Therefore, formally,
where
denotes the column vector space. Additionally we will standardize these random effects by assuming
meaning that it has been adjusted for a different number of subareas within an area, taking into an account the uncertainty in the survey estimates. This has two important implications. One is that it simplifies the mathematical treatment of the model and the second is that it becomes easier to compare predicted random effects.
Now observed covariables will be introduced in the model. There exist different types of covariables but we are not going to distinguish between different types. For example one type of covariables are those which are accounted for in the survey design (e.g. variables defining different strata) and another type can be covariables observed in registries and a third type can be baseline data which were available when a survey study with repeated measurements on units was initiated. The effect of the covariate will be modeled by a term where
stands for the unknown effect and
collects the observed covariables leading to the model
(2.6)
(2.6)
Moreover, suppose that there are a number of latent variables. It can be large clusters of variables, e.g. weather variables, socio-econometric variables, chemical characteristics of soil and water, etc. A typical phenomena of these clusters are that one knows that there should be an effect on the survey estimates but it is difficult to express this via some functional relationship. Instead we will model the relationship via rank restrictions on parameter matrices and will consider the following model:
(2.7)
(2.7)
where F: t × n is known and
is unknown of rank
The idea is that all clustered variables appear in F and these variables are governed by r unobserved latent variables (processes) which are incorporated in the model via the rank restrictions on
It is important to note that p has to be larger than r meaning that modeling the number of underlying latent variables depends on how many times the survey has been repeated. Thus, we really see an advantage of performing repeated surveys when latent variables are supposed to exist, which indeed we believe is a common phenomena.
3. Detailed model specification
Let Y consist of the direct estimates of the subareas where each column corresponds to a subarea including p estimators from the p repeated measurements. When performing likelihood inference the direct estimates are used but in the presentation we will not distinguish between estimators and estimates.
Definition 3.1.
Let
where A: p × q, C: k × n,
:
F: t × n, are all known matrices,
Z: l × n,
where U and E are independently distributed. The parameters
and
are supposed to be unknown and positive definite whereas V is known and determined from the survey.
When in Definition 3.1
and
then we have the classical growth curve model (GMANOVA) (see Potthoff and Roy Citation1964; von Rosen Citation2018). When
and
then the growth curve model with covariate information appears which sometimes is identified as a mixture of the GMANOVA and MANOVA models. Some references to this model, as well as more general models (see Chinchilli and Elswick Citation1985; Verbyla and Venables Citation1988; von Rosen Citation1989; and Bai and Shi Citation2007). If
and
we say that we have the growth curve model with random effects (see Ip et al. Citation2007) whereas if only
holds (see Yokoyama and Fujikoshi Citation1992; Yokoyama Citation1995) where similar models are considered and where references to earlier works can be found. However, usually in these works one puts structures on the covariance matrices which will not be discussed in this article.
4. Estimation of parameters
Let Q be a matrix of basis vectors such that where
is a symmetric square root of V in
Further, let
be any matrix of full rank satisfying
We assume
where
v > l.
A one-one transformation of the model in Definition 3.1, using Q, yields
(4.8)
(4.8)
(4.9)
(4.9)
The matrix of size v × v is idempotent since
and
and therefore
where
v × v is a known orthogonal matrix. The identity in EquationEquation (4.8)
(4.8)
(4.8) is post-multiplied by
leading to the model
(4.10)
(4.10)
However, since
and this leads to that the model in EquationEquation (4.10)
(4.10)
(4.10) will be split into two models. Let
Then we have three models which will be used when finding estimators:
(4.11)
(4.12)
(4.13)
The idea is to utilize EquationEquations (4.12)(4.10)
(4.10) and Equation(4.13)
(4.10)
(4.10) when estimating B,
and
To estimate
these estimators are inserted in EquationEquation (4.11)
(4.10)
(4.10) . Suppose that the estimators
and
have been obtained, and let
Under the assumption of no randomness in
and
Thus, if
, implying
Here it is assumed that the difference is positive definite. If is not positive definite, which should occur with some positive probability, this indicates that data does not support the model presented in Definition 3.1 or the estimation procedure has some deficiencies. It can be noted that the problem with negative variance components in mixed linear models has a long history and an early reference is Nelder (Citation1954). A more recent contribution to the discussion of the problem is provided by Molenberghs and Verbeke (Citation2011). If we would start with the model in EquationEquation (2.5)
(2.5)
(2.5) and only consider
we could avoid to assume
to be positive definite. However, in our model derivation, via
has to be positive definite. Thus, if obtaining a non-positive definite
we have to either reformulate the model or change the estimator. One way to modify
is to work with the eigenvalues of
If there are a few “small” (by absolute value) non-positive eigenvalues these eigenvalues can be replaced by small positive quantities. This should of course take place with caution and common sense but as far as we know there is no general recipe of how to handle a non-positive definite
To estimate
and
EquationEquations (4.12)
(4.10)
(4.10) and Equation(4.13)
(4.10)
(4.10) are merged:
(4.14)
(4.14)
where
Put
EquationEquation (4.14)(4.8)
(4.8) is identical to
(4.15)
(4.15)
which is an extended Growth Curve model (GMANOVA + MANOVA) with a reduced rank regression component. Furthermore, the likelihood function which corresponds to the model in EquationEquation (4.15)
(4.15)
(4.15) equals
(4.16)
(4.16)
where we have used the convention that instead of
it is written
for any arbitrary matrix expression H. Our first observation is that the likelihood in EquationEquation (4.16)
(4.16)
(4.16) is smaller or equal to
(4.17)
(4.17)
with equality if and only if
which, under some full rank conditions on
determines
as a function of B and
Thus, if B and
can be estimated the covariate effect described by
can be estimated. The density in EquationEquation (4.17)
(4.17)
(4.17) corresponds to the model
(4.18)
(4.18)
which is a Growth Curve model with latent variables (reduced rank effect). The model has been considered (Reinsel and Velu Citation1998; von Rosen and von Rosen Citation2017), where also other references can be found. Now it is shortly described how B and
can be estimated. For notational conveniences the model in EquationEquation (4.18)
(4.18)
(4.18) is written
(4.19)
(4.19)
Note that because
which is essential for being able to obtain explicit estimators.
Let
and then the likelihood function for model EquationEquation (4.19)(4.19)
(4.19) equals
(4.20)
(4.20)
The upper bound of the likelihood function in EquationEquation (4.20) is well known (see Srivastava and Khatri Citation1979, Theorem 1.10.4):
(4.21)
(4.21)
and in EquationEquation (4.21)
(4.21)
(4.21) equality holds if and only if
Hence, if B and
are estimated then
is also estimated. Maximizing the log-likelihood function is equivalent to minimizing the determinant
which now will take place.
It is worth noting that it is used, since that the matrix
can be factored into two terms, i.e.
where
and
are of size p × r and size r × t, respectively. Depending on the knowledge about the model
and
can be interpreted but there are also cases where the factorization has no clear interpretation.
In the subsequent, let and
with
denoting an arbitrary generalized inverse.
Following the approach presented in (Kollo and von Rosen Citation2005, Chapter 4) we have
(4.22)
(4.22)
where
(4.23)
(4.23)
(4.24)
(4.24)
and equality in EquationEquation (4.22)
(4.22)
(4.22) holds if and only if
Throughout the article it is supposed that
is positive definite which holds with probability 1. Thus, B as a function of
can be obtained. Moreover,
(4.25)
(4.25)
where
(4.26)
(4.26)
and the equality in EquationEquation (4.25)(4.25)
(4.25) holds if and only if
(4.27)
(4.27)
Given the linear system of equations in EquationEquation (4.27)
(4.27)
(4.27) is consistent and hence
can always be estimated as a function of
(4.28)
(4.28)
under the assumption that
and
The only parameter in the right-hand side of EquationEquation (4.25)
(4.25)
(4.25) which is left to estimate is
The right-hand side of EquationEquation (4.25)
(4.25)
(4.25) can be factored as
where
with
is such that
is semi-orthogonal of rank r and U is a positive definite matrix. The square root in M is supposed to be symmetric. Due to the Poincaré separation theorem (see Rao Citation1979)
where
are the eigenvalues of U which do not depend on any unknown parameter. The lower bound is then attained if M consists of the corresponding eigenvectors
Let
Therefore, a maximum likelihood estimator of
is found if we can find
satisfying the following equality:
(4.29)
(4.29)
Since the following choice of
is appropriate:
(4.30)
(4.30)
There remains an unresolved problem of presenting all solutions to EquationEquation (4.29)(4.29)
(4.29) , but this is not considered here.
Together with EquationEquation (4.28) this yields that
Now, the obtained results will be stated in the following theorem.
Theorem 4.1.
Let the direct estimates of a survey follow the model given in Definition 3.1. Moreover, all matrices in the statements have earlier been defined in this section.
If
If additionally
If
If additionally
and
If additionally
where
Suppose
is positive definite. Then
5. Prediction
Small area estimation is very often about prediction of unobserved units (subareas). For our model, due to normality assumptions, it is relatively straight forward to construct predictors, in particular if the explicit estimators of the previous section are utilized. The strategy will be to first predict U which will be based on the observed data and thereafter the unobserved units are predicted.
For prediction of U the conditional mean is crucial and thus the joint distribution of U and Y will be derived. According to Definition 3.1, U and E are independently distributed
is normally distributed, i.e.
where
Thus,
and the next proposition can be stated.
Proposition 5.1.
Let the model be given in Definition 3.1 and use the notation introduced earlier in Section 4.
The predicted value
is given by
where
and
are presented in Theorem 4.1 and Y consists of subarea estimates.
Let
consist of the unobserved direct estimates, corresponding to the non-sampled units. The corresponding covariables are available which are denoted
and
Then
where
is given in (i).
Additional information
Funding
References
- Bai, P., and L. Shi. 2007. Exact distributions of MLEs of regression coefficients in GMANOVA-MANOVA model. Journal of Multivariate Analysis 98 (9):1840–52. doi:10.1016/j.jmva.2007.06.001.
- Battese, G. E., R. M. Harter, and W. A. Fuller. 1988. An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association 83 (401):28–36. doi:10.1080/01621459.1988.10478561.
- Chinchilli, V., and R. Elswick. 1985. A mixture of the Manova and Gmanova models. Communication in Statistics- Theory and Methods 14 (12):3075–89. doi:10.1080/03610928508829096.
- Fay, R. E., and R. A. Herriot. 1979. Estimates of income for small places: an application of James-Stein procedures to census data. Journal of the American Statistical Association 74 (366a):269–77. doi:10.2307/2286322.
- Ip, W.-C., M.-X. Wu, S.-G. Wang, and H. Wong. 2007. Estimation for parameters of interest in random effects growth curve models. Journal of Multivariate Analysis 98 (2):317–27. doi:10.1016/j.jmva.2006.04.001.
- Kollo, T., and D. von Rosen. 2005. Advanced Multivariate Statistics with Matrices. Dordrecht: Springer.
- Molenberghs, G., and G. Verbeke. 2011. A note on a hierarchical interpretation for negative variance components. Statistical Modelling: An International Journal 11 (5):389–408. doi:10.1177/1471082X1001100501.
- Nelder, J. A. 1954. The interpretation of negative components of variance. Biometrika 41 (3–4):544–8. doi:10.2307/2332734.
- Potthoff, R. F., and S. N. Roy. 1964. A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 51 (3–4):313–26. doi:10.2307/2334137.
- Rao, C. R. 1979. Separation theorems for singular values of matrices and their applications in multi- variate analysis. Journal of Multivariate Analysis 9 (3):362–77. doi:10.1016/0047-259X(79)90094-0.
- Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. 2nd ed. Hoboken, NJ: Wiley.
- Reinsel, G. C., and R. P. Velu. 1998. Multivariate Reduced-Rank regression. Theory and applications. Lect. Notes stat. 136, New York: Springer-Verlag.
- von Rosen, D. 1989. Maximum likelihood estimators in multivariate linear normal models. Journal of Multivariate Analysis 31 (2):187–200. doi:10.1016/0047-259X(89)90061-4.
- von Rosen. D. 2018. Bilinear Regression Analysis: An Introduction. Lect. Notes stat. 220, New York: Springer-Verlag.
- von Rosen, T., and D. von Rosen. 2017. On estimation in some reduced rank extended growth curve models. Mathematical Methods of Statistics 26 (4):299–310. doi:10.3103/S1066530717040044.
- Srivastava, M. S., and C. G. Khatri. 1979. An Introduction to Multivariate Statistics. New York: North-Holland.
- Verbyla, A. P., and W. N. Venables. 1988. An extension of the growth curve model. Biometrika 75 (1):129–38. doi:10.1093/biomet/75.1.129.
- Yokoyama, T. 1995. Statistical inference on some mixed MANOVA - GMANOVA models with random effects. Hiroshima Mathematical Journal 25 (3):441–74. doi:10.32917/hmj/1206127624.
- Yokoyama, T., and Y. Fujikoshi. 1992. Tests for random-effects covariance structures in the growth curve model with covariates. Hiroshima Mathematical Journal 22 (1):195–202. doi:10.32917/hmj/1206128618.