Search in:

Statistical Theory and Related Fields Volume 3, 2019 - Issue 2

Submit an article Journal homepage

Free access

481

Views

CrossRef citations to date

Altmetric

Listen

Articles

Small area estimation with subgroup analysis

Xin WangDepartment of Statistics, Miami University, Oxford, OH, USACorrespondence[email protected]
View further author information

Zhengyuan ZhuDepartment of Statistics, Iowa State University, Ames, IA, USAView further author information

Pages 129-135 | Received 31 Dec 2018, Accepted 20 Aug 2019, Published online: 31 Aug 2019

Cite this article
https://doi.org/10.1080/24754269.2019.1659097
CrossMark

In this article

ABSTRACT
1. Introduction
2. The model and the algorithm
3. Simulation study
4. Real data analysis
5. Summary and conclusion
Disclosure statement
Additional information
References
Appendixes

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

In this article, a new unit level model based on a pairwise penalised regression approach is proposed for problems in small area estimation (SAE). Instead of assuming common regression coefficients for all small domains in the traditional model, the new estimator is based on a subgroup regression model which allows different regression coefficients in different groups. The alternating direction method of multipliers (ADMM) algorithm is used to find subgroups with different regression coefficients. We also consider pairwise spatial weights for spatial areal data. In the simulation study, we compare the performances of the new estimator with the traditional small area estimator. We also apply the new estimator to urban area estimation using data from the National Resources Inventory survey in Iowa.

KEYWORDS:

Linear mixed models
penalty regressions
small area estimation
spatial areal data
subgroup analysis

1. Introduction

Small area estimation (SAE) is an important problem in survey sampling when the sample sizes are not large enough to provide reliable estimates in small domains or areas. See Rao and Molina (Citation2015) and Pfeffermann (Citation2013) for overviews and recent developments in SAE. One of the model-based approaches for SAE is the unit level model, which was first proposed by Battese, Harter, and Fuller (Citation1988). Unit level models are specified for the individual elements of the population and require the availability of unit level auxiliary information.

Traditional unit level models typically assume a linear relationship between the variable of interest and the auxiliary information, and all the areas share the same regression coefficients to borrow information. Random effects are also considered for each small area. However, different relationships can exist in different areas. That is, subgroups could exist for different areas such that areas in one group have the same regression coefficient and areas in different groups have different regression coefficients.

In the linear regression setting, Ma and Huang (Citation2017), Ma, Huang, and Zhang (Citation2016) developed a method to obtain homogeneous groups based on regression coefficients through the alternating direction method of multiplier algorithm (ADMM, Boyd, Parikh, Chu, Peleato, & Eckstein, Citation2011). In the algorithm, they used pairwise concave penalties based on the smoothly clipped absolute deviation (SCAD) penalty (Fan & Li, Citation2001) and the minimax concave penalty (MCP) (Zhang, Citation2010). Wang, Zhu, and Zhang (Citation2019) extended the problem to a regression setting with repeated measures. They also considered spatial weights in the pairwise penalties and showed that spatial weights perform better than equal weights. However, the model cannot be applied to the SAE problems directly, since random effects are not considered.

In this article, we propose a new SAE estimator that allows different regression coefficients in different subgroups under a linear mixed model framework at the unit level. The ADMM algorithm is applied and the variance parameters are also estimated in the algorithm. As in Wang et al. (Citation2019), we use spatial pairwise weights in the pairwise penalties based on the SCAD penalty. In this algorithm, the number of groups and the group structure are also determined.

The article is organised as follows. In Section 2, we introduce the unit level model with areal regression coefficients and the algorithm to find subgroups. In Section 3, we conduct several simulation studies to compare the performance of the proposed estimator with the traditional estimators. In Section 4, we apply the proposed method to a real data set. Finally, Section 5 contains some conclusion and discussion.

2. The model and the algorithm

In this section, the unit level model with area level regression coefficients and the corresponding algorithm to estimate parameters are introduced.

2.1. The unit level model

Suppose there are M areas with known population size $N_{i}$ and $n_{i}$ is the sample size in area i for $i = 1, \dots, M$ . Let $y_{i h}$ be the observation of unit h in area i for $h = 1, \dots, n_{i}$ , $i = 1, \dots, M$ . Let $x_{i h}$ be the p dimension auxiliary information vector with area population mean ${\bar{X}}_{i} = 1 / N_{i} \sum_{h = 1}^{N_{i}} x_{i h}$ known. In the traditional unit level model, that is Battese–Harter–Fuller (BHF) model (Battese et al., Citation1988), different areas share the same regression coefficient as in (Equation1(1) $y_{i h} = x_{i h}^{T} β + v_{i} + ϵ_{i h},$ (1) ), (1) $y_{i h} = x_{i h}^{T} β + v_{i} + ϵ_{i h},$ (1) where $β$ is the unknown regression coefficient vector, $v_{i}$ 's are i.i.d areal random effects with mean zero and variance $σ_{v}^{2}$ , and $ϵ_{i h}$ 's are i.i.d random errors with mean zero and variance $σ_{ϵ}^{2}$ . Let $A_{i}$ be the set of observed units and $C_{i}$ be the set of unobserved units in area i. The predictor for the finite population mean ${\bar{Y}}_{i} = 1 / N_{i} \sum_{h = 1}^{N_{i}} y_{i h}$ in area i under model (Equation1(1) $y_{i h} = x_{i h}^{T} β + v_{i} + ϵ_{i h},$ (1) ) for SAE given in Battese et al. (Citation1988) and the sae package (Molina & Marhuenda, Citation2015) is (2) ${\hat{\bar{Y}}}_{i}^{B H F} = \frac{1}{N_{i}} (\sum_{h \in A_{i}} y_{i h} + \sum_{h \in C_{i}} (x_{i h}^{T} \hat{β} + {\hat{v}}_{i})),$ (2) where $\hat{β}$ is the estimate of $β$ and ${\hat{v}}_{i}$ is the empirical best linear unbiased prediction of $v_{i}$ . In the simulation study, we use the R package sae (Molina & Marhuenda, Citation2015) to obtain the predictions.

Instead of assuming all the areas have the same regression coefficients $β$ , we assume that there are K mutually exclusive subgroups $G = {G_{1}, \dots, G_{K}}$ , which is a partition of areas ${1, 2, \dots, M}$ . First we assume that each area has its own regression coefficient, (3) $y_{i h} = x_{i h}^{T} β_{i} + v_{i} + ϵ_{i h},$ (3) where $β_{i}$ is the unknown regression coefficient vector for area i. Let $y_{i} = (y_{i 1}, \dots, y_{i n_{i}})^{T}$ , $x_{i} = (x_{i 1}, \dots, x_{i n_{i}})^{T}$ and $β = (β_{1}^{T}, \dots, β_{M}^{T})^{T}$ . The weighted log likelihood function is (4) $\begin{aligned} l (β, σ_{v}^{2}, σ_{ϵ}^{2}) & = - \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} \log |Σ_{i}| \\ - \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} {(y_{i} - x_{i}^{T} β_{i})}^{T} \\ \times Σ_{i}^{- 1} (y_{i} - x_{i}^{T} β_{i}), \end{aligned}$ (4) where $Σ_{i}$ is the covariance matrix based on the random effect structure which has the following form: $Σ_{i} = 1_{n_{i}} 1_{n_{i}}^{T} σ_{v}^{2} + I_{n_{i}} σ_{ϵ}^{2}$ and $\begin{aligned} Σ_{i}^{- 1} & = {(1_{n_{i}} 1_{n_{i}}^{T} σ_{v}^{2} + I_{n_{i}} σ_{ϵ}^{2})}^{- 1} \\ = \frac{1}{σ_{ϵ}^{2}} (I_{n_{i}} - 1_{n_{i}} 1_{n_{i}}^{T} \frac{σ_{v}^{2}}{σ_{ϵ}^{2} + n_{i} σ_{v}^{2}}), \end{aligned}$ where $1_{n_{i}}$ is an $n_{i} \times 1$ vector with elements 1 and $I_{n_{i}}$ is an $n_{i} \times n_{i}$ identity matrix.

If area i and area j are in the same group, then $β_{i} = β_{j}$ . In order to find the estimated partition $\hat{G} = {{\hat{G}}_{1}, \dots, {\hat{G}}_{\hat{K}}}$ with the estimated number of groups $\hat{K}$ , the following objective function is considered (5) $\begin{aligned} Q (β, σ_{v}^{2}, σ_{ϵ}^{2}; λ, ψ) \\ = \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} \log |Σ_{i}| + \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} {(y_{i} - x_{i}^{T} β_{i})}^{T} \\ \times Σ_{i}^{- 1} (y_{i} - x_{i}^{T} β_{i}) \\ + \sum_{1 \leq i < j \leq M} p_{γ} (∥β_{i} - β_{j}∥, c_{i j} λ), \end{aligned}$ (5) where $∥ \cdot ∥$ denotes the Euclidean norm, $p_{γ} (\cdot, λ)$ is a penalty function with a fixed value γ and a tuning parameter $λ \geq 0$ . In the penalty function, pairwise weights are considered associated with area i and area j. In this paper, we use the SCAD penalty. In the context of spatial SAE, we define $c_{i j}$ as (6) $c_{i j} = \exp (ψ (1 - a_{i j})),$ (6) where ψ is a tuning parameter and $a_{i j}$ is the neighbour order between area i and area j. As shown in Wang et al. (Citation2019), pairwise spatial weights can help in spatial areal data.

2.2. The ADMM algorithm

For given λ and ψ, the solution of (Equation5(5) $\begin{aligned} Q (β, σ_{v}^{2}, σ_{ϵ}^{2}; λ, ψ) \\ = \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} \log |Σ_{i}| + \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} {(y_{i} - x_{i}^{T} β_{i})}^{T} \\ \times Σ_{i}^{- 1} (y_{i} - x_{i}^{T} β_{i}) \\ + \sum_{1 \leq i < j \leq M} p_{γ} (∥β_{i} - β_{j}∥, c_{i j} λ), \end{aligned}$ (5) ) is (7) $\begin{aligned} (\hat{β}, \hat{σ_{v}^{2}}, \hat{σ_{ϵ}^{2}}) & = \underset{β \in R^{M p}, σ_{v}^{2} \in R_{+}, σ_{ϵ}^{2} \in R_{+}}{a r g m i n} \\ Q (β, σ_{v}^{2}, σ_{ϵ}^{2}; λ, ψ) . \end{aligned}$ (7) The ADMM algorithm is applied to solve (Equation7(7) $\begin{aligned} (\hat{β}, \hat{σ_{v}^{2}}, \hat{σ_{ϵ}^{2}}) & = \underset{β \in R^{M p}, σ_{v}^{2} \in R_{+}, σ_{ϵ}^{2} \in R_{+}}{a r g m i n} \\ Q (β, σ_{v}^{2}, σ_{ϵ}^{2}; λ, ψ) . \end{aligned}$ (7) ). Let $δ_{i j} = β_{i} - β_{j}$ , the objective function becomes $\begin{aligned} L_{0} (β, σ_{v}^{2}, σ_{ϵ}^{2}, δ) \\ = \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} \log |Σ_{i}| \\ + \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} {(y_{i} - x_{i}^{T} β_{i})}^{T} Σ_{i}^{- 1} (y_{i} - x_{i}^{T} β_{i}) \\ + \sum_{1 \leq i < j \leq M} p_{γ} (∥δ_{i j}∥, c_{i j} λ) \\ s u b j e c t t o β_{i} - β_{j} - δ_{i j} = 0, \end{aligned}$ where $δ = (δ_{i j}^{T}, i < j)^{T}$ . The augmented Lagrangian is $\begin{aligned} L (β, σ_{v}^{2}, σ_{ϵ}^{2}, δ, v) & = L_{0} (β, σ_{v}^{2}, σ_{ϵ}^{2}, δ) \\ + \sum_{i < j} ⟨v_{i j}, β_{i} - β_{j} - δ_{i j}⟩ \\ + \frac{ϑ}{2} \sum_{i < j} ∥ β_{i} - β_{j} - δ_{i j} ∥^{2}, \end{aligned}$ where $v = (v_{i j}^{T}, i < j)^{T}$ are Lagrange multipliers and ϑ is the penalty parameter. Let $τ = (σ_{v}^{2}, σ_{ϵ}^{2})$ . Given $τ^{m}$ , $δ^{m}$ and $v^{m}$ , $β, τ, δ$ and $v$ are updated as follows: $\begin{aligned} β^{m + 1} & = \arg min L (β, τ^{m}, δ^{m}, v^{m}), \\ τ^{m + 1} & = τ^{m} + {[I (τ^{m})]}^{- 1} s (β^{m + 1}, τ^{m}), \\ δ^{m + 1} & = \arg min L (β^{m + 1}, τ^{m + 1}, δ, v^{m}), \\ v_{i j}^{m + 1} & = v_{i j}^{m} + ϑ (β_{i}^{m + 1} - β_{j}^{m + 1} - δ_{i j}^{m + 1}) . \end{aligned}$ Let $y = (y_{1}^{T}, \dots, y_{M}^{T})^{T}$ , $X = d i a g (x_{1}, x_{2}, \dots, x_{M})$ and $Ω = d i a g (1 / n_{1} Σ_{1}^{- 1}, \dots, 1 / n_{M} Σ_{M}^{- 1})$ . The update of $β$ is $\begin{aligned} β^{m + 1} & = {(X^{T} Ω^{m} X + ϑ A^{T} A)}^{- 1} \\ \times (X^{T} Ω^{m} y + ϑ v e c ((Δ^{m} - ϑ^{- 1} Υ^{m}) D)), \end{aligned}$ where $A = D \otimes I_{p}$ , ⊗ is the Kronecker product, $D = {(e_{i} - e_{j})}^{T}$ with $e_{i}$ an $M \times 1$ vector with ith element 1 and other elements 0, $Δ^{m} = (δ_{i j}^{m}, i < j)_{p \times M (M - 1) / 2}$ and $Υ^{m} = (v_{i j}^{m}, i < j)_{p \times M (M - 1) / 2}$ . When updating $τ$ , $I (τ)$ is the expected second-order derivative of $- l$ in (Equation4(4) $\begin{aligned} l (β, σ_{v}^{2}, σ_{ϵ}^{2}) & = - \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} \log |Σ_{i}| \\ - \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} {(y_{i} - x_{i}^{T} β_{i})}^{T} \\ \times Σ_{i}^{- 1} (y_{i} - x_{i}^{T} β_{i}), \end{aligned}$ (4) ) and $s (β^{m + 1}, τ^{m}) = {(\frac{\partial l}{\partial σ_{v}^{2}}, \frac{\partial l}{\partial σ_{ϵ}^{2}})}^{T} |_{β = β^{m + 1}, τ = τ^{m}} .$ The details of $s (\cdot, \cdot)$ and $I$ are in the appendix. In this step, $τ$ can be updated several times within one iteration.

Updating $δ_{i j}$ is based on the result of SCAD penalty. Let $ς_{i j}^{m} = (β_{i}^{m + 1} - β_{j}^{m + 1}) + ϑ^{- 1} v_{i j}^{m}$ , then the solution is

$δ_{i j}^{m + 1} = \{\begin{cases} S (ς_{i j}^{m}, λ c_{i j} / ϑ) & i f ∥ς_{i j}^{m}∥ \leq λ c_{i j} + λ c_{i j} / ϑ, \\ \frac{S (ς_{i j}^{m}, γ λ c_{i j} / ((γ - 1) ϑ))}{1 - 1 / ((γ - 1) ϑ)} & i f λ c_{i j} + λ c_{i j} / ϑ < ∥ς_{i j}^{m}∥ \leq γ λ c_{i j}, \\ ς_{i j}^{m} & i f ∥ς_{i j}^{m}∥ > γ λ c_{i j}, \end{cases}$ where $γ > c_{i j} + c_{i j} / ϑ$ and $S (w, t) = (1 - t / ∥ w ∥)_{+} w$ and $(t)_{+} = t$ if t>0, 0 otherwise.

Remark 2.1

The convergence criteria is based on that given in Boyd et al. (Citation2011). The primal residual and dual residual are defined as $r^{m + 1} = A β^{m + 1} - δ^{m + 1}$ and $s^{m + 1} = ϑ A^{T} (δ^{m + 1} - δ^{m})$ . The stopping criterion is ${∥r^{m}∥}_{2} \leq ϵ^{p r i}, {∥s^{m}∥}_{2} \leq ϵ^{d u a l},$ where $\begin{aligned} ϵ^{p r i} & = \sqrt{\frac{M (M - 1)}{2} p} ϵ^{a b s} \\ + ϵ^{r e l} max \{∥A β^{m}∥, ∥δ^{m}∥\}, \\ ϵ^{d u a l} & = \sqrt{M p} ϵ^{a b s} + ϵ^{r e l} ∥A^{T} v^{m}∥, \end{aligned}$ where $ϵ^{a b s}$ is an absolute tolerance and $ϵ^{r e l}$ is a relative tolerance. In the simulation study and the application, we use $ϵ^{a b s} = 10^{- 4}$ and $ϵ^{r e l} = 10^{- 2}$ .

2.3. The proposed small area estimator

As in Zhu, Zou, Liang, and Zhu (Citation2016), two small area estimators can be defined. Let ${\bar{y}}_{i} = 1 / n_{i} \sum_{h = 1}^{n_{i}} y_{i h}$ and ${\bar{x}}_{i} = 1 / n_{i} \sum_{h = 1}^{n_{i}} x_{i h}$ be the sample mean of the variable of interest and auxiliary information, respectively. The first one is based on the predictions of random effects, which is defined as (8) ${\hat{\bar{Y}}}_{i}^{(1)} = {\bar{X}}_{i}^{T} {\hat{β}}_{i} + {\hat{v}}_{i},$ (8) where ${\hat{β}}_{i}$ is the estimate of $β_{i}$ from the proposed algorithm, ${\hat{v}}_{i} = {\hat{γ}}_{i} ({\bar{y}}_{i} - {\bar{x}}_{i}^{T} {\hat{β}}_{i}),$ and ${\hat{γ}}_{i} = {\hat{σ}}_{v}^{2} / ({\hat{σ}}_{v}^{2} + {\hat{σ}}_{ϵ}^{2} / n_{i}) .$

In the second estimator, the unobserved values in each area are predicted based on the model, which is given by (9) $\begin{aligned} {\hat{\bar{Y}}}_{i}^{(2)} & = \frac{1}{N_{i}} (\sum_{h \in A_{i}} y_{i h} + \sum_{h \in C_{i}} {\hat{y}}_{i h}) \\ = f_{i} {\bar{y}}_{i} + {({\bar{X}}_{i} - f_{i} \bar{x_{i}})}^{T} {\hat{β}}_{i} + (1 - f_{i}) {\hat{v}}_{i}, \end{aligned}$ (9) where ${\hat{y}}_{i h} = x_{i h}^{T} {\hat{β}}_{i} + {\hat{v}}_{i}$ and $f_{i} = n_{i} / N_{i}$ . If $f_{i}$ is small, then the predictor in (Equation9(9) $\begin{aligned} {\hat{\bar{Y}}}_{i}^{(2)} & = \frac{1}{N_{i}} (\sum_{h \in A_{i}} y_{i h} + \sum_{h \in C_{i}} {\hat{y}}_{i h}) \\ = f_{i} {\bar{y}}_{i} + {({\bar{X}}_{i} - f_{i} \bar{x_{i}})}^{T} {\hat{β}}_{i} + (1 - f_{i}) {\hat{v}}_{i}, \end{aligned}$ (9) ) is nearly identical to the predictor in (Equation8(8) ${\hat{\bar{Y}}}_{i}^{(1)} = {\bar{X}}_{i}^{T} {\hat{β}}_{i} + {\hat{v}}_{i},$ (8) ).

3. Simulation study

The simulation setup is designed based on the features of the National Resources Inventory (NRI) survey, which monitors status and trend of natural resources characteristics. One of the characteristic is the area of land uses, such as cropland, pastureland and urban (Nusser & Goebel, Citation1997). Each state is divided into ‘segments’ with size of 160 acres. From 1982 to 1997, the full NRI sample was observed in 5-year intervals (1982, 1987, 1992 and 1997) with 300,000 segments. In 2000, the NRI transitioned to an annual sample design with about 70,000 segments.

For the simulation, we construct an artificial population composed of 300,000 segments in 99 counties. The number of counties in the simulated population is the same as the number of counties in Iowa. We treat counties as areas and segments as unit level observations. In the population for the simulation, the number of segments in each county for the 99 counties is between 2210 and 5412. These numbers are the population sizes of segments in counties used in the simulation study. This simulated population maintains features of the NRI data for Iowa. There are around 6000 segments selected in the full sample and around 1500 segments selected in the annual sample in the original NRI design. In the annual sample, fewer segments are sampled, so the accuracy of the estimates is reduced. Thus auxiliary information should be considered to improve the estimator.

We compare the performances of the proposed estimators to the BHF estimator based on 100 simulations. Tuning parameters are selected based on the following modified BIC (Wang, Li, & Tsai, Citation2007): (10) $B I C = - 2 l + C_{M} \log (M) (\hat{K} p),$ (10) where l is defined in (Equation4(4) $\begin{aligned} l (β, σ_{v}^{2}, σ_{ϵ}^{2}) & = - \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} \log |Σ_{i}| \\ - \frac{1}{2} \sum_{i = 1}^{M} \frac{1}{n_{i}} {(y_{i} - x_{i}^{T} β_{i})}^{T} \\ \times Σ_{i}^{- 1} (y_{i} - x_{i}^{T} β_{i}), \end{aligned}$ (4) ) and $C_{M}$ is a positive number which can depend on M. Here we use $C_{M} = c_{0} \log (\log (M p + 2))$ with $c_{0} = 0.2$ as in Wang et al. (Citation2019).

In the simulation study, simple random sampling is used in each county to select segments. As mentioned before, the population size of segments in each county is between 2210 and 5412. Two sampling rates are considered in each area, 1% and 0.5%. When sampling rate is 1%, there are 3067 selected segments in the whole state and the number of segments in each county is between 22 and 54. When sampling rate is 0.5%, there are 1537 selected segments in the whole state and the range of the number of segments in each county is from 11 to 27. $x_{i h} = (1, x_{i h})^{T}$ with $x_{i h}$ 's simulated from a normal distribution with mean 1 and standard deviation 1 and $v_{i}$ 's are simulated from a standard normal distribution, that is $σ_{v}^{2} = 1$ . The assumed group structure in Iowa is shown in Figure with three groups. The three groups are aggregated based on the districts available on https://www.nass.usda.gov/Charts_and_Maps/Crops_County/boundary_maps/indexpdf.php

Figure 1. Group information.

We consider three different sets of parameters.

Case I: $β_{i} = (0.5, 0.5)^{T}$ if $i \in G_{1}$ , $β_{i} = (2, 2)^{T}$ if $i \in G_{2}$ and $β_{i} = (3.5, 3.5)^{T}$ if $i \in G_{3}$ .
Case II: $β_{i} = (0.5, 0.5)^{T}$ if $i \in G_{1}$ , $β_{i} = (1.5, 1.5)^{T}$ if $i \in G_{2}$ and $β_{i} = (2.5, 2.5)^{T}$ if $i \in G_{3}$ .
Case III: $β_{i} = (0.5, 0.5)^{T}$ if $i \in G_{1}$ , $β_{i} = (1, 1)^{T}$ if $i \in G_{2}$ and $β_{i} = (1.5, 1.5)^{T}$ if $i \in G_{3}$ .

For each set of parameters, $σ_{ϵ} = 0.5, 1, 2$ are considered and $ϵ_{i h} \overset{i i d}{\sim} N (0, σ_{ϵ}^{2})$ . For the proposed estimator, we consider both the equal weight ( $c_{i j} = 1$ ) and the spatial weight selected based on the modified BIC. Different estimators are compared by $R M S E ({\hat{\bar{Y}}}_{i}^{E}) = \sqrt{\frac{1}{B} \sum_{b = 1}^{B} {({\hat{\bar{Y}}}_{i (b)}^{E} - {\bar{Y}}_{i (b)})}^{2}},$

where ${\hat{\bar{Y}}}_{i (b)}^{E}$ is the estimated population mean in area i and ${\bar{Y}}_{i (b)}$ is the population mean in the bth simulation, ‘E’ is the index of estimators which can be 1 or 2, and B=100. All the simulations are implemented in the Owens clusters of Ohio supercomputer centre (Ohio Supercomputer Center, Citation2016).

Figures , and show the results of the three sets of parameters when the sampling rate is 1% for 99 areas. ‘direct’ represents the direct estimator, which is the sample mean for simple random sampling. ‘BHF’ is calculated using the sae package provided in (Equation1(1) $y_{i h} = x_{i h}^{T} β + v_{i} + ϵ_{i h},$ (1) ). Under two different weights, we consider two small area estimators described in Section 2.3. ‘equal1’ and ‘sp1’ represent the estimator in (Equation8(8) ${\hat{\bar{Y}}}_{i}^{(1)} = {\bar{X}}_{i}^{T} {\hat{β}}_{i} + {\hat{v}}_{i},$ (8) ) with equal weights and spatial weights, respectively. ‘equal2’ and ‘sp2’ represent the estimator in (Equation9(9) $\begin{aligned} {\hat{\bar{Y}}}_{i}^{(2)} & = \frac{1}{N_{i}} (\sum_{h \in A_{i}} y_{i h} + \sum_{h \in C_{i}} {\hat{y}}_{i h}) \\ = f_{i} {\bar{y}}_{i} + {({\bar{X}}_{i} - f_{i} \bar{x_{i}})}^{T} {\hat{β}}_{i} + (1 - f_{i}) {\hat{v}}_{i}, \end{aligned}$ (9) ) with equal weights and spatial weights, respectively.

Figure 2. RMSE under Case I.

Figure 3. RMSE under Case II.

Figure 4. RMSE under Case III.

When $σ_{ϵ}$ is large, the proposed new estimator has the similar performance to the BHF estimator when the group difference is small. As $σ_{ϵ}$ becomes smaller, the performance gain of the proposed new estimator is better than the classical BHF estimator. Besides that, the estimator with spatial weights performs better than the estimator with equal weights. Since the sampling rate is small, thus $f_{i}$ is small, there is not much difference between the two estimators in (Equation8(8) ${\hat{\bar{Y}}}_{i}^{(1)} = {\bar{X}}_{i}^{T} {\hat{β}}_{i} + {\hat{v}}_{i},$ (8) ) and (Equation9(9) $\begin{aligned} {\hat{\bar{Y}}}_{i}^{(2)} & = \frac{1}{N_{i}} (\sum_{h \in A_{i}} y_{i h} + \sum_{h \in C_{i}} {\hat{y}}_{i h}) \\ = f_{i} {\bar{y}}_{i} + {({\bar{X}}_{i} - f_{i} \bar{x_{i}})}^{T} {\hat{β}}_{i} + (1 - f_{i}) {\hat{v}}_{i}, \end{aligned}$ (9) ).

Figures , and show the results when the sampling rate is 0.5%. When the sample sizes become smaller and $σ_{ϵ}$ is large, the proposed estimator with equal weights can be worse than the BHF estimator. But the estimator based on spatial weights is still comparable. Similarly to the case with sampling rate 1%, the proposed estimator performs better when the group difference is large or the value of $σ_{ϵ}$ is small.

Figure 5. RMSE under Case I.

Figure 6. RMSE under Case II.

Figure 7. RMSE under Case III.

4. Real data analysis

In this section, we apply the proposed method to the NRI Iowa urban data in 2015. The auxiliary information used is based on the Landsat data (Li et al., Citation2018). The Landsat data is matched to the segment level data in the NRI based on segments' locations. The number of segments per county is from 7 to 56. We try different starting values and select the best one with spatial weights. After finding the estimated group structure, we refit the model with known group structure and find the regression coefficients in all groups and then obtain the estimates in each county. Figure shows the estimated group map. And Figure shows the estimated population mean of urban in each county with a comparison with the sample mean and the BHF estimator. The proposed estimates are close to the estimates based on BHF, but with larger variations among different counties due to the fact that more than one groups are used in the estimates.

Figure 8. Estimated group structure.

Figure 9. Estimated population mean of urban in each county.

5. Summary and conclusion

In this article, we propose a new unit level small area estimator based on a penalised regression approach. In the new estimator, we can find subgroups of areas and also borrow information from both auxiliary information and areas. Besides that, spatial information is also used in the algorithm. We use simulation studies to compare the performance of the new estimator to traditional estimators under several simulation settings, which show that the proposed estimator can improve the estimates.

Variance estimator is also important in survey sampling. A future work is to develop the variance estimator for the proposed new estimator. Another potential future work is to find subgroups for both regression coefficients and random effects together.

Disclosure statement

No potential conflict of interest was reported by the authors.

Additional information

Funding

This research was supported in part by the Natural Resources Conservation Service of the U.S. Department of Agriculture.

Notes on contributors

Xin Wang

Xin Wang is currently an Assistant professor in Department of Statistics at Miami University. Her research interests are spatial data analysis, Bayesian statistics, clustering, convergence rates of MCMC algorithms and survey sampling.

Zhengyuan Zhu

Zhengyuan Zhu is currently a Professor in Department of Statistics at Iowa State University, director of Center for Survey Statistics & Methodology. His research interests include spatial statistics, survey statistics, time series analysis, and multivariate analysis.

References

Battese, G. E., Harter, R. M., & Fuller, W. A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association, 83(401), 28–36.
Web of Science ®Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.
Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456), 1348–1360.
Web of Science ®Google Scholar
Li, X., Zhou, Y., Zhu, Z., Liang, L., Yu, B., & Cao, W. (2018). Mapping annual urban dynamics (1985–2015) using time series of Landsat data. Remote Sensing of Environment, 216, 674–683.
Web of Science ®Google Scholar
Ma, S., & Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517), 410–423.
Web of Science ®Google Scholar
Ma, S., Huang, J., & Zhang, Z (2016). Exploration of heterogeneous treatment effects via concave fusion. arXiv preprint arXiv:1607.03717.
Google Scholar
Molina, I., & Marhuenda, Y. (2015). sae: An R package for small area estimation. The R Journal, 7(1), 81–98.
Google Scholar
Nusser, S. M., & Goebel, J. J. (1997). The national resources inventory: A long-term multi-resource monitoring programme. Environmental and Ecological Statistics, 4(3), 181–204.
Web of Science ®Google Scholar
Ohio Supercomputer Center (2016). Owens supercomputer. http://osc.edu/ark:/19495/hpc6h5b1.
Google Scholar
Pfeffermann, D. (2013). New important developments in small area estimation. Statistical Science, 28(1), 40–68.
Web of Science ®Google Scholar
Rao, J. N., & Molina, I. (2015). Small area estimation. Hoboken, New Jersey: John Wiley & Sons.
Google Scholar
Wang, H., Li, R., & Tsai, C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3), 553–568.
PubMed Web of Science ®Google Scholar
Wang, X., Zhu, Z., & Zhang, H. H (2019). Spatial automatic subgroup analysis for areal data with repeated measures. arXiv preprint arXiv:11906.01853.
Google Scholar
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2), 894–942.
Web of Science ®Google Scholar
Zhu, R., Zou, G., Liang, H., & Zhu, L. (2016). Penalized weighted least squares to small area estimation. Scandinavian Journal of Statistics, 43(3), 736–756.
Web of Science ®Google Scholar

Appendix

In this appendix, details of partial derivative are provided. $\begin{aligned} \frac{\partial l}{\partial σ_{v}^{2}} & = - \frac{1}{2} \sum_{i = 1}^{m} \frac{1}{n_{i}} t r (Σ_{i}^{- 1} \frac{\partial Σ_{i}}{\partial σ_{v}^{2}}) \\ - \frac{1}{2} \sum_{i = 1}^{m} \frac{1}{n_{i}} {(y_{i} - x_{i}^{T} β_{i})}^{T} \frac{\partial Σ_{i}^{- 1}}{\partial σ_{v}^{2}} (y_{i} - x_{i}^{T} β_{i}), \\ \frac{\partial l}{\partial σ_{ϵ}^{2}} & = - \frac{1}{2} \sum_{i = 1}^{m} \frac{1}{n_{i}} t r (Σ_{i}^{- 1} \frac{\partial Σ_{i}}{\partial σ_{ϵ}^{2}}) \\ - \frac{1}{2} \sum_{i = 1}^{m} \frac{1}{n_{i}} {(y_{i} - x_{i}^{T} β_{i})}^{T} \frac{\partial Σ_{i}^{- 1}}{\partial σ_{ϵ}^{2}} (y_{i} - x_{i}^{T} β_{i}), \end{aligned}$ where $\begin{aligned} \frac{\partial Σ_{i}}{\partial σ_{v}^{2}} & = 1_{n_{i}} 1_{n_{i}}^{T}, \frac{\partial Σ_{i}}{\partial σ_{ϵ}^{2}} = I_{n_{i}}, \\ \frac{\partial Σ_{i}^{- 1}}{\partial σ_{v}^{2}} & = - \frac{1}{{(σ_{ϵ}^{2} + n_{i} σ_{v}^{2})}^{2}} 1_{n_{i}} 1_{n_{i}}^{T} \\ \frac{Σ_{i}^{- 1}}{\partial σ_{ϵ}^{2}} & = \frac{1}{{(σ_{ϵ}^{2})}^{2}} [\frac{σ_{v}^{2} (2 σ_{ϵ}^{2} + n_{i} σ_{v}^{2})}{{(σ_{ϵ}^{2} + n_{i} σ_{v}^{2})}^{2}} 1_{n_{i}} 1_{n_{i}}^{T} - I_{n_{i}}] . \end{aligned}$ $I$ can be written as $I = (\begin{matrix} I_{11} & I_{12} \\ I_{21} & I_{22} \end{matrix}),$ where $\begin{aligned} I_{11} & = \frac{1}{2} \sum_{i = 1}^{m} \frac{1}{n_{i}} t r (Σ_{i}^{- 1} \frac{\partial Σ_{i}}{\partial σ_{v}^{2}} Σ_{i}^{- 1} \frac{\partial Σ_{i}}{\partial σ_{v}^{2}}) \\ = \frac{1}{2} \sum_{i = 1}^{m} \frac{n_{i}}{{(σ_{ϵ}^{2} + n_{i} σ_{v}^{2})}^{2}}, \\ I_{12} & = I_{21} = \frac{1}{2} \sum_{i = 1}^{m} \frac{1}{n_{i}} t r (Σ_{i}^{- 1} \frac{\partial Σ_{i}}{\partial σ_{v}^{2}} Σ_{i}^{- 1} \frac{\partial Σ_{i}}{\partial σ_{ϵ}^{2}}) \\ = \frac{1}{2} \sum_{i = 1}^{m} \frac{1}{{(σ_{ϵ}^{2} + n_{i} σ_{v}^{2})}^{2}}, \\ I_{22} & = \frac{1}{2} \sum_{i = 1}^{m} \frac{1}{n_{i}} t r (Σ_{i}^{- 1} \frac{\partial Σ_{i}}{\partial σ_{ϵ}^{2}} Σ_{i}^{- 1} \frac{\partial Σ_{i}}{\partial σ_{ϵ}^{2}}) \\ = \frac{1}{2 {(σ_{ϵ}^{2})}^{2}} \sum_{i = 1}^{m} [1 - \frac{σ_{v}^{2} (2 σ_{ϵ}^{2} + n_{i} σ_{v}^{2})}{{(σ_{ϵ}^{2} + n_{i} σ_{v}^{2})}^{2}}] . \end{aligned}$

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Small area estimation with subgroup analysis

ABSTRACT

1. Introduction