Measuring Housing Vitality from Multi-Source Big Data and Machine Learning: Journal of the American Statistical Association: Vol 117 , No 539

Abstract

Measuring timely high-resolution socioeconomic outcomes is critical for policymaking and evaluation, but hard to reliably obtain. With the help of machine learning and cheaply available data such as social media and nightlight, it is now possible to predict such indices in fine granularity. This article demonstrates an adaptive way to measure the time trend and spatial distribution of housing vitality (number of occupied houses) with the help of multiple easily accessible datasets: energy, nightlight, and land-use data. We first identified the high-frequency housing occupancy status from energy consumption data and then matched it with the monthly nightlight data. We then introduced the Factor-Augmented Regularized Model for prediction (FarmPredict) to deal with the dependence and collinearity issue among predictors by effectively lifting the prediction space, which is suitable to most machine learning algorithms. The heterogeneity issue in big data analysis is mitigated through the land-use data. FarmPredict allows us to extend the regional results to the city level, with a 76% out-of-sample explanation of the spatial and timeliness variation in the house usage. Since energy is indispensable for life, our method is highly transferable with the only requirement of publicly accessible data. Our article provides an alternative approach with statistical machine learning to predict socioeconomic outcomes without the reliance on existing census and survey data. Supplementary materials for this article are available online.

Keywords:

Supplementary Materials

The supplemental materials contain additional details on the distribution of multiple data sets, estimation of regional housing active status, statistical machine learning techniques, and further results on model comparisons. They augment further the data, methods, and results presented in the main article.

Acknowledgments

The authors gratefully acknowledge excellent research assistance by Xinyu Su.

Data Availability Statement

The energy data is confidential and restricted by nondisclosure agreements. Therefore, we provided a facsimile dataset containing 1% information of the original dataset. The nightlight and land-use data are public data. For data availability and replication needs, please contact the corresponding authors.

Disclosure Statement

The authors report there are no competing interests to declare.

Notes

1 In our paper, the housing price index (HPI) is the conditioned house selling price estimated by the model in Fang et al. (Citation2016).

2 For example, “nightlight” data, one of the most popular datasets, has been widely applied to estimate economic statistics such as poverty (Jean et al. Citation2016), economic growth (Chen and Nordhaus Citation2011), and population (Wardrop et al. Citation2018). Other datasets, including imagery, mobile phone, social media, and commercial online data are also introduced in predicting unemployment rate (Toole et al. Citation2015), political attitudes (Gebru et al. Citation2017), personal attributes (Kosinski, Stillwell, and Graepel Citation2013), and economic activities (Glaeser, Kim, and Luca Citation2017).

3 There are two reasons to use monthly estimation instead of higher-frequency daily ones. First, the high-frequency daily indices could be very noisy. For instance, one house could be vacant simply because the user is out of town for work or holiday. So we transformed the daily “vitality” into monthly indices in this article. See Section 3 for details. Second, the nightlight data is aggregated monthly, to estimate the vitality indices of the entire Shanghai, we summarized the daily results into monthly ones.

4 The reasons for using energy data are two-fold. First, as the necessary resource for living, energy could reflect the housing occupancy status reliably. Second, for billing purposes, energy consumption data is collected by local energy departments, making it much easier to obtain.

5 More precisely, it covers a square of 430 m × 430 m on the earth. We will use a 430-meter square for short.

6 Again, this refers to a square of 30 m × 30 m

7 All status variables are binary, with 0 for vacant and 1 for occupied.

8 The inertia index is the total cluster sum of square, measuring the sum of distances of each observation to the corresponding centers. For DBI and S-Dbw, see Halkidi and Vazirgiannis (Citation2001) and Singh et al. (Citation2020) for details.

9 The GMM is conducted with the logarithmic energy data while this figure presents the distribution of the original data of each cluster which follows a log-normal distribution.

10 By conducting an indoor interview, respondents would answer several questions about their energy consumption, including socioeconomic attributes, appliances, and other related information. Detailed data description could be found in Li, Pizer, and Wu (Citation2018).

11 We obtained the information about the composition of electric appliances from the survey and then used data from the most popular selling ones at JD.com, one of the biggest online stores in China.

12 Fan et al. (Citation2020) suggests to take C = 1. It is well known that largest eigenvalues are biased upwards. The correction is as follows (Bai and Ding Citation2012): Let ${\hat{λ}}_{j}$ be empirical eigenvalues and p be the dimension. For a given j, define $\begin{matrix} m_{n, j} (z) = {(p - j)}^{- 1} [\sum_{ℓ = j + 1}^{p} {({\hat{λ}}_{ℓ} - z)}^{- 1} + {((3 {\hat{λ}}_{j} + {\hat{λ}}_{j + 1}) / 4 - z)}^{- 1}], \\ {\underline{m}}_{n, j} (z) = - (1 - ρ_{j, n - 1}) z^{- 1} + ρ_{j, n - 1} m_{n, j} (z), \end{matrix}$ with $ρ_{j, n - 1} = (p - j) / (n - 1)$ . The corrected eigenvalue of ${\hat{λ}}_{j}$ is defined as ${\hat{λ}}_{j}^{C} = - \frac{1}{{\underline{m}}_{n, j} ({\hat{λ}}_{j})}$ .

13 The computation can be done expidiously, since ${\hat{B}}^{T} \hat{B}$ is a diagonal matrix, with the diagonal elements being the k largest eigenvalues of the matrix $X X^{T} / n$ .

14 The full R² is very close to the R² of leave-one-out tests, except that each time we leave 10% data out instead of one.

15 We also estimated the slope parameter between the price index and the standardized number of occupied houses. Both the OLS and fixed effect regression model show significant slope parameters as 0.195 (0.083) and 0.220 (0.084), respectively.

Additional information

Funding

This study was supported by the Chinese National Natural Science Foundation (No.71991471), the National Science Fund for Distinguished Young Scholars (No. 71925010), Shanghai Pujiang Scholar Project (21PJC010), the China Postdoctoral Science Foundation funded project (No. 2019M650076, No. 2020T130107), Paul and Marcia Wythes Center on Contemporary China at Princeton University, and the Science Project of the State Grid Shanghai Municipal Electric Power Company.

Measuring Housing Vitality from Multi-Source Big Data and Machine Learning

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Measuring Housing Vitality from Multi-Source Big Data and Machine Learning

Abstract

Supplementary Materials

Acknowledgments

Data Availability Statement

Disclosure Statement

Notes

Additional information

Funding

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature