Predicting owner-occupied housing values using machine learning: an empirical investigation of California census tracts data: Journal of Property Research: Vol 38 , No 4

ABSTRACT

In this paper, we introduce machine-learning (ML) methods to evaluate one of the key concepts of real estate analysis – the prediction of housing prices in the presence of a large number of covariates. We use several supervised ML tools that are based on regularisation methods – notably Ridge, LASSO, and Elastic Net regressions – and discuss their relative performance in comparison to conventional OLS-based methods. Our empirical results show that the supervised ML methods provide a comprehensive description of the determinants of owner-occupied housing values in the census tracts of California. We find that, compared to the familiar worlds of OLS and WLS, the Ridge, LASSO, and Elastic Net regressions provide relatively better out-of-sample predictions. Among the benefits of shrinkage-based ML methods are their ability to resolve such issues as variable selection and overfitting.

KEYWORDS:

Disclosure statement

No potential conflict of interest was reported by the author.

Notes

1. According to the guidelines of IAAO (Citation2013, p. 5), mass appraisal requires complete and accurate data, effective valuation models, and proper resource management.

2. An incomplete list of works in this area includes Borst and McCluskey (Citation2008, Citation2011), Bourassa et al. (Citation2007), Dubin (Citation1998), Goodman and Thibodeau (Citation2003, Citation2007), Hausler, Ruscheinsky and Lang (Citation2018), Lin and Mohan (Citation2011), Páez (Citation2005), Perez-Rave et al. (Citation2019), Worzala et al. (Citation1995), and Xu (Citation2008).

3. Traditionally, hedonic regression models used for mass appraisal have employed a nonlinear MRA framework to explain the variability of housing prices at the disaggregated level.

4. The advent of cloud computing and the increasing availability of ML codes in R and Python languages have also created a favourable environment for large exploratory data analysis that is suitable for real estate analysis.

5. For example, Pavlov (Citation2000) suggests that because of model misspsecifications and the influence of omitted variables, the implicit prices of housing attributes can be misleading.

6. Some recent work examines the time-series forecastability of housing prices under a data-rich environment. For example, Bork and Møller (Citation2018) discuss how to reduce the dimension of a large set of predictor variables by using principal component analysis, partial least squares (PLS), and sparse PLS methods.

7. Goodman and Thibodeau (Citation2003) utilise 28,561 single-family transactions from Dallas County and evaluate their predictive accuracy using three types of alternative housing submarket constructions: zip codes, census tracts, and hierarchical model. In contrast, Goodman and Thibodeau (Citation2007) analyse 44,000 sales of single-family properties and examine two alternative procedures for delineating housing submarkets within the Dallas metropolitan-area market. The first procedure combines spatially adjacent census block groups and the second procedure allows spatial discontinuities.

8. Other related works that discuss various spatial statistical methods includes Case et al. (Citation2004), who use a large sample of 50,000 transactions from Fairfax county, Virginia and compare out-of-sample prediction accuracy using a particular split of the data. Unlike the work of Bourassa et al. (Citation2007), Case et al. (Citation2004) explores out-of-sample predictive accuracy by using only one split of the data. There are complex methodological issues associated with penalised regressions in the presence of spatial dependence and thus a discussion of various spatial regression models using regularisation methods is beyond the current scope.

9. For example, the total number of census tracts increases from 5,732 in 1980 to 5,858 in 1990, and further to 7,049 in 2000. For many reasons, the number of census tracts during the pre-2000 years was considerably lower. In 1970 and 1980 the entire state of California had not yet been tracted. In 1990, in addition to the respective tracts from 1970 and 1980, data from enumeration districts (EDs) and census county divisions (CCDs) have also been incorporated.

10. However, as mentioned in the general Census guidelines, the converted 2000 tract data cannot be considered official U.S. Census Bureau data or California Department of Finance data. Even within 7,904 census tracts for which we have non-missing observations in 2010, 18 census tracts have $0 median values and 4 census tracts have $9,999 median values for the converted 2000 series.

11. Note that the matrix of prediction variables $X$ may include an intercept term. In practice, if we centre $X$ and $y$ before computing the regression, the intercept becomes zero.

12. The specifications we use to capture geographical dependence are comparable to those of Case et al. (Citation2004), Dubin (Citation1998), and Xu (Citation2008).

13. Apart from the predictive accuracy issue, another shortcoming of both OLS and WLS is the lack of interpretive ability. In a high-dimensional regression setting, where the number of predictors is large, we may want to implement a fitting procedure with a subset of important predictors. Interestingly, the above-mentioned issues have little to do with classical regression model assumptions.

14. In essence, the Ridge regression is a continuous shrinkage method that retains all the covariates but penalises large coefficients through the L2-norm. Unlike the Ridge regression, which keeps all the predictors in the presumed model, in the LASSO regression, the presence of multicollinearity results in the dropping of certain predictors while retaining others.

15. The first part of the loss function in Ridge regression is the same as the RSS of OLS. The second part of the loss function involves the regularisation of parameters because it penalises larger coefficient values. It is noticeable that as $λ \to 0$ , ${\hat{β}}_{R i d g e} \to {\hat{β}}_{O L S}$ , and as $λ \to \infty$ , ${\hat{β}}_{R i d g e} \to 0$ .

16. As mentioned by Xu (Citation2008), incorporating absolute location using spatial coordinates in conjunction with the polynomial expansion approach can capture heterogeneity in housing attribute prices.

17. It is important to note that while the WLS regressions correct for issues such as cross-sectional heteroskedasticity, one has to be careful about interpreting the high value of ${\overset{ˉ}{R}}^{2}$ associated with such regressions, which may not guarantee the success of the underlying model’s out-of-sample predictive capacity. We highlight this issue in a future subsection.

Additional information

Notes on contributors

Prodosh E. Simlai

Dr. Prodosh E. Simlai is a Professor of Economics at the Department of Economics & Finance, Nistler College of Business and Public Administration, University of North Dakota, USA. Dr. Simlai received his M.S in Finance and Ph.D. in Economics from the University of Illinois at Urbana-Champaign, USA. His research interests include financial markets, real estate, and applied econometrics. Dr. Simlai’s work has appeared in both leading general interest and field journals including the Accounting Research Journal, Business Economics, Finance Research Letters, International Review of Financial Analysis, Journal of Asset Management, Journal of Derivatives and Hedge Funds, Journal of Real Estate Finance and Economics, Quarterly Review of Economics and Finance, Studies in Economics and Finance, and Research in Finance among others.

Predicting owner-occupied housing values using machine learning: an empirical investigation of California census tracts data

Notes on contributors

Prodosh E. Simlai

Log in via your institution

Log in to Taylor & Francis Online

Restore content access

Related Research

Information for

Open access

Opportunities

Help and information

Predicting owner-occupied housing values using machine learning: an empirical investigation of California census tracts data

ABSTRACT

Disclosure statement

Notes

Additional information

Notes on contributors

Prodosh E. Simlai

Log in via your institution

Log in to Taylor & Francis Online

Log in to Taylor & Francis Online

Restore content access

Related Research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature