Full article: Bayesian Modelling of a Standard House Configuration Model to Analyze Housing Feature Impacts in Newly Developed Suburbs without Historical Sales

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

There is a recent trend of entire new suburbs being designed and built to solve the housing crisis all around the world. The aim of this study is to anticipate the value of housing features in newly developed suburbs using a Bayesian approach. We present the Standard House Configuration Model, where housing feature impacts are analyzed relative to the configuration of a standard house for easy interpretation. The benefit of using a Bayesian approach is that we describe housing feature impacts with highest density intervals, which more closely resemble the intuitive understanding of probability intervals than statistical confidence intervals. Our case study on newly developed suburbs in Auckland, New Zealand, demonstrates that the posterior distributions from our model effectively capture the complex relationship between housing features and sale price (R² value of 93%). The proposed model is cross-validated on four recently developed suburbs in Auckland. For comparable suburbs, our model is able to make reasonably accurate price predictions without using any historical sale records from the target suburb. This indicates that the insights into housing feature impacts are applicable to other new suburbs still in the planning stage and, therefore, have the potential to support future suburb developments.

Keywords:

Introduction

Housing shortage is a growing crisis all around the world. In addition to the supply and demand issue, new housing developments are often priced for upper-class buyers and overseas investors, and therefore do not resolve the crisis of housing shortage for middle-class buyers (Kusisto & Grant, Citation2019). Countries are looking for new housing solutions to satisfy the demands of middle-class families. One new solution is the coordinated development of entirely new suburbs that support a local community. Developing an entirely new suburb is time-consuming because of challenges such as planning, land allocation, and funding (Ambrose, Citation2019), and it is difficult to predict the expected price and housing demands of a new suburb far into the future. For example, the Old Oak development in West London aims to create a new London suburb that provides 25,500 homes and 65,000 jobs, but the entire project will take an estimated 20 to 30 years to complete (Braidwood, Citation2017). Demands for different housing features are quickly changing as a new generation of middle-class families emerges. It is becoming increasingly difficult to accurately identify specific housing requirements that create sustainable suburbs for a local community of new middle-class families (Ambrose, Citation2019). If we are able to anticipate the implicit value of housing features before a suburb has been built, we will be able to assist housing developers with their decision-making by making price predictions for new suburbs with no historical sales.

House price prediction using historical sales data from the same suburb has been thoroughly explored by past research (Bourassa et al., Citation2007; Limsombunchai, Citation2004), but there is a distinct lack of studies that focus on modelling entirely new suburbs with no historical sales. provides an overview of the modelling approaches from a selection of past studies that employ machine learning models to make price predictions, and include at least one housing feature as an explanatory variable. Common methods for modelling house prices include linear hedonic models (Wen et al., Citation2004), semi-parametric regression models (Bao & Wan, Citation2004), spatial autoregressive models (Pace et al., Citation2000), random forest models (Yoo et al., Citation2012), and artificial neural networks (Limsombunchai, Citation2004). Each study presented in is representative of a common modelling approach.

Table 1. Overview of a small sample of past studies representing the most common modelling approaches for house price prediction.

Download CSV Display Table

According to the hedonic price theory, each housing feature is associated with an implicit price, and the market price of a property is the sum of all implicit prices of the housing features (Wen et al., Citation2004). The main advantage of traditional linear hedonic models is that the impacts of individual housing features on sale price are easy to interpret, but these models are less adaptive to non-linear relationships (Wen et al., Citation2004), (Bao & Wan, Citation2004). Capturing the non-linear relationships between housing features and sale value is a critical aspect of this research.

Semi-parametric regressions and spatial autoregressive models are not suitable for analyzing entirely new suburbs because historical sales from the same neighbourhood are unavailable (Liao & Wang, Citation2012). Semi-parametric regressions generally consist of a parametric component that captures the effects of housing features, while the non-parametric component captures spatial variation. For example, Bao and Wan (Citation2004) use spline smoothing to allow feature coefficients to vary smoothly in space (i.e. to allow coefficients to be different at different property locations). Spatial variation of house prices often exhibits positive autocorrelation, meaning that houses which are geographically close together are more likely to have similar prices (Liao & Wang, Citation2012). Spatial autoregressive models incorporate a contiguity matrix that explicitly models the error structure to allow for spatial autocorrelation (Brunsdon et al., Citation1999; Bourassa et al., Citation2007). Pace et al. (Citation2000) conducted a case study on property sales in Baton Rouge, Louisiana, USA, from 1984 to 1992 to demonstrate the effectiveness of autoregressive models for capturing spatial variations. Limitations of the spatial autoregressive models include the assumption of a linear relationship, and performance dependency on the choice of the predefined contiguity matrix (Bency et al., Citation2017).

Spatial variation of 4880 property sales in Auckland, New Zealand, in 1996 has been analyzed by Bourassa et al. (Citation2007) using two spatial regression models. Bourassa et al. (Citation2007) compare the performance of a statistical model that explicitly models error structure, to the performance of a linear model that incorporates geographical housing submarkets as explanatory variables. The geographical housing submarkets used as explanatory variables are predefined by official property valuers, and these should theoretically form relatively homogeneous subgroups. Explicitly modelling for error structure is a relatively flexible approach that allows variations between individual houses and is especially useful for heterogeneous housing markets. For the Auckland case study, the linear model with geographical housing submarkets as variables outperforms the statistical model that explicitly models error structure. This implies that the geographical submarkets defined by valuers adequately capture the spatial variation of Auckland house prices, and complex statistical methods are not always required.

Non-parametric algorithms such as random forests and artificial neural networks often produce good forecasting results, but the impacts of individual housing features are more difficult to interpret (Biau & Scornet, Citation2016; Liu et al., Citation2018). Yoo et al. (Citation2012) use a random forest not only for house price prediction, but also to select relevant variables for individual neighbourhoods. Different sets of variables are selected for each neighbourhood, so defining the correct neighbourhood configuration is a critical step. Limsombunchai (Citation2004) demonstrates the superior predictive ability of an artificial neural network compared to a linear hedonic price model using a case study on 200 houses in Christchurch, New Zealand. The study does not model house price changes over time. The actual sale prices are unavailable, so the estimated values are used as the target variable instead. Another limitation of Limsombunchai’s study is that the predictive ability of a neural network is dependent on the chosen number of hidden layers and the number of nodes in each layer, but automatic optimization of these model parameters is not supported. Limsombunchai acknowledges that neural networks are not guaranteed to outperform linear models without a process of trial and error. The ‘black box’ nature of neural networks means that the contribution of each housing feature to the final sale price cannot be easily derived from the model (Liu et al., Citation2018). This is a major issue for developers who wish to understand buyer preferences to housing features in a new suburb.

None of the past studies specifically targets new suburbs before or shortly after being built (see for analysis criteria), and very few papers analyzed the housing feature impacts on sale prices in Auckland suburbs. Filippova and Rehm (Citation2011) conducted studies on many social and environmental amenity values (e.g. school zones, proximity to cell phone towers) in the Auckland region with hedonic price models. The study by Rehm and Filippova (Citation2008) on the impact of geographically defined school zones on house prices in Auckland uses standard house price as a reference to express the premiums and discounts of the case study suburbs in a way that is easy to interpret.

Table 2. Analysis criteria for a selection of past research that employs machine learning models to make price predictions or analyze the Auckland housing market.

Display Table

This paper focuses on analyzing newly developed suburbs in Auckland to understand housing feature impacts on sale price in these new suburbs using a Bayesian approach. Past studies have already demonstrated that Bayesian models are very effective for estimating house prices and the uncertainty related to the model parameters. For example, Clapp et al. (Citation2002) propose capturing spatial variation of house prices using a semi-parametric model with local polynomial smoothing combined with Bayesian estimation. The main advantage of this model compared to other semi-parametric regressions is that Bayesian estimation provides inference in the form of posterior distributions for all model parameters instead of point estimates, and uncertainty can therefore be quantified. Bayesian estimation combined with spatial regression is also explored by Wheeler et al. (Citation2014). They apply a Bayesian model with spatially varying coefficients to houses sold in Toronto, Canada, and the proposed Bayesian model outperforms both the traditional linear models and the geographically weighted regression (GWR) model. The Bayesian model with spatially varying coefficients not only provides better predictions than the GWR and the linear regression, but also provides complete inference on model parameters and predictions. Limitations of the Bayesian approach include the high computational cost for most numerical simulation methods. In the context of our research, a Bayesian approach allows us to quantify the uncertainty associated with individual housing feature impacts on sale price. We will demonstrate the application of the Bayesian approach outlined by Bishop (Citation2006), which relies on analytic solutions instead of numerical simulations, and therefore avoids the high computational costs generally associated with numerical simulation methods.

A special focus is the formulation of the Bayesian model, such that developers can easily interpret the results and integrate them into their planning. Defining a reliable point of reference for our model is critical for conveying the results to non-experts in machine learning, such as housing developers (Sargent-Cox et al., Citation2010; Manderbacka et al., Citation2003). Our chosen point of reference is the price of a standard house in the new suburbs over time, similar to Rehm and Filippova (Citation2008) method of using a reference standard house to compare price premiums of individual suburbs. Rehm and Filippova (Citation2008) model includes housing features such as floor area and site area, but the effects are not analyzed relative to the standard house.

We expand on previous research to include features such as number of bedrooms, bathrooms, and garages, which has not been a focus in past research on Auckland housing (Fernandez, Citation2019). The literature review conducted by Fernandez (Citation2019) on the application of hedonic price models to the New Zealand housing market also identifies the scope for matching housing features to sale prices and the application of Bayesian approaches. Our model also expands on Rehm and Filippova (Citation2008) method of estimating reference standard house price for every two-year period by modelling the monthly standard house price using a moving average, and thereby recognizing that price change over time is continuous. Modelling housing feature impacts relative to a standard house will create a solution to our regression problem that is easy to interpret and will be beneficial to developers. The main objectives of this paper are to:

Separate the effects of price change over time due to market dynamics from the effects of individual housing features.
Understand the non-linear impact of individual housing features on sale price relative to the configuration of a reference standard house.
Apply Bayesian modelling to newly developed suburbs in Auckland as a case study.
Validate the robustness of our proposed model by making out-of-sample price predictions for new suburbs without historical sales.

The remainder of this paper is structured as follows. Sec. 2 describes in detail the proposed Bayesian approach of modelling housing feature impacts relative to the configuration of a standard house. Sec. 3 shows the results from applying the proposed model to property sales in four newly developed suburbs in Auckland, New Zealand: Fairview Heights, Hobsonville, Oteha, and Stonefields. Conclusions about the deduced housing feature impacts in this study and the directions of future applications are given in Sec. 4.

Modelling Housing Feature Impacts in New Suburbs

This section describes the proposed approach to model house sale values in new suburbs using housing examples from Auckland, New Zealand. Our modelling approach is designed with specific emphasis on understanding the impacts of individual housing features that can be controlled by developers (e.g. number of bedrooms, number of bathrooms, etc.).

The rapid increase in Auckland house prices over the years is highly non-linear, as shown by the annual median price plotted in . The monthly median sale price of houses in Auckland is calculated with the housing data provided by CoreLogic for research purposes (CoreLogic, Citation2020). Due to economic uncertainties, it is very difficult to forecast future house price trends over a long time horizon. This study separates the effects of price change over time from the effects of individual housing features by decomposing the problem into two components:

Figure 1. Median sale price of houses in Auckland from 1990 to 2018, calculated from the housing data provided by CoreLogic (Citation2020). The increase in price after 2010 is rapid and non-linear.

Price of a standard house over time.
Impacts of individual housing features on sale price relative to the standard house.

In the context of our research, a standard house is a manually defined reference point for controlling the effects of market dynamics and individual housing features. The features of a standard house are defined by average statistics, such as the median value of house features in the studied suburbs. The first component estimates the price of a standard house over time to capture the effects of market dynamics. Estimating the price of a standard house over time means that the non-linear trend of price increase over time can be incorporated into the same model as other housing features. The second component models how individual housing features impact sale price relative to the estimated price of a standard house from the first component. The estimated housing feature impacts from our model will be easy to interpret for housing developers because they are presented relative to a clear point of reference. A flow chart showing an overview of our proposed model’s structure and process is presented in .

Figure 2. Flow chart showing a summary of the process for building a Standard House Configuration Model, given a set of housing data and the associated housing features. The standard house is a manually defined reference point, and all feature impacts are modelled relative to the standard house configuration. The discrete housing features include both ordinal variables, which refer to data that has a clear ordering of the categories (e.g. number of bedrooms), and nominal variables, which refer to data that has no clear ordering (e.g. house type).

Features of a standard house incorporated into our model are floor area $\bar{A},$ land value $\bar{L},$ number of bedrooms $\bar{B},$ number of bathrooms $\bar{C},$ number of garages $\bar{G},$ a Boolean value $\bar{F}$ that indicates whether a house has at least one free-standing garage, and house type $\bar{H} .$ Our standard house is described by a tuple of constants defining its configuration: (1) $(\bar{A}, \bar{L}, \bar{B}, \bar{C}, \bar{G}, \bar{F}, \bar{H}) .$ (1)

For every sale record with index i, we record the sale price P_i, floor area A_i, estimated land value L_i, number of bedrooms B_i, number of bathrooms C_i, number of garages G_i, a Boolean value F_i that indicates whether a house has at least one free-standing garage, house type H_i, suburb S_i, and sale month M_i. A sale record with index i is described by a tuple of values: (2) $(P_{i}, A_{i}, L_{i}, B_{i}, C_{i}, G_{i}, F_{i}, H_{i}, S_{i}, M_{i}) .$ (2)

We call our model the Standard House Configuration Model (SHCM). We model the sale price P_i relative to the configuration ( $\bar{A},$ $\bar{L},$ $\bar{B},$ $\bar{C},$ $\bar{G},$ $\bar{F},$ $\bar{H}$ ) of a standard house with the following formula: (3a) $\begin{matrix} l og (P_{i}) = w_{0} + w_{A} (A_{i} - \bar{A}) + w_{L} (log (L_{i}) ‐ log (\bar{L})) + \sum_{b \in B \ \bar{B}} w_{b}^{B} 1_{B_{i} = b} \\ + \sum_{c \in C \ \bar{c}} w_{c}^{C} 1_{C_{i} = c} + \sum_{g \in G \ \bar{G}} w_{g}^{G} 1_{G_{i} = g} \\ + \sum_{f \in F \ \bar{F}} w_{f}^{F} 1_{F_{i} = f} + \sum_{h \in H \ \bar{H}} w_{h}^{H} 1_{H_{i} = h} \\ + \sum_{s \in S} w_{s}^{S} 1_{S_{i} = s} \\ + \frac{1}{N (M_{i})} \sum_{m \in M} w_{m}^{M} 1_{(M_{i} - 5 \leq m \leq M_{i} + 6)} + ε_{i} \end{matrix}$ (3a) where (3b) $N (M_{i}) = \sum_{m \in M} 1_{{(M}_{i} - 5 \leq m \leq M_{i} + 6)}$ (3b) and (3c) $ε_{i} \sim N (0, σ^{2}) .$ (3c)

All variables and sets of the proposed model are described in , and ε_i is the residual. We take the natural log of sale price P_i and land value L_i to satisfy the assumptions of a linear hedonic regression model. Log-transformed sale price log(P_i) is the target variable. The rest of the variables described in are explanatory variables. Since we applied the natural log transformation on sale price, estimated impacts of housing features are calculated as multiplicative percentage changes after back-transforming fitted coefficients $w_{0},$ $w_{A},$ $w_{L},$ $w_{1}^{B},$ …, $w_{5 +}^{B},$ $w_{1}^{C},$ …, $w_{3 +}^{C},$ $w_{1}^{G},$ …, $w_{2 +}^{G},$ $w_{0}^{F},$ $w_{1}^{F},$ $w_{Bungalow}^{H},$ …, and $w_{Unit}^{H}$ by taking the exponential function. Since land value L_i is also log-transformed in EquationEquation 1(1) $(\bar{A}, \bar{L}, \bar{B}, \bar{C}, \bar{G}, \bar{F}, \bar{H}) .$ (1) , the corresponding coefficient $w_{L}$ can be interpreted as the percentage change in sale price P_i for every 1% increase in land value. The impacts of floor area A_i and land value log(L_i) are modelled in terms of the difference to their respective configurations $\bar{A}$ and log( $\bar{L}$ ) of the standard house (see EquationEquation 1(1) $(\bar{A}, \bar{L}, \bar{B}, \bar{C}, \bar{G}, \bar{F}, \bar{H}) .$ (1) ). Each level of discrete housing features such as number of bedrooms B_i, number of bathrooms C_i, and number of garages G_i are all encoded into binary variables (represented by indicator function $1$ ) so that our model is able to capture the non-linear relationship between housing features and sale value relative to the configuration of the standard house. For example, $1_{B_{i} = 1}$ = 1 if the property with sale record index i has one bedroom, and $1_{B_{i} = 1}$ = 0 otherwise. Most garages are under the main roof of the house, but some houses also have free-standing garages.

Table 3. Description of variables where i is the sale record index.

Display Table

The impact on sale price of having at least one free-standing garage is captured by the term $1_{F}$ _. House types H_i are also encoded into binary variables, and the back-transformed coefficients of each encoded house type $w_{h}^{H}$ provide a percentage price change for the value of each house type. By excluding the encoded terms of $\bar{B}, \bar{C},$ $\bar{G},$ $\bar{F},$ and $\bar{H}$ from our model, the coefficients of these configurations are essentially fixed to zero so that all other feature impacts are estimated relative to the standard house. For example, $w_{b}^{B}$ = 0 when b = $\bar{B},$ and $w_{c}^{C}$ = 0 when c = $\bar{C} .$ Based on a previous study on Auckland housing by Bourassa et al. (Citation2007), using homogeneous geographic submarkets as explanatory variables is sufficient for capturing spatial variation without more complex statistical methods. This study encodes suburbs S_i into binary variables to represent geographic submarkets. The back-transformed coefficients of each encoded suburb $w_{s}^{S}$ provide a percentage price change for house values in each suburb.

Sale months extracted from the provided sale dates are indexed as integers. For example, in our Auckland housing case study, January 2001 to December 2018 are indexed from 1 to 216 chronologically. Each unique sale month index m is then encoded into a binary variable similar to the discrete housing features. All model coefficients of encoded sale month indices $w_{1}^{M}, \dots, w_{216}^{M}$ are averaged across a moving time window so that the estimated price of a standard house to be used as a reference value changes gradually. Our moving time window is defined by the five months before and six months after the time of sale M_i, where N(M_i) gives the number of months in our case study that occurs in this time window (see EquationEquation 3b(3b) $N (M_{i}) = \sum_{m \in M} 1_{{(M}_{i} - 5 \leq m \leq M_{i} + 6)}$ (3b) ). This method is not suitable for forecasting future house prices because we are using information from the future six months after the time of sale. This restriction is suitable for the presented study because the focus is on analyzing housing feature impacts in new suburbs, and not on future price predictions. The benefit of separating housing feature impacts from the volatility of market dynamics is that our analysis has the potential to be combined with future price forecasts made by local domain experts such as developers or real estate agents.

The proposed model formula for the SHCM is an extension of traditional valuation methods based on the hedonic price theory (Wen et al., Citation2004). The distinctiveness of our modelling approach, compared to traditional hedonic regressions, includes the encoded housing features for capturing non-linear relationships, the moving average of the monthly standard house price to account for market dynamics, and the expression of housing feature impacts relative to the configuration of a standard house for easy interpretation.

The model formula in EquationEquation 3(3a) $\begin{matrix} l og (P_{i}) = w_{0} + w_{A} (A_{i} - \bar{A}) + w_{L} (log (L_{i}) ‐ log (\bar{L})) + \sum_{b \in B \ \bar{B}} w_{b}^{B} 1_{B_{i} = b} \\ + \sum_{c \in C \ \bar{c}} w_{c}^{C} 1_{C_{i} = c} + \sum_{g \in G \ \bar{G}} w_{g}^{G} 1_{G_{i} = g} \\ + \sum_{f \in F \ \bar{F}} w_{f}^{F} 1_{F_{i} = f} + \sum_{h \in H \ \bar{H}} w_{h}^{H} 1_{H_{i} = h} \\ + \sum_{s \in S} w_{s}^{S} 1_{S_{i} = s} \\ + \frac{1}{N (M_{i})} \sum_{m \in M} w_{m}^{M} 1_{(M_{i} - 5 \leq m \leq M_{i} + 6)} + ε_{i} \end{matrix}$ (3a) can be written in the form y = w^TX, where y is a vector of our target variable, w is a vector of our coefficients, and X is our feature matrix containing the values of our explanatory variables from the input data. Each element y_i of our target vector y is defined as: (4) $y_{i} = log (P_{i})$ (4) for all sales, and (5) $w = (w_{0}, w_{A}, w_{L}, w_{1}^{B}, \dots, w_{5 +}^{B}, w_{1}^{C}, \dots, w_{3 +}^{C}, w_{1}^{G}, \dots, w_{2 +}^{G}, w_{0}^{F}, w_{1}^{F}, w_{Bungalow}^{H}, \dots, w_{Unit}^{H}, w_{FairviewHeights}^{S}, \dots, w_{Stonefields}^{S}, w_{1}^{M}, \dots, w_{_{216}}^{M}) .$ (5)

We adopt the Bayesian linear regression approach outlined by Bishop (Citation2006). The Bayesian linear regression model assumes a zero-mean isotropic Gaussian prior for all parameters, and uses Bayesian inference to calculate posterior distributions of model coefficients w. The posterior distribution p(w|y) of model coefficients w is defined by mean m and covariance matrix S. The formula of the posterior distribution is as follows, where β is the noise precision parameter, and α is the regularization term: (6a) $p (w | y) = N (w | m, S)$ (6a) where (6b) $m = β {SX}^{T} y$ (6b) and (6c) $S^{- 1} = α I + β X^{T} X .$ (6c)

The Bayesian linear regression model in EquationEquation 6(6a) $p (w | y) = N (w | m, S)$ (6a) includes a sum-of-squares regularization term governed by the regularization parameter α (see EquationEquation 6c(6c) $S^{- 1} = α I + β X^{T} X .$ (6c) ) in the error function. Ridge regression also uses a sum-of-squares regularization term, so the means of posterior distributions for our Bayesian model coefficients are consistent with the fitted coefficients from a Ridge regression model. We select the optimal regularization parameter α by cross-validation.

From the posterior distributions p(w|y) of the housing feature coefficients, we are able to extract highest density intervals (HDI) of housing feature impacts on house sale values. Highest density intervals are much easier to interpret for non-experts such as housing developers than statistical confidence intervals, because they model the intuitive understanding of uncertainty in terms of probabilities. A 95% highest density interval means that the probability of the true value lying within the highest density interval is 95% (Kruschke & Liddell, Citation2018).

Regularization is required in order to prevent overfitting, especially in housing data with high dimensionality (Goeman et al., Citation2012). Uncertainty intervals are not defined for regularised ordinary least squares regressions. There are bootstrapping methods to calculate confidence intervals for regularized regression models, but the general consensus is that such uncertainty intervals are not very meaningful because of the bias introduced by the regularization (Goeman et al., Citation2012). Reporting confidence intervals calculated from bootstrapping a regularized model could give the misleading impression of high precision, while not taking the model bias into account. However, uncertainty intervals are crucial for the interpretability of our modelling results, and so by using a Bayesian approach instead of a frequentist model, we can use posterior distributions to calculate reliable uncertainty intervals. This proposed modelling approach is used to analyze house sale values in four new Auckland suburbs in the following section.

Case Study: Recently Developed Suburbs in Auckland

Auckland has an estimated shortage of 34,000 homes that accumulated from 2013 to 2018 as the local population continued to increase (Ninness, Citation2018). To meet the rising housing demands, Auckland Council has presented a plan to develop outer suburbs such as Fairview Heights and Hobsonville and expand the current city into ‘Greater Auckland’ (Auckland Council, Citation2020). Variations of housing features and house sale prices are very specific to each housing market, but very little research has targeted newly developed suburbs. The housing crisis in Auckland and the plan for rapid expansion mean that modelling sales in newly developed Auckland suburbs will be increasingly relevant to future developers.

Data Description

CoreLogic is a property data and analytic service provider (Fleming & Humphries, Citation2013). They provided New Zealand housing data to the University of Auckland library database for research purposes (CoreLogic, Citation2020). Two housing data sets on New Zealand properties are used for this project. The first data set is a compilation of all sales records in New Zealand from 1990 to 2018. This housing sales data set has a total of 2,992,518 observations and 56 data fields, including a property ID and a sale ID. A subset of relevant data fields in the sales data set is described in . The second data set contains some detailed information on housing features and property location, and relevant data fields from this data set are described in . The 1,567,224 observations in this second data set are only identified by the property ID, and not the sale ID. Close examination of the second data set reveals that the housing features in this data set only represent the state of the property at the time the data is compiled, and not the state of the property at the time of sale. This will be taken into account when preparing training and testing data sets.

Table 4. Description for sample of relevant data fields from the housing sales data set for New Zealand from 1990 to 2018 provided by CoreLogic (CoreLogic, Citation2020).

Download CSV Display Table

Table 5. Description for relevant data fields from the housing features data set for New Zealand provided by CoreLogic (CoreLogic, Citation2020).

Download CSV Display Table

Data Exploration

It is very time-consuming to manually examine all data features in the CoreLogic housing data sets. We have written two Python programs to generate summary statistics on numeric features and categorical features respectively, to gain a basic understanding of the data available. Both programs have been uploaded to the University of Auckland’s data repository ‘Figshare’ to be publicly accessible (Lin, Citation2023b; Lin, Citation2023a). Future researchers who decide to use the CoreLogic housing data, or other housing data sets with a similar format, can also apply our program to provide a quick overview of all data features.

The first Python program automatically generates a summary report for each numeric feature in the CoreLogic database (Lin, Citation2023b). Users must manually set up the correct directory path to the CoreLogic housing data sets before applying our program. The generated summary report includes information on:

the number of unique values,
the number of missing values and the percentage of missing values,
summary statistics such as the mean, median, maximum and minimum,
highest and lowest occurrences,
Pearson’s correlation coefficient and p-value with respect to sale price,
plot of the feature distribution,
scatter plot of the numeric feature against sale price,
the summary statistics and plots listed above after removing outliers by a z-score threshold.

The second Python program automatically generates a summary report for each categorical feature in the CoreLogic database (Lin, Citation2023a). The summary report includes information on:

9. the number of unique categories,
10. the number of missing values and the percentage of missing values,
11. highest and lowest occurrences,
12. bar plot of the occurrence frequencies of the top 10 most frequent categories,
13. box plot of the categorical feature against sale price after removing outliers by a z-score threshold of 5.

Data Processing

The housing data compiled by CoreLogic is not free of errors, so the data is inspected and cleaned carefully to prepare reliable training and testing sets. This section describes the process of transforming raw housing data provided by CoreLogic into datasets that could be used for modelling sales in newly developed Auckland suburbs. Data processing specific to each housing case study is always required, and this is an overview of the general steps that are applied to the case study on newly developed Auckland suburbs:

An inner join on the sales data set and the housing features data set by the property ID is performed. Both sales records and housing features are critical information for the model, so the two housing data sets are merged by the property ID ‘QPID’. The merged data set consists of 2,913,874 observations and all columns from both data sets. As previously mentioned in Sec. 2, the housing features do not necessarily describe the property at the time of sale. This should not be an issue for analyzing newly built suburbs because the housing features data is compiled recently, but it is critical to distinguish between property sales before and after a house is built (see step 4).
Approximately 0.0097% of the sales data have a sale price of zero for unknown reasons. Only observations with a non-zero sale price are selected for analysis.
This study focuses on the analysis of newly built suburbs, so land sales without a building are not selected for analysis. The exact year in which a house was built is not available in the data (i.e. only the estimated year built is available), so it is difficult to identify purely land sales from the sale date. Instead, only properties with a non-zero improvement value are selected for analysis. The improvement value is the estimated building value from the previous official valuation before sale, and selecting properties with non-zero improvement values should theoretically eliminate all land sales without buildings.
We only selected sales below two million New Zealand dollars for analysis. Sales above two million dollars are outliers in these newly developed suburbs. We consulted with housing developers who confirmed that the high-end market behaves differently to sales in the housing market for middle-class families we are targeting. Only a very small sample of sales data are removed. In the entire New Zealand housing data sets, there are only 11,456 sales above two million dollars. This is approximately 0.39% of the housing sales data. Removing sales from the high-end market only removes a small portion of the housing data and reduces the bias caused by extremely high sale values.
Units are not always consistent for all observations. For example, land area of a few properties is described in square metres instead of hectares, as specified in the data description. These anomalies can be identified from manual inspection of outlier values, and the correct value after unit conversion can be validated against the public property data from the Auckland Council website.
All units are converted to SI units for consistency and easy interpretation. For example, the unit of land area is converted from hectares to square metres.
Estimated land values and estimated building values should theoretically be extracted from the previous official valuation before sale, but close inspection reveals that the valuation date is after the sale date for a few observations. Only observations where the valuation date predates the sale date are selected for analysis.
Property sales where the floor area is either zero or missing are dropped from the data set since this is crucial information when modelling the value of a house.
Properties with house types marked as ‘Apartments’ are removed from our data set because modelling apartment sale values is outside the scope of this research.
For all numeric variables in the data set, we filled in missing values with the median value of the variable. Properties with zero land area are also filled with the median. All apartments are removed, so properties with zero land area could be caused by issues with subdivisions, and we are therefore treating these as missing values.
We only selected the first sale of each property, identified by sale dates and property ID ‘QPID’, since that is the sale that housing developers are interested in when they are planning for an entirely new suburb.
The objective of this study is to model sales in new suburbs, so columns that are unlikely to be available for houses in new suburbs before or shortly after they are built are removed (e.g. government-estimated building value).

The data processing steps listed above are applied to the entire CoreLogic data sets before the identification of our case study suburbs. The next section includes further processing steps after identifying our case study suburbs and examining the relevant housing data.

Case Study Description

We conduct a case study on modelling house sales in four suburbs, which were recently developed in Auckland, New Zealand. These suburbs are Fairview Heights, Hobsonville, Oteha, and Stonefields. These suburbs are selected as case studies based on consultation with housing developers, who confirmed that these suburbs are designed with young, middle-class families as the main target buyers. The similarity in target household attributes should lead to similar household demands. The locations of the four suburbs, along with Auckland Central, are all shown on the map in . Fairview Heights, Hobsonville, and Oteha are newly developed suburbs north of Central Auckland, while Stonefields is southeast of Central Auckland. Auckland is a polycentric (i.e. cities that consist of multiple urban sub-centres), coastal city with a complex land price pattern, where land price decreases non-linearly from Auckland Central to the urban periphery (Grimes & Liang, Citation2009). The effects of suburb locations of the four suburbs, with respect to their distance to Auckland central, are incorporated into our model by the estimated land value L_i provided by Auckland Council. The estimated land value in the CoreLogic housing data set is a record of the most recent valuation conducted by Auckland Council before the time of sale. This is public information that is available to housing developers still in the planning stage of suburb development.

Figure 3. Map of Greater Auckland, showing locations of Auckland Central, Fairview Heights, Hobsonville, Oteha, and Stonefields. Both Fairview Heights and Oteha are on the outer edges of the city.

All four suburbs are built after the year 2000. We carefully examined data from each of the four case study suburbs individually to make sure that all sales selected for analysis are part of the new suburb development, and filter out sales of properties that are already in the area but are not part of the new developments. Our selected case study suburbs are all designed for young, middle-class families, so very few properties in these suburbs cater to the high-end market. To reduce the bias that will be caused by the small portion of expensive properties from the high-end market, we calculated z-scores for property sale price and estimated land value. Observations with z-scores above five are filtered from our case study, and this step only removes seven sale records. Properties with land area above 2000 square are not selected for analysis because they are also outliers in these newly developed suburbs, and this only removes 0.75% of our sample data from the four suburbs.

The number of sales selected from each of the four suburbs is listed in . Sale prices in the four suburbs are plotted in . Hobsonville and Stonefields are shown to have the highest median prices out of the four suburbs. Comparing prices directly between suburbs without a fixed reference can be misleading because the time of sale of each property is not taken into account, and these four suburbs are not all developed at the same time. The number of sale samples representing each housing feature is listed in . The housing features are grouped so that there are at least 10 sale samples representing each feature. For example, house types ‘Bungalow’ and ‘Post-war Bungalow’ are all grouped together as ‘Bungalow’. We selected a total of 3188 sale records from these four suburbs using the New Zealand housing data provided by CoreLogic. All of our models are built using Python, but other software such as Stata can also be used to build Bayesian models (Thompson, Citation2014).

Figure 4. Boxplot of sale prices in Fairview Heights, Hobsonville, Oteha, and Stonefields. Hobsonville and Stonefields have the highest median prices out of the four suburbs.

Table 6. The number of sales selected for our case study in each suburb.

Download CSV Display Table

Table 7. The number of sale samples representing each housing feature.

Download CSV Display Table

Impact of Housing Features on Sale Price

This section analyzes the impacts of individual housing features on sale price based on the posterior distributions of coefficients from our fitted model. Our Bayesian regression model has an optimized regularization parameter of 0.07, and an R² value of 93% (see EquationEquation 3(3a) $\begin{matrix} l og (P_{i}) = w_{0} + w_{A} (A_{i} - \bar{A}) + w_{L} (log (L_{i}) ‐ log (\bar{L})) + \sum_{b \in B \ \bar{B}} w_{b}^{B} 1_{B_{i} = b} \\ + \sum_{c \in C \ \bar{c}} w_{c}^{C} 1_{C_{i} = c} + \sum_{g \in G \ \bar{G}} w_{g}^{G} 1_{G_{i} = g} \\ + \sum_{f \in F \ \bar{F}} w_{f}^{F} 1_{F_{i} = f} + \sum_{h \in H \ \bar{H}} w_{h}^{H} 1_{H_{i} = h} \\ + \sum_{s \in S} w_{s}^{S} 1_{S_{i} = s} \\ + \frac{1}{N (M_{i})} \sum_{m \in M} w_{m}^{M} 1_{(M_{i} - 5 \leq m \leq M_{i} + 6)} + ε_{i} \end{matrix}$ (3a) for the full model formula). The estimated percentage change for the standard house price in each suburb, calculated by back-transforming coefficients $w_{s}^{S}$ for s ∈{Fairview Heights, Hobsonville, Oteha, Stonefields}, are listed in . Fairview Heights has the lowest percentage price change at -4.96%, while Stonefields has the highest at 11.09%. Similarly, the estimated percentage change for the different house types, calculated by back-transforming coefficients $w_{h}^{H},$ are listed in .

Table 8. The estimated percentage change of standard house price for each suburb.

Download CSV Display Table

Table 9. The estimated percentage change for each house type.

Download CSV Display Table

The features of a standard house ( $\bar{A},$ $\bar{L},$ $\bar{B},$ $\bar{C},$ $\bar{G},$ $\bar{F},$ $\bar{H}$ ) in the four suburbs are defined in , and all house price impacts are analyzed relative to the configuration of this standard house. For example, the standard house has three bedrooms and two bathrooms (see ), so coefficients $w_{3}^{B}$ and $w_{2}^{C}$ are both zero. When the configuration of a standard house ( $\bar{A},$ $\bar{L},$ $\bar{B},$ $\bar{C},$ $\bar{G},$ $\bar{F},$ $\bar{H}$ ) is substituted into EquationEquation 3(3a) $\begin{matrix} l og (P_{i}) = w_{0} + w_{A} (A_{i} - \bar{A}) + w_{L} (log (L_{i}) ‐ log (\bar{L})) + \sum_{b \in B \ \bar{B}} w_{b}^{B} 1_{B_{i} = b} \\ + \sum_{c \in C \ \bar{c}} w_{c}^{C} 1_{C_{i} = c} + \sum_{g \in G \ \bar{G}} w_{g}^{G} 1_{G_{i} = g} \\ + \sum_{f \in F \ \bar{F}} w_{f}^{F} 1_{F_{i} = f} + \sum_{h \in H \ \bar{H}} w_{h}^{H} 1_{H_{i} = h} \\ + \sum_{s \in S} w_{s}^{S} 1_{S_{i} = s} \\ + \frac{1}{N (M_{i})} \sum_{m \in M} w_{m}^{M} 1_{(M_{i} - 5 \leq m \leq M_{i} + 6)} + ε_{i} \end{matrix}$ (3a) , all the terms corresponding to housing features are eliminated. The estimated price of a standard house ${\hat{P}}_{S, M}$ located in suburb $S$ ∈ $S$ , sold in month $M$ ∈ $M$ , can therefore be calculated by the formula: (7) ${\hat{P}}_{S, M} = exp (w_{0} + \sum_{s \in S} w_{s}^{S} 1_{S = s} + \frac{1}{N (M)} \sum_{m \in M} w_{m}^{M} 1_{(M - 5 \leq m \leq M + 6)}) .$ (7)

Table 10. Housing features of a standard house in Fairview Heights, Hobsonville, Oteha, and Stonefields.

Display Table

Sale months from the four suburbs are indexed from 1 to 216, where M = 1 corresponds to January 2001, and M = 216 corresponds to December 2018. Some of the months in this period have no sales (see for a plot of monthly sale volume), but coefficients $w_{m}^{M}$ are still fitted to produce a smooth ${\hat{P}}_{S, M}$ over time. The estimated price of our standard house ${\hat{P}}_{S, M}$ in Fairview Heights, calculated using EquationEquation 7(7) ${\hat{P}}_{S, M} = exp (w_{0} + \sum_{s \in S} w_{s}^{S} 1_{S = s} + \frac{1}{N (M)} \sum_{m \in M} w_{m}^{M} 1_{(M - 5 \leq m \leq M + 6)}) .$ (7) for sale month indices M ∈ {1,2,3,…,216} and S = {Fairview Heights}, is shown in as an example. This trend of increase in the estimated price of a standard house separates the effects of price change over time from the effects of individual housing features. Even though the estimated price of our standard house ${\hat{P}}_{S, M}$ is calculated using a moving average, we still observe discrepancies in periods with rapid changes in sale volume and outlier prices.

Figure 5. Panel a) shows the equivalent standard house prices ${\hat{P}}_{i}^{ESH}$ in Fairview Heights after converting each property into a standard house by adjusting for housing feature impacts using EquationEquation 8(8) ${\hat{P}}_{i}^{ESH} = \frac{P_{i}}{exp (w_{A} (A_{i} - \bar{A}) \times w_{L} (log (L_{i}) - log (\bar{L})) \times w_{B_{i}}^{B} \times w_{C_{i}}^{C} \times w_{G_{i}}^{G} \times w_{F_{i}}^{F} \times w_{H_{i}}^{H})} .$ (8) . The blue line represents the estimated standard house price ${\hat{P}}_{S, M}$ in Fairview Heights, calculated using EquationEquation 7(7) ${\hat{P}}_{S, M} = exp (w_{0} + \sum_{s \in S} w_{s}^{S} 1_{S = s} + \frac{1}{N (M)} \sum_{m \in M} w_{m}^{M} 1_{(M - 5 \leq m \leq M + 6)}) .$ (7) and the means of posterior distributions for model coefficients. The two dotted green lines represent the 90% HDI of the standard house price in Fairview Heights. The shaded green area represents the range of two standard deviations above and below the mean, which should include approximately 95% of the equivalent standard house prices. Panel b) shows the monthly sale volume from January 2001 to December 2018.

Figure 5. Panel a) shows the equivalent standard house prices P̂iESH in Fairview Heights after converting each property into a standard house by adjusting for housing feature impacts using EquationEquation 8(8) P̂iESH=Piexp (wA(Ai−A¯)×wL(log (Li)−log (L¯))×wBiB×wCiC×wGiG×wFiF×wHiH).(8) . The blue line represents the estimated standard house price P̂S,M in Fairview Heights, calculated using EquationEquation 7(7) P̂S,M=exp (w0+∑s∈SwsS1S=s+1N(M)∑m∈MwmM1(M−5≤m≤M+6)).(7) and the means of posterior distributions for model coefficients. The two dotted green lines represent the 90% HDI of the standard house price in Fairview Heights. The shaded green area represents the range of two standard deviations above and below the mean, which should include approximately 95% of the equivalent standard house prices. Panel b) shows the monthly sale volume from January 2001 to December 2018.

To verify that our estimated standard house price ${\hat{P}}_{S, M}$ captures the trend of price change over time in the four suburbs, we convert each property into a standard house by adjusting for housing feature impacts. For example, if a property has five bedrooms, then we adjust for the percentage price change of having five bedrooms instead of a standard three-bedroom house. We call the resulting value the equivalent standard house price ${\hat{P}}_{i}^{ESH}$ because it represents the hypothetical price of sale i if the property had been a standard house. The equivalent standard house price also allows us to study the residuals ε_i over time since the effects of all other housing features are eliminated. We define the equivalent standard house price ${\hat{P}}_{i}^{ESH}$ by: (8) ${\hat{P}}_{i}^{ESH} = \frac{P_{i}}{exp (w_{A} (A_{i} - \bar{A}) \times w_{L} (log (L_{i}) - log (\bar{L})) \times w_{B_{i}}^{B} \times w_{C_{i}}^{C} \times w_{G_{i}}^{G} \times w_{F_{i}}^{F} \times w_{H_{i}}^{H})} .$ (8) The equivalent standard house prices for Fairview Heights are plotted in . As expected, most of the equivalent standard house prices are between the 90% highest density interval from our Bayesian model. Overall, 96% of the equivalent standard house prices from all four suburbs are within the 90% highest density interval of their respective standard house price estimations. This observation indicates that our estimations of the highest density intervals are too conservative, and approximating the prediction error of the log-transformed sale prices as a normal distribution can be improved by choosing a more appropriate distribution function in future research.

The 95% highest density intervals of percentage change in price for individual housing features, compared to a standard house, can be calculated from the posterior distributions of housing features. The mean percentage changes in sale price, and their respective 95% highest density intervals, are listed in . The calculated percentage changes show that the price changes across increasing numbers of bedrooms, bathrooms, and garages are not linear. For every 10 m² increase in floor area relative to the standard house, price is estimated to increase by 2.10% to 2.44%. For every 1% increase in estimated land value relative to the standard house, price is estimated to increase by 0.18% to 0.24%. The posterior distributions of the percentage price change for floor area and land value are plotted in . Another benefit of our Bayesian approach compared to traditional ordinary least squares (OLS) models is that the feature impacts are expressed as posterior distributions, and any probability interval can be directly extracted from the posterior distributions.

Figure 6. Mean price change and 95% highest density interval from individual housing features compared to the estimated price of a standard house. All percentage changes are calculated from the coefficient posterior distributions of EquationEquation 3(3a) $\begin{matrix} l og (P_{i}) = w_{0} + w_{A} (A_{i} - \bar{A}) + w_{L} (log (L_{i}) ‐ log (\bar{L})) + \sum_{b \in B \ \bar{B}} w_{b}^{B} 1_{B_{i} = b} \\ + \sum_{c \in C \ \bar{c}} w_{c}^{C} 1_{C_{i} = c} + \sum_{g \in G \ \bar{G}} w_{g}^{G} 1_{G_{i} = g} \\ + \sum_{f \in F \ \bar{F}} w_{f}^{F} 1_{F_{i} = f} + \sum_{h \in H \ \bar{H}} w_{h}^{H} 1_{H_{i} = h} \\ + \sum_{s \in S} w_{s}^{S} 1_{S_{i} = s} \\ + \frac{1}{N (M_{i})} \sum_{m \in M} w_{m}^{M} 1_{(M_{i} - 5 \leq m \leq M_{i} + 6)} + ε_{i} \end{matrix}$ (3a) . The column with ‘5+’ bedrooms represents five bedrooms or above, and bathrooms and garages follow the same notation.

$Figure 6. Mean price change and 95% highest density interval from individual housing features compared to the estimated price of a standard house. All percentage changes are calculated from the coefficient posterior distributions of EquationEquation 3(3a) log(Pi)=w0+wA(Ai−A¯)+wL(log(Li)‐log(L¯))+∑b∈B\B¯wbB1Bi=b+∑c∈C\c¯wcC1Ci=c+∑g∈G\G¯wgG1Gi=g+∑f∈F\F¯wfF1Fi=f+∑h∈H\H¯whH1Hi=h+∑s∈SwsS1Si=s+1N(Mi)∑m∈MwmM1(Mi−5≤m≤Mi+6)+εi(3a) . The column with ‘5+’ bedrooms represents five bedrooms or above, and bathrooms and garages follow the same notation.$

Figure 7. Posterior distributions of the percentage price change for floor area A_i and land value L_i.

A house with only one bedroom is estimated to have a price approximately 16.71% to 25.66% lower than the price of a standard house with three bedrooms, while a house with two bedrooms is only 5.30% to 9.09% lower in price. A house with four bedrooms is estimated to be 3.69% to 6.54% higher in price than the standard house, but the increase in price for each added bedroom plateaus above four bedrooms. This indicates that most buyers are not satisfied with single-bedroom houses, but five bedrooms or above can become excessive. The posterior distributions of the percentage price change for number of bedrooms are plotted in .

Figure 8. Posterior distributions of the percentage price change for number of bedrooms. Posterior distribution for B_i = 3 is excluded as the standard number of bedrooms $\bar{B} .$

A house with only one bathroom reduces sale price by 1.11% to 3.50% compared to a standard house with two bathrooms, while a house with three bathrooms or above leads to a 0.58% to 3.17% increase (see for the posterior distribution of the percentage price change for number of bathrooms). The highest price change from the number of garages is a house with two garages or above, which leads to a price increase of 1.51% to 4.49% compared to the standard house. Lastly, houses with at least one free-standing garage are estimated to have prices 4.33% to 7.87% higher than those that do not. The posterior distributions of the percentage price change for number of garages are plotted in . Our fitted model provides insights into the complex relationship between house prices and housing features in a new suburb. The next section validates that newly developed suburbs with young, middle-class families as target buyers have similar household demands that are captured by our model.

Figure 9. Posterior distributions of the percentage price change for number of bathrooms. Posterior distribution for C_i = 2 is excluded as the standard number of bathrooms $\bar{C} .$

Figure 10. Posterior distributions of the percentage price change for number of garages. Posterior distribution for G_i = 1 is excluded as the standard number of garages $\bar{G} .$

Validate Model Performance

We apply k-fold cross-validation with k = 5 on the Auckland case study to demonstrate the ability of our model to make out-of-sample price predictions. This means we divide our housing data into five groups, hold out one group as the testing set to train on the remaining 4 groups of data, and repeat this step on all five groups. We use the means of coefficient posterior distributions from our model to make these out-of-sample price predictions so the results are consistent with a ridge regression model. The R² values, root mean square errors, and mean absolute errors from applying 5-fold cross-validation are listed in . The R² values for all 5 folds are above 90%, and the average mean absolute error across 5 folds is $56,792. The high R² values and low mean absolute errors from 5-fold cross-validation indicate that our model makes accurate out-of-sample price predictions for property sales in newly developed Auckland suburbs.

Table 11. R² values, root mean square errors, and mean absolute errors from 5-fold cross-validation.

Download CSV Display Table

To demonstrate the advantages of our proposed modelling approach over traditional valuation methods, we compare the predictive ability of our model to an ordinary least squares (OLS) model without encoded housing features. The baseline OLS model to be used for comparison has the following model formula: (9a) $\begin{matrix} l og (P_{i}) = w_{0} + w_{A} A_{i} + w_{L} (log (L_{i})) + w_{B} B_{i} + w_{C} C_{i} + w_{G} G_{i} + w_{F} F_{i} + w_{H} H_{i} \\ + \sum_{s \in S} w_{s}^{S} 1_{S_{i} = s} + \frac{1}{N (M_{i})} \sum_{m \in M} w_{m}^{M} 1_{(M_{i} - 5 \leq m \leq M_{i} + 6)} + ε_{i} \end{matrix}$ (9a) where (9b) $N (M_{i}) = \sum_{m \in M} 1_{{(M}_{i} - 5 \leq m \leq M_{i} + 6)}$ (9b) and (9c) $ε_{i} \sim N (0, σ^{2}) .$ (9c)

Instead of encoding housing features, the OLS model assumes a linear relationship between housing features (such as the number of bedrooms) and the target variable $log (P_{i}) .$ Another difference from our proposed SHCM is that the OLS model includes no regularization, meaning that the model is more prone to overfitting. The housing features in the OLS model are not modelled relative to the configuration of a standard house, and the coefficients are therefore less intuitive for housing developers to interpret. For example, the intercept of the SHCM represents the fitted price of our defined standard house with no fixed sale date. For the OLS model, the intercept represents the price of a house where all housing feature values are zero (i.e. zero floor area, no bedrooms, etc.). Such a house does not exist, and the intercept from the OLS model therefore has no meaningful interpretation. All feature impacts in the OLS model will also be expressed relative to this hypothetical house where all feature values are zero.

We included encoded suburbs in the OLS model to account for the bias of the individual suburb price premiums. The OLS model also includes encoded sale month indices for averaging house prices across a moving time window. This allows our OLS model to incorporate the non-linear impact of market dynamics over time. While this is not a typical approach for modelling price changes over time in traditional valuation methods, modelling sale prices over 18 years is a long time-horizon. For the Auckland housing market, it is unrealistic to assume a linear price change over time. Since we have incorporated the non-linear impact of market dynamics in both models, any difference between the models’ predictive ability is based on our use of encoded housing features to capture non-linear relationships. One of our main project objectives is to model housing feature impacts in newly developed suburbs, so showing the advantages of our housing feature formulation is a crucial aspect of our research.

We repeated 5-fold cross-validation ten times, shuffling the data before each iteration, to generate a distribution of root mean square errors composed of 50 values from each of the two models. shows box plots comparing the distributions of the root mean square errors from the OLS model and the SHCM. We can see from that the median value of the root mean square error distribution from the SHCM is lower than the median value from the OLS model, but more evidence is required to conclude whether there is a difference between the predictive abilities of the two models. The root mean square error distributions from the SHCM and the OLS model are compared to determine whether there is a difference between their predictive abilities. The traditional method for comparing two distributions is a paired statistical t-test. Applying paired statistical t-test to the two distributions produced a p-value of 9.9 $\times 10^{- 20},$ which is below the standard significance level of 0.05. The results from the paired statistical t-test provide strong evidence that the mean difference between the root mean square error distributions produced by the two models is statistically significant.

Figure 11. Box plots comparing the distributions of the root mean square errors generated by the OLS model and the SHCM. The median value from the SHCM is lower than the median value from the OLS model.

To obtain a distribution on the difference, this study also adopts Bayesian inference to compare the two distributions using the Python package ‘PyMC3’, a probabilistic programming library for building Bayesian models using numerical methods (The PyMC Development Team, Citation2022). The advantage of Bayesian inference is that it provides information on how different the two distributions are, instead of testing for statistical evidence against the null hypothesis (Kruschke, Citation2013; The PyMC Development Team, Citation2018). We applied the same model set-up and prior distributions as the tutorial on Bayesian estimation for comparing two groups from the official PyMC3 website (The PyMC Development Team, Citation2018). Our only change to the set-up is that the prior distributions of the standard deviations are set to Uniform(5000, 8000) to match the empirical standard deviations of our two distributions. The posterior distributions for the difference of means and difference of standard deviations are shown in . We use zero as a reference value to discern whether there is a credible difference between the two model performances. For the difference of means, 99.7% of the posterior distribution is above zero, and this is strong evidence that the means of the two root mean square error distributions are credibly different. There is a 94% probability that the difference in means between the root mean square errors from the two models is between 1514 and 6955. On average, we estimate the mean of the root mean square errors from the OLS model to decrease by 4116, or 4.3%, compared to the mean of the root mean square errors from the SHCM. For the difference of standard deviations, 29% of the posterior distribution is below zero, while 71% is above zero. We do not observe strong evidence of a credible difference between the standard deviations of the two root mean square error distributions. There is no clear evidence that the root mean square errors generated by the two models have different variability. From comparing the root mean square errors generated by the SHCM and the OLS model on repeated k-fold cross-validation, we have very strong evidence that our proposed SHCM has improved predictive ability compared to traditional valuation methods. The improvement in predictive ability is based on the formulation of our housing features for capturing non-linear relationships.

Figure 12. Posterior distributions for the difference of means and difference of standard deviations when the root mean square error distributions generated by the OLS model and the SHCM are compared using Bayesian inference. For the difference of means, 99.7% of the posterior distribution is above zero, indicating strong evidence of a credible difference between the two compared distributions.

Cross-Validate the Model on Each Suburb

To make out-of-sample price predictions for entire suburbs without using any historical sales from the target suburb, we apply leave-one-group-out cross-validation where the groups are defined by the suburbs. For example, we remove all sales data on Fairview Heights properties from the training data set, and train our model on the remaining three suburbs. The resulting model is applied to Fairview Heights to make price predictions for the sales data we removed. This step is repeated on all four suburbs to validate that these newly developed suburbs all have similar housing feature impacts that can be captured by our model. Since we are making out-of-sample predictions for an entire suburb, the fitted model does not contain information on the suburb price premium of the target suburb. With no information on the target suburb, we assume the target suburb has a suburb-specific price premium of 0%.

The resulting R² values, root mean square errors, and mean absolute errors from cross-validating on each suburb are listed in . Out-of-sample price predictions on Fairview Heights, Hobsonville, and Oteha produce R² values of 86%, 75%, and 89%, respectively, indicating very strong positive relationships. The mean absolute errors of the three suburbs are all below $100,000. The predicted prices are plotted against the actual sale prices in . Our model is able to make out-of-sample price predictions on the three suburbs with reasonable accuracy because the standard house prices in the three suburbs are all reasonably similar. This is evidence that the housing feature impacts derived from our model have the potential to provide insights for other new suburbs in Auckland still in the planning stage of development.

Figure 13. Scatter plot of predicted price against actual sale price for Fairview Heights, Hobsonville, Oteha, and Stonefields. Stonefields properties in panel d) are systematically underestimated. Price predictions are reasonably accurate for the other three suburbs.

Table 12. R² values, root mean square errors, and mean absolute errors from out-of-sample predictions by leave-one-group-out cross-validation on each suburb.

Download CSV Display Table

The R² value for prediction on Stonefields is a disappointing 37%, indicating a moderate positive relationship, because house prices in Stonefields are systematically underestimated (see ). Our results in estimate a price premium of 11.09% for properties in Stonefields, relative to a standard house trained on all four suburbs. The actual standard house price in Stonefields is significantly higher than the standard house price in the other three suburbs. Our assumption of a 0% suburb price premium is not realistic in this scenario, and the corresponding R² value is therefore low.

The inclusion of land value L_i in our model should capture the effects of suburb locations, but suburb location and distance to Auckland Central are not the sole determining factors of suburb-specific price premiums. Suburb-specific price premiums also depend on factors such as the quality of neighbourhood facilities that are not captured by our model. The focus of this study is on housing feature impacts, and not on estimating the price premiums of individual suburbs. We would have to rely on experts to estimate the appropriate suburb price premium of a new target suburb if the target suburb has a standard house price that differs significantly from the standard house price fitted on the training suburbs.

In order to simulate a best-case scenario where a housing expert is able to provide an accurate estimate of the Stonefields price premium, we fit for a price premium that optimizes our current price predictions on Stonefields in a separate linear regression model. This is a demonstration of the upper bound on the best possible performance, since estimating an accurate suburb price premium is very difficult. The fitted suburb price premium is 15.84%, and increasing our price predictions on Stonefields properties by 15.84% increases the R² value to 80%. This fitted Stonefields price premium of 15.84% is different from the in-sample suburb premium listed in because the reference standard house changes when the training data is different. The fitted standard house price is only a reference point and does not affect the predictive ability of our model. The increased R² value of 80% is evidence that the housing feature impacts captured by our model are still applicable to Stonefields even if the out-of-sample predictions contain clear systematic errors due to the assumption of a 0% suburb premium. We conclude that our model has the potential to be used for predicting house prices of future new suburbs if housing experts are able to provide a reasonable estimation of the standard house price in that new suburb.

Conclusion

This study analyzes the impacts of individual housing features on sale price for houses in new suburbs without historical sales using a Bayesian Standard House Configuration Model. Our modelling approach estimates the price of a standard house over time to separate the effects of price change over time from the effects of individual housing features. All impacts on sale price are described by highest density intervals from posterior distributions to produce results that are easy to interpret for non-experts. The case study on Fairview Heights, Hobsonville, Oteha, and Stonefields in Auckland, New Zealand, demonstrates that our proposed method effectively captures both the non-linear effects of individual housing features and price change over time. Single-bedroom houses are estimated to be 16.71% to 25.66% lower in price than a standard house with three bedrooms, while houses with five bedrooms or more lead to around 3.33% to 8.00% increase in price.

Our literature review shows that none of the past studies has attempted to make price predictions for an entire suburb before it has been built. The main obstacle is that there are no historical sales from the target suburb we can train the model on. From our Auckland housing case study, we have shown that our model is able to produce out-of-sample R² values above 75% for the suburbs of Fairview Heights, Hobsonville, and Oteha by training on other newly developed suburbs with similar target buyers without using any sale records from the target suburb. Our price predictions for the suburb of Stonefields contain systematic errors because the suburb-specific price premium of Stonefields is significantly higher than the other newly developed suburbs in our training data set. Adjusting our price predictions on Stonefields by the suburb-specific price premium improves the out-of-sample R² value to 80%. This indicates that our model is able to make reasonably accurate price predictions for comparable suburbs. For target suburbs with suburb-specific price premiums that differ significantly from suburbs in the training set, we would have to rely on experts to estimate an appropriate price premium to eliminate the systematic errors in our predictions. Overall, we have demonstrated the potential of our model to be applied to other new suburbs not in the current case study. Our findings have the potential to assist with the planning phase of suburb development by identifying buyer preferences for new suburbs.

This study has demonstrated the capabilities of probabilistic modelling in terms of its application to housing data. Choosing an appropriate approximation distribution for the model parameters of each housing case study is a key aspect of building a reliable probabilistic model. For our Auckland housing case study, we chose a normal approximation for the model parameters, but this is still not a perfect approximation. The advantage of a normal approximation is that we can calculate an analytic solution to our problem without any simulation methods. Analytic solutions are computationally inexpensive, and we are guaranteed a solution with no convergence issues. For future research involving other housing case studies, the approximation of prediction errors as a normal distribution could be improved by choosing more appropriate distribution functions using other existing statistical tools (e.g. PyMC3, Stata, etc.). Statistical tools such as the `PyMC3' Python package, which is employed briefly in this paper for the comparison of two error distributions, offer distribution functions such as the Student’s t-distribution and the skew-normal distribution that could be a more appropriate approximation for other housing case studies. The `PyMC3' package employs numerical methods which estimate model parameters by a simulation of random sampling from probabilistic distributions (Carlo, Citation2004). The drawback of numerical simulations is that they are computationally expensive, and the solution might not converge due to potential issues such as variable collinearity or inappropriate distribution approximations.

Our proposed approach of modelling feature impacts relative to the configuration of a standard house can be expanded in the future to include more housing features. One of the main advantages of the Standard House Configuration Model is that the results on housing feature impacts are easy to interpret because there is a clear reference point for each analyzed feature. The Standard House Configuration Model can be expanded to include any other features that are of interest to future users who wish to apply the model, as long as the configuration of the standard house to be used as the reference point is also expanded to include the added features. The concept of the Standard House Configuration Model can not only be used to analyze housing feature impacts, but also the value of neighbourhood facilities that are of interest to housing developers. The main obstacle to future expansion of our model is the limited data sample on newly developed suburbs. There must be sufficient data representation of each facility or additional housing feature to present reliable results on their implicit values. It is expected that our proposed modelling approach and findings also translate to the global housing market.

Disclosure Statement

The authors report there are no competing interests to declare.

Additional information

Funding

This research is supported by the Charles Ma Engineering Fund

References

Ambrose, J. (2019). The ongoing housing shortage. Property Journal, Jul/Aug:23– 23. Available: https://www.proquest.com/docview/2281055924.
Google Scholar
Auckland Council. (2020). What will Auckland look like in the future? Available: https://www.aucklandcouncil.govt.nz/plans-projects-policies-reportsbylaws/our-plans-strategies/auckland-plan/development-strategy/futureauckland/Pages/what-auckland-look-like-future.aspx.
Google Scholar
Bao, H. X., & Wan, A. T. (2004). On the use of spline smoothing in estimating hedonic housing price models: Empirical evidence using Hong Kong data. Real Estate Economics, 32(3), 487–507. https://doi.org/10.1111/j.1080-8620.2004.00100.x
Web of Science ®Google Scholar
Bency, A. J., Rallapalli, S., Ganti, R. K., Srivatsa, M., & Manjunath, B. S. (2017 Beyond spatial auto-regressive models: Predicting housing prices with satellite imagery [Paper presentation]. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), In pages 320–329. https://doi.org/10.1109/WACV.2017.42
Google Scholar
Biau, G., & Scornet, E. (2016). A random forest guided tour. TEST, 25(2), 197–227. https://doi.org/10.1007/s11749-016-0481-7
Web of Science ®Google Scholar
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer, (Springer Science + Business.
Google Scholar
Bourassa, S. C., Cantoni, E., & Hoesli, M. (2007). Spatial dependence, housing submarkets, and house price prediction. The Journal of Real Estate Finance and Economics, 35, 143–160. https://doi.org/10.1007/s11146-007-9036-8
Web of Science ®Google Scholar
Braidwood, E. (2017). Liz Peace: ‘Creating a new London suburb’. Architects’ Journal. Available: https://www.architectsjournal.co.uk/news/old-oak-commons-liz-peacewere-creating-a-new-london-suburb.
Google Scholar
Brunsdon, C., Fotheringham, A. S., & Charlton, M. (1999). Some notes on parametric significance tests for geographically weighted regression. Journal of Regional Science, 39(3), 497–524. https://doi.org/10.1111/0022-4146.00146
Web of Science ®Google Scholar
Carlo, C. M. (2004). Markov chain monte carlo and gibbs sampling. Lecture Notes for EEB, 581(540), 3.
Google Scholar
Clapp, J. M., Kim, H.-J., & Gelfand, A. E. (2002). Predicting Spatial Patterns of House Prices Using LPR and Bayesian Smoothing. Real Estate Economics, 30(4), 505–532. https://doi.org/10.1111/1540-6229.00048
Web of Science ®Google Scholar
CoreLogic. (2020). Residential property sales statistics. Data files for New Zealand house sales. Updated 29 June 2020University of Auckland. accessed: 06.02.2020.
Google Scholar
Fernandez, M. A. (2019). A review of applications of hedonic pricing models in the new zealand housing market. Available: https://knowledgeauckland.org.nz/publications/a-review-of-applications-ofhedonic-pricing-models-in-the-new-zealand-housing-market/.
Google Scholar
Filippova, O., & Rehm, M. (2011). The impact of proximity to cell phone towers on residential property values. International Journal of Housing Markets and Analysis, 4(3), 244–267. https://doi.org/10.1108/17538271111153022
Google Scholar
Fleming, M., & Humphries, S. (2013). Home price indices: Appreciating the differences. The Urban Institute. Available: https://www.urban.org/sites/default/files/2015/02/10/home-price-indicesspeaker-biographies.pdf.
Google Scholar
Goeman, J., Meijer, R., Chaturvedi, N. (2012). L1 and l2 penalized regression models. cran. r-project. or. Available: https://cran.rproject.org/web/packages/penalized/vignettes/penalized.pdf.
Google Scholar
Grimes, A., & Liang, Y. (2009). Spatial determinants of land prices: Does auckland’s metropolitan urban limit have an effect? Applied Spatial Analysis and Policy, 2, 23–45. https://doi.org/10.1007/s12061-008-9010-8
Google Scholar
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology. General, 142(2), 573. https://doi.org/10.1037/a0029146
PubMed Web of Science ®Google Scholar
Kruschke, J. K., & Liddell, T. M. (2018). Bayesian data analysis for newcomers. Psychonomic Bulletin & Review, 25(1), 155–177. https://doi.org/10.3758/s13423-017-1272-1
PubMed Web of Science ®Google Scholar
Kusisto, L., & Grant, P. (2019). Affordable housing crisis spreads throughout world; shortages persist despite millions of dollars invested and hundreds of thousands of units built. The Wall Street Journal. Available: https://www.wsj.com/articles/affordable-housing-crisis-spreadsthroughout-world-11554210003.
Google Scholar
Liao, W.-C., & Wang, X. (2012). Hedonic house prices and spatial quantile regression. Journal of Housing Economics, 21(1), 16–27. https://doi.org/10.1016/j.jhe.2011.11.001
Web of Science ®Google Scholar
Limsombunchai, V. (2004 House price prediction: Hedonic price model vs. artificial neural network [Paper presentation]. New Zealand Agricultural and Resource Economics Society Conference, In, pages 25–26.
Google Scholar
Lin, C. (2023a). Explore categorical numeric features in corelogic housing data. https://doi.org/10.17608/k6.auckland.24321193
Google Scholar
Lin, C. (2023b). Explore individual numeric features in corelogic housing data. https://doi.org/10.17608/k6.auckland.24319750.
Google Scholar
Liu, Z., Yan, S., Cao, J., Jin, T., Tang, J., Yang, J., & Wang, Q. (2018 A Bayesian approach to residential property valuation based on built environment and house characteristics [Paper presentation]. 2018 IEEE International Conference on Big Data (Big Data), In pages 1455–1464. https://doi.org/10.1109/BigData.2018.8622422
Google Scholar
Manderbacka, K., Kåreholt, I., Martikainen, P., & Lundberg, O. (2003). The effect of point of reference on the association between self-rated health and mortality. Social Science & Medicine (1982), 56(7), 1447–1452. https://doi.org/10.1016/s0277-9536(02)00141-7
PubMed Web of Science ®Google Scholar
Ninness, G. (2018). New figures show Auckland’s housing shortage is still getting worse but should start to decline in the next one to two years. interest.co.nz. Available: https://www.interest.co.nz/property/97023/new-figuresshow-aucklands-housing-shortage-still-getting-worse-should-start-decline.
Google Scholar
Pace, R., Barry, R., Gilley, O. W., & Sirmans, C. F. (2000). A method for spatial–temporal forecasting with an application to real estate prices. International Journal of Forecasting, 16(2), 229–246. https://doi.org/10.1016/S0169-2070(99)00047-3
Web of Science ®Google Scholar
Rehm, M., & Filippova, O. (2008). The impact of geographically defined school zones on house prices in New Zealand. International Journal of Housing Markets and Analysis, 1(4), 313–336. https://doi.org/10.1108/17538270810908623
Google Scholar
Sargent-Cox, K. A., Anstey, K. J., & Luszcz, M. A. (2010). Patterns of longitudinal change in older adults’ self-rated health: The effect of the point of reference. Health Psychology: Official Journal of the Division of Health Psychology, American Psychological Association, 29(2), 143–152. https://doi.org/10.1037/a0017652
PubMed Web of Science ®Google Scholar
The PyMC Development Team. (2018). Bayesian estimation supersedes the t-test. Available: https://www.pymc.io/projects/docs/en/v3.11.4/pymc-examples/examples/case_studies/BEST.html
Google Scholar
The PyMC Development Team. (2022). Pymc. Available: https://www.pymc.io/projects/docs/en/v3.11.4/pymc-examples/examples/case_studies/BEST.html
Google Scholar
Thompson, J. (2014). Bayesian analysis with Stata. Stata Press College Station, TX.
Google Scholar
Wen, H. Z., Lu, J. F., & Lin, L. (2004 An improved method of real estate evaluation based on Hedonic price model [Paper presentation]. 2004 IEEE International Engineering Management Conference (IEEE Cat. No.04CH37574), In volume 3, pages 1329–1332. https://doi.org/10.1109/IEMC.2004.1408910
Google Scholar
Wheeler, D. C., Páez, A., Spinney, J., & Waller, L. A. (2014). A Bayesian approach to hedonic price analysis. Papers in Regional Science, 93(3), 663–683. https://doi.org/10.1111/pirs.12003
Web of Science ®Google Scholar
Yoo, S., Im, J., & Wagner, J. E. (2012). Variable selection for hedonic model using machine learning approaches: A case study in Onondaga County, NY. Landscape and Urban Planning, 107(3), 293–306. https://doi.org/10.1016/j.landurbplan.2012.06.009
Web of Science ®Google Scholar

Bayesian Modelling of a Standard House Configuration Model to Analyze Housing Feature Impacts in Newly Developed Suburbs without Historical Sales

Abstract

Introduction

Table 1. Overview of a small sample of past studies representing the most common modelling approaches for house price prediction.

Table 2. Analysis criteria for a selection of past research that employs machine learning models to make price predictions or analyze the Auckland housing market.

Modelling Housing Feature Impacts in New Suburbs

Table 3. Description of variables where i is the sale record index.