ABSTRACT
Machine learning (ML) is becoming one of the most anticipated methods in predicting consumer demand. However, it is still uncertain how ML methods perform relative to traditional econometric methods under different dataset scales. This study estimates and compares the out-of-sample predictive accuracy of household budget share for organic fresh produce using two parametric models and six ML methods under regular and large sample sizes. Results show that ML method, particularly Logistic Elastic Net, performs better than econometric models under regular sample size. Contrarily, when dealing with big data, econometric models reach to same accuracy level as ML methods whereas random forest presents a possible overfitting problem. This study illustrates the competence of ML methods in demand prediction, but choosing the optimal method needs to consider product specifics, sample sizes, and observable features.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Nielsen disclaimer
Researcher(s) own analyses calculated (or derived) based in part from Nielsen Consumer LLC and marketing databases provided through NielsenIQ Datasets at the Kilts Center for Marketing Data Center at The University of Chicago Booth School of Business. The conclusions drawn from the NielsenIQ data are those of the researcher(s) and do not reflect the views of NielsenIQ. NielsenIQ is not responsible for, had no role in, and was not involved in analysing and preparing the results reported herein.
Notes
1 Varian (Citation2014) provided an overview of popular ML methods. For empirical applications, Athey (Citation2019) summarized representative uses of ML in economic literature from solving causal inference problems in predicting policy effects.
2 The predictive accuracy rankings are identical in validation set and test set. For the validation set, Unlike Bajari et al. (Citation2015), we do not calculate the percentages of weight based on the coefficients of the regression model that combines all predicted estimates by different methods, because the coefficient-based weight would be subject to high collinearity problem that leads to overweighting methods with small deviance while underweighting methods with large deviance..
3 takes value of any real number. Then we are able to apply different methods on such an outcome.