280
Views
0
CrossRef citations to date
0
Altmetric
Economic Instruction

Integrating data science into an econometrics course with a Kaggle competition

ORCID Icon
 

Abstract

As vast amounts of data have become available in business in recent years, the demand for data scientists has been rising. The author of this article provides a tutorial on how one entry-level machine learning competition from Kaggle, an online community for data scientists, can be integrated into an undergraduate econometrics course as an engaging activity using only linear regression. Other techniques in this tutorial include log-linear and quadratic models and interactions of explanatory variables, which are common functional forms in econometrics. The competition allows students to use real-world data, build a predictive model, submit their model online to be evaluated instantaneously based on accuracy, and keep improving their model. R and Python codes are provided to make it possible for readers to replicate.

JEL CODES:

Acknowledgments

The author thanks three anonymous referees for useful comments.

Notes

1 Indeed, there is a course called “Econometric Data Science” taught by Joshua Angrist at MIT. The description of the course is as follows: “Econometric Data Science develops the knowledge and skills needed to understand empirical economic research and to plan and execute empirical projects. Topics include randomized trials, regression, instrumental variables, differences-in-differences, regression-discontinuity designs, and simultaneous equations models.” (https://economics.mit.edu/files/20147). As one can tell from the description of the course, the focus appears to be causal inference, not prediction, which is primarily the concern of data science.

2 In this article, machine learning refers to “supervised” machine learning, in which the goal is to generate formulas based on input and output values. In “unsupervised” machine learning, only input data is used, and input values are grouped based on the association.

3 In machine learning, explanatory or independent variables are usually called “predictors” or “features.”

4 We do not know what products Target used to assign each shopper a “pregnancy prediction” score. Prenatal vitamins may not be one of them.

5 Of course, Stata can be used if one wishes to, as long as submission files are CSV (comma separated values) files that have the correct format.

6 For example, Dvorak et al. (Citation2019) discuss using R Markdown in teaching empirical economics to undergraduates. Kuroki (Citation2021) and Jenkins (Citation2022) show how to use Python in teaching intermediate microeconomics and macroeconomics, respectively, to undergraduates.

7 Even though “year,” “month,” and “hour” are numbers, these variables are not numerical because they are extracted as characters using substring() and thus treated as categorical variables in the regression.

8 Interacting more than three variables is not recommended, as it will require large computing power.

9 For example, regression trees and random forests are machine learning functions that search for interactions automatically.

10 I thank a reviewer for this suggestion. Incidentally, Kaggle may be an excellent resource for students who are required to write an empirical paper in their econometrics course. Choosing topics is difficult for many students because they must find data first, even though most students have little or no experience in gathering data. In addition to competitions, Kaggle has more than 160,000 public datasets that, as of September 2022, users deposited. Thus, Kaggle offers a perfect one-stop shop for data on a massive number of topics, and students, once they are familiar with the Kaggle Web site, will find it easy to download data and come up with interesting topics for their empirical projects.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.