3,351
Views
213
CrossRef citations to date
0
Altmetric
Original Articles

Using Lasso for Predictor Selection and to Assuage Overfitting: A Method Long Overlooked in Behavioral Sciences

Pages 471-484 | Published online: 13 Oct 2015
 

Abstract

Ordinary least squares and stepwise selection are widespread in behavioral science research; however, these methods are well known to encounter overfitting problems such that R2 and regression coefficients may be inflated while standard errors and p values may be deflated, ultimately reducing both the parsimony of the model and the generalizability of conclusions. More optimal methods for selecting predictors and estimating regression coefficients such as regularization methods (e.g., Lasso) have existed for decades, are widely implemented in other disciplines, and are available in mainstream software, yet, these methods are essentially invisible in the behavioral science literature while the use of sub optimal methods continues to proliferate. This paper discusses potential issues with standard statistical models, provides an introduction to regularization with specific details on both Lasso and its related predecessor ridge regression, provides an example analysis and code for running a Lasso analysis in R and SAS, and discusses limitations and related methods.

Notes

is the ℓ2 norm of the difference between the observed values Y and the predicted values . In linear algebra, the ℓ2 norm (a.k.a. the Euclidean norm) is square root of the summation of the argument. That is, Squaring the ℓ2 norm is simply a notationally convenient way to write the sum of squared residuals in matrix form.

OLS estimates are unbiased provided that model assumptions are correct such as homoscedasticity and normality of the residuals and proper specification of the model in terms of including the appropriate predictors with the appropriate functional form.

In physics and numerical mathematics where regularization originated, regression model fitting can be considered an inverse problem because the data are collected first and parameters are estimated afterwards based on the collected data. Furthermore, model fitting is often an ill-posed problem because the parameters are sensitive to small fluctuations in the data (i.e., overfitting) or problems may have multiple or zero solutions (Hadamard, Citation1902). Regularization introduces additional information into the problem (through a penalty term) that addresses the ill-posed nature of model fitting by reducing the sensitivity to small data fluctuations (Ambartsumian, Citation1929).

Regularization is not typically applied to the intercept because the intercept is not typically considered to be affected by overfitting in a similar manner as regression coefficients.

Note that this is the average mean squared error across cross-validation samples for predicting the kth fold at a particular value of λ, not a single fit to the complete data. Otherwise, the OLS λ = 0 solution would have the lowest mean squared error.

CalWorks is a program in California that provides resources to very low- income families. To qualify, California residents must be responsible for a child under 19 years old, have a very low income, and either work for very low wages, be unemployed, or be on the verge of unemployment.

Recall that stepwise p values do not address the adaptive nature of the model and are often quite untrustworthy.

Of course, real-world data never perfectly conform to the model assumptions and, as a result, the residuals from the estimated model often exhibit a small degree of correlation (e.g., trivial nestedness of the data, minor model misspecification due to the unavailability of a predictor in the data). Minor departures from these assumptions do not impact the estimation or interpretability of the model (Cohen et al., Citation2003; Lomax & Hahs-Vaughn, Citation2013).

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 352.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.