162
Views
1
CrossRef citations to date
0
Altmetric
Research Article

LASSO–penalized clusterwise linear regression modelling: a two–step approach

ORCID Icon, &
Pages 3235-3258 | Received 07 Sep 2022, Accepted 24 May 2023, Published online: 04 Jun 2023
 

Abstract

In clusterwise regression analysis, the goal is to predict a response variable based on a set of explanatory variables, each with cluster-specific effects. In many real–life problems, the number of candidate predictors is typically large, with perhaps only a few of them meaningfully contributing to the prediction. A well–known method to perform variable selection is the LASSO, with calibration done by minimizing the Bayesian Information Criterion (BIC). However, existing LASSO-penalized estimators are problematic for several reasons. First, only certain types of penalties are considered. Second, the computations may sometimes involve approximate schemes. Third, variable selection is usually time consuming, due to a complex calibration of the penalty term, possibly requiring several multiple evaluations of an estimator for each plausible value of the tuning parameter(s). We introduce a two–step approach to fill these gaps. In step 1, we fit LASSO clusterwise linear regressions with some pre–specified level of penalization (Fit step). In step 2 (Selection step), we perform covariate selection locally, i.e. on the weighted data, with weights corresponding to the posterior probabilities from the previous step. This is done by using a generalization of the Least Angle Regression (LARS) algorithm, which permits covariate selection with a single evaluation of the estimator. In addition, both Fit and Selection steps leverage on an Expectation Maximization (EM) algorithm, fully in closed forms, designed with a very general version of the LASSO penalty. The advantages of our proposal, in terms of computation time reduction, and accuracy of model estimation and selection, are shown by means of a simulation study, and illustrated with a real data application.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

2 Covariate acronyms are AVG – Batting Average, OBP – On Base Percentage, R – Runs, H – Hits, 2B – Doubles, 3B – Triples, HR – Home Runs, RBI – Runs Batted In, BB – Walks, SO – Strikeouts, SB – Stolen Bases, ERS – Errors, FAE – Free Agency Eligibility, FA – Free Agent in 1991/92, AE – Arbitration Eligibility, ARB – Arbitration in 1991/92.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.