ABSTRACT
We study the nonparametric least squares estimator (LSE) of a multivariate convex regression function. The LSE, given as the solution to a quadratic program with O(n2) linear constraints (n being the sample size), is difficult to compute for large problems. Exploiting problem specific structure, we propose a scalable algorithmic framework based on the augmented Lagrangian method to compute the LSE. We develop a novel approach to obtain smooth convex approximations to the fitted (piecewise affine) convex LSE and provide formal bounds on the quality of approximation. When the number of samples is not too large compared to the dimension of the predictor, we propose a regularization scheme—Lipschitz convex regression—where we constrain the norm of the subgradients, and study the rates of convergence of the obtained LSE. Our algorithmic framework is simple and flexible and can be easily adapted to handle variants: estimation of a nondecreasing/nonincreasing convex/concave (with or without a Lipschitz bound) function. We perform numerical studies illustrating the scalability of the proposed algorithm—on some instances our proposal leads to more than a 10,000-fold improvement in runtime when compared to off-the-shelf interior point solvers for problems with n = 500.
Supplementary Materials
The supplementary file gives the proofs of some of the results stated in the article, describes some of the algorithms in more detail, and provides additional computational studies.
Acknowledgments
The authors are grateful to the anonymous reviewers and the associate editor for their comments and helpful suggestions.
Notes
2 To see this, observe that any solution ’s and
of Problem (Equation2
(2)
(2) ) can be extended to a convex function by the interpolation rule (Equation3
(3)
(3) );
thus defined is convex in ℜd and has the same loss function as the optimal objective value of Problem (Equation2
(2)
(2) ). On the other hand, any solution of Problem (Equation1
(1)
(1) ) is feasible for Problem (Equation2
(2)
(2) ).
3 We initialize at the least squares solution and set the other variables to zero.
4 We note that we use a prox function here as a regularizer – its usage here different from that of a proximal mapping (see, e.g., Parikh and Boyd Citation2014) that denotes a minimizer in the convex optimization literature.
5 ∇1γ(z; τ) refers to the partial derivative of γ(z; τ) with respect to z.
6 Observe that max {∑mi = 1αiwi | w ∈ Δm} = α(m), the largest among the αi, i = 1, …, m. An optimal solution of this linear program is given by where
and wi = 0 otherwise.
7 For a simple proof of this fact, note that for any h ∈ ℜm we have ⟨∇2ρ(w)h, h⟩ = ∑ih2i/wi. By the Cauchy–Schwarz inequality, it follows that (∑ih2i/wi)(∑iwi) ⩾ (∑i|hi|)2, which implies strong convexity of the entropy prox function ρ( · ) with respect to the ℓ1-norm.
8 We used the authors’ software for (a), (b), and our own implementation of (c) following the article.