Search in:

Applied Artificial Intelligence

An International Journal

Volume 34, 2020 - Issue 5

Submit an article Journal homepage

Free access

390

Views

CrossRef citations to date

Altmetric

Listen

Articles

Extreme Learning Regression for nu Regularization

Xiao-Jian DingCollege of Information Engineering, Nanjing University of Finance and Economics, Nanjing, China

https://orcid.org/0000-0002-5276-7727 View further author information

Fan YangCollege of Information Engineering, Nanjing University of Finance and Economics, Nanjing, ChinaCorrespondence[email protected]

https://orcid.org/0000-0001-6861-9596 View further author information

Jian LiuCollege of Information Engineering, Nanjing University of Finance and Economics, Nanjing, ChinaView further author information

Jie CaoCollege of Information Engineering, Nanjing University of Finance and Economics, Nanjing, ChinaView further author information

Pages 378-395 | Published online: 07 Feb 2020

Cite this article
https://doi.org/10.1080/08839514.2020.1723863
CrossMark

In this article

ABSTRACT
Introduction
Related Works
Optimization Problem of nu-ELR
Solving Algorithm of nu-ELR
Numerical Experiments and Comparison of Results
Conclusions
Additional information
References

Full Article
Figures & data
References
Citations
Metrics
Reprints & Permissions
View PDF PDF View EPUB EPUB

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Extreme learning machine for regression (ELR), though efficient, is not preferred in time-limited applications, due to the model selection time being large. To overcome this problem, we reformulate ELR to take a new regularization parameter nu (nu-ELR) which is inspired by Schölkopf et al. The regularization in terms of nu is bounded between 0 and 1, and is easier to interpret compared to C. In this paper, we propose using the active set algorithm to solve the quadratic programming optimization problem of nu-ELR. Experimental results on real regression problems show that nu-ELR performs better than ELM, ELR, and nu-SVR, and is computationally efficient compared to other iterative learning models. Additionally, the model selection time of nu-ELR can be significantly shortened.

Introduction

Recently, researchers in the area of artificial intelligence have given more attention to single hidden layer feedforward neural networks (SLFNs) due to their strong performance, such as RBF networks, SVM (considered as a special type of SLFNs), polynomial networks, Fourier series, wavelet, etc. (Bu et al. Citation2019; Cortes and Vapnik Citation1995; Park and Sandberg Citation1991; Shin and Ghosh Citation1995; Zhang and Benveniste Citation1992). Extreme Learning Machines (ELMs) are one of the most popular SLFNs, first introduced by Huang and his group in the mid-2000s (Huang, Chen, and Siew Citation2006; Huang et al. Citation2006; Huang, Zhu, and Siew Citation2006). Different from previous works, ELM provides theoretical foundations on feedforward neural networks with random hidden nodes. It can handle classification, regression, clustering, representational learning, and many other learning tasks (Ding, Zhang, and Zhang et al. Citation2017; He, Xin, and Du et al. Citation2014; Lauren, Qu, and Yang et al. Citation2018). In this paper, we focus on regression learning task.

The law of large numbers suggests an interesting characteristic of trained ELM, a minimum empirical error can ensure minimum testing error with high probability for large training samples. In theory, it happens with probability one for infinite training samples. However, due to limit samples in real world, ELM may learn a function that perfectly separates the training samples but that does not generalize to unseen data. To address this issue, an optimization extreme learning machine for binary classification (ELC) was proposed (Ding and Chang Citation2014; Huang, Ding, and Zhou Citation2010; Huang et al. Citation2012). ELC implements the bartlett’s theory (Bartlett Citation1998), for which the smaller the norm of the output weights is the better generalization performance the system tends to have. Compared to ELM, the minimization norm of output weights enables ELC to get the better generalization performance. Then, the optimization idea is generalized to solve multi-class classification and regression learning tasks. Empirical studies based on real benchmark problems have shown that compared with classical learning algorithms (such as SVR and ELM), optimization ELM for regression (ELR) tends to provide better generalization performance with low computational cost. From the model selection point of view, ELR builds the training model without frequently tuning the parameters.

It has been shown that there are two parameters of ELR needed to be tuned, kernel parameter L and penalty parameters. Several papers suggest that the ELM-style optimization model generally maintains good generalization ability with large parameter L (Frénay and Verleysen Citation2010; Huang, Ding, and Zhou Citation2010; Huang et al. Citation2012). In fact, one can set proper L (e.g. 10³) before seeing the training samples. ELR uses parameters C and ϵ to apply a penalty to the optimization for samples which are not correctly predicted. However, C ranges from 0 to infinity and can be a bit hard to estimate the best value. In the case of SVM formulation, a new version of SVM for regression was developed where the epsilon penalty parameter was replaced by an alternative parameter nu (Chang and Lin Citation2002; Schölkopf et al. Citation2000). Parameter nu operates between 0 and 1 and represents the lower and upper bound on the number of samples that are support vectors and that lie on the wrong side of the hyperplane. It is the intuitive meaning that nu is more intuitive to tune than C or ϵ, and nu is successfully applied to ELC formulation (Ding et al. Citation2017).

In this paper, we extend ELR formulation to a new formula with such parameter nu, named nu-ELR, to address the problem mentioned above. In nu-ELR, the parameter ϵ is introduced into the new formula and it is estimated automatically for user. Compared to ELR, parameter nu lies in a smaller range than C (which goes from 0-infinity), which is tested on a linear scale.

This paper is organized as follows. In Section 2, the fundamental knowledge of ELR and ν-SVR is introduced. Section 3 presents the optimization problem of nu-ELR and derives its dual problem. In section 4, we propose an active set algorithm for solving the dual problem of nu-ELR. Section 5 compares ν-OELM with other state-of-the-art regressors for real benchmark datasets. Section 6 concludes the paper.

Related Works

In this section, the fundamentals of ELR and nu-SVR are reviewed.

ELR

Considering a regression problem with training samples ${\{(x_{i}, t_{i})\}}_{i = 1}^{N}$ , where $x_{i} \in R^{d}$ is the input pattern and $t_{i} \in R$ is the corresponding target. ELM is to minimize the training error as well as the norm of the output weights:

(1)

M i n i m i z e : \sum_{i = 1}^{N} |β \cdot h (x_{i}) - t_{i}| a n d ∥β∥

(1)

where $β$ is the vector of the output weights between the hidden layer of L nodes and the output node and $h (x_{i})$ is the output (row) vector of the i-th hidden node with respect to the input $x_{i}$ . The function $h (x)$ actually maps the training data $x$ from the input space to the L-dimensional ELM feature space.

Note than the norm term $∥β∥$ can be replaced by one half the norm squared, $\frac{1}{2} {∥β∥}^{2}$ . Here, we use the $ε$ -insensitive loss function:

(2)

{|β \cdot h (x_{i}) - t_{i}|}_{ε} = m a x \{0, |β \cdot h (x_{i}) - t_{i}| - ε\}

(2)

where $ε \geq 0$ is the width of the $ε$ -insensitive tube. Using the $ε$ -insensitive loss function, only the training points outside the $ε$ -tube contribute the loss, whereas the training points closest to the actual regression have zero loss. According to ELM learning theory (Huang, Chen, and Siew Citation2006), ELM can approximate any continuous target functions so that any set of distinct training points lies inside the tube. However, some testing points may lie outside the tube for noisy problems. In this case, potential violations are represented using positive slack variables $ξ_{i}$ and $ξ_{i}^{*} .$

(3)

- ε - ξ_{i} \leq t_{i} - β \cdot h (x_{i}) \leq ε + ξ_{i}^{*} \forall i

(3)

ELR attempts to strike a balance between minimization of training error and the penalization term.

(4)

\begin{matrix} M i n i m i z e : \frac{1}{2} {∥β∥}^{2} + C \sum_{i = 1}^{N} (ξ_{i} + ξ_{i}^{*}) \\ S u b j e c t t o : - ε - ξ_{i} \leq t_{i} - β \cdot h (x_{i}) \leq ε + ξ_{i}^{*} \\ ξ_{i}, ξ_{i}^{*} \geq 0, i = 1, \dots, N, ε \geq 0 \end{matrix}

(4)

The parameter C controls the trade-off between the norms of weights and the training error.

nu-SVR

The nu-SVR primal formulation problem, as given in (Chang and Lin Citation2002), is as follows:

(5)

\begin{matrix} M i n i m i z e : \frac{1}{2} {∥w∥}^{2} + C (v ε + \frac{1}{N} \sum_{i = 1}^{N} (ξ_{i} + ξ_{i}^{*})) \\ S u b j e c t t o : (w \cdot φ (x_{i}) + b) - t_{i} \leq ε + ξ_{i} \\ t_{i} - (w \cdot φ (x_{i}) + b) \leq ε + ξ_{i} \\ ξ_{i}, ξ_{i}^{*} \geq 0, i = 1, \dots, N, ε \geq 0 \end{matrix}

(5)

where ξ $\in R^{N}, ε, b \in R$ . The regression hyperplane of nu-SVR is $w \cdot φ (x_{i}) + b$ if ξ $= 0$ . Here ν is the user-specified parameter between 0 and 1, and training data $x_{i}$ are mapped into a feature space by through a mapping $φ (x)$ . The Wolfe dual formulations of this problem are

(6)

\begin{aligned} M i n i m i z e : \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (α_{i}^{*} - α_{i}) (α_{j}^{*} - α_{j}) K (x_{i}, x_{j}) - \sum_{i = 1}^{N} (α_{i}^{*} - α_{i}) t_{i} \\ S u b j e c t |t o : \sum_{i = 1}^{N} (α_{i}^{*} - α_{i}) = 0, 0 \leq α_{i}^{(*)} \leq \frac{C}{N}, a n d \sum_{i = 1}^{N} (α_{i}^{*} + α_{i}) \leq C ν \\ i = 1, \dots, N \end{aligned}

(6)

where Lagrange multipliers $α_{i}, α_{i}^{*} \geq 0$ , $K (x_{i}, x_{j}) = φ (x_{i}), φ (x_{j})$ is the implicit mapping kernel. Schölkopf et al. (Schölkopf et al. Citation2000) showed that ν is an upper bound on the fraction of margin errors, a lower bound on the fraction of support vectors, and both of these quantities approach ν asymptotically. Chang and Lin (Chang and Lin Citation2002) suggested that for any given $ν$ , at least one optimal solution of (6) satisfies the equation $e^{T} (a + a^{*}) = C ν$ , where $e = {[1, \dots, 1]}^{T} \in R^{N}$ . Thus, the inequality constraint of (2) can be solved by an equality constraint $\sum_{i = 1}^{N} (α_{i}^{*} + α_{i}) = C ν$ .

Optimization Problem of nu-ELR

The original ELM formulations for regression (ELR) used parameter C [0, inf] to apply a penalty to the optimization for points which were not correctly predicted. Parameter C is difficult to choose correctly and one has to resort to cross-validation or direct experimentation to find a suitable value. In this section, we will present a new formulation for nu-ELR, whose parameter C is replaced by parameter ν.

Optimization Formulation

Similar to ELR’s formulation, $ε$ is used as the width of the $ε$ -insensitive tube, which is slacked by variables $ξ_{i}^{(*)}$ . In the objective function, $ε$ is penalized by constant ν, and both variables $ε$ and $ξ_{i}^{(*)}$ are traded off against model complexity via a constant parameter C. Thus, the optimization formulation of nu-ELR can be shown as

(7)

\begin{matrix} M i n i m i z e : L_{p} = \frac{1}{2} {∥β∥}^{2} + C (v ε + \frac{1}{N} \sum_{i = 1}^{N} (ξ_{i} + ξ_{i}^{*})) \\ S u b j e c t t o : β \cdot h (x_{i}) - t_{i} \leq ε + ξ_{i} \\ t_{i} - β \cdot h (x_{i}) \leq ε + ξ_{i} \\ ξ_{i}, ξ_{i}^{*} \geq 0, i = 1, \dots, N, ε \geq 0 \end{matrix}

(7)

It should be noted that there are two major differences between nu-ELR and nu-SVR formulations:

The mapping mechanism of nu-SVR is inexplicit, in contrast to explicit mapping in nu-ELR. Thus, $φ (x_{i})$ in (5) is usually unknown and cannot be computed directly.
All kernel parameters in nu-SVR need to be tuned manually, whereas all parameters for ν-OELM are chosen randomly.

Considering many constraints of (7), we consider the Lagrangian:

(8)

\begin{aligned} L_{1} (β, ε, ξ^{(*)}, α^{(*)}, δ, η^{(*)}) = \frac{1}{2} {∥β∥}^{2} + C ν ε + \frac{C}{N} \sum_{i = 1}^{N} (ξ_{i} + ξ_{i}^{*}) - δ ε \\ - \sum_{i = 1}^{N} (η_{i} ξ_{i} + η_{i}^{*} ξ_{i}^{*}) \\ - \sum_{i = 1}^{N} α_{i} (ε + ξ_{i} + t_{i} - β \cdot h (x_{i})) \\ - \sum_{i = 1}^{N} α_{i}^{*} (ε + ξ_{i}^{*} - t_{i} + β \cdot h (x_{i})) \end{aligned}

(8)

where multipliers $α^{(*)}, η^{(*)}, δ \geq 0$ . This function has to be minimized with respect to variables $(β, ε, ξ^{(*)})$ of the primal problem and maximized with respect to dual variables $(α^{(*)}, δ, η^{(*)})$ . Setting the gradients of this Lagrangian with respect to $(β, ε, ξ^{(*)})$ equal to 0 gives the following KKT optimality conditions:

(9)

\{\begin{matrix} \frac{\partial L_{1} (β, ε, ξ^{(*)}, α^{(*)}, δ, η^{(*)})}{\partial β} = 0 \Rightarrow β = \sum_{i = 1}^{N} (α_{i}^{*} - α_{i}) x_{i} \\ \frac{\partial L_{1} (β, ε, ξ^{(*)}, α^{(*)}, δ, η^{(*)})}{\partial ε} = 0 \Rightarrow C ν - \sum_{i = 1}^{N} (α_{i}^{*} + α_{i}) - δ = 0 \\ \frac{\partial L_{1} (β, ε, ξ^{(*)}, α^{(*)}, δ, η^{(*)})}{\partial ξ^{(*)}} = 0 \Rightarrow \frac{C}{N} - α_{i}^{(*)} - η_{i}^{(*)} = 0 \end{matrix}

(9)

Substituting three equations of (9) into $L_{P}$ leaves us with the following quadratic optimization problem:

(10)

\begin{matrix} M i n i m i z e : L_{D} = \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (α_{i}^{*} - α_{i}) (α_{j}^{*} - α_{j}) K_{E L M} (x_{i}, x_{j}) - \sum_{i = 1}^{N} (α_{i}^{*} - α_{i}) t_{i} \\ S u b j e c t t o : 0 \leq α_{i}^{(*)} \leq \frac{C}{N}, a n d \sum_{i = 1}^{N} (α_{i}^{*} + α_{i}) \leq C ν \end{matrix}

(10)

where $K_{E L M} (x_{i}, x_{j}) = h (x_{i}) \cdot h (x_{j})$ is ELM kernel. Similar to nu-SVR’s formulation, inequation constraint $\sum_{i = 1}^{N} (α_{i}^{*} + α_{i}) \leq C ν$ can be converted to equation constraint $\sum_{i = 1}^{N} (α_{i}^{*} + α_{i}) = C ν$ .

The resulting decision function can be shown as

(11)

f (x) = \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) K_{E L M} (x_{i}, x)

(11)

From the dual formulation point of view, nu-SVR needs to satisfy one more optimization condition $\sum_{i = 1}^{N} (α_{i}^{*} - α_{i}) = 0$ as compared to nu-ELR. In this case, nu-SVR tends to find a solution which is the sub-optimal to nu-ELR’s solution.

Karush-Kuhn-Tucker Conditions of nu-ELR

From the KKT optimality condition (Fletcher Citation1981), primal and dual optimal solutions satisfy the following slackness conditions.

Primal feasibility

(12)

β \cdot h (x_{i}) - t_{i} \leq ε + ξ_{i}, t_{i} - β \cdot h (x_{i}) \leq ε + ξ_{i}^{*}, ξ_{i}^{(*)} \geq 0, ε \geq 0, \forall i

(12)

Dual feasibility

(13)

α_{i} \geq 0, α_{i}^{*} \geq 0, η_{i} \geq 0, η_{i}^{*} \geq 0, δ \geq 0, \forall i

(13)

Complementary slackness

(14)

α_{i} (ε + ξ_{i} + t_{i} - β \cdot h (x_{i})) = 0, α_{i}^{*} (ε + ξ_{i}^{*} - t_{i} + β \cdot h (x_{i})) = 0, \forall i

(14)

(15)

δ ε = 0, η_{i} ξ_{i} = 0, η_{i}^{*} ξ_{i}^{*} = 0, \forall i

(15)

By substituting (9) into (15), we have

(16)

(\frac{C}{N} - α_{i}) ξ_{i} = 0, (\frac{C}{N} - α_{i}^{*}) ξ_{i}^{*} = 0

(16)

From Equation (9) and (11), we have $f_{i} = β \cdot h (x_{i})$ , where $f_{i}$ is the predicted value of nu-ELR for the sample $(x_{i}, t_{i})$ . If $α_{i} = 0$ , from (16) we have $ξ_{i} = 0$ . By primal feasibility condition (12), we have $f_{i} - t_{i} \leq ε$ . If $α_{i} = \frac{C}{N}$ , from (14) we have $ε + ξ_{i} + t_{i} - β \cdot h (x_{i}) = 0$ . By primal feasibility condition (12) we further have $f_{i} - t_{i} \geq ε$ . If $0 < α_{i} < \frac{C}{N}$ , from (16) we have $ξ_{i} = 0$ . By complementary slackness condition (14) we have $ε + ξ_{i} + t_{i} - β \cdot h (x_{i}) = 0$ , that is $f_{i} - t_{i} = ε$ . Likewise, KKT conditions of $α_{i}^{*}$ can be concluded. In the light of the above, we have the following conditions:

(17)

\{\begin{matrix} α_{i}^{(*)} = 0 \Leftrightarrow f_{i} - t_{i} \leq ε \\ 0 < α_{i}^{(*)} < \frac{C}{N} \Leftrightarrow f_{i} - t_{i} = ε \\ α_{i}^{(*)} = \frac{C}{N} \Leftrightarrow f_{i} - t_{i} \geq ε \end{matrix}

(17)

Solving Algorithm of nu-ELR

It is well known that an active set algorithm is an effective method for solving quadratic programming problems with inequality constraints. As far as we know, it has been successfully applied to many ELM style quadratic programming optimization problems, such as ELM for classification, ELM for regression, SVM, etc. In the method, a subset of the variables is fixed at their bounds and the objective function is minimized with respect to the remaining variables. After a number of iterations, the correct active set is identified and the objective function converges to a stationary point.

The objective function of (10) can be rewritten as

L_{D} (α^{(*)}) = \frac{1}{2} \sum_{i, j = 1}^{N} (α_{i} α_{j} - α_{i} α_{j}^{*} - α_{i}^{*} α_{j} + α_{i}^{*} α_{j}^{*}) K_{E L M} (x_{i}, x_{j}) - \sum_{i = 1}^{N} (α_{i}^{*} - α_{i}) t_{i}

(18)

= \frac{1}{2} {[\begin{matrix} α \\ α^{*} \end{matrix}]}^{T} [\begin{matrix} K_{E L M} & - K_{E L M}^{T} \\ - K_{E L M} & K_{E L M} \end{matrix}] [\begin{matrix} α \\ α^{*} \end{matrix}] - {[\begin{matrix} T \\ - T \end{matrix}]}^{T} [\begin{matrix} α \\ α^{*} \end{matrix}]

(18)

Thus, formulation (10) can be equivalently written as

(19)

\begin{array}{l} Minimize : L_{D} (\bar{α}) = \frac{1}{2} {\bar{α}}^{T} {\bar{K}}_{ELM} \bar{α} - {\bar{T}}^{T} \bar{α} \\ Subjectto : 0 \leq α_{i}^{(*)} \leq \frac{C}{N}, and {\bar{e}}^{T} \bar{α} = C ν \end{array}

(19)

where $\bar{α} = [α; α^{*}],$ ${\overset{ˉ}{K}}_{E L M} = [\begin{matrix} K_{E L M} & - K_{E L M}^{T} \\ - K_{E L M} & K_{E L M} \end{matrix}]$ , $\bar{T} = [T; - T],$ $\bar{e} = [e; e],$ $e = {[1, \dots, 1]}^{T} \in R^{N} .$ For simple computation, we set $\frac{C}{N}$ as $C$ , so the constraint ${\bar{e}}^{T} \bar{α} = C ν$ can be rewritten as ${\bar{e}}^{T} \bar{α} = C N ν$ . We begin an active set algorithm with some notations. For a point $\overline{α}$ in the feasible region, we define $S_{0} : \{i | {\overset{ˉ}{α}}_{i} = 0\}, S_{C} : \{i | {\overset{ˉ}{α}}_{i} = C\}$ and $S_{w o r k} : \{i | {\overset{ˉ}{α}}_{i} \in (0, C)\}$ . Vectors ${\bar{α}}_{0}$ , ${\bar{α}}_{c}$ and ${\bar{α}}_{work}$ are defined according to these sets. The vector of elements in $\bar{α}$ whose indices belong to set $S_{0}$ denotes ${\overline{α}}_{0}$ , and other elements in $\overline{α}$ denote ${\bar{α}}_{C}$ and ${\bar{α}}_{work}$ respectively. Likewise, we define $\bar{α} {\bar{T}}_{0}$ , ${\bar{T}}_{w}$ , and ${\bar{T}}_{w}$ , where $\bar{T} = {\bar{T}}_{0} \cup {\bar{T}}_{w} \cup {\bar{T}}_{C}$ . Corresponding to the choice of indices set $S_{0}$ , $S_{C}$ , and $S_{w o r k}$ , we partition and rearrange matrix ${\overset{ˉ}{K}}_{E L M}$ as follows:

{\overset{ˉ}{K}}_{E L M} = [\begin{matrix} {\overset{ˉ}{K}}_{00} & {\overset{ˉ}{K}}_{0 w} & {\overset{ˉ}{K}}_{0 C} \\ K_{w 0} & {\overset{ˉ}{K}}_{w w} & {\overset{ˉ}{K}}_{w C} \\ {\overset{ˉ}{K}}_{C 0} & {\overset{ˉ}{K}}_{C w} & {\overset{ˉ}{K}}_{C C} \end{matrix}]

Thus, the objective function of (19) is equal to $\frac{1}{2} {\bar{α}}_{work}^{T} {\bar{K}}_{w w} {\bar{α}}_{work} + {\bar{α}}_{C}^{T} {\bar{K}}_{w C} {\bar{α}}_{work} + \frac{1}{2} {\bar{α}}_{C}^{T} {\bar{K}}_{C C} {\bar{α}}_{C} - {\bar{T}}_{w}^{T} {\bar{α}}_{work} - {\bar{T}}_{C}^{T} {\bar{α}}_{C}$ . At each iteration, ${\overline{α}}_{C}$ is fixed and the formulation (19) can be equivalently written as

(20)

\begin{array}{l} Minimize : L_{1} ({\bar{α}}_{work}) = \frac{1}{2} {\bar{α}}_{work}^{T} {\bar{K}}_{w w} {\bar{α}}_{work} + {\bar{α}}_{C}^{T} {\bar{K}}_{w C} {\bar{α}}_{work} - {\bar{T}}_{w}^{T} {\bar{α}}_{work} \\ Subject to : \sum {\bar{α}}_{work} + \sum {\bar{α}}_{C} = C N ν \end{array}

(20)

Then, formulation (20) can be further rewritten as

(21)

\begin{array}{l} Minimize : L_{1} ({\bar{α}}_{work}) = \frac{1}{2} {\bar{α}}_{work}^{T} {\bar{K}}_{w w} {\bar{α}}_{work} + ρ^{T} {\bar{α}}_{work} \\ Subject to : A {\bar{α}}_{work} = τ \end{array}

(21)

where $ρ = {\bar{K}}_{w C}^{T} {\bar{α}}_{C} - {\bar{T}}_{w}$ , $τ = C N ν - \sum {\bar{α}}_{C}$ , $A = {[1, \dots, 1]}^{T} \in R^{n}$ , and $n$ is the number of elements in vector ${\bar{α}}_{work}$ .

The Lagrangian for this problem (21) is

L_{2} ({\bar{α}}_{work}, λ) = \frac{1}{2} {\bar{α}}_{work}^{T} {\bar{K}}_{w w} {\bar{α}}_{work} + ρ^{T} {\bar{α}}_{work} - λ (A {\bar{α}}_{work} - τ)

The partial derivatives of the Lagrangian are set to zero, which leads to the following simple linear system.

(22)

[\begin{array}{l} {\bar{K}}_{w w} A^{T} \\ A 0 \end{array}] [\begin{array}{l} {\bar{α}}_{work} \\ λ \end{array}] = [\begin{array}{l} - ρ \\ τ \end{array}]

(22)

So, the optimal solution ${\bar{α}}_{work}$ with the corresponding $λ$ can be solved by (22).

Based on the above derivation of quadratic formulations, the proposed active set algorithm can be summarized by two loops: The first loop iterates over all samples violating the KKT conditions (17), and the first step of iterative process beginning from samples that are not on bound. The iterator keeps alternating between passes over entire training samples and passes over the non-bound instances. If the optimality conditions are satisfied over all samples, the algorithm stops with the solution; otherwise, the second loop begins; The second loop is a series of iterative steps to solve the formulation (21). As ${\overset{ˉ}{K}}_{w w}$ is convex, the strict decrease of the objective function holds, and a global minimum of (21) can be obtained. The theoretical convergence proof of similar formulation was given in (Ding and Chang Citation2014).

Numerical Experiments and Comparison of Results

In this section, the performance of nu-ELR will be investigated by comparing it numerically not only with ELR but also with two other well-accepted learning models: ELM and nu-SVR. All the experiments have been conducted on a 4-core, i7-7700HQ CPU @ 2.8 GHz laptop with 8 GB RAM and a MATLAB implementation. We have evaluated four learning models on 26 famous real-world benchmark data sets on UCI Machine Learning Repository (Blake and Merz Citation2013) and Statlib (Mike Citation2005). All the inputs of the data sets have been normalized to the range [0,1], while the outputs are kept unchanged. To find the average performance, 50 trials are conducted for each dataset with every learning model. The training and testing samples of the corresponding data sets are reshuffled at each trial of simulation.

In these data sets, some features are in nominal format, which are used to identify the objects only, and they cannot be manipulated as numbers. Some steps should be performed to convert these features into numeric attributes, which are quantitative because, they are some measurable quantities, represented in integer or real values. For example, in ‘Abalone’ dataset, the ‘sex’ attribute has three states: M, F, and I, which are represented by 1, 2, and 3; In ‘Cloud’ dataset, the ‘season’ attribute has four states: AUTUMN, WINTER, SPRING, and SUMMER. We simply use ‘1, 0, 0, 0ʹ to represent AUTUMN, and ‘0, 1, 0, 0ʹ, ‘0, 0, 1, 0ʹ, ‘0, 0, 0, 1ʹ are used to represent other three states. After attribute preprocessing, the number of attributes in ‘Cloud’ dataset is increased from 6 to 9, and 7 to 36 for ‘Machine_cpu’ dataset, 4 to 19 for ‘Servo’ dataset.

Selection of Parameters

The popular Gaussian kernel function $K (x_{i}, x_{j}) = exp (- γ {∥x_{i} - x_{j}∥}^{2})$ is used in both nu-SVR. To train nu-ELR and ELR, the Sigmoid type of ELM kernel is used: $K_{E L M} (x, x_{s}) = [G (a_{1}, b_{1}, x), . . ., G (a_{L}, b_{L}, x)]^{T}$ $\cdot [G (a_{1}, b_{1}, x_{s}), . . ., G (a_{L}, b_{L}, x_{s})]^{T}$ , where $G (a, b, x) = 1 / (1 + exp (- (a \cdot x + b)))$ . In addition, for the Sigmoid active function of ELM kernel, the input weights and biases are randomly generated from (−1, 1)^N×(0, 1) based on the uniform probability distribution.

In order to achieve good generalization performance, we use grid search to determine the kernel parameter $γ$ and ν for nu-SVR. Similar to Ghanty, Paul, and Pal (Citation2009), parameters ν and $γ$ of SVM are tuned on a grid of {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} $\times$ {0.001, 0.01, 0.1, 0.2, 0.4, 0.8, 1, 2, 5, 10, 20, 50, 100, 1000, 10000}. According to the suggestion of (Ding and Chang Citation2014; Ding et al. Citation2017; Huang, Ding, and Zhou Citation2010; Huang et al. Citation2012), ELR tends to achieve better generalization performance when kernel parameter L is large enough. For ELR, L is set 1000, and parameter C is tuned on {0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 1000, 10000}. For nu-ELR, L is set 1000, parameters ν and C are tuned on a grid of {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} $\times$ {0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 1000, 10000}. For ELM, there is only one parameter L (optimal number of hidden nodes) that needs to be determined. As the generalization performance of ELM is not sensitive to the number of hidden nodes L, parameter L is tuned on {5, 10, 20, 30, 50}.

The generalization performance of four models (ELM, ELR, nu-SVR, and nu-ELR) on the ‘Sensory’ dataset for different combinations of model parameters (cf ) is presented. It is clear that the best generalization performance of nu-SVR depends heavily on the combinations (ν, $γ$ ). The best generalization performance is usually achieved in a narrow range of such combinations. In contrast, the generalization performance of nu-ELR is less sensitive to combinations (ν, C), especially in a narrower range (0.1–1, 0.1–100).

Figure 1. Generalization performance of four learning models on sensory dataset for different combinations of parameters. (a) ELM, (b) ELR, (c) nu-SVR, (d) nu-ELR.

To analyze this phenomenon, four more data sets (Baskball, Autoprice, Abalone, and Lowbwt) are selected to run different combinations of parameters, as shown in . For both these data sets, we can confirm that nu-ELR performs smoother on local parameter combinations (ν, C) of (0.1–1, 0.1–100) than the whole combinations. For (d) of , if we fix the regularization parameter C at some value and vary the parameter ν in a large range, we found that generalization performance of nu-ELR is less sensitive to the variation of parameter ν.

Figure 2. Generalization performance of nu-ELR on four data sets. (a) Baskball, (b) Autoprice, (c) Abalone, (d) Lowbwt.

Statlib Data Sets

In this section, the performance of four learning models is tested through experiments of Statlib data sets, which contain 15 benchmark data sets, as listed in .

Table 1. Statlib data sets.

Download CSV Display Table

In our experiments, for each evaluated learning model, we use one training/testing partition to do parameter selection. Once the parameter selection process is completed, the selected model is used for other partitions. The selected parameter combination is then used for all independent trials, and the training and testing samples are randomly selected for each trial. First of all, we report model selection results for both four learning models, the selected combinations of parameters are listed in .

Table 2. Parameters setting on Statlib data sets.

Display Table

For both 15 data sets, 50 trials have been conducted for each dataset. Experimental results include the root-mean-square error (RMSE) and the training time (s), as shown in and . For 11 out of 15 data sets, nu-ELR gives the best performance of RMSE. We also see that nu-ELR achieves the second-best performance for four other data sets when compared with three learning models. Clearly, nu-ELR performs better than ELR for all 15 data sets. Another observation is worth noting. It can be seen that for Balloon, Mbagrade, Space-ga data sets, compared to other learning models, nu-SVR performs significantly worse. A possible explanation is that the generalization performance of nu-SVR is very sensitive to the model parameters. After parameter selection, even nu-SVR performs well on one training/testing partition, it may perform very bad on other partitions.

Table 3. Performance comparison of four learning models (RMSE) on Statlib data sets.

Download CSV Display Table

Table 4. Training time comparison of four learning models (s) on Statlib data sets.

Download CSV Display Table

The training time experiment (see ) shows that the advantage of the ELM is quite obvious, as iteration is no need in the ELM training process. The training time of three learning models is similar to each other, except nu-SVR on some data sets (Balloon, Mbagrade, Space-ga). This phenomenon is originated from the fact that nu-SVR hardly finds the global solution within the limited iterations.

UCI Data Sets

In this section, the performance of four learning models is tested through experiments of UCI data sets, which contain 11 benchmark data sets, as listed in . shows the model selection results.

Table 5. UCI data sets.

Download CSV Display Table

Table 6. Parameters setting on UCI data sets.

Display Table

From , out of 11 data sets, nu-ELR performs the best on 7 data sets, while nu-SVR performs the best on 4 other data sets. For Machine_cpu data set, we see that original nu-SVR is significantly better than other learning models when compared with RMSE. The most likely explanation for the superior performance of nu-SVR is that this data set is not sensitive to different training/testing partitions. For the Mpg dataset, it is worth noting that nu-SVR performs very badly compared to three other learning models, which has been discussed above. gives the training time comparison. An interesting point of comparison is with the Mpg data set. Although nu-SVR performs very badly, less training time is need compared to ELR and nu-ELR.

Table 7. Performance comparison of four learning models (RMSE) on UCI data sets.

Download CSV Display Table

Table 8. Training time comparison of four learning models (s) on UCI data sets.

Download CSV Display Table

A Performance Evaluation of Kernel Methods

Obviously, the kernel $K_{E L M} (x_{i}, x)$ plays an important role in determining the characteristics of the nu-ELR learning model and choosing different kernels may result in different performances. In this section, another experiment was set up to see how the kernel affects the performance of different data sets for nu-ELR. Four commonly used kernel functions of ELM kernel $K_{E L M} (x, x_{s}) = [G (a_{1}, b_{1}, x), . . ., G (a_{L}, b_{L}, x)]^{T} \cdot [G (a_{1}, b_{1}, x_{s}), . . ., G (a_{L}, b_{L}, x_{s})]^{T}$ in literatures are adopted in this experiment:

Sigmoid function:

G (a, b, x) = 1 / (1 + exp (- (a \cdot x + b)))

Sin function:

G (a, b, x) = sin (a \cdot x + b)

Hard-limit function:

G (a, b, x) = h a r d l i m i t (a \cdot x + b)

Exponential function:

G (a, b, x) = e x p (- (a \cdot x + b))

All the vectors $α$ and variables $b$ in these kernel functions are set the same, which are randomly generated from (−1, 1)^N×(0, 1) based on the uniform probability distribution. The RMSE of regression performance on the test sets, across the Statlib data sets and UCI data sets, for the different nu-ELR kernels is shown in and .

Table 9. Comparison results of four kernel functions on Statlib data sets.

Download CSV Display Table

Table 10. Comparison results of four kernel functions on UCI data sets.

Download CSV Display Table

From both two tables, we see that the Sigmoid kernel shows the best performance over most of the data sets. The Sin and Hard kernels show similar performance over more than half of all data sets. Hard kernel is the second most accurate in our experiments, but is clearly less accurate than Sigmoid kernel. As the computational cost of each kernel is very similar, the comparison of training time is not shown in this experiment.

Conclusions

By a simple reformulation of ELR with parameter nu, a novel ELR formulation is proposed in this work as an inequality-constrained minimization problem with the key advantage being the new parameter nu is only searched within the range [0, 1]. It is further proposed to solve the minimization problem using the active set algorithm. Experimental results on two different data repositories, including 26 regression problems, demonstrate that nu-ELR achieves the best performance over most of the regression problems, compared with ELM, ELR, and nu-SVR learning models. In particular, it provides a fair comparison on the RMSE of the different kernels of nu-ELR. It is clear from these results that some kernels are better than others, and certain kernels are better suited to certain types of problems. In future works, the proposed approach will be extended for other kernel-based learning methods.

Additional information

Funding

This work was supported by the Key Program of National Natural Science Foundation of China (91646204), and the Natural Science Foundation of Jiangsu (Grant No. BK20160148).

References

Bartlett, P. L. 1998. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory 44 (2):525–36. doi:10.1109/18.661502.
Web of Science ®Google Scholar
Blake, C. K., and C. J. Merz. 2013. UCI repository of machine learning databases. http://archive.ics.uci.edu/ml/.
Google Scholar
Bu, Z., J. Li, C. Zhang, J. Cao, A. Li, and Y. Shi. 2019. Graph K-means based on leader identification, dynamic game, and opinion dynamics. IEEE Transactions on Knowledge and Data Engineering. doi:10.1109/TKDE.2019.2903712.
Web of Science ®Google Scholar
Chang, C. C., and C. J. Lin. 2002. Training v-support vector regression: theory and algorithms. Neural Computation 14 (8):1959–77. doi:10.1162/089976602760128081.
PubMed Web of Science ®Google Scholar
Cortes, C., and V. Vapnik. 1995. Support vector networks. Machine Learning 20:273–97. doi:10.1007/BF00994018.
Web of Science ®Google Scholar
Ding, S., N. Zhang, J. Zhang, X. Xu, Z. Shi. 2017. Unsupervised extreme learning machine with representational features. International Journal of Machine Learning and Cybernetics. 8(2):587–95. doi:10.1007/s13042-015-0351-8.
Web of Science ®Google Scholar
Ding, X., and B. Chang. 2014. Active set strategy of optimized extreme learning machine. Chinese Science Bulletin 59 (31):4152–60. doi:10.1007/s11434-014-0512-2.
Google Scholar
Ding, X. J., Y. Lan, Z. F. Zhang, and X. Xu. 2017. Optimization extreme learning machine with ν regularization. Neurocomputing 261:11–19. doi:10.1016/j.neucom.2016.05.114.
Web of Science ®Google Scholar
Fletcher, R. 1981. Practical Methods of Optimization: Volume2 Constrained Optimization. New York: Wiley.
Google Scholar
Frénay, B., and M. Verleysen. 2010. Using SVMs with randomized feature spaces: An extreme learning approach, in: Proceedings of The 18th European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium, 28–30 April, 2010, 315–20.
Google Scholar
Ghanty, P., S. Paul, and N. R. Pal. 2009. NEUROSVM: An architecture to reduce the effect of the choice of kernel on the performance of svm. Journal of Machine Learning Research 10:591–622.
Web of Science ®Google Scholar
He, Q., J. Xin, C. Y. Du, F. Zhuang, Z. Shi. 2014. Clustering in extreme learning machine feature space. Neurocomputing 128:88–95. doi:10.1016/j.neucom.2012.12.063.
Web of Science ®Google Scholar
Huang, G.-B., L. Chen, and C.-K. Siew. 2006. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks 17 (4):879–92. doi:10.1109/TNN.2006.875977.
PubMed Web of Science ®Google Scholar
Huang, G.-B., X. Ding, and H. Zhou. 2010. Optimization method based extreme learning machine for classification. Neurocomputing 74:155–63. doi:10.1016/j.neucom.2010.02.019.
Web of Science ®Google Scholar
Huang, G.-B., H. Zhou, X. Ding, and R. Zhang. 2012. Extreme learning machine for regression and multiclass classification. IEEE Transaction on System, Man, and Cybernetics-Part B: Cybernetics 42 (2):513–29. doi:10.1109/TSMCB.2011.2168604.
PubMed Web of Science ®Google Scholar
Huang, G.-B., Q.-Y. Zhu, K.-Z. Mao, C.-K. Siew, P. Saratchandran, and N. Sundararajan. 2006. Can threshold networks be trained directly? IEEE Transactions on Circuits and Systems-II: Express Briefs 53 (3):187–91. doi:10.1109/TCSII.2005.857540.
Web of Science ®Google Scholar
Huang, G.-B., Q.-Y. Zhu, and C.-K. Siew. 2006. Extreme learning machine: Theory and applications. Neurocomputing 70:489–501. doi:10.1016/j.neucom.2005.12.126.
Web of Science ®Google Scholar
Lauren, P., G. Qu, J. Yang, P. Watta, G-B. Huang, A. Lendasse. 2018. Generating word embeddings from an extreme learning machine for sentiment analysis and sequence labeling tasks. Cognitive Computation 10(4):625–38. doi:10.1007/s12559-018-9548-y.
Web of Science ®Google Scholar
Mike, M. 2005. Statistical datasets. http://lib.stat.cmu.edu/datasets/.
Google Scholar
Park, J., and I. W. Sandberg. 1991. Universal approximation using radial basis-function networks. Neural Computation 3:246–57. doi:10.1162/neco.1991.3.2.246.
PubMed Web of Science ®Google Scholar
Schölkopf, B., A. Smola, R. C. Williamson, and P. L. Bartlett. 2000. New support vector algorithms. Neural Computation 12:1207–45. doi:10.1162/089976600300015565.
PubMed Web of Science ®Google Scholar
Shin, Y., and J. Ghosh. 1995. Ridge polynomial networks. IEEE Transactions on Neural Networks 6 (3):610–22. doi:10.1109/72.377967.
PubMedGoogle Scholar
Zhang, Q., and A. Benveniste. 1992. Wavelet networks. IEEE Transactions on Neural Networks 3 (6):889–98. doi:10.1109/72.165591.
PubMed Web of Science ®Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

Download PDF

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Your download is now in progress and you may close this window

Did you know that with a free Taylor & Francis Online account you can gain access to the following benefits?

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Have an account?
Login now Don't have an account?
Register for free

Login or register to access this feature

Have an account?
Login now Don't have an account?
Register for free

Choose new content alerts to be informed about new research of interest to you
Easy remote access to your institution's subscriptions on any device, from any location
Save your searches and schedule alerts to send you new results
Export your search results into a .csv file to support your research

Extreme Learning Regression for nu Regularization

ABSTRACT

Introduction

Related Works

ELR

nu-SVR

Optimization Problem of nu-ELR

Optimization Formulation

Karush-Kuhn-Tucker Conditions of nu-ELR

Solving Algorithm of nu-ELR

Numerical Experiments and Comparison of Results

Selection of Parameters

Statlib Data Sets

Table 1. Statlib data sets.

Table 2. Parameters setting on Statlib data sets.

Table 3. Performance comparison of four learning models (RMSE) on Statlib data sets.

Table 4. Training time comparison of four learning models (s) on Statlib data sets.

UCI Data Sets

Table 5. UCI data sets.

Table 6. Parameters setting on UCI data sets.

Table 7. Performance comparison of four learning models (RMSE) on UCI data sets.

Table 8. Training time comparison of four learning models (s) on UCI data sets.

A Performance Evaluation of Kernel Methods

Table 9. Comparison results of four kernel functions on Statlib data sets.

Table 10. Comparison results of four kernel functions on UCI data sets.

Conclusions

References

Information for

Open access

Opportunities

Help and information

Extreme Learning Regression for nu Regularization

ABSTRACT

Introduction

Related Works

ELR

nu-SVR

Optimization Problem of nu-ELR

Optimization Formulation

Karush-Kuhn-Tucker Conditions of nu-ELR

Solving Algorithm of nu-ELR

Numerical Experiments and Comparison of Results

Selection of Parameters

Statlib Data Sets

Table 1. Statlib data sets.

Table 2. Parameters setting on Statlib data sets.

Table 3. Performance comparison of four learning models (RMSE) on Statlib data sets.

Table 4. Training time comparison of four learning models (s) on Statlib data sets.

UCI Data Sets

Table 5. UCI data sets.

Table 6. Parameters setting on UCI data sets.

Table 7. Performance comparison of four learning models (RMSE) on UCI data sets.

Table 8. Training time comparison of four learning models (s) on UCI data sets.

A Performance Evaluation of Kernel Methods

Table 9. Comparison results of four kernel functions on Statlib data sets.

Table 10. Comparison results of four kernel functions on UCI data sets.

Conclusions

Additional information

Funding

References

Reprints and Corporate Permissions

Academic Permissions

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date