Full article: Feature Selection for Supervised Learning and Compression

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Supervised feature selection aims to find the signals that best predict a target variable. Typical approaches use measures of correlation or similarity, as seen in filter methods, or predictive power in learned models, as seen in wrapper methods. In both approaches, the selected features often have high entropies and are not suitable for compression. This is a particular drawback in the automotive domain where fast communication and archival of vehicle telemetry data is increasingly important, especially with technologies such as V2V and V2X (vehicle-to-vehicle and vehicle-to-everything communication). This paper aims to select features with good predictive performances and good compression by introducing a compressibility factor into several existing feature selection approaches. Where appropriate, performance guarantees are provided for greedy searches based on monotonicity and submodularity. Using the language of entropy, the relationship between relevance, redundancy, and compressibility is discussed from the perspective of signal selection. The approaches are then demonstrated in selecting features from a vehicle Controller Area Network for use in SVMs in a regression task, namely predicting fuel consumption, and a classification task, namely identifying Points of Interest. We show that while predictive performance is slightly lower when compression is considered, the compressibility of the selected features is significantly improved.

Introduction

Cities are becoming increasingly smart, with more sensors producing larger amounts of data. Sensors are deployed in all aspects of city infrastructure, and in particular, they are necessary for intelligent transportation systems. Collecting data from these sensors, storing and analyzing it, presents a huge challenge for researchers. Vehicles increasingly communicate with other vehicles (V2V) and traffic management systems (V2X) to enhance both efficiency and safety (Harding et al. Citation2014; Weiß Citation2011). For example, it is possible for one vehicle to broadcast that its brakes have been applied to warn others that they are slowing, reducing the burden on a human driver seeing the brake light. Modern roadside infrastructure is also laden with sensors that measure traffic flow, pollution, and weather among other things and use transmitters to provide useful information for vehicles and their drivers (Guerrero-Ibáñez, Zeadally, and Contreras-Castillo Citation2018). For example, traffic flow beginning to slow in one section of a highway may be eased by lowering the speed limit temporarily in another, reducing the level of traffic at the bottleneck. These examples all require large amounts of telemetry data to be communicated over networks. Existing communications infrastructure cannot satisfy the required bandwidth, necessitating new communication protocols and novel approaches to data analysis.

Supervised machine learning aims to build models that map inputs to output predictions for a target variable (Witten et al. Citation2016). For example, an automotive manufacturer may wish to learn a model that estimates whether or not a vehicle is likely to have a fault in the near future. Such a model may take input data from on-board vehicular sensors but also require data from many vehicles to learn a model with sufficient accuracy. Similarly, a highways agency may wish to locate pot-holes or predict traffic jams using road sensors and vehicle telemetry data. The number of possible input features in such domains is large, and computationally, it is typically impractical to consider them all. Moreover, many features may be irrelevant to the target variable or redundant with respect to other features. Both irrelevance and redundancy can increase model complexity or cause incorrect mappings to be learned, which in turn leads to lower predictive performance. To overcome this, supervised feature selection can be used to find a subset of the features that are most related to the target variable (i.e., high relevance) but least related to each other (i.e., low redundancy) (Ding and Peng Citation2005; Peng, Long, and Ding Citation2005) or have the best performance when combined in a model (Aboudi and Benhlima Citation2016; Guyon and Elisseeff Citation2003). However, feature selection approaches generally have a bias for features that carry more information about a target variable, as these typically are the most relevant. This high information content often coincides with high entropies and poor compression, meaning that these signals are likely to be expensive to communicate and difficult to store in data centers.

This paper considers compression as a cost in the feature selection process using ideas from cost-based feature selection (Bolón-Canedo et al. Citation2014) and aims to choose features with good predictive performance as well as high compression. In particular, the paper:

introduces a compressibility factor into a variety of filter and wrapper approaches to feature selection,
performs an entropy analysis of introducing the compressibility factor into feature selection schemes based on mutual information (MI), and
assesses the selection of compressible features from vehicle telemetry data in a regression and a classification task.

Specifically, a compressibility factor is introduced into the maximum relevance (MR), minimal redundancy maximal relevance (MRMR) (Peng, Long, and Ding Citation2005), and goodness of fit (GOF) (Das and Kempe Citation2011) filter approaches, and the wrapper (Kohavi and John Citation1997) approach using linear regression ( $L R r$ ), logistic regression ( $L C c$ ), and support vector machines (SVM). Relevant performance guarantees for greedy searches, based on monotonicity, submodularity, or weak submodularity, are provided for each of the filter methods (Nemhauser, Wolsey, and Fisher Citation1978). Although such theoretical guarantees do not extend to wrapper approaches, related assurances can be derived from those of model performance measures (May, Dandy, and Maier Citation2011). As such, wrapper methods typically achieve good performances when a suitable evaluation is used (Kohavi and John Citation1997). The proposed methods are then demonstrated in selecting features to predict instantaneous fuel consumption (FC) and to detect points of interest (PoIs) (Van Hinsbergh et al. Citation2018) using the telemetry signals available via a vehicle controller area network (CAN).

The remainder of this paper is structured as follows. Section 2 discusses related work in feature selection and compression. Section 3 details the performance measures and selection algorithms, along with any relevant performance guarantees. This is followed by an entropy analysis of MR, MRMR, and compressibility in Section 4. The selection methods are then demonstrated on collected vehicle telemetry data in Section 5. Finally, Section 6 concludes this paper and outlines future work.

Related Work

Feature selection aims to find a subset of all possible features to reduce their number while still sufficiently describing the data with respect to a particular task (Chandrashekar and Sahin Citation2014). Unsupervised feature selection generally aims to find the features that best differentiate the samples (Jennifer and Brodley Citation2004; Mitra, Murthy, and Sankar Citation2002). Typically, this relies on transforming the feature space, assessing clustering properties, or other discriminatory properties measured using heuristics (Alelyani, Tang, and Liu Citation2013).

In supervised learning, which is the focus of this paper, the three main approaches to feature selection are embedded, wrapper, and filter methods (Guyon and Elisseeff Citation2003; Jain and Singh Citation2018). Embedded methods perform feature selection as part of the learning algorithm (Lal et al. Citation2006). For example, in decision tree induction, the choice of variable on which to split nodes can be seen as feature selection. The nodes used, and their associated features, are then the selected features. Trees in random forests may also be removed if their estimated performance is poor, which may mean poor-performing features are never used in the learned model.

Wrapper methods assess feature subsets via the performances of models that use them as inputs (Kohavi and John Citation1997). This is done by comparing the performances of models learned using the same process and training samples but with different sets of input features. Wrapper methods are typically computationally expensive, as they require a machine learning algorithm to build a model for each feature subset evaluation. Filter methods, on the other hand, generally assess the performances of feature subsets using heuristics that are typically less computationally expensive. Some commonly used heuristics include similarity measures such as Pearson’s Correlation Coefficient (PCC) or MI, which can be used to estimate both relevance and redundancy of a feature set (Jain and Singh Citation2018; Witten et al. Citation2016).

In general, selecting up to $l$ features from a set of features, $X$ , can be represented as an optimization problem (Kumar and Minz Citation2014),

(1)

\underset{S \subseteq X, | S | \leq l, l \leq | X |}{argmax} P (S, y),

(1)

where $P (S, y)$ estimates the performance of a subset of these features, $S$ , with respect to predicting the target variable, $y$ . Ideally, all possible subsets would be evaluated to find the feature set that provides the highest performance, but the number of possible feature subsets is $(\begin{matrix} | X | \\ l \end{matrix})$ , and this exhaustive search is infeasible. In practice, therefore, a more efficient combinatorial search algorithm is applied.

Possibly, the most common search strategy is the forward greedy search (Kumar and Minz Citation2014), which selects the feature, $s_{k} \in X ∖ S$ , that satisfies a condition, $C$ ,

(2)

C = \underset{s_{k} \in X ∖ S_{k}}{argmax} P (S_{k} \cup {s_{k}}, y) - P (S_{k}, y),

(2)

in each of $k = 1, 2, \dots, l$ iterations. The search begins with no selected features, $S_{0} = \emptyset$ , and so initially the feature with the highest individual performance is selected. It is typical that during this first iteration, $P (\emptyset, y) = 0$ . In the subsequent $l - 1$ iterations, $P (S_{k}, y)$ is the performance of the already selected features, and the feature that increases the performance most is selected. The search stops when a stopping criterion is met, such as when a given number of features are selected or if the performance score decreases after selecting a new feature.

If the performance measure is both monotone (i.e., increases with $| S |$ ),

(3)

P (S, y) \leq P (S^{'}, y),

(3)

for all $S \subseteq S^{'} \subseteq X$ , and submodular (i.e., has diminishing returns),

(4)

P (S \cup x, y) - P (S, y) \geq P (S^{'} \cup x, y) - P (S^{'}, y),

(4)

for all $x \in X ∖ S^{'}$ , then the solution of the forward greedy search is within $1 - e^{- 1}$ ( $\approx 63 %$ ) of the optimal. The consequences of submodularity are also used in the lazy greedy search algorithm (as described in Section 3.6), which maintains a sorted list of upper bounds on the marginal gains to reduce the number of evaluations of $P (S, y)$ (Minoux Citation1978). If the performance measure is submodular but non-monotone, alternative greedy algorithms with different performance guarantees can be used (Buchbinder et al. Citation2015; Feige, Mirrokni, and Vondrak Citation2011), and the submodularity ratio (Das and Kempe Citation2018) can be used for set functions that are monotone but not submodular.

Two advances to the greedy algorithm include the distributed greedy and stochastic greedy searches (Khanna et al. Citation2017; Mirzasoleiman et al. Citation2015). In each of these, multiple searches are applied to different subsets of the full feature set. In distributed greedy search, the subsets are generated by sampling the features without replacement, whereas in stochastic greedy search, the sampling is performed with replacement. The results of the multiple searches are combined into an interim result set, on which another greedy search is applied to find the final result. Alternatively, the feature subsets can be combined by taking their intersection (Nagpal and Singh Citation2018).

Other optimization approaches include particle swarm optimization (PSO), which iteratively updates candidate vectors in a population based on both local and global information (Kennedy and Eberhart Citation1995). For feature selection using binary PSO, a candidate vector is a probabilistic representation of a feature being selected, that are collapsed onto either $0$ or $1$ for feature set evaluation using a random threshold (Chuang et al. Citation2008). Similar to this, brain storm optimization iteratively clusters candidates, generates new ones, and selects those with the best performance for the next iteration (Papa et al. Citation2018). Differential evolution approaches also iteratively create new populations through a process of mutation, crossover, and selection (Zhang et al. Citation2020).

Feature selection is often characterized as a compression method, as it reduces the data that must be processed by machine learning algorithms and models. This has several advantages, including reducing model complexity and improving performance, but selected features often have high variances and entropies that typically coincide with good predictive performance (Guyon and Elisseeff Citation2003). This bias toward high entropy features limits the data compression that can be achieved with those that are selected. It may be the case, particularly with the high redundancies found with vehicle telemetry and city data, that some features with lower entropies and better compression may provide comparable predictive performances (Taylor, Griffiths, and Mouzakitis Citation2018).

To achieve both these competing aims, namely selecting high quality features that have good compression, this paper adopts ideas from cost-based feature selection. A simple approach is to introduce a cost factor, $C (S)$ , into the optimization problem (EquationEquation 1(1) $\underset{S \subseteq X, | S | \leq l, l \leq | X |}{argmax} P (S, y),$ (1) ) as follows (Bolón-Canedo et al. Citation2014):

(5)

\underset{S \subseteq X, | S | \leq l, l \leq | X |}{argmax} P (S, y) - ω C (S),

(5)

where $ω$ is a cost weighting parameter. A similar approach is to sum the costs of individual features to estimate the cost of the feature set (Jiang, Kong, and Li Citation2021). Costs of features can arise from observing or measuring them (e.g., in using expensive medical machinery), or computational costs (e.g., in extracting features from large images).

Other approaches include searching for a pareto-optimal solution where no criteria can improve without worsening another (e.g., (Hancer, Xue, and Zhang Citation2018; Zhang et al. Citation2019, Citation2020)). Typically, such mutliobjective optimization approaches focus on the competing aims of minimizing the number of selected features vs predictive performance, but Hancer et al. (Hancer, Xue, and Zhang Citation2018) demonstrated it for multiple feature performance measures (including MI relevance and redundancy). While such approaches may be applicable to the objectives of predictive performance and compression, they do not enable an investigation of the trade-off between them. As such, this paper employs a cost-weighting parameter that enables full control over the relative importance of compression, measured using a compressibility factor outlined in Section 3.1.

Data compression aims to remove redundancy, thus making the data representation smaller, in such a way that it can be restored (Salomon and Motta Citation2010). Characteristics of city and vehicle telemetry data, including temporal consistency, noise, and signal redundancy, all support good compression. While signal redundancy is removed in performing feature selection, noise and temporal consistency are not removed and must be considered using other kinds of data compression. There are two broad categories of data compression, namely lossless and lossy.

Lossless compression aims to compress the data in such a way that the uncompressed version is indistinguishable from the original (Salomon and Motta Citation2010). Typically, lossless compression inspects the frequencies of symbols and looks for repeating symbols or sequences of symbols in the data stream. Perhaps the most simple method of compression is run-length encoding, in which symbols are encoded along with their number of consecutive repetitions. For example, the string ‘AAAABBA’ can be encoded as ‘A4B2A1.’ Two other notable lossless compression algorithms are LZ77 dictionary encoding (Ziv and Lempel Citation1978) and Huffman coding (Huffman Citation1952). LZ77 uses a sliding window and searches for repeating sequences, which are encoded as the length and location of its first occurrence in the window. Huffman coding produces a variable length prefix-code defining the path to the encoded symbol in a Huffman tree. Symbols that occur with higher frequencies are located closer to the root node in the tree and thus have shorter Huffman codes. Taken together, LZ77 and Huffman encoding make up the DEFLATE compression algorithm (Deutsch Citation1996), which is the basis of the ZIP file format.

Although lossless compression guarantees that the decompressed stream is the same as the original, lossy compression relaxes this constraint and aims only to minimize information loss. In particular, lossy compression aims to keep information where it is important and lose information where its loss will not be noticed. In MP3 audio compression, for example, the high frequencies above the human hearing range are removed. For vehicle telemetry data, similar components of the signals can be removed if they are not useful to further analysis. Some signals such as vehicle speed, for example, contain noise that may even be detrimental to analyses. Such approaches to separating informative and noise components of signals include the discrete wavelet transforms (DWT) (Addison Citation2017), the discrete Fourier transform (DFT) (Bailey and Swarztrauber Citation1994), the discrete cosine transforms (DCT) (Ahmed, Natarajan, and Rao Citation1974).

The DWT operates by extracting two signals that are each half the length of the original (Addison Citation2017). The first represents an approximation of the original signal and is referred to as the low frequency (LF) component. The second is the high frequency (HF) component and is a representation of the detail in the original signal. The LF and HF components are produced by a convolution of the original signal with wavelet kernels, followed by a down-sampling by a factor of 2. Typically, the HF component contains many small values and can be considered the noise component of the original signal. By quantizing this component and applying a threshold, many of these small values become zero and can be encoded very efficiently using lossless compression methods such as run-length encoding. This quantization introduces errors into the signal when it is reconstructed, but the error is minimized because only the HF component (detail) is affected. This process can be performed recursively to the LF component, further increasing the potential for lossless compression of the coefficients.

This paper focuses primarily on lossless compression, but the methods are also applicable to lossy compression approaches such as DWT.

Feature Selection

To consider the compression of features as a cost during their selection (Bolón-Canedo et al. Citation2014), a compressibility factor is introduced into the feature selection process alongside the performance metric, $P (S, y)$ , of a feature set, $S$ . This compressibility factor is detailed in Section 3.1, followed by its introduction to filter methods in Sections 3.2–3.4, and the wrapper method in Section 3.5. These feature selection approaches were chosen because of their simplicity and existing adoption in the literature and practice. Their range also demonstrates that compression can be considered generally in feature selection. Finally, Section 3.6 describes forward greedy and lazy forward greedy searches used to find the subset of features that maximizes the performance measure, $P (S, y)$ .

Compressibility Factor

The compressibility factor, $C o m (S)$ , is defined as the mean compressibility over a set of features, $S$ ,

(6)

C o m (S) = \frac{1}{| S |} \sum_{x_{i} \in S} β (x_{i}),

(6)

where $β (x_{i})$ is the compressibility of the individual feature, $x_{i}$ . Compressibility of an individual feature can typically be assessed using entropy, but this is difficult to compute for vehicle telemetry signals, which are not independent and identically distributed. We therefore adopt a more direct approach and measure the space savings achieved by a compression algorithm on training samples,

(7)

β (x) = 1 - \frac{B_{c o m p} (x)}{B_{o r i g} (x)},

(7)

where $B_{c o m p}$ is the number of bits in the compressed data and $B_{o r i g}$ is the number of bits in the original. This measure is larger for variables that are more easily compressed than those that do not compress well and typically has a range of $[0, 1]$ . Therefore, this compressibility measure should be maximized to minimize the impact of poor feature compression.

Due to the pigeon-hole principle, a lossless compression algorithm may increase the number of bits in the data in some circumstances, leading to $β (x)$ having a value greater than $1$ . To mitigate this, a compression algorithm can output the original data if its size would be increased and prepends a single bit to signify whether or not this is the case. If the bit has a value of $1$ , all subsequent data is equal to the original (i.e., it was not compressed), and otherwise the subsequent data was compressed by the algorithm and should be decompressed. Therefore, in general, $β (x)$ has a range of $[- ϵ, 1]$ , where $ϵ = \frac{1}{B_{o r i g} (x)}$ .

In the simulations performed for this paper, the compressibility is measured using DEFLATE for all features, which in the worst case increases the required storage by at most three bits per block. Different features are more suited to different compression methods, however, so it may be beneficial to employ a compression strategy targeted toward the particular features being considered. For example, DEFLATE may be applied to some features and DWT to others, or groups of features may be considered together in a more comprehensive compression strategy. These approaches incur higher computational costs during feature selection, however, compared to applying the same compression method to each feature individually. We therefore leave the development of a more sophisticated estimate for the impact of poor compression to future work.

Maximum Relevance (MR)

To assess the relevance of a feature set, the individual feature relevancies can be aggregated,

(8)

P_{M R} (S, y) = \frac{1}{| S |} \sum_{x_{i} \in S} I (x_{i}, y),

(8)

where $I (x_{i}, y)$ is the MI between feature, $x_{i}$ , and the target, $y$ . Unfortunately, the mean is not monotone and is not submodular in general. To overcome this, $M R$ is redefined as a summation, which is both monotone and modular,

(9)

P_{Σ M R} (S, y) = \sum_{x_{i} \in S} I (x_{i}, y),

(9)

In a forward greedy search, this is equivalent to selecting the feature, $s_{k}$ , that satisfies,

(10)

C_{M R} = \underset{_{s_{k} \in X ∖ S}}{argmax} \sum_{x_{i} \in S \cup {s_{k}}} I (x_{i}, y) - \sum_{x_{i} \in S} I (x_{i}, y)

(10)

(11)

= \underset{s_{k} \in X ∖ S}{argmax} I (s_{k}, y),

(11)

in each iteration.

Theorem 1 $P_{Σ M R} (S, y)$ is monotone and modular.

Proof 1 $I (x, y) \geq 0$ ,and any summation of a set of positive numbers is monotone and modular.

Theorem 2 The final set, $S$ ,found using a stable forward greedy search is the same when using either the mean of individual signal performances, $P_{μ} (S, y) = P_{M R} (S, y)$ ), or their summation, $P_{Σ} (S, y) = P_{Σ M R} (S, y)$ ).

Proof 2 In the $k^{t h}$ iteration, $k = | S_{k - 1} | + 1$ ,and $P_{Σ} (S_{k - 1} \cup {s_{k}}, y) = \break k P_{μ} (S_{k - 1} \cup {s_{k}}, y)$ . Given $S_{k - 1}$ ,let ${\hat{s}}_{k}$ be the feature selected using $P_{μ} (\cdot)$ ,and let ${\overset{ˉ}{s}}_{k} \neq {\hat{s}}_{k}$ be that selected using $P_{Σ} (\cdot)$ . Using $P_{μ} (\cdot)$ ,

P_{μ} (S \cup {{\hat{s}}_{k}}, y) - P_{μ} (S, y) \geq P_{μ} (S \cup {{\overset{ˉ}{s}}_{k}}, y) - P_{μ} (S, y)

P_{μ} (S \cup {{\hat{s}}_{k}}, y) \geq P_{μ} (S \cup {{\overset{ˉ}{s}}_{k}}, y)

k P_{μ} (S \cup {{\hat{s}}_{k}}, y) \geq k P_{μ} (S \cup {{\overset{ˉ}{s}}_{k}}, y)

P_{Σ} (S \cup {{\hat{s}}_{k}}, y) \geq P_{Σ} (S \cup {{\overset{ˉ}{s}}_{k}}, y)

Using $P_{Σ} (\cdot)$ ,

P_{Σ} (S \cup {{\overset{ˉ}{s}}_{k}}, y) - P_{Σ} (S, y) \geq P_{Σ} (S \cup {{\hat{s}}_{k}}, y) - P_{Σ} (S, y)

P_{Σ} (S \cup {{\overset{ˉ}{s}}_{k}}, y) \geq P_{Σ} (S \cup {{\hat{s}}_{k}}, y),

and so both $P_{μ} (S \cup {{\hat{s}}_{k}}, y) = P_{μ} (S \cup {{\overset{ˉ}{s}}_{k}}, y)$ and $P_{Σ} (S \cup {{\hat{s}}_{k}}, y) = P_{Σ} (S \cup {{\overset{ˉ}{s}}_{k}}, y) .$ Finally, because the search is stable (i.e., tie breaks are decided deterministically), ${\hat{s}}_{k} = {\overset{ˉ}{s}}_{k}$ . Therefore, in each iteration of a forward greedy search, the same maximizing feature, $s_{k} \in X ∖ S$ , is selected using either $P_{μ} (S, y)$ or $P_{Σ} (S, y)$ .

Corollary 1 Using a stable forward greedy search to maximize the summation $P_{Σ} (S, y)$ or mean $P_{μ} (S, y)$ yields the same optimal result for MR.

To maximize the compression of selected features, the compressibility factor must also be maximized. The combination of $M R$ and the compressibility factor ( $M R + C$ ) can be defined as,

(12)

P_{M R + C} (S, y) = P_{M R} (S, y) + ω_{c o m} \times C o m (S),

(12)

where a weighting parameter, $ω_{c o m}$ , allows the level to which compression is considered during the selection process to be varied. The smaller the value of $ω_{c o m}$ , the less compression is considered. In particular, a value of $ω_{c o m} = 0$ means that $M R + C$ is equivalent to $M R$ selection. As with $M R$ , $M R + C$ can be written as a monotone and modular summation,

(13)

P_{Σ M R + C} (S, y) = \sum_{x_{i} \in S} [I (x_{i}, y) + ω_{c o m} β (x_{i})],

(13)

which results in the same optimal final set as $P_{M R + C} (\cdot)$ when using stable forward greedy search (see Proof 2). Such a forward greedy search selects the feature, $s_{k}$ , that maximizes,

(14)

C_{M R + C} = \underset{s_{k} \in X ∖ S}{a r g m a x} I (s_{k}, y) + ω_{c o m} β (s_{k}),

(14)

in each iteration.

Minimum Redundancy Maximum Relevance

The redundancy of a feature set can be assessed as the mean of all pairwise feature similarities, measured using MI,

(15)

R e d (X) = \frac{1}{| S |^{2}} \sum_{x_{i} \in S} \sum_{x_{j} \in S} I (x_{i}, x_{j}),

(15)

which, in $M R M R$ , is to be minimized in the performance function, for example,

(16)

P_{M R M R} (S, y) = P_{M R} (S, y) - R e d (X) .

(16)

As with $M R$ , $M R M R$ can be written as a summation,

(17)

P_{Σ M R M R} (S, y) = \sum_{x_{i} \in S} [I (x_{i}, y) - \frac{1}{| S |} \sum_{x_{j} \in S} I (x_{i}, x_{j})],

(17)

which is equivalent to selecting the feature, $s_{k}$ , which satisfies,

(18)

C_{M R M R} = \underset{s_{k} \in X ∖ S}{argmax} I (s_{k}, y) - \frac{1}{| S |} \sum_{x_{j} \in S} I (s_{k}, x_{j}),

(18)

in each iteration of a stable forward greedy search.

In this form, $M R M R$ is neither submodular nor monotonic. Having said this, the $M R M R$ criterion (EquationEquation 18(18) $C_{M R M R} = \underset{s_{k} \in X ∖ S}{argmax} I (s_{k}, y) - \frac{1}{| S |} \sum_{x_{j} \in S} I (s_{k}, x_{j}),$ (18) ) is equivalent to the max-dependency criterion when one feature is selected at a time (Peng, Long, and Ding Citation2005), as in forward greedy search. While $M R M R$ aims to minimize redundancy between features and maximize their relevance to a target variable, it does not consider the joint probability distribution of the variables. Max-dependency aims to directly maximize the joint dependency between the features and the target variable (Peng, Long, and Ding Citation2005). Direct analysis of the joint probability distributions is computationally expensive, however, so $M R M R$ is a useful alternative when using forward greedy search. Furthermore, max-dependency is easily shown to be monotone (i.e., adding an input variable does not decrease the MI dependency with the target). Therefore, $M R M R$ results in the same selected features as with max-dependency, which provides an approximation guarantee of $1 - e^{- 1}$ .

Finally, the compression factor can again be introduced as,

(19)

P_{M R M R + C} (S, y) = \sum_{x_{i} \in S} [I (x_{i}, y) - \frac{1}{| S |} \sum_{x_{j} \in S} I (x_{i}, x_{j}) + ω_{c o m} β (x_{i})],

(19)

which is equivalent to selecting the feature, $s_{k}$ , that satisfies,

(20)

C_{M R M R + C} = \underset{s_{k} \in X ∖ S}{argmax} I (s_{k}, y) - \frac{1}{| S |} \sum_{x_{j} \in S} I (s_{k}, x_{j}) + ω_{c o m} β (s_{k}),

(20)

in each iteration of a stable forward greedy search.

Goodness of Fit

If two features are poor individual predictors (i.e., they each have low relevance to the target) but can be combined in a model to have good performance, they should be selected. $M R$ and $M R M R$ are both unable to capture these kinds of feature interactions (although adaptations using conditional MI exist (Herman et al. Citation2013)). Alternatively, such feature interactions are captured by assessing the GOF, $G$ , of features in a model,

(21)

M (S) = L_{M} (S, y),

(21)

produced by fitting the features $S$ to target $y$ using the learning algorithm $L_{M}$ . Goodness-of-fit measures how closely the model maps the input data, $S$ , to the target variable, $y$ . In regression tasks, a linear regression model is often used alongside the $R^{2}$ goodness-of-fit measure. In general, any model and GOF measure can be used. For example, in a classification task, a Naïve Bayes or SVM classifier can be used alongside the accuracy or area under the receiver operator characteristic curve (AUC) as a measure of fit (Witten et al. Citation2016). The performance of a feature set using GOF then has the form,

(22)

P_{M, G} (S, y) = G (y, M (S)),

(22)

(23)

= G (y, L_{M} (S, y)) .

(23)

If the measure for GOF is a non-negative monotone set function, an approximation guarantee is provided via the submodularity ratio,

(24)

γ_{S, l} (P) = \underset{_{L \subseteq S, M : | M | \leq l, M \cap L = \emptyset}}{a r g m i n} \frac{\sum_{x \in M} P (L \cup {x}, y) - P (L, y)}{P (L \cup M, y) - P (M, y)},

(24)

which is the minimum ratio of increase in performance when adding individual elements of one subset, $x \in M$ , to another, $L$ , compared to adding all elements of $M$ to $L$ at once (Das and Kempe Citation2011, Citation2018). Specifically, with linear regression and $R^{2}$ , forward greedy stepwise selection provides a lower bound approximation of $1 - e^{- γ_{S, l} (P)}$ . A special case of this approximation is when the GOF is measured using a sub-modular function, in which case $γ_{S, l} = 1$ and the lower bound approximation is $1 - e^{- 1}$ .

For the regression task investigated, a linear regression model is used, $M_{L R r}$ , of the form,

(25)

M_{L R r} (X) = \underset{\in}{a r g m i n} X β + \in,

(25)

where $β$ is the linear coefficient and $ϵ$ is the error term. GOF is measured using $R^{2}$ ,

(26)

G_{R^{2}} (y, M (X)) = \frac{E [(y - M (X {))}^{2}]}{E [(y - E [y {])}^{2}]} .

(26)

where $E [\cdot]$ is the expectation, or mean, of the variable. In this GOF selection for regression ( $G O F r$ ), therefore, the performance measure is

(27)

P_{M_{L R r}, G_{R^{2}}} (S, y) = G_{R^{2}} (y, M_{L R r} (S)) .

(27)

For the classification task, binomial logistic models (Cox Citation1958) are used,

(28)

M_{L C c} (X) = (1 + e^{- X β})^{- 1},

(28)

which output the likelihood of the target variable having a value of $1$ . Using this likelihood, the AUC, $G_{A U C}$ , is calculated as a measure for goodness of fit. Thus, goodness of fit for classification ( $G O F c$ ) uses the performance measure

(29)

P_{M_{L C c}, G_{A U C}} (S, y) = G_{A U C} (y, M_{L C c} (S)) .

(29)

While this suffices for the binary classification task used in the evaluation for this paper, performing stepwise selection with multi-class target variables requires a multi-class learning algorithm, such as a decision tree or naïve Bayes. Another possible approach is to transform the target into multiple binary variables and learn multiple binary classification models to be combined in an ensemble (Fürnkranz Citation2002).

To consider compression in both the regression and classification tasks, the compressibility factor is again added to the performance metrics. The $G O F r$ and compressibility ( $G O F r + C$ ) performance metric is then

(30)

P_{M_{L R r}, G_{R^{2} + C}} (S, y) = G_{R^{2}} (y, M_{L R r} (S)) + ω_{c o m} \times C o m (S),

(30)

and for $G O F c$ and compressibility ( $G O F c + C$ ), it is

(31)

P_{M_{L C c}, G_{A U C + C}} (S, y) = G_{A U C} (y, M_{L C c} (S)) + ω_{c o m} \times C o m (S) .

(31)

Wrapper (W)

The wrapper approach to feature selection is similar to GOF, but goodness of fit is replaced by model performance, measured from data not used when fitting the models (Aboudi and Benhlima Citation2016; Guyon and Elisseeff Citation2003). As feature selection is performed using the training data, the training data is further split into a selection-training portion, $⟨ X^{t r}, y^{t r} ⟩$ , to learn models using different feature sets, and a selection-testing portion, $⟨ X^{t e}, y^{t e} ⟩$ , to assess their performance. It is typical for this selection-testing portion to be referred to as hold-out or validation data. A selection-training portion that is too small may lead to poorly fit models that are not indicative of the performance of the feature set, while one that is too large will leave too few samples in the selection-testing portion for a reliable performance estimate. Alternatively, the training data can be split into several folds, for a $k$ -folds cross evaluation of each feature set, but of course this increases the computational requirements of the already computationally expensive wrapper approach. In this paper, the first half of the training samples temporally make up the selection-training portion, and the remainder make up the selection-testing portion.

Any learning algorithm, $L (S, y)$ , can be used in a wrapper, although usually it is the same model that is to be used on the final set of features. For example, if features are being selected for use in SVM, then a SVM is the most appropriate method for use within the wrapper during the selection process. A large number of models are typically learned during wrapper selection, and some learning algorithms require a large amount of computation to learn each model. Therefore, the computational requirements of a learning algorithm may also be a consideration when adopting the wrapper selection approach, in particular with large numbers of features from which to select. In some cases, a two-stage feature selection process may be performed that uses a filter to reduce the number of features significantly before applying a wrapper to reduce the feature set further (Peng, Long, and Ding Citation2005).

To assess the performance of a set of features, $S$ , a model, $M$ , must first be learned using the selection-training portion, $⟨ S^{t r}, y^{t r} ⟩$ ,

(32)

M = L_{M} (S^{t r}, y^{t r}) .

(32)

This model is then applied to the selection-testing portion, $⟨ S^{t e}, y^{t e} ⟩$ , to produce a set of outputs,

(33)

{\hat{y}}^{t e} = M (S^{t e}),

(33)

which are then compared to the observed values, $y^{t e}$ , to estimate the performance using a performance measure,

(34)

P_{M, P} (S, y) = P (M (S^{t e}), y^{t e}) .

(34)

Alternatively, this can be written by defining the model outputs using a partial function of the learning algorithm,

(35)

P_{M, P} (S, y) = P (L_{M} (⟨ S^{t r}, y^{t r} ⟩) (S^{t e}), y^{t e}) .

(35)

Again, compression can be considered by adding the compressibility factor to the performance measure,

(36)

P_{M, P + C} (S, y) = P (M (S^{t e}), y^{t e}) + ω_{c o m} \times C o m (S)

(36)

(37)

= P (L_{M} (⟨ S^{t r}, y^{t r} ⟩) (S^{t e}), y^{t e}) + ω_{c o m} \times C o m (S) .

(37)

The submodularity of the wrapper approach cannot be guaranteed, in general, because the model is learned and evaluated using different sets of data. If the training data is independent and identically distributed (IID) and sampled in a stratified manner, a performance guarantee similar to that provided by the submodularity ratio can be achieved when using linear regression. Time series data, however, is not IID, and so no such guarantees can be provided for the data used in this paper.

In our evaluation for the regression task, $R^{2}$ is used as a model performance measure and both linear regression (i.e., $P_{M_{L R r}, P_{R^{2}}}$ ), referred to as $W L R r$ , and SVM regression (i.e., $P_{M_{S V M r}, P_{R^{2}}}$ ), referred to as $W S V M r$ . For the classification task, AUC is used along with logistic regression (i.e., $P_{M_{L C c}, P_{A U C}}$ ), referred to as $W L C c$ , and SVM classification (i.e., $P_{M_{S V M c}, P_{A U C}}$ ), referred to as $W S V M c$ . In all cases, compression is included by adding the compression factor; therefore, for linear regression and $R^{2}$ with compression ( $W L R r + C$ ), the performance measure is

(38)

P_{M_{L R r}, P_{R^{2} + C}} (S, y) = P_{R^{2}} (M_{L R r} (S^{t e}), y^{t e}) + ω_{c o m} \times C o m (S),

(38)

and for SVM regression with compression ( $W S V M r + C$ ), it is defined as

(39)

P_{M_{S V M r}, P_{R^{2} + C}} (S, y) = P_{R^{2}} (M_{S V M r} (S^{t e}), y^{t e}) + ω_{c o m} \times C o m (S) .

(39)

The performance measure for logistic regression with compression ( $W L C c + C$ ) is defined as

(40)

P_{M_{L C c}, P_{A U C + C}} (S, y) = P_{A U C} (M_{L C c} (S^{t e}), y^{t e}) + ω_{c o m} \times C o m (S),

(40)

and finally, for SVM classification with compression ( $W S V M c + C$ ), it is defined as

(41)

P_{M_{S V M c}, P_{A U C + C}} (S, y) = P_{A U C} (M_{S V M c} (S^{t e}), y^{t e}) + ω_{c o m} \times C o m (S) .

(41)

Search

In feature selection, the aim is to find the subset of features that maximizes the performance metric, $P (S, y)$ . Naively, this is done by evaluating the performance metric for all possible subsets of features, but in reality, there are too many subsets for this to be feasible. For instance, there are $(\begin{matrix} | X | \\ k \end{matrix}) = | X |! (k! (| X | - k)!)^{- 1}$ different subsets of size $k$ when selecting from $| X |$ features. The forward greedy search is outlined in Algorithm 1 and iteratively selects the feature that maximizes the marginal gain in performance until $k$ features have been selected. In each iteration, $P (S, y)$ is computed for each unselected feature, meaning that $0.5 k (2 | X | - k - 1)$ feature set evaluations are performed when selecting $k$ features from a set of $| X |$ features. While this is usually computationally feasible for filter approaches, the feature set evaluation in the wrapper approach is significantly more expensive, and so a more efficient search is required.

Algorithm 2 outlines the lazy forward greedy search algorithm (Minoux Citation1978). To avoid computing $P (S, y)$ for every unselected feature in each iteration, lazy forward greedy search prioritizes the features with good performance in previous iterations. Specifically, the feature, $s$ , with maximum observed marginal gain, $Δ_{s}$ , is tentatively selected and evaluated to recompute and update its marginal gain,

(42)

Δ_{s} = P (S \cup {s}, y) - P (S, y) .

(42)

It is possible that this new marginal gain is no longer the maximum observed. In this case, the next feature, which now maximizes the marginal gain, is tentatively selected and evaluated. This process is repeated until a feature is tentatively selected for a second time, where it is selected and added to $S$ . During the first iteration, when $S = \emptyset$ , the algorithm requires that $P (\emptyset, y) = 0$ . In subsequent iterations, $P (S, y)$ is the performance of the already selected features.

Table

Display Table

Table

Display Table

Entropy Analysis

MI is typically defined as

(43)

I (x, y) = \int_{x} \int_{y} p (x, y) l o g \frac{p (x, y)}{p (x) p (y)},

(43)

but is equivalently defined using entropies,

(44)

I (x, y) = H (y) + H (x) - H (x, y)

(44)

(45)

= H (x) - H (x | y)

(45)

(46)

= H (y) - H (y | x)

(46)

(47)

= H (x, y) - H (x | y) - H (y | x),

(47)

where $H (x)$ and $H (y)$ are the marginal entropies of $x$ and $y$ , $H (x, y)$ is their joint entropy, and $H (x | y)$ and $H (y | x)$ are conditional entropies. The marginal entropy, $H (x)$ , also provides a lower bound on the number of bits required to represent some data, $x$ , i.e., the best compression achievable. Although compression algorithms such as DEFLATE often require a larger number of bits than this bound in practice, $H (x)$ is still a useful theoretical tool for assessing compressibility. In particular, $H (x)$ is lower when $x$ can be represented by fewer bits when compressed than when $x$ is difficult to compress and, thus, has similar properties to the compression ratio. In this section, we use entropies to analyze the relationships between relevance, redundancy, and compression in $M R$ , $M R + C$ , $M R M R$ , and $M R M R + C$ .

$M R$ selects the feature that satisfies the condition, $C_{M R}$ , in each iteration, which is defined using entropy as

(48)

C_{M R} = \underset{s_{k} \in X ∖ S}{argmax} H (y) - H (y | s_{k}) .

(48)

Similarly, $C_{M R + C}$ is defined as

(49)

C_{M R + C} = \underset{s_{k} \in X ∖ S}{argmax} H (y) - H (y | s_{k}) - ω_{c o m} H (s_{k}),

(49)

$C_{M R M R}$ as

(50)

C_{M R M R} = \underset{s_{k} \in X ∖ S}{argmax} H (y) - H (y | s_{k}) + \frac{1}{| S |} \sum_{x \in S} H (s_{k} | x),

(50)

and $C_{M R M R + C}$ as

(51)

C_{M R M R + C} = \underset{s_{k} \in X ∖ S}{argmax} H (y) - H (y | s_{k}) - ω_{c o m} H (s_{k}) + \frac{1}{| S |} \sum_{x \in S} H (s_{k} | x) .

(51)

Each of these equations is a linear sum with four terms, of which some are maximized (i.e., they are positive) and some are minimized (i.e., they are negative). shows the coefficients of these terms in each of the four information-based feature selection methods. While $H (y)$ is always maximized with a factor of $1$ , it can be ignored as it is a constant throughout the selection process because the target, $y$ , is fixed.

Table 1. The quantities of entropy that are maximized in each of the entropy based criterions.

Display Table

In all the equations relevance is maximized, which in each equation is represented by $- H (y | s_{k})$ (i.e., the conditional entropy of $y$ given $s_{k}$ is minimized). Intuitively, this is minimizing the unknown quantity of $y$ given the information in $s_{k}$ , represented as the intersections of $s_{1}$ and $s_{2}$ with $y$ Venn diagram in ). In this example, $s_{1}$ has a larger intersection with $y$ than $s_{2}$ (i.e., more of $y$ is known given $s_{1}$ than $s_{2}$ ).

Figure 1. Entropy Venn diagrams.

In $M R M R$ and $M R M R + C$ , redundancy is minimized using the term $| S |^{- 1} \sum_{x \in S} H (s_{k} | x)$ (i.e., the mean conditional entropy of $s_{k}$ given the already selected features is maximized). This term aims to minimize the information overlap, or redundancy, of the selected features. ) shows three variables, $s_{1}$ , $s_{2}$ and $s_{3}$ , each with the same size of intersection (relevance) with the target variable, $y$ . After selecting $s_{1}$ , $s_{2}$ is preferred over $s_{3}$ as its intersection (redundancy) with the already selected features, $S = {s_{1}}$ , is smaller.

Finally, in $M R + C$ and $M R M R + C$ , compressibility is maximized by $- ω H (s_{k})$ (i.e., a portion of the marginal entropy of $s_{k}$ is minimized). This is effectively minimizing the amount of information a feature carries. The Venn diagram in ) shows two variables, $s_{1}$ , $s_{2}$ with equal overlaps with the target variable (relevance), and no redundancy. The smaller set, $s_{1}$ , representing lower entropy, is more compressible than the larger set, $s_{2}$ , representing a higher entropy, and so in this case, therefore, $s_{1}$ is preferred over $s_{2}$ .

The effect of $ω_{c o m}$ is demonstrated using simulated data in , for which 100 samples were generated, with the first 50 samples having a target value of $1$ and the target value of the remaining 50 samples being $0$ . Nine features were then generated with fifty zero values and fifty non-zero values, to have different relevancies, redundancies, and entropies. In the first set of three features, the non-zero values were at samples 5 to 55, in the second set of three features the non-zero values were at samples 10 to 60, and in the final set of three features they were at samples 15 to 65. This ensures that the features have different relevancies to the target and different redundancies to one another. In each set, the non-zero values of the first feature were either 1 or2, in the second feature they were either 1, 2, or 3, and in the third feature they were either 1, 2, 3 or 4 each chosen uniformly at random. The more values a feature takes, the more entropy it has in general.

Figure 2. Demonstration of (a) $M R$ vs $M R + C$ and (b) $M R M R$ vs $M R M R + C$ .

) shows the mean $M R$ , $M R + C$ (with $ω_{c o m} = 0.1$ ) over ten generated datasets. The features with the most non-zero values when $y = 1$ (features 1, 2, and 3) had the highest relevancies (measured using $M R$ ), whereas those with fewest non-zero values when $y = 1$ (features 7, 8, and 9) had the lowest relevancies. Features with the same set of non-zero samples but that took more values also had higher $M R$ (e.g., feature 3 had a higher relevancy than feature 1), even though they contain the same information about the target variable, $y$ . With $M R + C$ , those features that had more values but the same set of non-zero values were considered of lower importance, since the entropy of features is also considered.

) shows the same results for $M R M R$ , and $M R M R + C$ (with $ω_{c o m} = 0.1$ ). Redundancy for each feature was computed as its total MI to the other eight features. Here, $M R M R$ also prefers features with fewer values, as they have less chance for having values in common. This is, however, not true in general, and $ω_{c o m}$ still had an effect in reducing the $M R M R + C$ scores of the features that took more values. In fact, a feature with more non-zero samples when $y = 1$ and higher relevance (feature 6) had a slightly lower score than a feature with fewer non-zero samples (feature 7) once compression was considered. This means that a feature with lower predictive performance may be selected over a feature that may provide better predictions, but to the benefit of compression.

Results

In this section, the results for the feature selection approaches described in Section 3 are provided. The results were obtained using the Location Extraction Dataset (LED) (Van Hinsbergh et al. Citation2018). The LED consists of over 1900 vehicle telemetry signals collected over 72 journeys in an urban environment, while performing various pick-up and drop-off scenarios. Data was sampled at 10 Hz and the mean length of each journey was 19.7 minutes, the standard deviation of journey lengths was 8.2 minutes, and the range was 29.1 minutes. All signals with names containing the strings ‘time’ and ‘minutes’ were removed prior to any feature selection or model learning, as they were found to be detrimental to the results due to each sample having a unique value.

One regression task, namely estimating the instantaneous Fuel Consumption (FC), and one classification task, namely detecting Points of Interest (PoI), were constructed using the LED. For FC, all signals containing the string ‘fuel’ or ‘torq’ were removed from the data, because these signals can be used to achieve high performances in simple models using one feature. In both tasks, training data was made up of the samples from ten journeys sampled randomly without replacement, and testing data was made of different ten journeys, again sampled randomly without replacement. The data was split temporally into equally sized blocks of samples, from which the mean (if continuous) or modal (if categorical) value was extracted. The block size for the FC task was 10 samples (1 second), and for the PoI task a block size of 50 samples (5 seconds) was used. These block sizes were chosen as they provided the most interesting trade-off for predictive performance and feature compression, and provided training and testing datasets with a mean number of samples of about 12000 for the FC task and 5200 for the PoI task.

Signals were selected from the training data using incremental search, maximizing the performance measures listed in . $M R + C$ and $M R M R + C$ are applied to both the regression (FC) and classification (PoI) tasks, whereas $G O F r + C$ , $W L R r + C$ , and $W S V M r + C$ are used for the regression task only, and $G O F c + C$ , $W L C c + C$ , and $W S V M c + C$ are used for the classification task only. In both tasks, each performance measure was applied in both forward greedy and lazy forward greedy searches, and the number of feature set evaluations (applications of the performance measure) was recorded.

Table 2. Feature selection performance measures investigated in the evaluation.

Display Table

For the $M R + C$ and $M R M R + C$ methods the variables were discretised in order to estimate the integrals when computing MI (EquationEquation 43(43) $I (x, y) = \int_{x} \int_{y} p (x, y) l o g \frac{p (x, y)}{p (x) p (y)},$ (43) ). Specifically, the features in both tasks were discretised by learning a decision tree for each input feature using Gini’s impurity as a split criterion (Breiman Citation2017), and taking the cut points of nodes in the tree as discretization intervals. To obtain a discrete target variable in the regression task, Matlab’s histcounts functionFootnote¹ was used to find suitable discretization bins. Following feature selection, the original continuous features and target variables were used in model learning and evaluation.

The selected features were used to train a SVM that predicted the target variable. Predictions were then made for each sample in the testing data, which were compared to their observed values to assess the model performances. The $R^{2}$ scores are presented for the regression task, and in the classification task AUC is used. Mean Absolute Error (MAE) and the F1 score results are available in supplimentary material. Finally, the overall compression was measured on the testing data for the selected features, using DEFLATE and DWT. The results presented are the mean performances over ten train-test cycles, and error bars show the standard errors.

Predictive Performance

shows the maximum $R^{2}$ performances for the regression task when selecting up to (a) 5, (b) 10, and (c) 15 features with different values of the compressibility factor weight, $ω_{c o m}$ . With up to five features, $M R + C$ and $M R M R + C$ had poor performances, with $R^{2} < 0.5$ , but performed significantly better when selecting up to ten or fifteen features. In fact, with up to ten or fifteen features, $M R + C$ had significantly the best performances when compared to the other selection methods for any value of $ω_{c o m}$ . The $R^{2}$ performances of $G O F r + C$ , $W L R r + C$ , and $W S V M r + C$ either did not change or increased only marginally when selecting up to ten or fifteen features compared to five features.

Figure 3. $R^{2}$ performances for the FC task.

In almost all cases, giving greater consideration to compression (i.e., larger values of $ω_{c o m}$ ) led to the selection of features with worse performances. One exception was $M R + C$ , where $R^{2}$ performances remained the same with different values of $ω_{c o m}$ . $M R M R + C$ , $G O F r + C$ , and $W L R r + C$ , on the other hand, were affected significantly by the value of $ω_{c o m}$ , with $R^{2}$ performances decreasing by up to $0.3$ when comparing $ω_{c o m} = 0$ and $ω_{c o m} = 1$ for the same number of features. $W S V M r + C$ was less affected by increasing $ω_{c o m}$ , with the maximum achieved $R^{2}$ performance only decreasing slightly between $ω_{c o m} = 0$ and $ω_{c o m} = 1$ . Interestingly, each selection method was affected for different ranges of $ω_{c o m}$ . For instance, $W L R r + C$ decreased in performance with low values of $ω_{c o m} \leq 0.2$ and then was unaffected by increasing $ω_{c o m}$ further, whereas performance was maintained by $M R M R + C$ and $G O F r + C$ until $ω_{c o m} \geq 0.6$ . This may be as a result of the different ranges and distributions of the feature set performance measures, indicating that the effect of $ω_{c o m}$ is dependent on the performance measure being used.

The maximum $A U C$ performances for the classification task, PoI, are shown in when selecting up to (a) 5, (b) 10, and (c) 15 features with different values for $ω_{c o m}$ . As with the regression task, the performances when predicting PoIs improved with the number of features selected for $M R + C$ and $M R M R + C$ . For $G O F c + C$ , $G O F c + C$ , and $W S V M c + C$ , the $A U C$ performances however did not increase when selecting more than five features. The highest $A U C$ performances were attained by $G O F c + C$ , $G O F c + C$ and $W S V M c + C$ , followed by $M R M R + C$ .

Figure 4. $A U C$ performances for the PoI task.

Unlike in the regression task, the consideration of compression actually improved the performance of the features selected for the PoI task in some cases. In particular, the $A U C$ performance increased significantly when selecting up to 5 features with $M R M R + C$ with $ω_{c o m} = 0$ compared to $ω_{c o m} = 0.1$ . Likewise, $A U C$ increased with larger values of $ω_{c o m} \leq 0.3$ for $M R + C$ , which had the worst performance of all selection methods investigated. This is indicative of overfitting due to their use of MI, which is prone to selecting high entropy features at the cost of their generalizability to new data (Taylor, Griffiths, and Bhalerao Citation2015). The consideration of compression is likely reducing this bias in MI and leading to the selection of features that are less specific to the training data.

The $G O F c + C$ and $W L C c + C$ selection methods, which do not rely on MI, were less affected by the value of $ω_{c o m}$ , but were still improved for increased values of $ω_{c o m}$ . The $A U C$ performance of $W L C c + C$ increased from $0.875$ (with $ω_{c o m} = 0$ ) to $0.939$ (with $ω_{c o m} >= 0.2$ ), and for $G O F c + C$ it increased from $0.922$ when $ω_{c o m} = 0$ to $0.953$ when $ω_{c o m} = 0.1$ . For $W S V M c + C$ however, the $A U C$ performances decreased slightly with higher values of $ω_{c o m}$ .

Compression Performance

(FC task) and (PoI task) show the number of bytes required to represent the best performing features when selecting up to (a) 5, (b) 10, and (c) 15 features, using the DEFLATE compression algorithm, with a compression level of 6. For the FC task, the number of bytes is shown for the feature set that achieved the highest $R^{2}$ in each case, and for the PoI task the feature set with the highest $A U C$ was used. In both tasks, more features required more bytes to represent them. In many cases, however, the maximum predictive performances were achieved by fewer than five or ten features, so the number of bytes remained the same when selecting up to ten and up to fifteen features. In particular, when $ω_{c o m} \geq 0.3$ , the number of bytes required to represent features selected using $W L C c + C$ was the same when selecting either up to five, ten, or fifteen features, as were their performances.

Figure 5. DEFLATE bytes performances for the FC task.

Figure 6. DEFLATE bytes performances for the PoI task.

In the PoI task, the number of bytes required decreased as $ω_{c o m}$ increased for $M R + C$ . Interestingly, this was predominantly when $0 \leq ω_{c o m} \leq 0.5$ , which is where the predictive performance also improved (). This is more evidence that $M R + C$ was overfitting to the training data in the PoI task, due to selecting high entropy features.

Although $M R + C$ achieved the best performances in the FC task (), the features it selected had significantly the worst compression in either task. For example, selecting ten features using $M R + C$ when $ω_{c o m} = 0$ required $341.7 K B$ when using DEFLATE, whereas selecting ten features using $W S V M r + C$ required only $82.4 K B$ . Similarly, for the PoI task, $M R + C$ with $ω_{c o m} = 0$ required $87.5 K B$ to represent the top ranked ten features, whereas $W S V M c + C$ required only $28.9 K B$ . Features selected by $M R M R + C$ also required significantly more bytes than did the goodness of fit and wrapper approaches. Furthermore, the number of bytes required to represent features selected by $M R + C$ and $M R M R + C$ for the FC task were not affected significantly by the value of $ω_{c o m}$ , whereas the number of bytes required decreased as $ω_{c o m}$ increased for all other selection methods ( $G O F r + C$ , $W L R r + C$ , and $W S V M r + C$ ). In particular, $W S V M r + C$ , which had the second best performances for the FC task, required significantly fewer bytes when $ω_{c o m} = 1$ than when $ω_{c o m} = 0$ , with only a marginal decrease in predictive performance. The decrease in bytes was largest for $W L R r + C$ and $W S V M r + C$ between $ω_{c o m} = 0$ and $ω_{c o m} \leq 0.4$ , and when $ω_{c o m} \geq 0.5$ the number of bytes did not change significantly.

As in the FC task, the number of bytes required by features selected using $G O F c + C$ and $W L C c + C$ decreased as $ω_{c o m}$ increased when $ω_{c o m} \leq 0.3$ , and with $ω_{c o m} > 0.3$ the number of bytes required did not change significantly. The number of bytes required by features selected using $W S V M c + C$ continued to decrease with increased values of $ω_{c o m}$ in all cases investigated, to be similar to that required by $G O F c + C$ .

The number of bytes required to represent the selected features using the DWT are shown in for the FC task, and in for PoI. The DWT transform had two layers, each using the db5 wavelet, with a quantization level of $8$ and threshold of $0.5$ . The resulting coefficients were then compressed using DEFLATE with a compression level of 6. The results in both FC and PoI reflect those found when compressing the features using DEFLATE only, but DWT achieved better compression in all cases. In both tasks more bytes were required when selecting more features. In the FC task, $M R + C$ was again unaffected by the value of $ω_{c o m}$ , and $G O F r + C$ and $W S V M r + C$ had better compression with higher values of $ω_{c o m}$ . In the PoI task, the number of bytes required by features selected using $G O F c + C$ decreased as $ω_{c o m}$ increased for small values of $ω_{c o m}$ . In this case, however, the number of bytes required by features selected by $M R + C$ increased with $ω_{c o m} \leq 0.3$ , and only decreased again with $ω_{c o m} \geq 0.5$ .

Figure 7. DWT bytes performances for the FC task.

Figure 8. DWT bytes performances for the PoI task.

Lazy Evaluations

presents the maximum performances (by $R^{2}$ ) of the selection methods when applied in (a) forward greedy and (b) lazy forward greedy searches to select features for the FC task. The performances for $M R + C$ are not displayed in these tables, as it selects the same set of features when using either forward greedy or lazy forward greedy search. Features selected using the filter methods, $M R M R + C$ and $G O F r + C$ , tended to have worse performances when forward greedy search was employed than when using lazy forward greedy search. Although for some values of $ω_{c o m}$ , $R^{2}$ performances for $G O F r + C$ with lazy forward greedy were higher than with forward greedy, the $M A E$ s were still higher. For $M R M R + C$ the decrease in both $R^{2}$ performance and increase in $M A E$ from forward greedy to lazy forward greedy was significant. As a result, even though fewer bytes were required, lazy forward greedy is not appropriate when using $M R M R + C$ to select features for the FC task.

Table 3. Best performances (by $R^{2}$ ) in the FC task selecting up to 15 features using (a) forward greedy and (b) lazy forward greedy search.

Display Table

Whereas the performances of $G O F r + C$ when using lazy forward greedy tended to be lower than for forward greedy, the features selected required fewer bytes to achieve the maximum $R^{2}$ score. When $ω_{c o m} = 0.4$ , for example, $32.3 K B$ were required to achieve an $R^{2}$ of $0.722$ with forward greedy and $26.7 K B$ to achieve $0.714$ with lazy forward greedy. This particular difference was predominantly due to forward greedy selecting six features whereas lazy forward greedy selected five, but to select five features with $ω_{c o m} = 0.4$ forward greedy still required $30.7 K B$ , and also achieved a lower $R^{2}$ performance of $0.704$ . This is likely as a result of extra redundancy in features selected using lazy forward greedy, which ranks features based on their performance in previous rounds to expedite the search, rather than assessing all features in each round as in forward greedy. Because features that are promising in one round are prioritized in the next, features selected in consecutive rounds may be good in similar ways and be redundant, making their combination more easily compressible. While this is accounted for in the performance metrics, such as $P_{M R M R}$ and $P_{M R M R + C}$ , it is more challenging due to the reduced lazy search.

The performances of the wrapper methods, $W L R r + C$ and $W S V M r + C$ , were again worse in general when lazy forward greedy was used, especially with small values of $ω_{c o m}$ . Also in these cases, fewer bytes were typically required by the features selected. With larger values of $ω_{c o m} \geq 0.2$ , however, $W L R r + C$ had better performances and required more bytes with lazy forward greedy than forward greedy. This was likely as a result of the peak performance when using forward greedy being achieved with only one feature in these cases, which is a clear sign of overfitting during feature selection. Features selected by $W S V M r + C$ also required more bytes when lazy forward greedy was used than when using forward greedy, and also achieved marginally worse performances.

The maximum performances (by $A U C$ ) for the PoI task are presented in for (a) forward greedy search and (b) lazy forward greedy search. Again, fewer bytes were typically required to represent features selected using lazy forward greedy than were required when using forward greedy. In most cases, lower $A U C$ and $F 1$ performances were also achieved by lazy forward greedy. For example, $M R M R + C$ had worse performance in both $A U C$ and $F 1$ for any value of $ω_{c o m}$ , and $G O F r + C$ and $W L R r + C$ had lower $A U C$ s. However, $G O F r + C$ and $W L R r + C$ did have slightly higher $F 1$ performances in some cases. $W S V M c + C$ , on the other hand, improved in both $A U C$ and $F 1$ performances while also requiring fewer bytes.

Table 4. Best performances (by $A U C$ ) in the PoI task selecting up to 15 features using (a) forward greedy and (b) lazy forward greedy search.

Display Table

In general, lazy forward greedy selected worse features than did forward greedy, but the advantages of lazy forward greedy lie in computational efficiency while still selecting adequate features. show the number of feature set evaluations performed when using lazy forward greedy with each of the search methods when selecting (a) 5, (b) 10, and (c) 15 features. For all search methods except $M R + C$ , the number of feature set evaluations increases with $ω_{c o m}$ . The most simple way to implement $M R + C$ is to evaluate each feature individually and produce a single ranking, from which the top features are selected. In both the FC and PoI tasks, the methods using linear regression had the most feature set evaluations, in particular for larger values of $ω_{c o m}$ . While $M R M R + C$ had the fewest feature set evaluations, $W S V M r + C$ had a similar number and $W S V M c + C$ only required around twice as many.

Figure 9. Number of feature set evaluations for the FC task.

Figure 10. Number of feature set evaluations for the PoI task.

Summary

In summary, the following conclusions can be drawn from these results.

Considering compression led to the selection of features with worse predictive performances in general, but that required fewer bytes when compressed.
In the FC task, $M R + C$ had best performances with up to 10 or 15 features to achieve it and required the most bytes to represent them.
The next best performances in the FC task were achieved by $G O F c + C$ and $W S V M c + C$ , which also required the fewest bytes.
In the PoI task, $G O F c + C$ and $W S V M c + C$ achieved the best performances and also required fewest bytes.
Using forward greedy generally provided better performances than lazy forward greedy, but in some cases the selected features required more bytes when compressed.

When considering compression of selected features, there is a trade-off against performance. In some cases, the lower predictive performance of a feature set may be acceptable when considering the significant improvement in compression. For example, in the FC task when selecting features with $G O F r + C$ , it is possible to reduce the compressed size of the features from $96 K B$ to $32 K B$ while only reducing the $R^{2}$ performance by $0.035$ and increasing the $M A E$ by $0.002$ . In some applications, the $3 \times$ improvement in compression may warrant the small increase in error.

$M R + C$ required selection of more than five features for good predictive performances, and they had the poorest compression (i.e., they required the most bytes) of all the methods. The features selected using $M R + C$ for the FC regression task had higher $R^{2}$ performances than those selected by any other method, when more than five features were selected. The compression performances of the features selected by $M R + C$ was also unaffected by $ω_{c o m}$ in this task. For the PoI classification task, the features selected by $M R + C$ had good AUC performances when compression was considered with small values (but greater than $0$ ) of $ω_{c o m}$ , but poor performances with larger values. $M R + C$ also required the fewest feature evaluations of any method investigated when using a lazy search. In regression tasks where compression is not important, therefore, $M R + C$ may be most appropriate, especially when the number of feature set evaluations is to be minimized. However, this is unlikely to be the case in general as it ignores feature redundancy (Guyon and Elisseeff Citation2003).

$M R M R + C$ required more features for good performance than did most other methods, only having comparable performance when $15$ features were selected. The selected features also had poorer compression, requiring almost $10 \times$ more bytes than all other methods except $M R + C$ when compression was considered during selection. When using lazy evaluations, however, $M R M R + C$ required fewer than half the number of evaluations of other methods except $M R + C$ .

The linear regression wrapper had poor performance in the regression FC task for any value of $ω_{c o m}$ . In the classification PoI task, the AUC performances achieved by features selected using $W L R r$ improved when compression was considered during selection. With $ω_{c o m} > 0.2$ , AUC performance was then unaffected, and the features selected required the fewest bytes of all methods when compressed.

The two remaining approaches, based on $G O F$ and $W S V M$ had similar results in general, only differing for the FC regression task for larger values of $ω_{c o m}$ . For such values of $ω_{c o m} > 0.6$ , the $R^{2}$ performances of $G O F r + C$ decreased significantly as $ω_{c o m}$ increased, whereas those of $W S V M r + C$ were much less affected. The features selected by either of these methods had the highest predictive performances (except $M R + C$ with FC) of any method investigated, and the effect for small values of $ω_{c o m}$ was small. Their compression performances were also similar, and both methods were outperformed only by $W L R r + C$ in the FC regression task when assessed using DEFLATE (they were similar when using DWT). Under lazy evaluation, however, both methods required the largest number of feature set evaluations.

In general, our findings suggest that the $G O F$ and $W S V M$ based methods are best for selecting feature subsets with both high predictive performance and good compression. Wrapper methods are computationally very complex, however, and even with lazy search the number of feature set evaluations may be too high when using $G O F r + C$ or $W S V M r + C$ for regression tasks, or $G O F c + C$ or $W S V M c + C$ for classification tasks. Therefore, it may be preferable to employ a filter method such as $M R M R + C$ , which selected feature sets with similar predictive performances but required more bytes when compressed.

Conclusion

This paper investigated the compressibility of features selected using three filter methods and one wrapper method. We have introduced a compressibility factor to feature set performance measures in order to improve this compression. Performance guarantees based on monotonicity and submodularity were given for the $P_{M R}$ and $P_{M R + C}$ performance measures, after showing their equivalence with a sum of non-negative numbers. The submodularity ratio can be used to provide performance guarantees for $G O F r$ , $G O F r + C$ , $G O F c$ , and $G O F c + C$ . The $P_{M R M R}$ and $P_{M R M R + C}$ performance measures are not submodular, but are equivalent to max dependency (Peng, Long, and Ding Citation2005) when used in forward greedy search. Similarly, under assumptions of a strong evaluation procedure, the wrapper methods are known to provide good performances when used in greedy searches (Kohavi and John Citation1997).

An entropy analysis was then provided for criteria defined by the performance measures based on MI, namely $C_{M R}$ , $C_{M R + C}$ , $C_{M R M R}$ , and $C_{M R M R + C}$ . This analysis provided a useful perspective, and identified the entropy components that are are minimized and those that are maximized in each of the performance measures. Since MI, used to assess relevance and redundancy in $M R$ and $M R M R$ , and compressibility can both be defined using entropy, the analysis also provides insight into the impact of compression in the selection process.

Finally, the feature selection performance measures were demonstrated in selecting features using forward greedy and lazy forward greedy search for one regression task, FC, and one classification task, PoI. We found that, considering compression (i.e., $ω_{c o m} > 0$ ) led to lower performances of the selected features, but improved compression. In the regression task, $M R$ had the best performances, but selected features with significantly worse compression. Forward greedy search also had better performances than lazy forward greedy search in general, but in some cases the selected features required more bytes when compressed. Overall, using the wrapper approaches had the best performances while selecting compressible features. When wrapper approaches are not possible due to computational limitations, $M R M R + C$ provides a suitable alternative, but the selected features generally had poorer compression.

As future work, different compressibility factors should be investigated. First, as different features have different compressions with different algorithms, a more comprehensive approach to accommodate this is likely to improve its accuracy. This is especially the case if specialized compression approaches are to be applied to different features. Also, with a more comprehensive and accurate measure of feature set compression, cost sensitive searches for a feature set that satisfies a budget defining a maximum number of bytes could be applied (Leskovec et al. Citation2007). Alternatively, a multi-objective optimization search could be employed, such as one that searches for a Pareto-optimal solution (e.g., hancer2018,ZHANG2020).

Acknowledgments

This work was supported by Jaguar Land Rover and the UK-EPSRC grant EP/N012380/1 as part of the jointly funded Towards Autonomy: Smart and Connected Control (TASCC) Programme.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the Engineering and Physical Sciences Research Council [EP/N012380/1]; Engineering and Physical Sciences Research Council [EP/N012380/1]; Engineering and Physical Sciences Research Council [EP/N012380/1].

Notes

1. https://uk.mathworks.com/help/matlab/ref/histcounts.html

References

Aboudi, N., and L. Benhlima. 2016. Review on wrapper feature selection approaches. In 2016 International Conference on Engineering MIS, Agadir, Morocco, 1–2375. September.
Google Scholar
Addison, P. S. 2017. The illustrated wavelet transform handbook: Introductory theory and applications in, In science, engineering, medicine and finance, Boca Raton, CRC press.
Google Scholar
Ahmed, N., T. Natarajan, and K. R. Rao. 1974. Discrete Cosine Transform. IEEE Transactions on Computers C-23 1 (January):90–93. doi:10.1109/T-C.1974.223784.
Google Scholar
Alelyani, S., J. Tang, and H. Liu. 2013. Feature selection for clustering: A review. In Charu C. Aggarwal, Chandan K. Reddy editors, Data Clustering, 29–60. New York; Chapman/Hall/CRC.
Google Scholar
Bailey, D., and P. Swarztrauber. 1994. A fast method for the numerical evaluation of continuous Fourier and Laplace transforms. SIAM Journal on Scientific Computing 15 (5):1105–10. doi:10.1137/0915067.
Web of Science ®Google Scholar
Bolón-Canedo, V., I. Porto-Díaz, N. Sánchez-Maroño, and A. Alonso-Betanzos. 2014. A framework for cost-based feature selection. Pattern Recognition 47 (7):2481–89. doi:10.1016/j.patcog.2014.01.008.
Web of Science ®Google Scholar
Breiman, L. 2017. Classification and regression trees. New York: Routledge.
Google Scholar
Buchbinder, N., M. Feldman, J. Seffi, and R. Schwartz. 2015. A tight linear time (1/2)- approximation for unconstrained submodular maximization. SIAM Journal on Computing 44 (5):1384–402. doi:10.1137/130929205.
Web of Science ®Google Scholar
Chandrashekar, G., and F. Sahin. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40 (1):16–28. doi:10.1016/j.compeleceng.2013.11.024.
Web of Science ®Google Scholar
Chuang, L.-Y., H.-W. Chang, T. Chung-Jui, and C.-H. Yang. 2008. Improved binary PSO for feature selection using gene expression data. Computational Biology and Chemistry 32 (1):29–38. doi:10.1016/j.compbiolchem.2007.09.005.
PubMed Web of Science ®Google Scholar
Cox, D. 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological) 20 (2):215–32.
Web of Science ®Google Scholar
Das, A., and D. Kempe. 2011. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In Proceedings of the 28th, International Conference on International Conference on Machine Learning. Omnipress, Bellevue, Washington, USA, 1057–64.
Google Scholar
Das, A., and D. Kempe. 2018. Approximate submodularity and its applications: Subset selection, sparse approximation and dictionary selection. The Journal of Machine Learning Research 19 (1):74–107.
Google Scholar
Deutsch, P. 1996. DEFLATE Compressed Data Format Specification version 1.3. RFC 1951. Technical report. Internet Engineering Task Force, May.
Google Scholar
Ding, C., and H. Peng. 2005. Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology 3 (2):185–205. doi:10.1142/S0219720005001004.
PubMedGoogle Scholar
Feige, U., V. S. Mirrokni, and J. Vondrak. 2011. Maximizing non-monotone submodular functions. SIAM Journal on Computing 40 (4):1133–53. doi:10.1137/090779346.
Web of Science ®Google Scholar
Fürnkranz, J. 2002. Round robin classification. Journal of Machine Learning Resea Rch 2 (Mar):721–47.
Google Scholar
Guerrero-Ibáñez, J., S. Zeadally, and J. Contreras-Castillo. 2018. Sensor technologies for intelligent transportation systems. Sensors 18 (4):1212. doi:10.3390/s18041212.
PubMed Web of Science ®Google Scholar
Guyon, I., and A. Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3 (Mar):1157–82.
Google Scholar
Hancer, E., B. Xue, and M. Zhang. 2018. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-Based Systems 140:103–19. doi:10.1016/j.knosys.2017.10.028.
Web of Science ®Google Scholar
Harding, J., G. Powell, R. Yoon, J. Fikentscher, C. Doyle, D. Sade, M. Lukuc, J. Simons, and J. Wang. 2014. Vehicle-to-vehicle communications: Readiness of V2V technology for application. Report DOT HS 812 014. Washington, DC: National Highway Traffic Safety Administration.
Google Scholar
Herman, G., B. Zhang, Y. Wang, Y. Getian, and F. Chen. 2013. Mutual information- based method for selecting informative feature sets. Pattern Recognition 46 (12):3315–27. doi:10.1016/j.patcog.2013.04.021.
Web of Science ®Google Scholar
Hinsbergh, V., N. G. James, P. Taylor, A. Thomason, X. Zhou, and A. Mouzakitis. 2018. vehicle point of interest detection using in-car data. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, 1–4. GeoAI’18. Seattle, WA, USA: ACM.
Google Scholar
Huffman, D. A. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40 (9):1098–101. doi:10.1109/JRPROC.1952.273898.
Google Scholar
Jain, D., and V. Singh. 2018. Feature selection and classification systems for chronic disease prediction: A review. Egyptian Informatics Journal 19 (3):179–89. doi:10.1016/j.eij.2018.03.002.
Web of Science ®Google Scholar
Jennifer, D., and C. Brodley. 2004. Feature selection for unsupervised learning. Journal of Machine Learning Research 5 (Aug):845–89.
Google Scholar
Jiang, L., G. Kong, and C. Li. 2021. Wrapper Framework for Test-Cost-Sensitive Feature Selection. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51 (3):1747–56.
Web of Science ®Google Scholar
Kennedy, J., and R. Eberhart. 1995. Particle swarm optimization. In Proceedings of ICNN’95 - Int e rnational Confe rence on Neural Networks, Perth, WA, Australia, 4:1942–48.
Google Scholar
Khanna, R., E. Elenberg, A. Dimakis, S. Negahban, and J. Ghosh. 2017. Scalable greedy feature selection via weak submodularity. Artificial Intelligence and Statistics, 54, 1560–68.
Google Scholar
Kohavi, R., and G. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97 (1–2):273–324. doi:10.1016/S0004-3702(97)00043-X.
Web of Science ®Google Scholar
Kumar, V., and S. Minz. 2014. Feature Selection: A literature review. Smart Computing R Eview 4 (3):211–29.
Google Scholar
Lal, T., O. Chapelle, J. Weston, and A. Elisseeff. 2006. Embedded methods. In Isabelle Guyon, Masoud Nikravesh, Steve Gunn, Lotfi A. Zadeh editors, Feature extraction, 137–65. Berlin Heidelberg: Springer.
Google Scholar
Leskovec, J., A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. 2007. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, California, USA, 420–29. ACM.
Google Scholar
May, R., G. Dandy, and H. Maier. 2011. Review of input variable selection methods for artificial neural networks, vol. 10, 16004. Rijeka: InTech Europe.
Google Scholar
Minoux, M. 1978. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques, isbn: 9783, isbn: 9783540358909, ed. 540358909, ed. J. Stoer, 234–43. Berlin, Heidelberg: Springer Berlin Heidelberg.
Google Scholar
Mirzasoleiman, B., A. Badanidiyuru, A. Karbasi, J. Vondrák, and A. Krause. 2015. Lazier than lazy greedy. In Proceedings of the 29th AAAI Conference on Artificial Int elli g ence, Austin, Texas, 1812–18. AAAI’15. AAAI Press.
Google Scholar
Mitra, P., C. A. Murthy, and K. P. Sankar. 2002. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (3):301–12. doi:10.1109/34.990133.
Web of Science ®Google Scholar
Nagpal, A., and V. Singh. 2018. A feature selection algorithm based on qualitative mutual information for cancer microarray data. International Conference on Computational Intelligence and Data Science, Procedia Computer Science, Elsevier 132:244–52. https://doi.org/10.1016/j.procs.2018.05.195
Google Scholar
Nemhauser, G., L. Wolsey, and M. Fisher. 1978. An analysis of approximations for maximizing submodular set functions—I. Mathematical Programming 14 (1):265–94. doi:10.1007/BF01588971.
Web of Science ®Google Scholar
Papa, J. P., G. H. Rosa, A. N. de Souza, and L. C. S. Afonso. 2018. Feature selection through binary brain storm optimization. Computers & Electrical Engineering 72:468–81. doi:10.1016/j.compeleceng.2018.10.013.
Web of Science ®Google Scholar
Peng, H., F. Long, and C. Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8):1226–38. doi:10.1109/TPAMI.2005.159.
PubMed Web of Science ®Google Scholar
Salomon, D., and G. Motta. 2010. Handbook of data compression. Springer Science & Business Media, London.
Google Scholar
Taylor, P., N. Griffiths, and A. Bhalerao. 2015. Redundant feature selection using permutation methods. Automatic Machine Learning Workshop, Lille, France, 1–8.
Google Scholar
Taylor, P., N. Griffiths, and A. Mouzakitis. 2018. Selection of compressible signals from telemetry data. Mining Urban Data Workshop, London.
Google Scholar
Weiß, C. 2011. V2X communication in Europe – From research projects towards standardization and field testing of vehicle communication technology. Computer Networks 55 (14):3103–19. doi:10.1016/j.comnet.2011.03.016.
Web of Science ®Google Scholar
Witten, I. H., E. Frank, M. A. Hall, and C. J. Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, Burlington, MA.
Google Scholar
Zhang, Y., S. Cheng, Y. Shi, D.-W. Gong, and X. Zhao. 2019. Cost-sensitive feature selection using two-archive multi-objective artificial bee colony algorithm. Expert Systems with Applications 137:46–58. doi:10.1016/j.eswa.2019.06.044.
Web of Science ®Google Scholar
Zhang, Y., D.-W. Gong, X.-Z. Gao, T. Tian, and X.-Y. Sun. 2020. Binary differential evolution with self-learning for multi-objective feature selection. Information Sciences 507:67–85. doi:10.1016/j.ins.2019.08.040.
Web of Science ®Google Scholar
Ziv, J., and A. Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions onInformation Theory 5. 24 (September):530–36. doi:10.1109/TIT.1978.1055934.
Google Scholar

Feature Selection for Supervised Learning and Compression

ABSTRACT

Introduction

Related Work

Feature Selection

Compressibility Factor

Maximum Relevance (MR)

Minimum Redundancy Maximum Relevance

Goodness of Fit

Wrapper (W)

Search

Entropy Analysis

Table 1. The quantities of entropy that are maximized in each of the entropy based criterions.

Results

Table 2. Feature selection performance measures investigated in the evaluation.

Predictive Performance

Compression Performance

Lazy Evaluations

Table 3. Best performances (by $R^{2}$ ) in the FC task selecting up to 15 features using (a) forward greedy and (b) lazy forward greedy search.

Table 4. Best performances (by $A U C$ ) in the PoI task selecting up to 15 features using (a) forward greedy and (b) lazy forward greedy search.

Summary

Conclusion

Acknowledgments

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

Feature Selection for Supervised Learning and Compression

ABSTRACT

Introduction

Related Work

Feature Selection

Compressibility Factor

Maximum Relevance (MR)

Minimum Redundancy Maximum Relevance

Goodness of Fit

Wrapper (W)

Search

Entropy Analysis

Table 1. The quantities of entropy that are maximized in each of the entropy based criterions.

Results

Table 2. Feature selection performance measures investigated in the evaluation.

Predictive Performance

Compression Performance

Lazy Evaluations

Table 3. Best performances (by R2) in the FC task selecting up to 15 features using (a) forward greedy and (b) lazy forward greedy search.

Table 4. Best performances (by AUC) in the PoI task selecting up to 15 features using (a) forward greedy and (b) lazy forward greedy search.

Summary

Conclusion

Acknowledgments

Disclosure statement

Additional information

Funding

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 3. Best performances (by $R^{2}$ ) in the FC task selecting up to 15 features using (a) forward greedy and (b) lazy forward greedy search.

Table 4. Best performances (by $A U C$ ) in the PoI task selecting up to 15 features using (a) forward greedy and (b) lazy forward greedy search.