18,019
Views
51
CrossRef citations to date
0
Altmetric
Articles

SPlit: An Optimal Method for Data Splitting

&
Pages 166-176 | Received 10 Dec 2020, Accepted 06 Apr 2021, Published online: 01 Jun 2021

Abstract

In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), which was initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.

1 Introduction

For developing statistical and machine learning models, it is common to split the dataset into two parts: training and testing (Stone Citation1974; Hastie, Tibshirani, and Friedman Citation2009). The training part is used for fitting the model, that is, to estimate the unknown parameters in the model. The model is then evaluated for its accuracy using the testing dataset. The reason for doing this is because if we were to use the entire dataset for fitting, then the model would overfit the data and can lead to poor predictions in future scenarios. Therefore, holding out a portion of the dataset and testing the model for its performance before deploying it in the field can protect against unexpected issues that can arise due to overfitting.

In this article, we consider only datasets where each row is independent of other rows, that is, we will exclude cases such as time series data. The simplest and probably the most common strategy to split such a dataset is to randomly sample a fraction of the dataset. For example, 80% of the rows of the dataset can be randomly chosen for training and the remaining 20% can be used for testing. The aim of this article is to propose an optimal strategy to split the dataset.

Snee (Citation1977) seemed to be the first one who has carefully investigated several data splitting strategies. He proposed DUPLEX as the best strategy which was originally developed by Kennard as an improvement to another popular strategy CADEX (Kennard and Stone Citation1969). Over the time, many other methods have been proposed in the literature for data splitting; see, for example, the survey in Reitermanová (Citation2010) and the comparative study in Xu and Goodacre (Citation2018). Some of these methods will be discussed in the next section after proposing a mathematical formulation of the problem.

It is also common to hold out a portion of the training set for validation. The validation set can be used for fine-tuning the model performance such as for choosing hyper-parameters or regularization parameters in the model. In fact, the training set can be divided into multiple sets and the model can be trained using cross-validation. Our proposed method for optimally splitting the dataset into training and testing can also be used for these purposes by applying the method repeatedly on the training set.

The article is organized as follows. In Section 2, we provide a mathematical formulation of the problem and propose an optimal splitting method called SPlit based on a technique for finding optimal representative points of a distribution known as support points (SP; Mak and Joseph Citation2018b). SP are defined only for continuous variables. Therefore, we extend the SP methodology to deal with categorical variables in Section 3 so that SPlit can be applied to both regression and classification problems. We apply SPlit on several real datasets in Section 4 and compare its performance with random subsampling. Some concluding remarks are given in Section 5.

2 Methodology

Let X=(X1,,Xp) be the p input variables (or features) and Y the output variable. Let D={(Xi,Yi)}i=1N be the dataset in hand. Our aim is to divide D into two disjoint and mutually exclusive sets: Dtrain and Dtest, where the training set Dtrain contains Ntrain points and testing set Dtest contains Ntest points with Ntrain+Ntest=N.

2.1 Mathematical Formulation

Suppose the rows of the dataset are independent realizations from a distribution F(X,Y):(1) (Xi,Yi)iidF,i=1,,N.(1)

Let g(X;θ) be the prediction model that we would like to fit to the data, where θ is a set of unknown parameters in the model. The unknown parameters θ will be estimated by minimizing a loss function L(Y,g(X;θ)). Typical loss functions include squared or absolute error loss. More generally, the negative of the log-likelihood can be used as a loss function.

In postulating a prediction model g(X;θ), our hope is that it will be close to the true model E(Y|X) for some value of θ. However, the postulated model could be wrong and there may not exist a true value for θ. Thus, it makes sense to try out different possible models on a training set and check their performance on the testing set so that we can identify the model that is closer to the truth.

The unknown parameters θ can be estimated from the training set as follows:(2) θ̂=Argminθ1Ntraini=1NtrainL(Yitrain,g(Xitrain;θ)),(2) which is a valid estimator provided that(3) (Xitrain,Yitrain)F,i=1,,Ntrain.(3)

How should we split the dataset to obtain training and testing sets? We propose that the dataset should be split in such a way that the testing set gives an unbiased and efficient evaluation of the model’s performance fitted using the training set.

To quantify the model’s performance, define the generalization error as in (Hastie, Tibshirani, and Friedman Citation2009, chap. 7) by(4) E=EX,Y{L(Y,g(X;θ̂))|Dtrain},(4) where the expectation is taken with respect to a realization (X,Y) from F. Note that we do not include the randomness in θ̂ induced by Dtrain for computing the expectation.

We can estimate E if we have a sample of observations from F that is independent of the training set. We can use the testing set for this purpose. Thus, an estimate of E can be obtained as(5) Ê=1Ntesti=1NtestL(Yitest,g(Xitest;θ̂)),(5) which works if(6) (Xitest,Yitest)F,i=1,,Ntest.(6)

A simple way to ensure condition (6) is to randomly sample Ntest points from D. Then, EquationEquation (5) can be viewed as the Monte Carlo (MC) estimate of E. The question we are trying to answer is, if there is a better way to sample from D so that we can get a more efficient estimate of E. The answer to this question is affirmative. We can use quasi-Monte Carlo (QMC) methods to improve the estimation of E. It is well known that the error of MC estimates decreases at the rate O(1/Ntest), whereas when sampling from uniform distributions, the QMC error rate can be shown to be almost O(1/Ntest) (Niederreiter Citation1992). This is a substantial improvement in the error rate. However, most QMC methods focus on uniform distributions (Owen Citation2013). Recently, Mak and Joseph (Citation2018b) developed a method known as SP to obtain a QMC sample from general distributions. Although their theoretical results guarantee a convergence rate faster than MC by only a logNtest factor, much faster convergence rates are observed in practical implementations. This leads us to the proposed method SPlit (stands for SP-based split). We will discuss the method of SP and SPlit in detail after reviewing the existing data splitting methods in the next section.

2.2 Review of Data Splitting Methods

Interestingly, the original motivation behind CADEX (Kennard and Stone Citation1969) and DUPLEX (Snee Citation1977) was to create two sets with similar statistical properties, which agrees with the distributional condition mentioned in EquationEquation (6). However, these algorithms cannot achieve this objective. For example, consider a two-dimensional data generated using (X1i,X2i)iidN2(0,Σ) for (HTML translation failed), where 0=(0,0) and Σjk=.5|jk|. We will omit the response here because these algorithms do not use it. Let N = 1000 and Ntest=100. The CADEX and DUPLEX testing sets obtained using the R package prospectr (Stevens and Ramirez-Lopez Citation2020) are shown in the left and middle top panels of . We can see that both CADEX and DUPLEX testing points are too spread out and therefore, their distributions do not match with the distribution of the data. This can be seen more clearly in the marginal distributions shown in the bottom panels. For comparison, the testing set generated using the proposed SPlit method is shown in the top-right panel. We can see that its distribution matches quite well with the distribution of the full data, as desired.

Fig. 1 Comparison of CADEX and DUPLEX testing sets with SPlit testing set. Hundred testing points (red circles) are chosen from 1000 data points (black crosses). In the lower panels, the marginal densities of the testing sets are plotted over the histogram of the full data.

Fig. 1 Comparison of CADEX and DUPLEX testing sets with SPlit testing set. Hundred testing points (red circles) are chosen from 1000 data points (black crosses). In the lower panels, the marginal densities of the testing sets are plotted over the histogram of the full data.

The sample set partitioning based on joint X–Y distances (SPXY) algorithm (Galvão et al. Citation2005) is a modification of the CADEX algorithm which incorporates the distances computed from the response values. Although incorporating Y was in the right direction of EquationEquation (6), the algorithm suffers from the same issues of CADEX and DUPLEX algorithms. Bowden, Maier, and Dandy (Citation2002) proposed a data splitting method which uses global optimization techniques to match the mean and standard deviations of the testing set and the full data. This is again in the right direction of EquationEquation (6), however, matching the first two moments does not ensure distributional matching. May, Maier, and Dandy (Citation2010) proposed further improvement to the foregoing methodology using clustering-based stratified sampling. Although this provides an improvement, it is well known that clustering distorts the original distribution (Zador Citation1982) and hence cannot satisfy EquationEquation (6). In summary, none of existing data splitting methods except the random subsampling can ensure the distributional condition given in either EquationEquation (6) or EquationEquation (3).

Before proceeding further, we would like to mention about another possible approach to solve the problem. One could think about splitting the dataset in such a way that the training set gives the best possible estimation of the model under a given loss function. For example, it is well known that the best estimate of a linear regression model with linear effects of the predictors under the least-square criterion can be obtained by choosing the extreme values of the data from the predictor-space (Wang, Yang, and Stufken Citation2019). Although such a choice can minimize the variance of the parameter estimates, the model may not perform well in the testing set if the original dataset is not generated from such a linear model. In our opinion, the testing set should provide a set of samples for an unbiased evaluation of the model performance and detect possible model deviations. For example, if we detect that quadratic terms are needed in the linear regression model, then having a training dataset with only extreme values of the predictors is not going to be useful. Therefore, the splitting method should be independent of the modeling choice and the loss function. Random subsampling achieves this aim and we will show that SP will achieve it even better!

Although SP have been used in the past for subsampling from big data (Mak and Joseph Citation2018a), it was done for the purpose of saving storage space and time for fitting computationally expensive models due to limited resources. On the other hand, data reduction is not the objective in our problem. After assessing the model’s performance using the testing set, the model will be re-estimated using the full data before deploying it for future predictions.

2.3 Support Points

Let F(Z) and the empirical distribution of a set of points z1,,zn is defined as (Székely and Rizzo Citation2013)(7) ED=2ni=1nE||ziZ||21n2i=1nj=1n||zizj||2E||ZZ||2,(7) where Z,Z~F,||·||2 is the Euclidean distance, and the expectations are taken with respect to F. Note that for the Euclidean distance to make sense, all the variables are standardized to have zero mean and unit variance. The energy distance will be small if the empirical distribution of z1,,zn is close to F. Therefore, Mak and Joseph (Citation2018b) defined the SP of F as the minimizer of the energy distance:(8) {zi*}i=1nArgminz1,,znED=Argminz1,,zn{2ni=1nE||ziZ||2  1n2i=1nj=1n||zizj||2}.(8)

They can be viewed as the representative points of the distribution F, which is the best set of n points to represent F according to the energy distance criterion. Mak and Joseph (Citation2018b) showed that SP converge in distribution to F and therefore, they can be viewed as a QMC sample from F. This property makes SP different from other representative points of a distribution such as MSE-rep points (Fang and Wang Citation1994) or principal points (Flury Citation1990) which do not possess distributional convergence (Zador Citation1982).

In our problem, we do not have F. Instead we only have a dataset D, which is a set of independent realizations from F. Therefore, to compute the SP, we can replace the expectation in EquationEquation (8) with a Monte Carlo average computed over D:(9) {zi*}i=1nArgminz1,,zn{2nNi=1nj=1N||ziZj||2  1n2i=1nj=1n||zizj||2 }.(9)

This simple extension to real data settings is another advantage of SP, which cannot be done for other representative points such as minimum energy design (Joseph et al. Citation2015) and Stein points (Chen et al. Citation2018) even though they possess distributional convergence.

At first sight, the optimization in EquationEquation (9) appears to be a very hard problem. The objective function is nonlinear and nonconvex. Moreover, the number of variables in the optimization is n(p+1), which can be extremely high even for small datasets. However, the objective function has a nice feature; it is a difference of two convex functions. By exploiting this feature, Mak and Joseph (Citation2018b) developed an efficient algorithm based on difference-of-convex programming techniques, which can be used for quickly finding the SP. Although a global optimum is not guaranteed, an approximate solution can be obtained in a reasonable amount of time. Their algorithm is implemented in the R package support (Mak Citation2019). Although it is possible to create representative points using other goodness-of-fit test statistics (Hickernell Citation1999) and kernel functions (Chen, Welling, and Smola Citation2010) instead of the energy distance criterion, they do not seem to possess the computational advantage and robustness of SP.

2.4 SPlit

We can use the SP obtained from EquationEquation (9) as the testing set with n=Ntest and the remaining data can be used as the training set. Alternatively, we can use SP to obtain the training set with n=Ntrain and then use the remaining data as the testing set. However, the computational complexity of the algorithm used for generating SP is O(n2(p+1)). Since Ntest is usually smaller than Ntrain, it will be faster to generate the testing set using SP than the training set. In general, we let n=min{Ntest,Ntrain}=min{Nγ,N(1γ)}, where γ=Ntest/N is the splitting ratio.

As discussed earlier, the testing set generated using SP is expected to work better than a random sample from (HTML translation failed). However, there is one drawback. Support points need not be a subsample of the original dataset. This is because the optimization in EquationEquation (9) is done on a continuous space and therefore, the optimal solution need not be part of D. To get a subsample, we actually need to solve the following discrete optimization problem:(10) {zi*}i=1nArgminz1,,znD{2nNi=1nj=1N||ziZj||2      1n2i=1nj=1n||zizj||22nNi=1n}.(10)

Our initial attempts to solve this problem using state-of-the-art integer programming techniques showed that they are accurate, but too slow in finding the optimal solution. Computational speed is crucial for our method to succeed because otherwise it will not be attractive against the computationally cheap alternative of random subsampling. Therefore, here we propose an approximate but efficient algorithm to subsample from D.

Algorithm 1 SPlit: Splitting a dataset D with splitting ratio γ [R package: SPlit]

1: Input DRN×(p+1) and γ=Ntest/N

2: Standardize the columns of D

3: nmin{Nγ,N(1γ)}

4: Compute {zi*}i=1n using (9)

5:D1{}

6:for i{1,,n} do

7:ûargminu{||uzi*||2:uD}

8:D1D1{û}

9:DD{û}

10:end for

11:D2D

12:return D1,D2

We will first find the SP in a continuous space as in EquationEquation (9), which is very fast. We will then choose the closest points in D to {zi*}i=1n according to the Euclidean distance. This can be done efficiently even for big datasets using KD-Tree based nearest neighbor algorithms. However, a naive nearest neighbor assignment can lead to duplicates and therefore, the remaining data points can become more than Nn. Moreover, separating the points can increase the second term in EquationEquation (10) and thus potentially improve the energy distance criterion. This can be achieved by doing the nearest neighbor assignment sequentially. Our method is summarized in Algorithm 1 and is implemented in the R package SPlit. A critical step in this algorithm is to update the KD-Tree efficiently when a point is removed from the dataset. We use nanoflann, a C++ header-only library (Blanco and Rai Citation2014), which allows for lazy deletion of a data point from the KD-Tree without having to rebuild the KD-Tree every time a point is removed from the dataset.

2.5 Visualization

Consider a simple example for visualization purposes. Suppose we generate N = 100 points as follows: XiiidN(0,1) and Yi|XiiidN(Xi2,1) for i=1,,N. Both X and Y values are standardized to have zero mean and unit variance. shows the optimal testing set obtained using SPlit and a random testing set obtained using random subsampling without replacement. We can see that the points in the SPlit testing set are well spread out throughout the region and provide a much better point set to evaluate the model performance than the random testing set.

Fig. 2 The circles are the testing set obtained using random (left) and SPlit (right) subsampling.

Fig. 2 The circles are the testing set obtained using random (left) and SPlit (right) subsampling.

Most statistical and machine learning models have some hyper parameters or regularization parameters, which are commonly estimated from the training set by holding out a validation set, say of size Nvalid. One simple approach to create an optimal validation set is to apply the SPlit algorithm on the training set. However, it may happen that such a set is close to the points in the testing set, which is not good as it may lead to a biased testing performance. We want the validation points to stay away from the testing points so that the testing performance is not influenced by the model estimation/validation step. This can be achieved as follows. Let {z1,,zNtest} be the testing set and {zNtest+1,,zn} the validation set, where n=Ntest+Nvalid. Then, the optimal validation points can be obtained as(11) {zi*}i=Ntest+1nArgminzNtest+1,,znD{2nNi=1nj=1N||ziZj||2    1n2i=1nj=1n||zizj||2},(11) where the optimization takes place only over the validation points with the testing points fixed at {z1*,,zNtest*}. Because of the second term in the energy distance criterion, the validation points will move away from the testing points.

shows 20 points selected out of the 80 training points using EquationEquation (11). A random subsample is also shown in the same figure for comparison. Clearly, the optimal validation set created using our method is a much better representative set of the original dataset and therefore, it can do a much better job in tuning the hyper-parameter or regularization parameter than using a random validation set. In fact, we can sequentially divide the training set into K sets by repeated application of this method and use them for K-fold cross-validation. Because of the importance of this problem, we will leave this topic for future research.

Fig. 3 The squares are the validation set obtained using random (left) and SPlit (right) subsampling from the training set. The testing set is shown as circles.

Fig. 3 The squares are the validation set obtained using random (left) and SPlit (right) subsampling from the training set. The testing set is shown as circles.

2.6 Simulations

Since SP create a dependent set, one may wonder if the testing set and training set are related and if the dependence will create a bias in the estimation of the generalization error in EquationEquation (5). We will perform some simulations to check this. Consider again the data-generating model discussed in the previous section(12) Yi=Xi2+ϵi,(12) where ϵiiidN(0,1) and XiiidN(0,1), for i=1,,N. Let N = 1000. Suppose we fit the following rth degree polynomial model to the data:Yi=g(Xi;θ)+ϵi, where g(X;θ)=θ0+θ1X+θ2X2++θrXr and ϵiiidN(0,σ2). The unknown parameters θ=(θ0,θ1,,θr) can be estimated from the training set using least squares. The generalization error can then be computed as follows:E=EX,Y[{Yg(X;θ̂)}2|Dtrain]=EX,Y[{Yg(X;θ̂)}2](by independence)=EX(EY|X[{Yg(X;θ̂)}2|X])=EX({X2g(X;θ̂)}2)+1.

We can divide a given dataset into training and testing sets using various data splitting methods and estimate the generalization error using EquationEquation (5). Thus, we can compute the estimation error of a data splitting method as ÊE. For comparison, we use SPlit, random subsampling, CADEX, and DUPLEX. This procedure is repeated 100 times by generating testing sets with splitting ratios of 10% and 50%. Owing to the deterministic nature, CADEX and DUPLEX produce the same testing set each time. On the other hand, some variability is observed in the testing sets produced by SPlit, which is mainly due to the random initialization and convergence to local optima of the SP’ algorithm. shows the estimation errors over different values of r.

Fig. 4 The error in the estimation of generalization error with splitting ratios of 10% (left) and 50% (right) for different data splitting methods over 100 iterations.

Fig. 4 The error in the estimation of generalization error with splitting ratios of 10% (left) and 50% (right) for different data splitting methods over 100 iterations.

We can see that the bias in the estimation of generalization error using SPlit is small compared to the other data splitting methods. This confirms the validity of the proposed method.

3 Categorical Variables

Energy distance in EquationEquation (7) is defined only for continuous variables because its definition involves Euclidean distances and therefore, SP can only be found for datasets with continuous variables. However, a dataset can have categorical predictors and/or responses. Therefore, it is important to extend the SP methodology to deal with categorical variables in order to implement SPlit.

For simplicity of notations, let us consider the case of only a single categorical variable, which could be a predictor or a response. It is easy to extend our methodology to multiple categorical variables, which will be explained later. Let m be the number of levels of the categorical variable and Ni be the corresponding number of points in the dataset at the ith level, i=1,,m.

The most naive approach to deal with a categorical variable is to simply ignore it and find the n SP from a dataset containing N points using only the continuous variables as in EquationEquation (9). For illustration, consider the same example used in Section 2.4, except that we generate the data as follows. Let X1iiidN(0,1) and X2i|X1iiidN(X1i2,1) for i=1,,N with N = 100. They are scaled to have zero mean and unit variance. Consider a nominal categorical response variable with three levels: Red, Green, and Blue. The points are classified as Red with probability Φ(6X1X25) and Blue with probability Φ(6X1X25), where Φ(·) is the standard normal distribution function. The remaining points are classified as Green. The data are shown in .

Fig. 5 The circles are the testing set obtained using naive method (left), stratified proportional method (center), and the proposed coding-based method (right). The three categorical levels are shown as red triangles, green pluses, and blue crosses.

Fig. 5 The circles are the testing set obtained using naive method (left), stratified proportional method (center), and the proposed coding-based method (right). The three categorical levels are shown as red triangles, green pluses, and blue crosses.

Suppose our aim is to generate a testing set of 20 points. The result of applying the naive method is shown in the left panel of . In this dataset, there are N1=21 Red, N2=59 Green, and N3=20 Blue. A testing set gives a good representation of the categorical variable if(13) ninNiNforalli=1,,m,(13) where ni is the number of testing samples for the ith level. Thus, we should have approximately four Red, 12 Green, and four Blue in the testing set. However, the naive method gives only two Red and three Blue, which is quite disproportionate to the number of Red and Blue in the dataset. This is expected because the naive method does not use the information in the categorical response for splitting.

Another possible approach that can ensure the proportional sampling in EquationEquation (13) is to first choose niNi/Nn and then find ni SP from the dataset containing only the ith level of the categorical variable while ignoring the remaining part of the dataset. The results of this stratified proportional method is shown in the middle panel of . We can see that some points are too close and almost overlapping. Thus, although the stratified proportional sampling approach ensures perfect balancing of categorical levels, the SP in the continuous space may not be representative.

A yet another approach to deal with categorical variables is to first convert them into numerical variables and then use the methodology that we developed earlier for continuous variables. The first step is to represent the m levels of the categorical variable using m – 1 dummy variables assuming that the model has a parameter to represent the mean of the data. There are many choices for creating the dummy variables such as treatment coding, Helmert coding, sum coding, orthogonal polynomial coding, etc. (Faraway Citation2015, chap.14). Here we use Helmert coding as we have observed better numerical stability with it in the SP’ algorithm.

Consider the example again. Using Helmert coding, the Red, Green, and Blue levels are represented using two dummy variables d1=(1,1,0) and d2=(1,1,2). Now the SPlit algorithm can be applied after standardizing the four columns of the augmented dataset. The result is shown in the last panel of . We can see that the categorical levels are well-balanced and the points are well-spread out in the continuous space. Thus, this coding approach seems to work very well. Moreover, it can be easily adapted for ordinal categorical variables through scoring (Wu and Hamada Citation2011, p. 647). Furthermore, it can be used for multiple categorical variables by coding each variable separately. Thus, this approach appears to be very general and simple to implement and therefore, we will adopt it in the SPlit algorithm to handle categorical variables.

4 Examples

In this section, we will compare SPlit with random subsampling on real datasets for both regression and classification problems.

4.1 Regression

Consider the concrete compressive strength dataset from Yeh (Citation1998) which can be obtained from the UCI Machine Learning Repository (Dua and Graff Citation2017). This dataset has eight continuous predictors pertaining to the concrete’s ingredients and age. The response is the concrete’s compressive strength. We will make an 80-20 split of this dataset which has 1, 030 rows. Thus Ntrain=824 and Ntest=206. The split is done using both SPlit and random subsampling. All the nine variables are normalized to mean 0 and standard deviation 1 before splitting.

A good splitting procedure should work well for all possible modeling choices. Therefore, to check the robustness against different modeling choices, we choose a linear regression model with linear main effects estimated using LASSO (Tibshirani Citation1996) and a nonlinear-nonparametric regression model estimated using random forest (Breiman Citation2001). Both the models are fitted on the training set using the default settings of the R packages glmnet (Friedman, Hastie, and Tibshirani Citation2010) and randomForest (Liaw and Wiener Citation2002). Then we compute the root-mean-squared prediction error (RMSE) on the testing set to evaluate the models’ prediction performance. We repeat this procedure 500 times, where the same split is used for fitting both the LASSO and random forest.

The testing RMSE values for the 500 simulations are shown in . We can see that on the average the testing RMSE is lower for SPlit compared to random subsampling. This improvement is much larger for random forest compared to LASSO. We also note significant improvement in the worst-case performance of SPlit over random subsampling. Furthermore, the variability in the testing RMSE is much smaller for SPlit compared to random subsampling and therefore, a more consistent conclusion can be drawn using SPlit. Thus, the simulation clearly shows that SPlit produces testing and training set that are much better for model fitting and evaluation. Computation of SPlit for this dataset took on an average 1.6 seconds on a computer with 6-core 2.6 GHz Intel processor, which is a negligibly small price that we need to pay for the improved performance over random subsampling.

Fig. 6 Distribution of RMSE over 500 random and SPlit subsampling splits for the concrete compressive strength dataset.

Fig. 6 Distribution of RMSE over 500 random and SPlit subsampling splits for the concrete compressive strength dataset.

We repeated the simulation with several other datasets, namely Abalone (Nash et al. Citation1994), Airfoil self-noise (Brooks, Pope, and Marcolini Citation1989), Meat spectroscopy (Thodberg Citation1993), and Philadelphia birthweights (Elo, Rodriguez, and Lee Citation2001). Abalone and Airfoil self-noise datasets can be obtained from the UCI machine learning repository (Dua and Graff Citation2017), while Meat spectorscopy and Philadelphia birthweights can be obtained from the faraway (Faraway Citation2015) package in R. The details of these datasets are summarized in . shows the testing RMSE values for both LASSO and random forest. We see similar trends as before on all the datasets; SPlit gives a better testing performance on the average than random subsampling and a substantial improvement in the worst-case testing performance.

Fig. 7 Distribution of RMSE over 500 random and SPlit subsampling splits for the datasets described in .

Fig. 7 Distribution of RMSE over 500 random and SPlit subsampling splits for the datasets described in Table 1.

Table 1 Description of datasets considered for regression.

4.2 Classification

For checking the performance of SPlit on classification problems, consider the famous Iris dataset (Fisher Citation1936). The Iris dataset has four continuous predictors (sepal length, sepal width, petal length, and petal width) and a categorical response with three levels representing the three types of Iris flowers (setosa, versicolor, and virginica). There are 150 rows in total with 50 rows for each flower type. Following the discussion in Section 3, the flower type is converted into two continuous dummy variables using Helmert coding. Thus, the resulting dataset has six continuous columns. For modeling we will use multinomial logistic regression and random forest. The classification performance will be assessed using the residual deviance (D) defined as(14) D:=2i=1Ij=1Jyij·ln(yijp̂ij),(14) where I is the number of rows, J the number of classes, yij{0,1} is 1 if row i corresponds to class j and 0 otherwise, and p̂ij is the probability that row i belongs to class j as predicted by the model. Note that 0log0 is taken as 0 by definition.

shows a testing set selected by SPlit. We can see that they are well-balanced among the three classes and the points are well-spread out in the space of the four continuous predictors. We fit multinomial logistic regression and random forest on the training set and then the residual deviance is computed on the testing set. This is then repeated 500 times. shows the deviance results for SPlit, random, and stratified proportional random subsampling. We can see that again SPlit gives significantly better average and worst-case performance compared to both random and stratified proportional random subsampling.

Fig. 8 Visualizing a SPlit subsampling testing set (circles) for the Iris dataset.

Fig. 8 Visualizing a SPlit subsampling testing set (circles) for the Iris dataset.

Fig. 9 Distribution of residual deviance (D) over 500 random, stratified proportional random, and SPlit subsampling splits for the Iris dataset.

Fig. 9 Distribution of residual deviance (D) over 500 random, stratified proportional random, and SPlit subsampling splits for the Iris dataset.

The foregoing study is repeated for four other datasets: banknote authentication, breast cancer (diagnostic, Wisconsin) (Street, Wolberg, and Mangasarian Citation1993), cardiotocography (Ayres-de Campos et al. Citation2000), and glass identification (Evett and Spiehler Citation1989), all of which can be obtained from the UCI machine learning repository (Dua and Graff Citation2017). The details of these datasets are summarized in and the results on the residual deviance are shown in . It is possible to encounter while calculating deviance; for the purpose of plotting, is replaced with the maximum finite deviance obtained from the remainder of the 500 simulations. We can see that SPlit gives a better performance than both random and stratified proportional random subsampling in all the cases. The improvement realized varies over the datasets and modeling methods, but SPlit has a clear advantage over both random and stratified proportional random subsampling.

Fig. 10 Distribution of residual deviance (D) over 500 random, stratified proportional random, and SPlit subsampling splits for the datasets described in .

Fig. 10 Distribution of residual deviance (D) over 500 random, stratified proportional random, and SPlit subsampling splits for the datasets described in Table 2.

Table 2 Description of datasets considered for classification.

5 Conclusions

Random subsampling is probably the most widely used method for splitting a dataset for testing and training. In this article, we have proposed a new method called SPlit for optimally splitting the dataset. It is done by first finding SP of the dataset and then using an efficient nearest neighbor algorithm to choose the subsamples. They are then used as the testing set and the remaining as the training set. The SP give the best possible representation of the dataset (according to the energy distance criterion) and therefore, SPlit is expected to produce a testing set that is best for evaluating the performance of a model fitted on the training set. The ability of SP to match the distribution of the full data is one of its big advantage over the other deterministic data splitting methods such as CADEX and DUPLEX. We have also extended the method of SP to deal with categorical variables. Thus, SPlit can be applied to both regression and classification problems. We have also briefly discussed on how a sequential application of the SP can be used to generate validation and cross-validation sets, but further development on this topic is left for future research.

We have applied SPlit on several datasets for both regression and classification using different choices of modeling methods and found that SPlit improves the average testing performance in almost all the cases with substantial improvement in the worst-case predictions. The variability in the testing performance metric using SPlit is found to be much smaller than that of random subsampling, which shows that the results and the findings of a statistical study would be much more reproducible if we were to use SPlit.

Supplemental material

Supplemental Material

Download PDF (216.6 KB)

Acknowledgments

The authors thank to Professor Dan Apley for handling this article as the editor and for his valuable comments and suggestions.

Supplementary materials

code and data.pdf: This file contains R codes to reproduce and as well as links to all the datasets used in Section 4.

Additional information

Funding

This research is supported by a U.S. National Science Foundation grant CBET-1921873.

References