705
Views
0
CrossRef citations to date
0
Altmetric
STATISTICS

A Novel Minimization Approximation Cost Classification Method to Minimize Misclassification Rate for Dichotomous and Homogeneous Classes

&
Article: 2021627 | Received 22 Feb 2021, Accepted 13 Dec 2021, Published online: 18 Jan 2022

ABSTRACT

Dependence of the linear discriminant analysis on location and scale weakens its performance when predicting class under the presence of homogeneous covariance matrices for the candidate classes. Further, outlying samples render the method to suffer from higher rates of misclassification. In this study, we propose the minimization approximation cost classification (MACC) method that accounts for some specific cost function 23.9. The theoretical derivation is made to find an optimal linear hyperplane θ, which yields maximum separation between the dichotomous groups. Real-life data and simulations were used to validate the method against the standard classifiers. Results show that the proposed method is more efficient and outperforms the standard methods when the data are crowded at the class boundaries.

1. Introduction

The idea to optimize the cost function has been of interest since the second half of twentieth century (Gibra, Citation1967). For instance, engineers in a factory may need to control and optimize the total production cost of goods associated with high quantity and quality in a short time (Zavvar Sabegh et al., Citation2016). Minimizing the cost is important and has several applications. For example, in the health sector, the risk of misclassifying an infected person with a very contagious disease such as COVID-19 and influenza can be disastrous as many more people could get infected. Researchers in health science may need to minimize the cost of misclassifying patients, especially as they allocate them to wards so as to minimize the undesired outcomes.

Models have been used in medicine and engineering to predict the class membership such as multilevel logistic model (Dey & Raheem, Citation2016). However, classification when data are crowded around the separable hyperplane still remains a major statistical research problem. The popularly used standard linear discriminant analysis as well as the quadratic discriminant analysis are often characterized with high misclassification rates(Young, D. M.,Raudys,Citation2004). The dependence of these methods on location and covariance weakens their class prediction performance under the assumption of homogeneity of the covariance matrix.

Besides, presence of outlying samples may also render these methods amenable to high misclassification rates. Therefore, the major contribution of our study was to develop suitable cost function that can be used in the classification problem so as to minimize misclassification rates.

1.1. Defining the classification problem

The multivariate classification problem involves grouping the features Xi in Rp space to one of the group membership yiR. The general form of linear classification function for the binary outcomes yi=signf(Xi) where yi1,1 and f(Xi)=θTXii=1,2,3,.,N. The linear discriminant analysis (LDA), sometimes called the Fisher’s approach, is the most basic linear classifier. As indicated in other studies, this method does not require to satisfy the normality assumption (Liong & Foo, Citation2013; Tillmanns & Krafft, Citation2017). Its main assumption is homogeneity of the group covariance matrices, as is the case for the two-group classification (Puntanen, Citation2013). The general idea of the LDA is to construct a linear hyperplane so as to separate the two groups as much as possible. Suppose we have a random variable Xi,(i=1,2,,N) from one of the two groups yi{1,1} with X1ϕ(μ1,Σ1) and X2ϕ(μ2,Σ2), where ϕ is any multivariate distribution, which is not necessarily the normal distribution. We wish to classify each data vector Xi of size p×1 to the binary group membership {1,+1} where the number of groups k=2. The overall covariance matrix is indicated by: X such that ΣX=ΣB+ΣW, where:

ΣX=j=12i=1NXijμXiμT,ΣB=Nj=12μjμμjμT

and

ΣW=j=12i=1NjXijμjXijμjT

where: ΣB and ΣW are the between and within class covariance matrices. The data vector Xij in the jth with μj being its true mean vector for the jth class, while μ is the overall true mean vector.

The theoretical mechanism of finding the optimal linear separable hyperplane is estimating the parameter θ that maximizes data variation between the classes and minimizes the variations within each class. In other words, it is equivalent to maximizing the standardized squared distance from their centroids:

(1) θTΣBθθTΣWθzˉ1zˉ22σz2=θTμ1μ22θTΣpθ(1)

where zˉj is the transformed data vector that belongs to the jthj=1,2 classes. The parameter for the linear hyperplane is θ while σz2 is the variance for the transformed values of zi=θTXi,i=1,,N and Σp is the pooled covariance matrix of X

Consequently, the middle part of the expression in EquationEquation [1] can be shown to be equivalent to the right hand part, by assuming that the parent populations have different population means but equal variances. Thereby, one unbiased estimator of population variance σZ2 is the combined variance: (n11)sz12+(n21)sz22(n1+n22), where sz12,sz22 are the sample variances for the transformed values of class 1 and class 2 respectively.

The idea of Fisher’s approach in binary classification is to find a vector (θp×1) that maximizes the standardized squared distance between the two centroid groups. The algebraic representation for this idea is in the following maximization problem:

maxθzˉ1zˉ22σz2subjectto:zˉj=θTμj

where, σz is the standard deviation for all N data vectors. By having sufficient samples n=n1+n2 from both population groups N1+N2=N, it can be assumed that our populations are normally distributed. Hence, the maximum likelihood estimators for the overall mean vector μ and Σ, that is, Xˉ and S respectively, can be used. By using these estimators, Xˉ and S, we can show the following:

zˉ1zˉ22sz2=θTXˉ1Xˉ22i=1nzizˉ2<θTSBθθTSWθ;sinceSX=SB+SW

The right hand side of the last inequality is the Fisher-Roa’s Criterion, where SB is the between-class sample covariance matrix, SX is the total-class sample covariance matrix and SW is the within-class sample covariance matrix, and all these estimates are the maximum likelihood estimators. Hence, maximizing the standardized squared distance between groups involves minimizing the within group-sample covariance matrices.

The lemma by Johnson and Wichern (Puntanen, Citation2013) who used the extended Cauchy-Schwarz inequality for optimization was adopted in our search for an optimal estimate of θ in EquationEquation (1).

Lemma 1.1. Let Bp×p be a symmetric positive definite matrix and dp×1 be a given vector. Then for any arbitrary nonzero vector xp×1,

(2) maxx0xTd2xTBx=dTB1d(2)

attained at x=cB1d for any scalar c0.

After matching the vector x with the right hand side of EquationEquation (1), we found that B=Sp and d=xˉ1xˉ2. Taking into account a normalized vector x gives c = 1.

The new version of this estimated hyperplane resulted by an iterative method that tries to minimize the covariance within each group Sw. We used the cost function C(θ) to minimize the data points from their corresponding centroids and consequently minimized the denominator containing Sw.

1.2. A motivating example

The search for another method that minimizes the misclassification between groups has been motivated by a number of studies (Croux & Joossens, Citation2005; Shen et al., Citation2011; Velilla & Hernndez, Citation2005; Zhang, Citation2004). Further motivation was from our exploratory analysis of simulation results from different distributions of data that revealed the effect of dispersion on the MCR. It was observed that as more data concentrate around boundary (separable hyperplane), their separation becomes very difficult as seen in . There are many data points close to the linear separable hyperplane such as the points; 10, 17, 34 and 56 from the squares group, which are highly likely to be misclassified. In other words, the risk of losing the information in estimating the optimal hyperplane is expected to be higher for the points around the hyperplane than the other points in the same group, where R(xi)=E[L(xi)] (Mengyi et al., Citation2012).

Figure 1. Distribution of simulated binary-class data to demonstrate the misclassification problem.

Figure 1. Distribution of simulated binary-class data to demonstrate the misclassification problem.

Figure 2. Relationship between p-value separation and misclassification.

Figure 2. Relationship between p-value separation and misclassification.

Moreover, among the circular points, there are also data points such as 113,172 and 174 which are highly likely to be misclassified. It is logical, therefore, that fixing the same cost of misclassification for all data points is unfair. This leads to our idea that introducing a suitable cost function using the MM-principle(Mairal, Citation2013; Shen et al., Citation2011; Wang & Zou, Citation2018) would vary the cost for each data value according to how far this data point locates from the class mean vector. Hence, we will refer to the misclassification rate for the new method as minimal cost classification rate (MCCR), while the new method will be referred to as approximation minimization cost classifier (AMCC).

Thus, the aim of our study was to develop an optimal separable linear hyperplane using the MM-principle with cost function (discussed in section 2) that minimizes the rate of misclassification MCR. Also, in the next section we show how the algorithm to obtain the updated separable hyperplane that depends on the current one was derived. In section 3, we validate the proposed method by simulating some datasets and comparing them with the classical approaches in terms of misclassification rate. Further, real life datasets were used to further assess the proposed method by utilizing various train-test methods such as; SLDA, BSM, LOOCV and KFCV. All these train-test techniques are discussed in details in section 4. Ultimately, simulated dataset were used to discover the asymptotic behaviour of the proposed method and MCCR comparison to the MCR of SLDA.

2. Methodology

2.1. Developing the MACC based on the loss function

To achieve the study objectives, we applied a loss function to map values of one or more observed variables onto a real number representing some “cost” associated with the training item in the data (Shen et al., Citation2011). The total information lost can be represented by the cost function. In fact, the history of minimizing the MCR by using the loss function is a motivation to many researchers, for example, those who have worked to obtain optimum estimators of precision matrix Σ1 under quadratic loss function (Mengyi et al., Citation2012). Cost may be taken as the average of the losses. We explored a quadratic loss function represented by (xiμˆi) where μˆi is the expected value of xi. It measures how much information is lost between the observed and its predicted value for each data item. A specific form of this cost function is the mean square error abbreviated by MSE.

MSE=i=1n(xiμˆi)2n

where the μˆi is the corresponding expected value of xi. In this study we used the quadratic loss function. Therefore, for the linear discriminant analysis (LDA), our cost function is:

(3) Cj(zi)=i=1n(zijzˉj)2wherej=1,2,...K(3)

zij: the transformed value of the vector item Xj in the jth group.

zˉj: the transformed value of the mean vector of the jth group.

The reason for choosing quadratic cost function was for its ease to show that the total cost for both groups in terms of hyperplane θ can be written as: C(θ)=θT(Σ1+Σ2)θ. Therefore, minimizing the total cost requires minimizing the within-class variance for both groups by choosing the optimal value of θ. Thus, this mechanism is similar to the approach of the Fisher-Roa’s Criterion which attempts to project the data points towards the centres of the groups, especially when many data points are concentrated around the marginal boundaries (Ahn & Marron, Citation2010). In some studies this process is called the data piling method which is projecting the high-dimensional data (p>n) into the low dimension leading to maximizing the marginal distance between groups and projecting the data values that are concentrated on the boundaries towards the centres of groups. On the other hand, our proposed method tries to vary the cost needed to minimize based on the location of the data points from their class centres. The main difference is that in data piling, a kernel trick is used, whereas in our method we use the cost function. Therefore, our method works in parallel with classifiers, making it easy to validate against the classical methods of classification.

2.2. Overview of the Proposed MACC Method

We applied the majorization-minimization (MM) principle to find an expression that solves the iteratively updated separable hyperplane θ in terms of the current solution θ˜ such that θ=f(θ˜). After some iterations, the updated θ gives an optimum such that the total cost C(θ) is at minimum. One limitation is how to obtain an optimum single closed form of θ from the direct differentiation of the cost function C(θ). However, a mechanism of direct differentiation does not always lead to a closed form of θ neither does it produce right desired solution. It can be shown that expressing cost function (7) in terms of θ and differentiating partially yields θ=0 which is logically impossible. Generally, the MM-principle operates in two steps. The first step searches for the majorization function D(θ|θ˜) such that C(θ)D(θ|θk) for any θθk The second step involves differentiating the D(θ|θk) with respect to θ and setting it to 0 and iteratively finding an expression that involves θ and θˆ (Lange & TongWu, Citation2008; Wang & Zou, Citation2018). In fact, based on Fisher’s approach (Shin, Citation2008), θ takes the form of θ=(β1,β2,,βp)T for p predictors. Finding the convex supremum majorization function D(θ|θ˜) for any θθ˜ is quite difficult. By doing some manipulation using this principle, it can be called approximation-minimization principle. This can be reached algebraically using quadratic convex approximated function of the cost function (3), such that for any θθ˜, the function:

(4) f(h(θ))f(h(θ˜))+f(h(θ˜))(h(θ)h(θ˜))+f(h(θ˜))2!(h(θ)h(θ˜))2(4)

where f(.) is the cost function and h(θ) is the corresponding value of the linear classification function with unknown parameter θ. Note that the right hand side of this inequality is approximated by the Taylor series approximation (Wu et al., Citation2019). In addition, θ˜=(β˜1,,β˜p)p×1T is the current solution and θ=(β1,β2,,βp)T is the updated solution (Mairal, Citation2013; Wang & Zou, Citation2018).

2.3. Deriving the minimization maximization cost classification (MACC)

Given the data matrix Xn×p with the ith row xiT. Let b˜ be an n×1 vector with the ith element f(θ˜Txi), θ˜ be the current solution and θ the updated solution, then yi=signθTxi is the linear classification function.

C(θ)=i=1nf(θTxi)D(θ|θ˜)
=i=1nf(θ˜Txi)+i=1nf(θ˜Txi)(θTxiθ˜Txi)+
i=1nf(θTxi)2(θTxiθTxi)2
=i=1nf((θ˜Txi)+
i=1nf(θ˜Txi)θTxi+i=1nf(θ˜Txi)θ˜Txi
+12i=1nf(θTxi)θTxiθTxi2

Since f(θTxi)=2 for any xi and θ, then

D(θ|θ˜)=i=1nf(θ˜Txi)+i=1nb˜iθTxii=1nb˜iθ˜Txi+i=1n(θTxi)2+i=1n(θ˜Txi)22i=1nθTxixiTθ˜
=i=1nf(θ˜Txi)+θTi=1nxiθ˜Ti=1nb˜ixi+i=1n(θTxi)2+i=1n(θ˜Txi)2
2θTi=1n(xixiT)θ˜
=i=1nf(θ˜Txi)+θTXTb˜θ˜TXTb˜+i=1n(θTxi)2+i=1n(θ˜Txi)22θTi=1n(xixiT)θ˜
D(θ|θ˜)

Now in order to find the iterative equation of θ in terms of θˆ, we differentiate the majorization (approximation) function D(θ|θ˜) with respect to θT, and set it equal to 0p×1 as follows:

(5) DθT=0+b˜TX0+2θTi=1n(xixiT)+02θ˜Ti=1n(xixiT)T=0.(5)

Solving EquationEquation (5) for θT gives (EquationEquation 6):

(6) θT=θ˜T12b˜TXi=1nxixiT1.(6)

The EquationEquation (6) can be iterated a number of times to get an updated solution θ (the hyperplane) until a desired minimum misclassification rate is reached. A threshold \isin>0 is the desired minimum misclassification rate that can be set by the researcher. It can also be defined as θθˆ2 where θ is the updated solution and θˆ the previous solution. Moreover, if the issue of over fitting arises, it can be solved through the cross-validation method. In Algorithm (1) we show how the estimated value of θ can be iteratively determined such that the rate of misclassification MCR does not exceed \isin. However, the proposed method performs well under the assumptions of homogeneity of groups and the nonsingularity of the matrix i=1nxixiT. Also using the Taylor’s approximation in the majorization function, and putting the parameter θ in the explicit form of the Fisher’s approach: θ=(μ1μ2)TA gives one property of this majorization function D(θ|θ˜) based on its partial derivatives with respect to the two mean vectors.

Further, we let θ=(μ1μ2)TA in our majorization function D(θ|θ˜) and taking partial with respect to μ1 and μ2, to get expressions in EquationEquations (7) and (12) respectively:

(7) Dμ1=0+ZTYA0+2i=1n(μ1μ2)TAyi(Ayi)T+02(μ1μ2)TAi=1n(yiyiT)AT=0(7)
(8) Dμ2=0ZTYA+02i=1n(μ1μ2)TAyi(Ayi)T0+2(μ1μ2)TAi=1n(yiyiT)AT=0(8)

consequently:

(9) Dμ1+Dμ2=0(9)

The last partial differential equation implies that in order to minimize the majorization function D(θ) and consequently minimize the cost function C(θ), leads to minimum misclassification rate MCR. Note that the rate of change of D(θ) with respect to μ1 should be approximately the same rate change of D(θ) with respect to μ2 but in the opposite direction, while preserving homogeneity within groups.

2.4. A Pseudocode for the updated MACC hyperplaneθ

To illustrate the application of the proposed minimal cost classification rate (MCCR), the pseudocode in algorithm (1) describes the procedure for updating the hyperplane θ. It is necessary to set the desired misclassification rate \isin, sample size, n and number of iteration,iter that represent the maximum number of iterations required to update the hyperplane θ. Then, the parameters α and β are set to be positive so as to control the variances in the covariance matrices αI and βI, where I is an identity matrix. We then, simulate n/2 samples for each group and estimate the covariance matrix and population mean (,μ) for both groups from previous samples. If real data set is available, simulation part may not be required. After which, we suggest conducting a test for homogeneity between groups H0:1=2 as well as their separation H0:μ1=μ2 to ascertain meaningful classification and separation required for using this method. Finally, we update θ at each iterative step τ by using EquationEquation (6) to find the minimum misclassification rate (MCCR) that corresponds to the optimum θ.

Algorithm 1: The Pseudo code to implement Minimal Cost Function based on LDA

Data: file.txt

Result: calculate the MCCR

Test μ1=μ2\amp1=2;

initialization θ,iter,\isin>0,τ=1;

while τ<iter do

Calculate zi=θxi;

Find the Lost information of zi using Quadratic cost;

Update θ using EquationEquation 6;

Calculate MCR;

if MCR >\isin then

ττ+1;

current θ is updated;

else

MCCR MCR;

Exit;

end

end

3. Validation of the proposed MACC method by monitoring the MCCR

The efficiency of our method is validated by comparing its misclassification rate, MCCR against that from four different classification methods, including; the standard linear discriminant analysis SLDA, bootstrapping sampling method BSM, leave-one-out-cross-validation LOOCV and the k-fold cross-validation KFCV. We compare them by assessing their performance based on their resulting misclassification rates MCR. Here is a brief description for each method:

(1) The SLDA is the Fisher’s approach for classification (Puntanen, Citation2013; Shin, Citation2008).

(2) In the BSM, some samples are selected randomly from the dataset with specific sample sizes, fitting the linear model that leads to predict the group’s memberships for the remaining unselected samples and consequently compute the MCR. This process was repeated many times and finally the average MCR is calculated (Shao, Citation1993).

(3) The LOOCV splits the data set into two parts; the “train”, a frame that contains all samples except one data subject and the “test” frame. Train set is used to fit the linear classifier which takes the one left subject to predict its corresponding group membership. This process was continued until all subjects in the dataframe were completed giving the final result for the MCR (Xu & Goodacre, Citation2018).

(4) Under the KFCV, the data is divided the data into k-parts, which should contain relatively equal subjects. It is an extension for the LOOCV, but the test set contains more than one subject. At each time, one fold was treated as test frame and the others used to fit the model. At the end, we averaged the k MCRs (Xu & Goodacre, Citation2018).

(5) MCCR is the misclassification rate calculated from the new proposed method that uses the cost function based on the MM-principle.

3.1. Validation of the MACC method using simulation study

Datasets of different sizes N with N/2 data values in each each group and seven predictors p=7 were generated from two multivariate normal distributions with known parameters μ1,μ2,Σ1andΣ2. In each iteration, the covariance matrices were tested using the Box-M test for homogeneity, so as to check validity of the linear discrimination. Moreover, the hypothesis of population mean vectors H0:μ1=μ2 was also tested in order to check for existence of sufficient separation between groups, as required to perform meaningful classification. This process of simulation was conducted with different set seeds for each dataset. presents the results of these calculations. Note that each calculated MCR is based on an average of 100 iterations for each dataset.

Table 1. Comparison of MCR and MCCR based on simulated data

It can be seen from that in most cases there are small differences in the misclassification between the standard classical LDA and our proposed method, the MACC. More specifically, when the separation between the groups becomes more difficult as it is indicated by the increase of p-value, the MACC method performs more efficiently than the standard linear discriminant analysis, LDA method.

3.2. Validation of the MACC method using data from real life studies

In this section, we present validation analysis of the MACC method based on five real life datasets. These may not necessarily meet the assumptions of the new method (MACC), but are good for exploratory purposes for the performance of the new method. The first one includes 12 predictors and sample size n of 872 students. We performed logistic regression to select the 5 most significant predictors that were used in the discrimination at the next stage. The group membership of the discriminant function were on time or late graduation. Before conducting the standard linear classification, the equality of two vector means were tested using the Hotelling’s T2 test. A significant difference between them resulted implying that there was a possibility to separate the two groups from each other by classification methods. In addition, we tested for the equality of the covariance matrices for the two groups H0:Σ1=Σ2 using the Box-M test statistic, small p-values (p<0.05) indicated that linear classification was not the appropriate method for classification. Nevertheless, the previous indication, SLDA was conducted and resulted in MCR=16.5%. However, after performing 100 iteration using EquationEquation 6, the MACC method’s minimal cost classification rate MCCR was 16.5% which is the same misclassification rate as that obtained by the standard linear classifier.

The second database analysed was (students-performance). It contained sample size of 604 students. The purpose of this dataset was to predict their group membership success (n1=226) or fail (n2=378) using five predictors. Implementing the Box-M test gave a high p-value (p=0.818) which reflected the homogeneity of data values in the two groups was acceptable. On the other hand, p-value of testing equality of the mean vectors was small (p=0.0024) indicating significant difference between μ1 and μ2 that reflected possible separation between these two groups. The SLDA and MACC methods were conducted and ended up with very close misclassification rates, MCR (38.2%) and MCCR (38.7%).

The third database was collected from the Department of Psychology at the Sultan Qaboos University Hospital SQUH. It contained information about eighty patients including five features. These were as follows: age of patients (x1), gender (x2), primary weight of patient (x3), age group (x4) and drug group (x5). The response for this model had two levels; over weight if the weight increased by more than or equal to eight kilograms after taking the drug otherwise there was no significant difference. Equality of covariance matrices was tested (p=0.121) allowing the use of linear classification. Further, the possibility of separation between group was tested (p=0.3045) reflecting difficulty of separation of these data values since the centres of group approximately in the same location. Standard linear discriminant method was implemented yielding MCR of 41.2%. By contrast, the MACC method gave on the average of 100 iterations a minimal cost classification rate MCCR of 38.6%, which reflects great improvement in using the proposed MACC method, particularly for this data set.

illustrates the results of analysis of these three data sets plus two other datasets; Bullying and Purchased, which were collected by using questionnaires. They were conducted as mini-projects among students of the College of Nursing and College of Economics and Political Science, respectively. Because of the marginal significant difference between centroids, and extreme significance between covariances of the dichotomous groups, the performance of the proposed MACC method was poor in the Bullying dataset.

Table 2. Misclassification rates and minimal cost classification rates of real life data

Furthermore, using the train-test approach, we validated the efficiency of our proposed MACC method (Xu & Goodacre, Citation2018). Findings from these analyses continue to confirm the superiority of the proposed MACC method over the standard LDA or QDA particularly when data points are crowded at boundaries with no significant covariances and separation between groups. We provide detailed discussion in the next section.

4. Discussion

The aim of our study was to propose a new method so as to improve the classification performance of data often clustered around the linear separable hyperplane. Referring to , it is clear that the performance of the proposed method varies from one data set to another. It depends on the degree of overlap between groups as well as the significant difference on their homogeneity.(Calabrese, Citation2014; Naranjo et al., Citation2019) For instance, in the first validation dataset gradstudents, the classification performance for both methods (SLDA\ampMACC) were approximately the same. Because of low overlap between groups and significant difference in the homogeneity between groups, the results were reasonable. Moreover, for the second dataset studperform, it was found that the performance of SLDA was relatively better than MACC’s since having marginal significant mean vectors still indicates that not too much overlap existed between the groups, consequently, no crowded data were expected along boundaries. Thus, we expect little to no contribution of the cost function to minimize the misclassification rate. On the other hand, applying the proposed MACC method on the drug data gave better improvements for the MCCR (38.6%) as compared to the MCR (41.2%). That was due to the fact that there were overlaps between groups and also the existence of equivalent group covariances. This signalled more importance for the quadratic cost function to influence the data points contributing in estimating the hyperplane θ. Further, the poor performance of MCCR using the fourth data set Bullying can be explained by the same reason of existence of marginal significant separation and differences in homogeneity.

It has been shown in others studies that splitting datasets into two parts that is, the train and test data could improve the performance of classification methods (Shao, Citation1993; Xu & Goodacre, Citation2018). In this section we discuss its effect on the MCR. The splitting methods considered included; Bootstrap splitting method BSM, k-Fold Splitting method KFSM(k = 5,10) and Leave-one-out-cross-validation LOOCV. Moreover, we tested some of them by using the Chi-square goodness of fit test-statistic as well as compared their performance by taking the mean of MCR for a number of iterations. Ultimately, comparing their performance was important to help draw some important conclusions.

We utilized real life data sets as demonstrated in the . We developed R programming functions to fit five linear discriminant functions using three splitting methods for each proposed real dataset. This process was repeated 100 times until the final p-value (average of 100 p-value’s) as presented in correspond to each fitted model. Although, some of these models gave good classification performance, most of them did not fit the data well, meaning that the hypothesized goodness of fit was rejected. Further, we used the proposed MACC method for the five real data sets using train-test approach with 100 repetitions and calculated the average of the MCCR, which resulted in more classification efficiency improvement than the classical LDA. Thus, we concluded that using different splitting methods does not improve the MCR nor its goodness of fit. Besides, the train-test splitting method (train=60%, test=40%) was relatively the most appropriate choice for the MACC to solve the over-fitting issue.

Table 3. Comparison the three classification splitting methods based on MCR

Furthermore, we compared the effect of increasing crowdedness of data points around the boundaries on misclassification using both methods. To verify that, we simulated 20 distinct data sets with two classes from multivariate normal distribution with equal covariance matrices and increase the centroids separation in each data set. The resultant relationship is presented in . It has been noticed that as the separation between groups decrease (increasing the p-value), the MCCR decreases. On the other hand, the flow of blue dots shows that decreasing the separation (much overlap and large p-value) yields poorer misclassification rates (increase) using the standard method LDA.

The main challenge for any classification problem is existence of overlaps between groups, especially where there is no clear separation, often resulting in poor classifier performance. (Naranjo et al., Citation2019; Pires et al., Citation2020) This phenomena happens when the centroids of the two groups are too close to each other, identified by very large p-value of the Hotelling’s test. For this reason, we suggest to test the separation of groups and their homogeneity before using the proposed MACC method.

5. Conclusion

Our study sought to develop a method based on the quadratic cost function through majorization minimization principle to improve the classification of data that are more concentrated at the boundaries and infused into another group. The proposed method, MACC has been validated against the standard methods through simulations and real life data. The findings show that the proposed method gives minimal misclassification rate compared to the standard classification methods. The method outperforms the linear discriminant analysis for more homogenous groups, when data are crowded at the boundaries.

In order to solve overfitting, we illustrated numerically that using distinct splitting methods such as bootstrapping and k-fold algorithms the performance of SLDA did not improve the classification. However, reduced misclassification rates were realised from the proposed method. Therefore, we recommend using the proposed MACC method to perform classification under the threat of group homogeneity.

Pubic interest statement

There are many life applications that are difficult to classify due to the presence of similarities between the prior classes. Failure to correctly classify could be dangerous and cost is prohibitive. The misclassification cost could be as financial loss, death of a misdiagnosed patient or just sending a student abroad to study a major that is incompatible with his abilities. As an application is to correctly classify a patient with either influenza or COVID-19 based on their signs and symptoms. To overcome this problem, our study introduces a suitable classification method that provides a minimal cost compared to the current classifiers.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The authors received no direct funding for this research.

Notes on contributors

Mubarak Al-Shukeili

Mubarak Al-Shukeili holds an MSc in Statistics and is currently a final year PhD student. His area of research is to investigate the methods that can result into minimization of classification rates. Also, he is interested in doing research in medical science, mathematical modelling and computational statistics.

Ronald Wesonga

Mubarak Al-Shukeili holds an MSc in Statistics and is currently a final year PhD student. His area of research is to investigate the methods that can result into minimization of classification rates. Also, he is interested in doing research in medical science, mathematical modelling and computational statistics.

Ronald Wesonga holds PhD in Statistics; he is a professional statistician with vast knowledge, skills and experience gained over years through collaborative networks with other professionals across the world. As a university professor, he has has published widely in high-impact journals, inspired many students, groomed junior staff and is currently enthusiastic about estimation error minimization as well as creating deeper understanding and new knowledge in data, computing & statistics.

References