228
Views
0
CrossRef citations to date
0
Altmetric
SHORT COMMUNICATIONS

Some results of classification problem by Bayesian method and application in credit operation

Pages 150-157 | Received 30 Nov 2017, Accepted 22 Sep 2018, Published online: 03 Oct 2018

ABSTRACT

This study proposes some results in classifying by Bayesian method. There are upper and lower bounds of the Bayes error as well as its determination in case of one dimension and multi-dimensions. Based on the proposals for estimating of probability density functions, calculating the Bayes error and determining the prior probability, we establish an algorithm to evaluate ability of customers to pay debts at banks. This algorithm has been performed by the Matlab procedure that can be applied well with real data. The proposed algorithm is tested by the real application at a bank in Viet Nam that obtains the best results in comparing with the existing approaches.

1. Introduction

The classification problem is one of the main subdomains of discriminant analysis and has relation with many fields. Classification is to assign an element to the appropriate population based on the observed variables. It is an important development direction of multivariate statistics and has been applied in many different fields such as medicine and economics. Recently, this problem is interested by many statisticians in both theory and application (Miller, Inkret, & Little, Citation2001; Nguyen-Trang & Vo-Van, Citation2017; Pham-Gia, Nhat, & Phong, Citation2015; Tai, Thao, & Ha, Citation2016). There are many methods for classifying such as Fisher, logistic regression, Bayes and the machine learning algorithms (Naive–Bayes (NB), Supported Vector Machine (SVM), k-Nearest Neighbour, etc.). For Fisher method, we have to assume the equality of the variance matrices of groups for implementing. This is the drawback of Fisher method in real applications (Fisher, Citation1936; Marta, Citation2001). When constructing a logistic regression model, we must constrain on the data conditions that are difficult to satisfy in reality (James, Citation2001; Jan, Cheng, & Shih, Citation2010), so it is not suitable for many applications. According to many literatures (Altman, Citation1991; Hastie & Tibshirani, Citation1996), the algorithms in machine learning have the following major disadvantages: (i) Error diagnosis and correction: One notable limitation of machine learning is its susceptibility to errors. The actual problem with this inevitable fact is that when they do make errors, diagnosing and correcting them can be difficult because it will require going through the underlying complexities of the algorithms and associated processes, (ii) Time constraints in learning: It is impossible to make immediate accurate predictions with a machine learning system. Remember that it learns through historical data. The bigger the data and the longer it is exposed to these data, the better it will perform, (iii) Problems with verification: Another limitation of machine learning is the lack of variability. Machine learning deals with statistical truths rather than literal truths. In situations that are not included in the historical data, it will be difficult to prove with complete certainty that the predictions made by a machine learning system is suitable in all scenarios, and (iv) Limitations of predictions: Unlike humans, computers are not good story tellers. Machine learning systems know more what they can tell humans. Thus, they cannot always provide rational reasons for a particular prediction or decision. Bayesian method bases the distribution of data, the prior probability and the relation between the classified element with groups to perform. It does not require much historical data as the algorithms in machine learning because it use the prior probabilities in classifying. This method does not also need normal condition for data and can classify for two and more populations. As a result, it has many advantages in classifying (Tai, Citation2017).

Given k populations with and are the probability density function (pdf) and the prior probability of , respectively. Pham–Gia, Turkkan, and Tai (Citation2008) have used the maximum function of pdfs as a tool to study about Bayesian method and given the important results. Classification principle and Bayes error were established based on the The upper and lower bounds of the Bayes error and its relationship with the distance of the pdfs as well as with the overlap coefficient of the pdfs were also built. The function has a very important role in the classification problem by Bayesian method, so the authors have continued to study on it. Using the Matlab software, Pham-Gia et al. (Citation2015) have given the function of two bivariate normal pdfs. With the given specific parameters of two densities, this method can determine the regions of in and their boundaries (straight lines, ellipses, parabolas or hyperbolas). However, it can not perform for non-normal distributions. With the similar development, Tai (Citation2017) has proposed the distance of the and established its relationship with Bayes error. This distance also was used to calculate Bayes error and to classify a new element. However, we see that the relevant quantities to Bayesian approach have not been surveyed completely yet.

Bayesian method has many advantages, however according to our knowledge, the level for application of this method in practice is less than others. We can find many applications in bank and medicine using Fisher method, logistic method and the algorithms of machine learning model (Altman, Citation1991; Christopher, Citation2006; Cristianini & Shawe, Citation2000; Jan et al., Citation2010; Marta, Citation2001). Recently, all statistics software can solve effectively and quickly with big and multivariate data in classifying for above methods, while the Bayesian method does not have this advantage. The cause of this problem is the ambiguity in determining prior probability, estimating pdfs and complex problem in calculating Bayes error. Although all these issues have been discussed by many authors, the optimal methods have not been still found yet (Tai, Citation2017). In this article, we propose specific approaches to perform all above mentioned problems. From these results, we establish an complete algorithm for evaluating the ability of customers to pay debts at banks from their information. The proposed algorithm is applied for customers of Vietcom bank in Viet Nam. This application gives advantages in comparing to existing approaches. The proposed algorithm can be applied for other domains.

The next section of the article is structured as follows. Section 2 presents the classification principle and the Bayes error. Determining the Bayes error and some its results, finding the function in case of one-dimension and multi-dimensions to calculate Bayes error and to classify a new element are also performed in this section. Section 3 proposes an algorithm to evaluate ability of customers to pay debts at banks and solve calculable problem in applying practice of this algorithm. This section also compares the proposed algorithm with existing ones by many numerical examples. Section 4 applies the proposed algorithm for real data at a bank in Viet Nam. The final section is destined for conclusion of the paper.

2. Classification principle and Bayes error

2.1. Classification principle

Given k populations with and are the prior probability and pdf of , respectively. According to Pham–Gia et al. (Citation2008), an observation is assigned to population if (1) where

Misclassification probability of this method is called Bayes error. It is given by (Equation2): (2) where From (Equation2), we can prove the following result: (3) The correct probability is determined by (Equation4). (4)

2.2. Determining Bayes error

Theorem 2.1

Let be k pdfs defined on and let (5) The Bayes error is determined by (6)

Proof.

See Appendix 1.

2.3. Bounds of Bayes error

Theorem 2.2

Let be k pdfs defined on . We have bounds of Bayes error as well as its relationships with other measures as follows:

  1. (7)

  2. (8)

  3. (9)

  4. (10)

where is affinity of Toussaint (Citation1972).

Proof.

See Appendix 2.

From (Equation7), with we have the relationship between Bayes error and affinity of Matusita (Citation1967). Especially, when we have the relationship between and Hellinger distance.

In addition, we also have the relation between Bayes error and overlap coefficients as well as distance of (see Tai, Citation2017). For special case: the authors in Pham–Gia et al. (Citation2008) had established expressions about relations between Bayes error and distance of and .

2.4. Maximum function

To classify a new element by (Equation1) and to determine Bayes error by (Equation3), we must find the Some authors such as Pham-Gia et al. (Citation2015) and Tai (Citation2017) have surveyed relationships between with some related quantities of classification problem. The specific expressions for in some special cases have been found. However, the general expression for all of cases is a complex problem that has not been still found yet.

Given k pdfs and and let The maximum function is determined in the following two cases:

(i) For one-dimension

In this case, we can find by the following algorithm:

  • Step 1. Solve the equations to find all roots.

  • Step 2. With root of equation compare value with all the values of . If there exists such that then we delete and keep for otherwise. Arrange the kept roots in order from small to large, then we have the set

  • Step 3. Given is determined by the following principles:

    • If then for

    • If then for

    • If then for

In the above algorithm, are the positive constants such that:

From this algorithm, we have written Matlab procedure to find the When is determined, we will easily calculate Bayes error by (Equation3), as well as classify a new element by (Equation1).

(ii) For multi-dimensions

In case of multi-dimensions, it should be very complicated to obtain the closed expression for . The difficulty comes from the various forms of the intersection space curves between the surfaces of pdfs. This problem has been interested by the authors in Ghosh (Citation2006), Pham–Gia et al. (Citation2008), Pham-Gia et al. (Citation2015), and Tai (Citation2017). The authors in Pham-Gia et al. (Citation2015) have attempted finding the function however it has been only established for some cases of bivariate normal distribution.

Here, we do not find the expression of . We compute Bayes error instead by taking integration of by quasi Monte-Carlo method. An algorithm for doing calculations has been constructed, and a corresponding Matlab procedure is also established.

3. The proposed algorithm in evaluating ability of customers to pay debts

3.1. The proposed algorithm

Based on the Bayesian method, we propose an algorithm to evaluate the ability to repay debt bank of customers. In bank credit operations, determining the repayment ability of customers is really important. If the lending is too easy, the bank may have bad debt problem. In contrast, the bank will miss a good business. Therefore, this problem is interested of many statisticians and managers.

Given N customers divided to k groups . Each customer is considered by n variables. is data set of all customers and is a new customer that we need to classify. We propose an algorithm to classify (PAC) as follows:

  • Step 1: Determine variables that have statistical significance to influence to the ability to repay bank debt of customers.

  • Setp 2: Find the prototype element for each group by (Equation11): (11) where is the probability of jth element assigned to and is the coordinate of the jth element.

  • Step 3: Establish the initial partition matrix where the first N columns is extracted from known training data with if the jth element belongs to the and for otherwise. The th column is the initial prior probabilities of . We can choose them by uniform distribution.

  • Step 4: Update the new partition matrix by the following principle: (12) if for and for for otherwise is the distance from to ).

  • Step 5: Compute the

    Repeat Step 2, Step 3 and Step 4 until

  • Step 6: Estimate pdf for the group and compute where

  • Step 7: If then is assigned to with Bayes error is determined by (Equation3).

The proposed algorithm has two phases to perform. Phase 1 determines the prior probabilities (Step 1 to Step 5) and Phase 2 classifies a new element with specific Bayes error (Step 6 and Step 7). Phase 1 is an important contribution of the proposed algorithm and established based on the fuzzy relation between the classified element and the populations. This phase finishes when the probability of two consecutive iterations is almost the same. Thus, the number of iterations depends on each data set. Computing the Bayes error is complexity of Phase 2. For one dimension, first, we find the by the proposed algorithm in Subsection 2.4 and, and then, compute Bayes error by (Equation2). For multi-dimensions, Bayes error is approximated by quasi Monte-Carlo method.

3.2. Some other related problems of the proposed algorithm

In above algorithm, we need to pay attention to some following problems:

  1. ϵ is a really small positive number chosen arbitrarily. The smaller it is, the more iterations and the cost time are taken. In this article, we choose 0.0001.

  2. is the distance from object to the prototype There are many distances between two elements summarised in Webb (Citation2002). In this paper, we use the distance (see Pham–Gia et al., Citation2008) for applications.

  3. To determine variables having statistical significance of Step 1, we use logistic regression model to perform.

  4. Normally, in case of non-information, we choose prior probabilities by uniform distribution. Based on the training set, the prior probabilities are often estimated by Laplace method: and the ratio of sample one: where and N are the number of elements in ith group and training set, respectively, n is the number of dimensions and k is the number of groups. The above mentioned approaches have been studied and applied by many authors, such as, (McLachlan & Basford, Citation1998; Nguyen-Trang & Vo-Van, Citation2017; Tai, Citation2017; Tai et al., Citation2016). Five steps of the proposed algorithm (Step 1, Step 2, Step 3, Step 4 and Step 5) determine the prior probability for If the algorithm stops at the fifth step, we will get the matrix of size in which the last column is the prior probability of Thus, in this algorithm, we have combined the sample data set and classified elements to determine the prior probability. Hence, it contains more information than the ratio of sample and Laplace methods that only depend on training data. Anyway, these prior probabilities which consider the relations between the classified object and all of populations may be more suitable than traditional methods that only base on training set.

  5. There are many parameter and nonparameter methods to estimate pdfs of Step 6. In the examples and applications of this article, we use the kernel function method, a popular one applied in reality nowadays (Inman & Bradley, Citation1989; Nguyentrang & Vovan, Citation2017; Tai, Citation2017; Tai & Pham-Gia, Citation2010; Tai et al., Citation2016).

3.3. Numerical examples for comparison

In this section, three well-known data sets which comprise Pima, Breast Tissue and User are used to test the performance of the proposed method. Pima data was originally donated by Vincent Sigillito, Applied Physics Laboratory, Johns Hopkins University and was constructed by constrained selection from a larger database held by the National Institute of Diabetes and Digestive and Kidney Diseases. All patients represented in this data were females at least 21 years old of Pima Indian heritage living near Phoenix, Arizona, USA. The problem posed here is to predict whether a patient would test positive for diabetes according to World Health Organization criteria (i.e. if the patients 2 hour post load plasma glucose is at least 200 mg/dl.) given a number of physiological measurements and medical test results. The attribute details includes number of times pregnant, plasma glucose concentration in an oral glucose tolerance test, diastolic blood pressure (mm/Hg), triceps skin fold thickness (mm), 2-hour serum insulin (mu U/ml), body mass index (kg/m), diabetes pedigree function, age (years). This is a two class problem with class value 1 being interpreted as ‘tested positive for diabetes ’. There are 500 examples of Class 1 and 268 of Class 2. The Breast Tissue is the data set with electrical impedance measurements of freshly excised tissue samples from the breast. It includes nine features, such as IO-Impedivity (ohm) at zero frequency, phase angle at 500 KHz, high-frequency slope of phase angle, DA-impedance distance between spectral ends, area under spectrum, area normalised by DA, maximum of the spectrum, distance between IO and real part of the maximum frequency point, length of the spectral curve. All observations in this data are divided into four classes: car (carcinoma), con (connective), adi (adipose) and the merged class of fad (fibro-adenoma), mas (mastopathy), gla (glandular). The last data is the real one about the students' knowledge status regarding the subject of Electrical DC Machines. All of considered data sets are collected from www.is.umk.pl/projects/datasets.html. The summary of three data sets is presented in Table .

Table 1. Summary of three bench mark data sets.

For each data set, we conducted the experiments 10 times and use 30% of objects as the test set at each time, randomly. In addition, the results of the proposed algorithm are compared with Fisher method, logistic method and some machine learning algorithms such as NB, SVM, Radian basic function support vector machine (RBFSVM), linear discriminant analysis (LDA), k-nearest neighbour with k=1 (1-NN) and k=3 (3-NN). With the considered data sets that have the difference about the features, the number of dimensions and groups, this comparison is very meaningful to evaluate the advantages of the proposed algorithm.

As presented in Table , the proposed algorithm provides the best results for all three data sets.

Table 2. The empirical error of the proposed method and others.

4. Application in credit operation

In this section, we classify customers at Vietcom bank in Viet Nam to illustrate for application of the proposed algorithm. In this article, Bayesian method with prior probabilities calculated by uniform distribution, ratio of samples, Laplace method and proposed algorithm are called BayesU, BayesR, BayesL and BayesC, respectively.

The considered custormers in this application are companies in Can Tho city (CTC), Viet Nam. We collect a data set on 214 enterprises operating in key sectors as agriculture, industry, commerce, including 143 cases of good debt (G) and 71 cases of bad debt (B). Data is provided by responsible organisations of CTC and studied in Tai (Citation2017). Each company is evaluated by 13 independent variables in the expert opinion. The specific variables are given in Table .

Table 3. The surveyed independent variables.

In this application, the article will use random 70% of the data size (100 elements belong to group G and 50 elements belong to group B) as the training set to determine variables which have significance, to estimate pdfs and to find suitable model for classification problem by Bayesian method. 30% of the remaining data will be used as the validation set (43 elements belong to group G and 21 elements belong to group B). The result of Bayesian method is also compared to others.

To assess the effect of the independent variables to the solvency of the companies, we have built the logistic regression model with the independent variables Xi, is the probability of repaying bank debt of companies). The analytical results are summarised in Table .

Table 4. The results of logistic regression model.

Table  only shows three variables X1, X4 and X7 have statistical significance at 10% level, so we use three variables to classify. Performing with BayesU, BayesR, BayesL and BayesC, we have the results given in Table .

Table 5. The correct probability (%) in classifying RBD from training set.

Table  shows that the correct probability of BayesC with three variables gives the largest value. Comparing this result with that of existing some other methods, we obtain Table .

Table 6. The correct probability (%) for optimal models of training set.

Table  shows that BayesC also gives the highest result in comparing with existing method for 1 variable, 2 variables and 3 variables. Using the best model for each case of methods from Table  to classify the test set (67 elements), we obtain Table .

Table 7. Compare the correct probability (%) of test set.

Once again with test data, BayesC also gives the best result in Table .

5. Conclusion

The article has considered completely the classification problem by Bayesian method. Bayes error for one-dimension and multi-dimensions are surveyed in theory and application. The relationships between the Bayes error with affinity of Toussaint are also established. Surveying the function not only adds tool to find Bayes error, classifying the new element but also is the visual illustration for the classification problem. Based on the determinant prior probability, classification principle, computation the Bayes error, we have proposed a new algorithm to evaluate ability of customers to pay debts at banks. This algorithm has been performed by the Matlab procedure that can be applied well with real data. The proposed algorithm is compared with existing algorithms by many benchmark data sets. They show that the proposed algorithm is more advantage than existing approaches. We have also applied the proposed algorithm for customers at Vietcom bank in Viet Nam. This example shows potentiality in real application of the researched problem. In the coming time, we continue to use it to survey the other problems.

Disclosure statement

No potential conflict of interest was reported by the author.

Additional information

Notes on contributors

Tai Vovan

Tai Vovan received the Ph.D. degree in theory of probability and statistical mathematics in 2011. He has worked in Can Tho University, Viet Nam, since 1997. His research interests include statistical pattern recognition (classification problem and cluster analysis) and fuzzy time series and their applications in data mining. He has published over 20 papers about these subjects.

References

  • Altman, D. G. (1991). Statistics in medical journals: Development in 1980s. Statistical in Medicine, 10, 1897–1913. doi: 10.1002/sim.4780101206
  • Christopher, M. B. (2006). Pattern recognition and machine learning. New York, NY: Springer.
  • Cristianini, N., & Shawe, T. J. (2000). An introduction to support vector machines and other kernel-based learning method. London: Cambridge University.
  • Fisher, R. A. (1936). The statistical utilization of multiple measurements. Annals of Eugenic, 7, 376–386.
  • Ghosh, A. K. (2006). Classification using kernel density estimates. Technometrics, 48, 120–132. doi: 10.1198/004017005000000391
  • Hastie, T., & Tibshirani, R. (1996). Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), 607–616. doi: 10.1109/34.506411
  • Inman, H. F., & Bradley, E. L. (1989). The overlapping coefficient as a measure of agreement between probability distribution sand point estimation of the overlap of two normal densities. Communication in Statistics Theory and Methods, 18, 3851–3874. doi: 10.1080/03610928908830127
  • James, J. (2001). Interaction effects in logistic regression. London: Sage.
  • Jan, Y. K., Cheng, C. W, & Shih, Y. H. (2010). Application of logistic regression analysis of home mortgage loan prepayment and default risk. ICIC Express Letters, 2, 325–331.
  • Marta, E. (2001). Application of Fisher's method to materials that only release water at high temperatures. Portugaliae Etecfochlmlca Acta, 15, 301–311.
  • Matusita, K. (1967). On the notion of affinity of several distributions and some of its applications. Annals of the Institute of Statistical Mathematics, 19(1), 181–192. doi: 10.1007/BF02911675
  • McLachlan, G. J., & Basford, K. E. (1998). Mixture models: Inference and applications to clustering. New York, NY: Marcel Dekker.
  • Miller, G., Inkret, W. C., Little, T. T., Martz, H. F., & Schillaci, M. E. (2001). Bayesian prior probability distributions for internal dosimetry. Radiation Protection Dosimetry, 94, 347–352. doi: 10.1093/oxfordjournals.rpd.a006509
  • Nguyentrang, T., & Vovan, T. (2017). Fuzzy clustering of probability density functions. Journal of Applied Statistics, 44(4), 583–601. doi: 10.1080/02664763.2016.1177502
  • Nguyen-Trang, T., & Vo-Van, T. (2017). A new approach for determining the prior probabilities in the classification problem by Bayesian method. Advances in Data Analysis and Classification, 11, 629–643. doi: 10.1007/s11634-016-0253-y
  • Pham-Gia, T., Nhat, N. D., & Phong, N. V. (2015). Statistical classification using the maximum function. Open Journal of Applied Statistics, 5(7), 665–679. doi: 10.4236/ojs.2015.57068
  • Pham–Gia, T., Turkkan, N., & Tai, V. V. (2008). Statistical discrimination analysis using the maximum function. Communications in Statistics Simulation and Computation, 37, 320–336. doi: 10.1080/03610910701790475
  • Tai, V. V. (2017). - distance and classification problem by Bayesian method. Journal of Applied Statistics, 4(3), 385–401.
  • Tai, V. V., & Pham-Gia, T. (2010). Clustering probability distributions. Journal of Applied Statistics, 37(11), 1891–1910. doi: 10.1080/02664760903186049
  • Tai, V. V., Thao, N. T., & Ha, C. N. (2016). The prior probability in classifying two populations by Bayesian method. Applied Mathematics Engineering and Reliability, 6, 35–40. doi: 10.1201/b21348-7
  • Toussaint, G. T. (1972). Some inequalities between distance measures for feature. IEEE Transactions on Computers, C-21, 409–410. doi: 10.1109/TC.1972.5008991
  • Webb, A. (2002). Statistical pattern recognition. London: John Wiley & Sons.

Appendix 1. Proof of Theorem 2.1

To obtain (Equation6), we need to prove the following two results: and Let we have From (Equation5), we obtain therefore, On the other hand, from antithesis style of D'Morgan, we have Similarly, so In addition, from (Equation5) we can directly find out (A1) Combining (Equation3) and (EquationA1), we obtain (Equation6).

Appendix 2. Proof of Theorem 2.2

(i) For each , we have Therefore, (A2) On the other hand, So Or (A3) Combining (EquationA2) and (EquationA3), we obtain Because includes terms, we have Thus Integrating the above relation, we obtain: (A4) Using for (EquationA4), we have (Equation7). (ii) We have Since then Integrating the above inequality, we obtain: (iii) and (iv) The proofs for (Equation9) and (Equation10) can be seen in Pham–Gia et al. (Citation2008).

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.