348
Views
3
CrossRef citations to date
0
Altmetric
Articles

On the analytical properties of category encodings in logistic regression

ORCID Icon
Pages 1870-1887 | Received 11 Dec 2020, Accepted 01 Jun 2021, Published online: 21 Jun 2021

References

  • Agresti, A. 2018. An introduction to categorical data analysis. 3rd ed. New York: Wiley.
  • Albert, A., and J. A. Anderson. 1984. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71 (1):1–10. doi:10.1093/biomet/71.1.1.
  • Alkharusi, H. 2012. Categorical variables in regression analysis: A comparison of dummy and effect coding. International Journal of Education 4 (2):202–10. doi:10.5296/ije.v4i2.1962.
  • Berry, K. J., P. W. Mielke, Jr., and H. K. Iyer. 1998. Factorial designs and dummy coding. Perceptual and Motor Skills 87 (3):919–27. doi:10.2466/pms.1998.87.3.919.
  • Bolton, C. 2009. Logistic regression and its application in credit scoring. Doctoral diss., University of Pretoria.
  • Cerda, P., G. Varoquaux, and B. Kégl. 2018. Similarity encoding for learning with dirty categorical variables. Machine Learning 107 (8–10):1477–94. doi:10.1007/s10994-018-5724-2.
  • Cohen, J., P. Cohen, S. G. West, and L. S. Aiken. 2013. Applied multiple regression/correlation analysis for the behavioral sciences. 3rd ed. London: Routledge.
  • Davis, M. J. 2010. Contrast coding in multiple regression analysis: Strengths, weaknesses, and utility of popular coding structures. Journal of Data Science 8 (1):61–73.
  • Guo, C., and F. Berkhahn. 2016. Entity embeddings of categorical variables. arXiv:1604.06737. https://arxiv.org/abs/1604.06737
  • Gupta, H., and V. Asha. 2020. Impact of encoding of high cardinality categorical data to solve prediction problems. Journal of Computational and Theoretical Nanoscience 17 (9):4197–201. doi:10.1166/jctn.2020.9044.
  • Hosmer, D. W., and S. Lemeshow. 2013. Applied logistic regression. 3rd ed. New York: John Wiley & Sons, Inc.
  • Johannemann, J., V. Hadad, S. Athey, and S. Wager. 2019. Sufficient representations for categorical variables. arXiv Preprint arXiv:1908.09874. https://arxiv.org/abs/1908.09874
  • Jordan, M. I., and T. Mitchell. 2015. Machine learning: Trends, perspectives, and prospects. Science 349 (6245):255–60. doi:10.1126/science.aaa8415.
  • Kerlinger, F. N., and E. J. Pedhazur. 1973. Multiple regression in behavioral research. New York: Holt, Rinehart, and Winston.
  • Lichman, M. 2013. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml
  • Ma, Y., and Z. Zhang. 2020. Travel mode choice prediction using deep neural networks with entity embeddings. IEEE Access 8:64959–70. doi:10.1109/ACCESS.2020.2985542.
  • Micci-Barreca, D. 2001. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3 (1):27–32. doi:10.1145/507533.507538.
  • Myers, J. L., A. Well, and R. F. Lorch. 2010. Research design and statistical analysis. London: Routledge.
  • O'Grady, K. E., and D. R. Medoff. 1988. Categorical variables in multiple regression: Some cautions. Multivariate Behavioral Research 23 (2):243–2060. doi:10.1207/s15327906mbr2302_7.
  • Potdar, K., T. S, and C. D. 2017. A comparative study of categorical variable encoding techniques for neural network classifiers. International Journal of Computer Applications 175 (4):7–9. doi:10.5120/ijca2017915495.
  • Refaat, M. 2011. Credit risk scorecards: Development and implementation using SAS. Raleigh, NC: LULU.COM.
  • Rhemtulla, M., P. É. Brosseau-Liard, and V. Savalei. 2012. When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods 17:354–73. doi:10.1037/a0029315.
  • Siddiqi, N. 2006. Credit risk scorecards: Developing and implementing intelligent credit scoring. Hoboken, NJ: John Wiley & Sons, Inc.
  • Uyar, A., A. Bener, H. N. Ciray, and M. Bahceci. 2009. A frequency based encoding technique for transformation of categorical variables in mixed IVF dataset. In 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 6214–6217.
  • von Eye, A., & Clogg, C. C., eds. 1996. Categorical variables in developmental research: Methods of analysis. New York: Academic Press.
  • Zeng, G. 2013. Metric divergence measures and information value in credit scoring. Journal of Mathematics 2013:1–10. doi:10.1155/2013/848271.
  • Zeng, G. 2015. A unified definition of mutual information with applications in machine learning. Mathematical Problems in Engineering 2015:1–12. doi:10.1155/2015/201874.
  • Zeng, G. 2017a. A comparison study of computational methods of Kolmogorov–Smirnov statistic in credit scoring. Communications in Statistics - Simulation and Computation 46 (10):7744–60. doi:10.1080/03610918.2016.1249883.
  • Zeng, G. 2017b. Invariant properties of logistic regression model in credit scoring under monotonic transformations. Communications in Statistics - Theory and Methods 46 (17):8791–807. doi:10.1080/03610926.2016.1193200.
  • Zeng, G. 2017c. On the existence of maximum likelihood estimates for weighted logistic regression. Communications in Statistics - Theory and Methods 46 (22):11194–203. doi:10.1080/03610926.2016.1260743.
  • Zeng, G. 2020a. On the confusion matrix in credit scoring and its analytical properties. Communications in Statistics - Theory and Methods 49 (9):2080–93. doi:10.1080/03610926.2019.1568485.
  • Zeng, G. 2020b. A graphic and tabular variable deduction method in logistic regression. Communications in Statistics - Theory and Methods. Advance online publication. doi:10.1080/03610926.2020.1839499.
  • Zeng, G., and E. Zeng. 2019a. On the three-way equivalence of AUC in credit scoring with tied scores. Communications in Statistics - Theory and Methods 48 (7):1635–50. doi:10.1080/03610926.2018.1435814.
  • Zeng, G., and E. Zeng. 2019b. On the relationship between multicollinearity and separation in logistic regression. Communications in Statistics - Simulation and Computation. Advance online publication. doi:10.1080/03610918.2019.1589511.
  • Zhang, W., T. Du, and J. Wang. 2016. Deep learning over multi-field categorical data. In Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science, vol. 9626, eds. N. Ferro, et al. 45–57. Cham: Springer. https://doi.org/10.1007/978-3-319-30671-1_4

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.