621
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Improving Imbalanced Machine Learning with Neighborhood-Informed Synthetic Sample Placement

, , , &
Pages 1116-1145 | Published online: 11 Dec 2022
 

ABSTRACT

Machine learning is widely used in information systems design. Yet, training algorithms on imbalanced datasets may severely affect performance on unseen data. For example, in some cases in healthcare, fintech, or cybersecurity contexts, certain subclasses are difficult to learn because they are underrepresented in training data. Our study offers a flexible and efficient solution based on a new synthetic average neighborhood sampling algorithm (SANSA), which, in contrast to other solutions, introduces a novel “placement” parameter that can be tuned to adapt to each dataset’s unique manifestation of the imbalance. This package can be downloaded for RFootnote1. We tested SANSA against seven existing sampling methods used in conjunction with the four most frequently used machine learning models trained on 14 benchmark datasets. Our results provide suggestive evidence that SANSA offers a feasible solution to the imbalance problem for most datasets. Our findings provide practical recommendations for how SANSA can be effectively implemented while reducing the complexity level of an imbalanced learning pipeline.

Acknowledgment

Earlier versions of this research work that introduced our novel SANSA algorithm for the first time had received various awards such as the Winner of the international-level Decision Sciences Institute (DSI) Regional Best Paper Award at the DSI 2020 Annual Conference across all six chapters of DSI globally, the Finalist for the Best Paper Award in Data Mining at the INFORMS 2020 Annual Conference, Best Contribution to Theory Paper Award at the NEDSI 2020 Annual Conference, and the Best Overall Conference Paper Award at the NEDSI 2020 Annual Conference. We are grateful to all judges and award committee members for their evaluation and consideration of our work worthy of such awards. We are indebted to the audiences of those conferences for their invaluable feedback to further improve this work. We are also thankful to the three anonymous referees of this JMIS review panel for their comments and suggestions that significantly helped us further improve our work. Last, but not the least, we thank Editor-in-Chief Dr. Vladimir Zwass for his very timely management of our submission and his constructive comments throughout the submission process.

Supplementary information

Supplemental data for this article can be accessed online at https://doi.org/10.1080/07421222.2022.2127453

Disclosure Statement

No potential conflict of interest was reported by the authors.

Notes

2. Following the suit of other studies, we use Euclidian space to calculate distances, which, if needed, can be replaced with other distance metrics using different geometry within SANSA or any of the other ML algorithms.

5. Anirban Datta, Personal Loan Modeling - https://www.kaggle.com/teertha/personal-loan-modeling

Additional information

Notes on contributors

Murtaza Nasir

Murtaza Nasir ([email protected]) is a Ph.D. candidate in Management Science at the Manning School of Business, University of Massachusetts Lowell and an incoming Assistant Professor at the Finance, Real Estate, and Decision Sciences Department at the W. Frank Barton School of Business, Wichita State University. His research interests are in machine learning and data mining, and span both theory as well as application. He has worked on healthcare, finance, and operational predictive and prescriptive analytics, in addition to pedagogical and theoretical work in business analytics and machine learning. He has received Best Paper awards at DSI 2020, NEDSI 2020, and NEDSI 2021 and was a finalist at the INFORMS 2020.

Ali Dag

Ali Dag ([email protected]) is an Associate Professor of Analytics at the Business Intelligence & Analytics Department at the Heider College of Business, Creighton University. He received his Ph.D. from Auburn University. His research interests include business and data analytics, operations research, operations management, and text mining. Dr. Dag is serving as an associate editor of Journal of Business Analytics, Journal of Modeling in Management, and AI in Business journals. His research work has been published in many journals, such as Journal of Management Information Systems, Decision Support Systems, OMEGA: The International Journal of Management Science, Annals of Operations Research, Journal of Business Research, Information Systems Frontiers, among others. He has received Best Paper awards at DSI 2020, NEDSI 2020, and was a finalist at the INFORMS 2020 conferences.

Serhat Simsek

Serhat Simsek ([email protected]) is an Assistant Professor in the Department of Information Management & Business Analytics at the Feliciano School of Business, Montclair State University. He earned his Ph.D. in Statistics from Auburn University. His research interests span information systems, healthcare analytics, and machine learning. His work has appeared in such journals as Journal of Management Information Systems, OMEGA, Decision Support Systems, Annals of Operations Research and several others. His research has received awards from leading conferences including INFORMS and DSI.

Anton Ivanov

Anton Ivanov ([email protected]) is an Assistant Professor of Information Systems in the Gies College of Business, University of Illinois at Urbana-Champaign. He received his Ph.D. in Management Information Systems from the State University of New York at Buffalo. Dr. Ivanov’s research stands at the intersection of information systems, social media, and healthcare analytics with an emphasis on the user-generated content.

Asil Oztekin

Asil Oztekin ([email protected], *corresponding author) is an Associate Professor of Analytics & Operations Management in Manning School of Business at the University of Massachusetts Lowell. He earned his Ph.D. from Oklahoma State University. Dr. Oztekin’s research interests relate to data science, data mining, predictive analytics, decision analytics, decision support systems with applications in healthcare analytics, marketing analytics, and text mining. He has published over 50 peer-reviewed articles in the leading journals and conference proceedings, including Journal of Management Information Systems, European Journal of Operational Research, Decision Support Systems, International Journal of Production Research, OMEGA, Information Systems Frontiers, and Annals of Operations Research, among others. He serves as senior editor/associate editor/ editorial review board member for Journal of the Association for Information Systems, Decision Sciences, European Journal of Operational Research, Decision Support Systems, Journal of Business Research, and others. His research work has received several awards from various venues, such as DSI Annual Conference, Northeast Decision Sciences Institute, and INFORMS.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 640.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.