Abstract
Natural Language Processing (NLP) techniques are used to glean information from Electronic Health Records (EHR) for identifying patients with unique clinical characteristics and defining phenotypes. The classification of imbalanced datasets is also one of the vital concerns in medical diagnosis. We built an improved framework for automating the multi-class classification of imbalanced medical transcriptions [Citation1] into 40 medical specialties, by creating a set of important phenotypes/features. We implemented and tested five machine learning models out of which Random Forest Classifier has achieved the highest performance of 0.99 F1 score (precision 0.99, recall 0.99) and roc-auc-score of 0.99 on test data.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Additional information
Notes on contributors
![](/cms/asset/a5db1e7a-5a59-414d-b2f4-5a52a327e336/tijr_a_2304002_ilg0001.gif)
Priti Bhardwaj
Priti Bhardwaj is a research scholar in the department of information technology, Indira Gandhi Delhi Technical University for Women, New Delhi. Her research interests include machine learning, healthcare analytics, and natural language processing. E-mail: [email protected]
![](/cms/asset/090b1ace-fc72-4340-8d71-b4a98dc910b4/tijr_a_2304002_ilg0002.gif)
Niyati Baliyan
Niyati Baliyan is an assistant professor in the department of computer engineering, National Institute of Technology Kurukshetra, Haryana. She received a doctorate of philosophy from the computer science department, Indian Institute of Technology (IIT) in Roorkee, India. Her research interests include knowledge engineering, machine learning, healthcare analytics, recommender systems, information security, and natural language processing. Corresponding author. E-mail: [email protected]