Article ID: | iaor20132783 |
Volume: | 16 |
Issue: | 2 |
Start Page Number: | 119 |
End Page Number: | 128 |
Publication Date: | Jun 2013 |
Journal: | Health Care Management Science |
Authors: | Roumani Yazan, May Jerrold, Strum David, Vargas Luis |
Keywords: | datamining |
Highly imbalanced data sets are those where the class of interest is rare. In this paper, we compare the performance of several common data mining methods, logistic regression, discriminant analysis, Classification and Regression Tree (CART) models, C5, and Support Vector Machines (SVM) in predicting the discharge status (alive or deceased, with ‘deceased’ being the class of interest) of patients from an Intensive Care Unit (ICU). Using a variety of misclassification cost ratio (MCR) values and using specificity, recall, precision, the F‐measure, and confusion entropy (CEN) as criteria for evaluating each method’s performance, C5 and SVM performed better than the other methods. At a MCR of 100, C5 had the highest recall and SVM the highest specificity and lowest CEN. We also used Hand’s measure to compare the five methods. According to Hand’s measure, logistic regression performed the best. This article makes several contributions. We show how the use of MCR for analyzing imbalanced medical data significantly improves the method’s classification performance. We also found that the F‐measure and precision did not improve as the MCR was increased.