Data Resampling Approach to Handle the Imbalanced Class Problem
Abstract
Imbalanced class problem (machine learning) is a problem that arises because of the significant difference in the number of instances between the minority class and the majority class. Imbalanced class ratio makes the classifier do the wrong decision when classifying, which tends to prefer the majority class and ignore the minority class. To tackle this problem, we use a data resampling approach that use 6 types of popular data resampling techniques, such as: (i) random oversampling (ROS), (ii) random undersampling (RUS), (iii) synthetic minority oversampling technique (SMOTE), (iv) adaptive synthetic sampling (ADASYN), (v) SMOTETomek, and (vi) SMOTEENN to balance the ratio of the number of instances of 15 types of datasets. Furthermore, this balanced dataset is classified using a random forest classifier. The metric used as a performance measurement tool is the geometric mean (G-Mean). To compare the performance of the 6 types of data resampling techniques, these G-Mean values were tested using Friedman's nonparametric statistical test, and if the null hypothesis was rejected, it was continued with Nemenyi's Post Hoc statistical test. Based on mean of ranks values, the best resampling technique is SMOTEENN (1.700), ADASYN (2.767), RUS (3.333), SMOTETomek (3.867), SMOTE (4.000), ROS (5.333).
Downloads
References
A. F. Hilario, S. G. López, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, “Learning from Imbalanced Data Sets,” Springer, p. 19, 2018, doi: 10.1007/978-3-319-98074-4.
B. Krawczyk, “Learning from Imbalanced Data: Open Challenges and Future Directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016, doi: 10.1007/s13748-016-0094-0.
V. Ganganwar, “An overview of classification algorithms for imbalanced datasets,” International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 4, pp. 42–47, 2012.
M. Fatourechi, R. K. Ward, S. G. Mason, J. Huggins, A. Schloegl, and G. E. Birch, “Comparison of evaluation metrics in classification applications with imbalanced datasets,” in 2008 seventh international conference on machine learning and applications, 2008, pp. 777–782. doi: 10.1109/ICMLA.2008.34.
J. Brownlee, Imbalanced classification with python: Better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery, 2020.
V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Information sciences, vol. 250, pp. 113–141, 2013, doi: 10.1016/j.ins.2013.07.007.
J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
A. Saifudin and R. S. Wahono, “Penerapan teknik ensemble untuk menangani ketidakseimbangan kelas pada prediksi cacat software,” IlmuKomputer. com Journal of Software Engineering, vol. 1, no. 1, pp. 28–37, 2015.
J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques, 3rd ed. Amsterdam: Morgan Kaufmann, 2011.
F. Grina, Z. Elouedi, and E. Lefevre, “A preprocessing approach for class-imbalanced data using SMOTE and belief function theory,” International Conference on Intelligent Data Engineering and Automated Learning, pp. 3–11, 2020, doi: 10.1007/978-3-030-62365-4_1.
A. Saifudin and R. S. Wahono, “Pendekatan Level Data untuk Menangani Ketidakseimbangan Kelas pada Prediksi Cacat Software,” IlmuKomputer. com Journal of Software Engineering, vol. 1, no. 2, pp. 76–85, 2015.
Copyright (c) 2022 Yosua Alberth Sir, Agus H H Soepranoto
This work is licensed under a Creative Commons Attribution 4.0 International License.
The author submitting the manuscript must understand and agree that if accepted for publication, authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution (CC-BY) 4.0 License that allows others to share the work with an acknowledgment of the work’s authorship and initial publication in this journal.