Data Resampling Approach to Handle the Imbalanced Class Problem

  • Yosua Alberth Sir(1*)
    Universitas Nusa Cendana
  • Agus H H Soepranoto(2)
    Universitas Nusa Cendana
  • (*) Corresponding Author
Keywords: resampling data, imbalanced class, Freidman test, Post Hoc Nemenyi test

Abstract

Imbalanced class problem (machine learning) is a problem that arises because of the significant difference in the number of instances between the minority class and the majority class. Imbalanced class ratio makes the classifier do the wrong decision when classifying, which tends to prefer the majority class and ignore the minority class. To tackle this problem, we use a data resampling approach that use 6 types of popular data resampling techniques, such as: (i) random oversampling (ROS), (ii) random undersampling (RUS), (iii) synthetic minority oversampling technique (SMOTE), (iv) adaptive synthetic sampling (ADASYN), (v) SMOTETomek, and (vi) SMOTEENN to balance the ratio of the number of instances of 15 types of datasets. Furthermore, this balanced dataset is classified using a random forest classifier. The metric used as a performance measurement tool is the geometric mean (G-Mean). To compare the performance of the 6 types of data resampling techniques, these G-Mean values were tested using Friedman's nonparametric statistical test, and if the null hypothesis was rejected, it was continued with Nemenyi's Post Hoc statistical test. Based on mean of ranks values, the best resampling technique is SMOTEENN (1.700), ADASYN (2.767), RUS (3.333), SMOTETomek (3.867), SMOTE (4.000), ROS (5.333).

Downloads

Download data is not yet available.

References

A. F. Hilario, S. G. López, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, “Learning from Imbalanced Data Sets,” Springer, p. 19, 2018, doi: 10.1007/978-3-319-98074-4.

B. Krawczyk, “Learning from Imbalanced Data: Open Challenges and Future Directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016, doi: 10.1007/s13748-016-0094-0.

V. Ganganwar, “An overview of classification algorithms for imbalanced datasets,” International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 4, pp. 42–47, 2012.

M. Fatourechi, R. K. Ward, S. G. Mason, J. Huggins, A. Schloegl, and G. E. Birch, “Comparison of evaluation metrics in classification applications with imbalanced datasets,” in 2008 seventh international conference on machine learning and applications, 2008, pp. 777–782. doi: 10.1109/ICMLA.2008.34.

J. Brownlee, Imbalanced classification with python: Better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery, 2020.

V. López, A. Fernández, S. García, V. Palade, and F. Herrera, “An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics,” Information sciences, vol. 250, pp. 113–141, 2013, doi: 10.1016/j.ins.2013.07.007.

J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.

A. Saifudin and R. S. Wahono, “Penerapan teknik ensemble untuk menangani ketidakseimbangan kelas pada prediksi cacat software,” IlmuKomputer. com Journal of Software Engineering, vol. 1, no. 1, pp. 28–37, 2015.

J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques, 3rd ed. Amsterdam: Morgan Kaufmann, 2011.

F. Grina, Z. Elouedi, and E. Lefevre, “A preprocessing approach for class-imbalanced data using SMOTE and belief function theory,” International Conference on Intelligent Data Engineering and Automated Learning, pp. 3–11, 2020, doi: 10.1007/978-3-030-62365-4_1.

A. Saifudin and R. S. Wahono, “Pendekatan Level Data untuk Menangani Ketidakseimbangan Kelas pada Prediksi Cacat Software,” IlmuKomputer. com Journal of Software Engineering, vol. 1, no. 2, pp. 76–85, 2015.

PlumX Metrics

Published
2022-03-18
How to Cite
[1]
Y. Sir and A. Soepranoto, “Data Resampling Approach to Handle the Imbalanced Class Problem”, jicon, vol. 10, no. 1, pp. 31-38, Mar. 2022.
Section
Articles

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.