KLASIFIKASI SPAM E-MAIL MENGGUNAKAN METODE TRANSFORMED COMPLEMENT NAÏVE BAYES (TCNB)

Hanna Florenci Tapikap; Bertha Selviana Djahi; Tiwuk Widiastuti

doi:10.35508/jicon.v7i1.878

Hanna Florenci Tapikap⁽¹⁾
Ilkom Undana
Bertha Selviana Djahi^(2*)
Ilkom Undana
Tiwuk Widiastuti⁽³⁾

DOI: https://doi.org/10.35508/jicon.v7i1.878

Keywords: Text Classification, Naïve Bayes, Transformed Complement Naïve Bayes (TCNB), Spam, Legimate, K-Fold Cross Validation

Abstract

Classification is one of the ways to organize text so that the texts with the same contents can be grouped in the same category. One of the famous text classification methods is the Naïve Bayes Method. Naïve Bayes has efficient computation and good prediction result however the performance of Naïve Bayes is not really good in classifying unbalanced dataset. This Naïve Bayes method is then modified to overcome the weakness, this modified method is then known as Transformed Complement Naïve Bayes (TCNB) method. In this research, TCNB method was used to the spam e-mails whose dataset were unbalanced and were consisted of 481 dataset in spam e-mail class, and 2412 dataset in legitimate e-mail class (in total, there are 2893 dataset). The classification was done with and without cross validation. The classification with cross validation was done starting from k=2 until k=10. The classification without cross validation was done by dividing the training data by 80% and testing data by 20%. The result showed that the classification by using TCNB with cross validation had its best accuracy level on k=10 by 93,917% and the classification without cross validation had its best accuracy by 92,760%. Thus it can be concluded that TCNB can handle unbalanced dataset with good prediction accuracy.

Downloads

Download data is not yet available.

References

[1] Aamodt, A., dan Plaza, E., 1994, Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches. AI Communications, Vol. 7, 39- 59.
[2] Jackson et al,.1989, Similarity Coefficient: Measures of co-occurrence and association or simply measures of occurrence, University of Toronto,Canada.
[3] Graham P., 2002. A Plan for Spam.
[4] Pozzolo, A., Caelen, O. and Bontempi, G., 2012, Comparison of balancing techniques for unbalanced datasets.
[5] Sun,Y.,Mohamed, K, S., Wong, A. K., & Wang, Y., Cost-sensitive Boosting fof Classification of Imbalanced Data. Pattern Recognition Society, 3358-3378.
[6] Kibriya,A., Frank, E., Pfhringer, B. and Holmes, G.,2008, Multinomil naïve Bayes for text categorization revisited.
[7] Rennie, J., Shih, L., Teevan, J.and Karger, D., 2003, Tackling the Poor Assumptions of Naïve Bayes Text Classifier.
[8] Sanu, Anindhyan., 2016, Studi Perbandingan Performansi Multinominal naїve Bayes dan Transformed Complement Naїve Bayes saat klasifikasi teks pada Dataset yang tidak seimbang.
[9] Mahinovs, A. and Tiwari, A., 2007, Text classification method review.
[10] Saad, Omar., Darwish, Asharf., and Faraj, Ramadan., 2012. A survey of Machine Learning Techniques for Spam Filtering, International Journal of Computer Science and Network Security.
[11] Anugroho, Prasetyo.,2010. Klasifikasi e-mail spam dengan metode naїve bayes classifier menggunakan java programming.
[12] Manning, C., Raghavan, P. and Schutze, H., 2009. An introduction to information retrieval.
[13] Full. 1994. Neural Network in Computer Science. Singapura: McGrawHill.
[14] Han, J., and Kamber M. 2006. Data Mining:Concept and Techniques. New York:Morgan Kaufmann Publisher.
[15] Sheu, jyh-jian.2008. An Effecient Two-Phase Spam Filtering Method Based On E-mails Categorization.

KLASIFIKASI SPAM E-MAIL MENGGUNAKAN METODE TRANSFORMED COMPLEMENT NAÏVE BAYES (TCNB)

Abstract

Downloads

References

PlumX Metrics

Most read articles by the same author(s)