IMPLEMENTATION OF EMAIL SPAM DETECTION USING NAÏVE BAYES ALGORITHM AND DECISION TREE J48 TEXT MINING METHOD
Abstract
Email is quite popular as a digital communication media. This is because the message sending process via email is easy. Unfortunately, most messages in emails are spam emails. Spam is a message that the recipient of the message does not want because spam usually contains advertising messages or fraudulent messages. Ham is the message that the recipient wants. One way to sort these messages is to classify email messages into spam or Ham. Naïve Bayes and decision tree J48 are the algorithms that can be used to classify email messages. Therefore, this study aims to compare the effectiveness of the Naïve Bayes algorithm and decision tree J48 in sorting spam emails. The method used is text mining. Data containing the text of the email message in English will be processed before being classified with Naïve Bayes and decision tree J48. The pre-process stage includes tokenization, disposal of stop word lists, stemming, and attribute selection. Furthermore, Data text for email message will be processed using the Naïve Bayes algorithm and decision tree J48. The Naïve Bayes algorithm is a classification algorithm based on Bayesian Decision Theory, while the J48 decision tree algorithm is the development of the ID3 decision tree algorithm. The result of this research is that the decision tree J48 algorithm gets higher accuracy than the Naïve Bayes algorithm. The decision tree J48 algorithm has an accuracy of 93,117% while Naïve Bayes has an accuracy of 88,5284%. The conclusion of this study is that the decision tree J48 algorithm is superior to Naive Bayes for sorting spam emails when viewed from the level of accuracy of each algorithm.
Downloads
References
A. W. Irawan, A. Yusufianto, D. Agustina, and R. Dean, “Laporan Survei Internet APJII 2019 – 2020,” 2020. [Online]. Available: https://apjii.or.id/surveii.
J. Batra, R. Jain, V. A. Tikkiwal, and A. Chakraborty, “A comprehensive study of Spam detection in e-mails using bio-inspired optimization techniques,” Int. J. Inf. Manag. Data Insights, vol. 1, no. 1, p. 100006, 2021, doi: 10.1016/j.jjimei.2020.100006.
J. Qadri, “SPAM -- Technological and Legal Aspects,” University of Khasmir Srinagar, 2011.
CISCO, “Email: Click with Caution,” 2019.
T. Kwartler, What is Text Mining? 2017.
X. Wu et al., “Top 10 algorithms in data mining,” Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2008, doi: 10.1007/s10115-007-0114-2.
E. Ferrara, “The history of digital Spam,” Commun. ACM, vol. 62, no. 8, pp. 82–91, 2019, doi: 10.1145/3299768.
O. Saad, A. Darwish, and R. Faraj, “A survey of machine learning techniques for Spam filtering,” J. Comput. Sci., vol. 12, no. 2, pp. 66–73, 2012.
Y. Kontsewaya, E. Antonov, and A. Artamonov, “Evaluating the Effectiveness of Machine Learning Methods for Spam Detection,” Procedia Comput. Sci., vol. 190, no. 2019, pp. 479–486, 2020, doi: 10.1016/j.procs.2021.06.056.
A. A. Akinyelu and A. O. Adewumi, “Classification of phishing email using random forest machine learning technique,” J. Appl. Math., vol. 2014, no. April, 2014, doi: 10.1155/2014/425731.
K. Borgwardt and C. Biology, “What is text mining ?,” 2010. http://people.ischool.berkeley.edu/~hearst/text-mining.html (accessed Apr. 24, 2015).
R. Feldman and J. Sanger, The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data. New York: Cambridge University Press, 2007.
J. B. Lovins, “Development of a Stemming Algorithm,” Mech. Transl. Comput. Linguist., vol. 11, no. 1, pp. 22–31, 1968.
S. Yucebas and R. Tintin, “Govdeturk: A novel turkish natural language processing tool for stemming, morphological labelling and verb negation,” Int. Arab J. Inf. Technol., vol. 18, no. 2, pp. 148–157, 2021, doi: 10.34028/IAJIT/18/2/3.
G. Salton and C. Buckley, “Term-Weifhting Approaches in Automatic Text Retrieval,” Inf. Process. Manag., 1988, doi: 10.1016/0306-4573(88)90021-0.
A. G. Karegowda, A. S. Manjunath, G. Ratio, and C. F. Evaluation, “COMPARATIVE STUDY OF ATTRIBUTE SELECTION USING GAIN RATIO,” Int. J. Inf. Technol. Knowl. Manag., vol. 2, no. 2, pp. 271–277, 2010.
T. K. Bhowmik, “Naive bayes vs logistic regression: Theory, implementation and experimental validation,” Intel. Artif., vol. 18, no. 56, pp. 14–30, 2015, doi: 10.4114/ia.v18i56.1113.
T. M. Mitchell, “GENERATIVE AND DISCRIMINATIVE CLASSIFIERS : NAIVE BAYES AND LOGISTIC REGRESSION Learning Classifiers based on Bayes Rule,” in Machine Learning, vol. 1, no. Pt 1-2, 2010, pp. 1–17.
V. Oktaviani, B. Warsito, H. Yasin, R. Santoso, and Suparti, “Sentiment analysis of e-commerce application in Traveloka data review on Google Play site using Naïve Bayes classifier and association method,” J. Phys. Conf. Ser., vol. 1943, no. 1, 2021, doi: 10.1088/1742-6596/1943/1/012147.
H. Fan, “Network Activities Recognition and Analysis Based on Supervised Machine Learning Classification Methods Using J48 and Naïve Bayes Algorithm.”
T. Hastie, R. Tibshirani, J. Frie, and Dman, The Elemen of Statistical Learning Data mining, Inference, and Prediction, 2nd ed. california, 2008.
R. O. Duda, P. E. Hart, D. G. Stork, and J. Wiley, “Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed),” no. April, 2016.
X. Ying, “An Overview of Overfitting and its Solutions,” J. Phys. Conf. Ser., vol. 1168, no. 2, 2019, doi: 10.1088/1742-6596/1168/2/022022.
A. H. Rakhmah and T. A. Putri, “Analisis Sentimen Terhadap Pasangan Calon Presiden 2019 Pada Media Sosial Twitter,” J. Lentera Ict, no. ISSN 2338-3143, pp. 1–11, 2019.
S. Cepeda and S. García-garcía, “Advantages and limitations of intraoperative ultrasound strain elastography applied in brain tumor surgery : a single-center experience,” 2021.
Y. N. Feng, Z. H. Xu, J. T. Liu, X. L. Sun, D. Q. Wang, and Y. Yu, “Intelligent prediction of RBC demand in trauma patients using decision tree methods,” Mil. Med. Res., vol. 8, no. 1, pp. 1–12, 2021, doi: 10.1186/s40779-021-00326-3.
Copyright (c) 2021 Rizka Safitri Lutfiyani, Niken Retnowati
This work is licensed under a Creative Commons Attribution 4.0 International License.
The author submitting the manuscript must understand and agree that if accepted for publication, authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution (CC-BY) 4.0 License that allows others to share the work with an acknowledgment of the work’s authorship and initial publication in this journal.