IMPLEMENTATION OF EMAIL SPAM DETECTION USING NAÏVE BAYES ALGORITHM AND DECISION TREE J48 TEXT MINING METHOD

  • Rizka Safitri Lutfiyani(1*)
    Universitas Widya Dharma
  • Niken Retnowati(2)
    Universitas Widya Darma Klaten
  • (*) Corresponding Author
Keywords: Text mining, Decision tree, Naïve Bayes

Abstract

Email is quite popular as a digital communication media. This is because the message sending process via email is easy. Unfortunately, most messages in emails are spam emails. Spam is a message that the recipient of the message does not want because spam usually contains advertising messages or fraudulent messages. Ham is the message that the recipient wants. One way to sort these messages is to classify email messages into spam or Ham. Naïve Bayes and decision tree J48 are the algorithms that can be used to classify email messages. Therefore, this study aims to compare the effectiveness of the Naïve Bayes algorithm and decision tree J48 in sorting spam emails. The method used is text mining. Data containing the text of the email message in English will be processed before being classified with Naïve Bayes and decision tree J48. The pre-process stage includes tokenization, disposal of stop word lists, stemming, and attribute selection. Furthermore, Data text for email message will be processed using the Naïve Bayes algorithm and decision tree J48. The Naïve Bayes algorithm is a classification algorithm based on Bayesian Decision Theory, while the J48 decision tree algorithm is the development of the ID3 decision tree algorithm. The result of this research is that the decision tree J48 algorithm gets higher accuracy than the Naïve Bayes algorithm. The decision tree J48 algorithm has an accuracy of 93,117% while Naïve Bayes has an accuracy of 88,5284%. The conclusion of this study is that the decision tree J48 algorithm is superior to Naive Bayes for sorting spam emails when viewed from the level of accuracy of each algorithm.

Downloads

Download data is not yet available.

References

A. W. Irawan, A. Yusufianto, D. Agustina, and R. Dean, “Laporan Survei Internet APJII 2019 – 2020,” 2020. [Online]. Available: https://apjii.or.id/surveii.

J. Batra, R. Jain, V. A. Tikkiwal, and A. Chakraborty, “A comprehensive study of Spam detection in e-mails using bio-inspired optimization techniques,” Int. J. Inf. Manag. Data Insights, vol. 1, no. 1, p. 100006, 2021, doi: 10.1016/j.jjimei.2020.100006.

J. Qadri, “SPAM -- Technological and Legal Aspects,” University of Khasmir Srinagar, 2011.

CISCO, “Email: Click with Caution,” 2019.

T. Kwartler, What is Text Mining? 2017.

X. Wu et al., “Top 10 algorithms in data mining,” Knowl. Inf. Syst., vol. 14, no. 1, pp. 1–37, 2008, doi: 10.1007/s10115-007-0114-2.

E. Ferrara, “The history of digital Spam,” Commun. ACM, vol. 62, no. 8, pp. 82–91, 2019, doi: 10.1145/3299768.

O. Saad, A. Darwish, and R. Faraj, “A survey of machine learning techniques for Spam filtering,” J. Comput. Sci., vol. 12, no. 2, pp. 66–73, 2012.

Y. Kontsewaya, E. Antonov, and A. Artamonov, “Evaluating the Effectiveness of Machine Learning Methods for Spam Detection,” Procedia Comput. Sci., vol. 190, no. 2019, pp. 479–486, 2020, doi: 10.1016/j.procs.2021.06.056.

A. A. Akinyelu and A. O. Adewumi, “Classification of phishing email using random forest machine learning technique,” J. Appl. Math., vol. 2014, no. April, 2014, doi: 10.1155/2014/425731.

K. Borgwardt and C. Biology, “What is text mining ?,” 2010. http://people.ischool.berkeley.edu/~hearst/text-mining.html (accessed Apr. 24, 2015).

R. Feldman and J. Sanger, The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data. New York: Cambridge University Press, 2007.

J. B. Lovins, “Development of a Stemming Algorithm,” Mech. Transl. Comput. Linguist., vol. 11, no. 1, pp. 22–31, 1968.

S. Yucebas and R. Tintin, “Govdeturk: A novel turkish natural language processing tool for stemming, morphological labelling and verb negation,” Int. Arab J. Inf. Technol., vol. 18, no. 2, pp. 148–157, 2021, doi: 10.34028/IAJIT/18/2/3.

G. Salton and C. Buckley, “Term-Weifhting Approaches in Automatic Text Retrieval,” Inf. Process. Manag., 1988, doi: 10.1016/0306-4573(88)90021-0.

A. G. Karegowda, A. S. Manjunath, G. Ratio, and C. F. Evaluation, “COMPARATIVE STUDY OF ATTRIBUTE SELECTION USING GAIN RATIO,” Int. J. Inf. Technol. Knowl. Manag., vol. 2, no. 2, pp. 271–277, 2010.

T. K. Bhowmik, “Naive bayes vs logistic regression: Theory, implementation and experimental validation,” Intel. Artif., vol. 18, no. 56, pp. 14–30, 2015, doi: 10.4114/ia.v18i56.1113.

T. M. Mitchell, “GENERATIVE AND DISCRIMINATIVE CLASSIFIERS : NAIVE BAYES AND LOGISTIC REGRESSION Learning Classifiers based on Bayes Rule,” in Machine Learning, vol. 1, no. Pt 1-2, 2010, pp. 1–17.

V. Oktaviani, B. Warsito, H. Yasin, R. Santoso, and Suparti, “Sentiment analysis of e-commerce application in Traveloka data review on Google Play site using Naïve Bayes classifier and association method,” J. Phys. Conf. Ser., vol. 1943, no. 1, 2021, doi: 10.1088/1742-6596/1943/1/012147.

H. Fan, “Network Activities Recognition and Analysis Based on Supervised Machine Learning Classification Methods Using J48 and Naïve Bayes Algorithm.”

T. Hastie, R. Tibshirani, J. Frie, and Dman, The Elemen of Statistical Learning Data mining, Inference, and Prediction, 2nd ed. california, 2008.

R. O. Duda, P. E. Hart, D. G. Stork, and J. Wiley, “Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed),” no. April, 2016.

X. Ying, “An Overview of Overfitting and its Solutions,” J. Phys. Conf. Ser., vol. 1168, no. 2, 2019, doi: 10.1088/1742-6596/1168/2/022022.

A. H. Rakhmah and T. A. Putri, “Analisis Sentimen Terhadap Pasangan Calon Presiden 2019 Pada Media Sosial Twitter,” J. Lentera Ict, no. ISSN 2338-3143, pp. 1–11, 2019.

S. Cepeda and S. García-garcía, “Advantages and limitations of intraoperative ultrasound strain elastography applied in brain tumor surgery : a single-center experience,” 2021.

Y. N. Feng, Z. H. Xu, J. T. Liu, X. L. Sun, D. Q. Wang, and Y. Yu, “Intelligent prediction of RBC demand in trauma patients using decision tree methods,” Mil. Med. Res., vol. 8, no. 1, pp. 1–12, 2021, doi: 10.1186/s40779-021-00326-3.

PlumX Metrics

Published
2021-10-30
How to Cite
[1]
R. Lutfiyani and N. Retnowati, “IMPLEMENTATION OF EMAIL SPAM DETECTION USING NAÏVE BAYES ALGORITHM AND DECISION TREE J48 TEXT MINING METHOD”, jicon, vol. 9, no. 2, pp. 244-252, Oct. 2021.
Section
Articles

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.