CONSTRUCTING A DATASET FOR INFECTIOUS DISEASE PREDICTION AND SPATIAL CLUSTER ANALYSIS

  • Husni Iskandar Pohan(1*)
    Binus University
  • (*) Corresponding Author
Keywords: Covid, Dengue, Varicella, Dataset, SHAP, LIME, Cluster

Abstract

This study presents a structured methodology for constructing a custom dataset derived from patient visit records collected over a three-year period (January 1, 2019 – December 31, 2021) at a healthcare facility in Bandung Regency, Indonesia. The raw medical records were systematically transformed into a machine learning–ready dataset, involving feature extraction, labeling, and geospatial enrichment. Key transformations included the removal of personally identifiable information, the standardization of clinical symptoms into structured variables, and the assignment of diagnostic and referral labels in accordance with ICD-10 classification standards.

 

Additionally, the dataset was enhanced with spatial coordinates—longitude and latitude—to enable geospatial analyses such as transmission radius estimation, proximity clustering, and identification of regional case densities. This structure supports both supervised and unsupervised learning methods, including classification, referral prediction, and spatial cluster detection.

 

The resulting dataset has been successfully utilized in several advanced experiments: disease classification, referral status prediction, feature importance interpretation using SHAP and LIME, geospatial clustering, and synthetic data generation to mitigate challenges related to privacy and limited data availability. The methodology outlined in this study is expected to support future research in healthcare analytics and contribute to the development of decision support systems and public health policy planning tools.

 

 

 

Downloads

Download data is not yet available.

References

[1] M. G. Guzman, D. J. Gubler, A. Izquierdo, E. Martinez, and S. B. Halstead, “Dengue infection,” Nat. Rev. Dis. Prim., vol. 2, 2016, doi: 10.1038/nrdp.2016.55.
[2] J. P. Utami, “Epidemiologi Varicella,” www.alomedika.com, 2023. [online]. Available at: https://www.alomedika.com/penyakit/penyakit-infeksi/cacar-air/epidemiologi.
[3] T. Willy, “Pengertian Demam Berdarah,” Dokter.Tips, 2014. [online]. Available at: https://www.alodokter.com/demam-berdarah.
[4] alodokter.com, “COVID-19,” 2021.[online]. Available at: https://www.alodokter.com/covid-19.
[5] S. Sharma, H. K. Shakya, and V. Marriboyina, “A location based novel recommender framework of user interest through data categorization,” Mater. Today Proc., vol. 47, no. 19, pp. 7155–7161, 2020, doi: 10.1016/j.matpr.2021.06.325.
[6] H. I. Pohan, W. Suparta, Y. Heryadi, A. Wibowo, and L. Lukas, “Prediction of DHF (Dengue Hemorrhagic Fever) Severity Using Random Forest, KNN, Decision Tree and Naïve Bayes,” Proc. 2022 IEEE 7th Int. Conf. Inf. Technol. Digit. Appl. ICITDA 2022, 2022, doi: 10.1109/ICITDA55840.2022.9971377.
[7] H. I. Pohan, “Using Maps as a Factor to Increase The Accuracy of Collaborative Filtering in Providing Recommendations Regarding Cluster-Based Diseases Covid-19, Varicella and Dengue v2,” Educational Administration: Theory and Practice, Bandung, 2024. doi: 10.53555/kuey.v30i4.2460.
[8] J. Moon, S. Jung, S. Park, and E. Hwang, “Conditional tabular GaN-based two-stage data generation scheme for short-term load forecasting,” IEEE Access, vol. 8, pp. 205327–205339, 2020, doi: 10.1109/ACCESS.2020.3037063.
[9] H. I. Pohan, “The Effect of Combined Synthetic Tabular Data Generated Using CTGAN Model with Actual Data on Performance of DHF, Varicella, and COVID-19 Recognition Model,” J. Electr. Syst., vol. 20, no. 3, pp. 1867–1873, 2024, doi: 10.52783/jes.3797.
[10] H. I. Pohan, R. Rahmania, and A. I. Arrahmah, “Predicting Infectious Diseases Using XGBoost Algorithm and Discovering Dominant Features Using SHAP Model Interpreter,” 2025 International Conference on Computer Sciences, Engineering, and Technology Innovation (ICoCSETI), pp. 479–484, 2025, doi: 10.1109/ICoCSETI63724.2025.11019611.
[11] S. R. Vadyala, S. N. Betgeri, E. A. Sherer, and A. Amritphale, “Prediction of the number of COVID-19 confirmed cases based on K-means-LSTM,” Array, vol. 11, p. 100085, Sep. 2021, doi: 10.1016/j.array.2021.100085.

PlumX Metrics

Published
2025-08-12
How to Cite
[1]
H. Pohan, “CONSTRUCTING A DATASET FOR INFECTIOUS DISEASE PREDICTION AND SPATIAL CLUSTER ANALYSIS”, jicon, vol. 13, no. 2, pp. 60-67, Aug. 2025.

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.