CONSTRUCTING A DATASET FOR INFECTIOUS DISEASE PREDICTION AND SPATIAL CLUSTER ANALYSIS
Abstract
This study presents a structured methodology for constructing a custom dataset derived from patient visit records collected over a three-year period (January 1, 2019 – December 31, 2021) at a healthcare facility in Bandung Regency, Indonesia. The raw medical records were systematically transformed into a machine learning–ready dataset, involving feature extraction, labeling, and geospatial enrichment. Key transformations included the removal of personally identifiable information, the standardization of clinical symptoms into structured variables, and the assignment of diagnostic and referral labels in accordance with ICD-10 classification standards.
Additionally, the dataset was enhanced with spatial coordinates—longitude and latitude—to enable geospatial analyses such as transmission radius estimation, proximity clustering, and identification of regional case densities. This structure supports both supervised and unsupervised learning methods, including classification, referral prediction, and spatial cluster detection.
The resulting dataset has been successfully utilized in several advanced experiments: disease classification, referral status prediction, feature importance interpretation using SHAP and LIME, geospatial clustering, and synthetic data generation to mitigate challenges related to privacy and limited data availability. The methodology outlined in this study is expected to support future research in healthcare analytics and contribute to the development of decision support systems and public health policy planning tools.
Downloads
References
[2] J. P. Utami, “Epidemiologi Varicella,” www.alomedika.com, 2023. [online]. Available at: https://www.alomedika.com/penyakit/penyakit-infeksi/cacar-air/epidemiologi.
[3] T. Willy, “Pengertian Demam Berdarah,” Dokter.Tips, 2014. [online]. Available at: https://www.alodokter.com/demam-berdarah.
[4] alodokter.com, “COVID-19,” 2021.[online]. Available at: https://www.alodokter.com/covid-19.
[5] S. Sharma, H. K. Shakya, and V. Marriboyina, “A location based novel recommender framework of user interest through data categorization,” Mater. Today Proc., vol. 47, no. 19, pp. 7155–7161, 2020, doi: 10.1016/j.matpr.2021.06.325.
[6] H. I. Pohan, W. Suparta, Y. Heryadi, A. Wibowo, and L. Lukas, “Prediction of DHF (Dengue Hemorrhagic Fever) Severity Using Random Forest, KNN, Decision Tree and Naïve Bayes,” Proc. 2022 IEEE 7th Int. Conf. Inf. Technol. Digit. Appl. ICITDA 2022, 2022, doi: 10.1109/ICITDA55840.2022.9971377.
[7] H. I. Pohan, “Using Maps as a Factor to Increase The Accuracy of Collaborative Filtering in Providing Recommendations Regarding Cluster-Based Diseases Covid-19, Varicella and Dengue v2,” Educational Administration: Theory and Practice, Bandung, 2024. doi: 10.53555/kuey.v30i4.2460.
[8] J. Moon, S. Jung, S. Park, and E. Hwang, “Conditional tabular GaN-based two-stage data generation scheme for short-term load forecasting,” IEEE Access, vol. 8, pp. 205327–205339, 2020, doi: 10.1109/ACCESS.2020.3037063.
[9] H. I. Pohan, “The Effect of Combined Synthetic Tabular Data Generated Using CTGAN Model with Actual Data on Performance of DHF, Varicella, and COVID-19 Recognition Model,” J. Electr. Syst., vol. 20, no. 3, pp. 1867–1873, 2024, doi: 10.52783/jes.3797.
[10] H. I. Pohan, R. Rahmania, and A. I. Arrahmah, “Predicting Infectious Diseases Using XGBoost Algorithm and Discovering Dominant Features Using SHAP Model Interpreter,” 2025 International Conference on Computer Sciences, Engineering, and Technology Innovation (ICoCSETI), pp. 479–484, 2025, doi: 10.1109/ICoCSETI63724.2025.11019611.
[11] S. R. Vadyala, S. N. Betgeri, E. A. Sherer, and A. Amritphale, “Prediction of the number of COVID-19 confirmed cases based on K-means-LSTM,” Array, vol. 11, p. 100085, Sep. 2021, doi: 10.1016/j.array.2021.100085.
Copyright (c) 2025 Husni Iskandar Pohan

This work is licensed under a Creative Commons Attribution 4.0 International License.
The author submitting the manuscript must understand and agree that if accepted for publication, authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution (CC-BY) 4.0 License that allows others to share the work with an acknowledgment of the work’s authorship and initial publication in this journal.