Comparative Analysis of Data Augmentation Methods for Enhancing the Performance of Churn Prediction Models
Main Article Content
Abstract
Customer churn is a critical challenge for subscription-based businesses, often exacerbated by imbalanced datasets that hinder predictive accuracy. This study evaluates various oversampling techniques, K-means SMOTE, SMOTE, and ADASYN, that generate synthetic samples to balance datasets. The objective is to assess the impact of these oversampling techniques on the performance of machine learning (ML) classifiers, including gradient boosting (GB), random forest (RF), naive Bayes (NB), and support vector machines (SVM). Findings reveal that K-means SMOTE is the most effective at improving model performance, while GB consistently outperforms other classifiers in churn prediction. These findings provide valuable insights into optimizing data balancing and predictive models, offering a robust framework to enhance customer retention strategies.
Article Details
References
- A. Sikri, R. Jameel, S.M. Idrees, H. Kaur, Enhancing Customer Retention in Telecom Industry with Machine Learning Driven Churn Prediction, Sci. Rep. 14 (2024), 13097. https://doi.org/10.1038/s41598-024-63750-0.
- A. Rodan, A. Fayyoumi, H. Faris, J. Alsakran, O. Al-Kadi, Negative Correlation Learning for Customer Churn Prediction: A Comparison Study, Sci. World J. 2015 (2015), 473283. https://doi.org/10.1155/2015/473283.
- A. H M, B. T, S. Tanisha, S. B, C.C. Shanuja, Customer Churn Prediction Using Synthetic Minority Oversampling Technique, in: 2023 4th International Conference on Communication, Computing and Industry 6.0 (C216), IEEE, 2023, pp. 01-05. https://doi.org/10.1109/C2I659362.2023.10430989.
- S.J. Haddadi, A. Farshidvard, F.D.S. Silva, J.C. dos Reis, M. da Silva Reis, Customer Churn Prediction in Imbalanced Datasets with Resampling Methods: A Comparative Study, Expert Syst. Appl. 246 (2024), 123086. https://doi.org/10.1016/j.eswa.2023.123086.
- X. Liu, G. Xia, X. Zhang, W. Ma, C. Yu, Customer Churn Prediction Model Based on Hybrid Neural Networks, Sci. Rep. 14 (2024), 30707. https://doi.org/10.1038/s41598-024-79603-9.
- I. AlShourbaji, N. Helian, Y. Sun, A.G. Hussien, L. Abualigah, et al., An Efficient Churn Prediction Model Using Gradient Boosting Machine and Metaheuristic Optimization, Sci. Rep. 13 (2023), 14441. https://doi.org/10.1038/s41598-023-41093-6.
- I. Ullah, B. Raza, A.K. Malik, M. Imran, S.U. Islam, et al., A Churn Prediction Model Using Random Forest: Analysis of Machine Learning Techniques for Churn Prediction and Factor Identification in Telecom Sector, IEEE Access 7 (2019), 60134-60149. https://doi.org/10.1109/access.2019.2914999.
- D.T. Barus, R. Elfarizy, F. Masri, P.H. Gunawan, Parallel Programming of Churn Prediction Using Gaussian Naïve Bayes, in: 2020 8th International Conference on Information and Communication Technology (ICoICT), IEEE, 2020, pp. 1-4. https://doi.org/10.1109/ICoICT49345.2020.9166319.
- Telco Customer Churn Dataset, Telco Customer Churn, https://www.kaggle.com/blastchar/telco-customer-churn, Accessed Oct. 2, 2024.
- N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: Synthetic Minority Over-Sampling Technique, J. Artif. Intell. Res. 16 (2002), 321-357. https://doi.org/10.1613/jair.953.
- M. Mujahid, E. Kına, F. Rustam, M.G. Villar, E.S. Alvarado, et al., Data Oversampling and Imbalanced Datasets: An Investigation of Performance for Machine Learning and Feature Engineering, J. Big Data 11 (2024), 87. https://doi.org/10.1186/s40537-024-00943-4.
- H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, pp. 1322-1328. https://doi.org/10.1109/IJCNN.2008.4633969.
- J.H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, Ann. Stat. 29 (2001), 1189–1232. https://doi.org/10.1214/aos/1013203451.
- L. Breiman, Random Forests, Mach. Learn. 45 (2001), 5-32. https://doi.org/10.1023/a:1010933404324.
- P. Domingos, M. Pazzani, On the Optimality of the Simple Bayesian Classifier Under Zero-One Loss, Mach. Learn. 29 (1997), 103-130. https://doi.org/10.1023/a:1007413511361.
- B. Schölkopf, A.J. Smola, Learning with Kernels, The MIT Press, 2001. https://doi.org/https://doi.org/10.7551/mitpress/4175.001.0001.
- G. Douzas, F. Bacao, F. Last, Improving Imbalanced Learning Through a Heuristic Oversampling Method Based on K-Means and SMOTE, Inf. Sci. 465 (2018), 1-20. https://doi.org/10.1016/j.ins.2018.06.056.
- A.A. Khan, O. Chaudhari, R. Chandra, A Review of Ensemble Learning and Data Augmentation Models for Class Imbalanced Problems: Combination, Implementation and Evaluation, Expert Syst. Appl. 244 (2024), 122778. https://doi.org/10.1016/j.eswa.2023.122778.
- J. Ong, G. Tong, K. Khor, S. Haw, Enhancing Customer Churn Prediction with Resampling: A Comparative Study, TEM J. 13 (2024), 1927-1936. https://doi.org/10.18421/tem133-20.
- Z. Liu, P. Jiang, K.W. De Bock, J. Wang, L. Zhang, et al., Extreme Gradient Boosting Trees with Efficient Bayesian Optimization for Profit-Driven Customer Churn Prediction, Technol. Forecast. Soc. Chang. 198 (2024), 122945. https://doi.org/10.1016/j.techfore.2023.122945.
- O.J. Ogbonna, G.I. Aimufua, M.U. Abdullahi, S. Abubakar, Churn Prediction in Telecommunication Industry: A Comparative Analysis of Boosting Algorithms, Dutse J. Pure Appl. Sci. 10 (2024), 331-349. https://doi.org/10.4314/dujopas.v10i1b.33.
- K.C. Mouli, C.V. Raghavendran, V.Y. Bharadwaj, G.Y. Vybhavi, C. Sravani, et al., An Analysis on Classification Models for Customer Churn Prediction, Cogent Eng. 11 (2024), 2378877. https://doi.org/10.1080/23311916.2024.2378877.