Development of Two Methods for Estimating High-Dimensional Data in the Case of Multicollinearity and Outliers

Main Article Content

Ahmed A. El-Sheikh, Mohamed C. Ali, Mohamed R. Abonazel

Abstract

High-dimensional problems involve datasets or models characterized by a substantial number of variables or parameters prevalent across various domains such as statistics, machine learning, optimization, physics, and engineering. Challenges in these scenarios include computational complexity, data sparsity, over-fitting, and the curse of dimensionality. This study introduces two innovative techniques that combine the Random Forest machine learning approach with both the least absolute shrinkage and selection operator and the elastic net, which are statistical methodologies tailored to address high-dimensional challenges. We compared performance evaluations of these hybrid methods against traditional statistical approaches and standalone machine learning methods. The assessment is conducted using goodness-of-fit measures and involves both Monte Carlo simulation and a real-world application. The findings show that the strategies proposed in this study exhibit superior performance compared to conventional approaches when tackling high-dimensional challenges.

Article Details

References

  1. G. Manikandan, S. Abirami, An Efficient Feature Selection Framework Based on Information Theory for High Dimensional Data, Appl. Soft Comp. 111 (2021), 107729. https://doi.org/10.1016/j.asoc.2021.107729.
  2. A. Rauschenberger, E. Glaab, M.A. van de Wiel, Predictive and Interpretable Models via the Stacked Elastic Net, Bioinformatics 37 (2020), 2012–2016. https://doi.org/10.1093/bioinformatics/btaa535.
  3. A. Rauschenberger, E. Glaab, Predicting Correlated Outcomes from Molecular Data, Bioinformatics 37 (2021), 3889–3895. https://doi.org/10.1093/bioinformatics/btab576.
  4. A.A. El-Sheikh, M.R. Abonazel, M.C. Ali, Proposed Two Variable Selection Methods for Big Data: Simulation and Application to Air Quality Data in Italy, Commun. Math. Biol. Neurosci. 2022 (2022), 16. https://doi.org/10.28919/cmbn/6908.
  5. H. Wang, G. Wang, Improving Random Forest Algorithm by Lasso Method, J. Stat. Comp. Simul. 91 (2020), 353–367. https://doi.org/10.1080/00949655.2020.1814776.
  6. T.M. Khoshgoftaar, M. Golawala, J.V. Hulse, An Empirical Study of Learning from Imbalanced Data Using Random Forest, in: 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), IEEE, Patras, Greece, 2007: pp. 310–317. https://doi.org/10.1109/ICTAI.2007.46.
  7. R. Genuer, J.M. Poggi, C. Tuleau-Malot, Variable Selection Using Random Forests, Pattern Recogn. Lett. 31 (2010), 2225–2236. https://doi.org/10.1016/j.patrec.2010.03.014.
  8. A. Hapfelmeier, K. Ulm, A New Variable Selection Approach Using Random Forests, Comp. Stat. Data Anal. 60 (2013), 50–69. https://doi.org/10.1016/j.csda.2012.09.020.
  9. S. Wager, T. Hastie, B. Efron, Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife, J. Mach. Learn. Res. 15 (2014), 1625-1651.
  10. L. Mentch, G. Hooker, Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests, arXiv preprint arXiv:1404.6473, 2014. https://doi.org/10.48550/arXiv.1404.6473.
  11. M. Roozbeh, S. Babaie-Kafaki, Z. Aminifard, Improved High-Dimensional Regression Models with Matrix Approximations Applied to the Comparative Case Studies with Support Vector Machines, Optim. Methods Softw. 37 (2022), 1912–1929. https://doi.org/10.1080/10556788.2021.2022144.
  12. M. Roozbeh, S. Babaie-Kafaki, Z. Aminifard, Two Penalized Mixed–Integer Nonlinear Programming Approaches to Tackle Multicollinearity and Outliers Effects in Linear Regression Models, J. Ind. Manage. Optim. 17 (2021), 3475-3491. https://doi.org/10.3934/jimo.2020128.
  13. M. Roozbeh, S. Babaie-Kafaki, Z. Aminifard, Improved High-Dimensional Regression Models with Matrix Approximations Applied to the Comparative Case Studies with Support Vector Machines, Optim. Methods Softw. 37 (2022), 1912–1929. https://doi.org/10.1080/10556788.2021.2022144.
  14. M. Maanavi, M. Roozbeh, Regression Analysis Methods for High-dimensional Data, Andishe 25 (2021), 69–90.
  15. M. Arashi, M. Norouzirad, M. Roozbeh, N.M. Khan, A High-Dimensional Counterpart for the Ridge Estimator in Multicollinear Situations, Mathematics 9 (2021), 3057. https://doi.org/10.3390/math9233057.
  16. Z. Farhadi, H. Bevrani, M.-R. Feizi-Derakhshi, Improving random forest algorithm by selecting appropriate penalized method, Communications in Statistics - Simulation and Computation 53 (2022) 4380–4395. https://doi.org/10.1080/03610918.2022.2150779.
  17. R. Tibshirani, Regression Shrinkage and Selection Via the Lasso, J. R. Stat. Soc. Ser. B: Stat. Methodol. 58 (1996), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
  18. M. Amini, M. Roozbeh, Improving the Prediction Performance of the LASSO by Subtracting the Additive Structural Noises, Comp. Stat. 34 (2018), 415–432. https://doi.org/10.1007/s00180-018-0849-0.
  19. J. Friedman, T. Hastie, N. Simon, R. Tibshirani, Package glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models, ver. 2.0, 2016. https://cran.r-project.org/web/packages/glmnet.
  20. H. Zou, T. Hastie, Regularization and Variable Selection Via the Elastic Net, J. R. Stat. Soc. Ser. B: Stat. Methodol. 67 (2005), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x.
  21. A.S. Al-Jawarneh, M.T. Ismail, A.M. Awajan, A.R.M. Alsayed, Improving Accuracy Models Using Elastic Net Regression Approach Based on Empirical Mode Decomposition, Comm. Stat. – Simul. Comp. 51 (2020), 4006–4025. https://doi.org/10.1080/03610918.2020.1728319.
  22. L. Breiman, Random Forests, Mach. Learn. 45 (2001), 5–32. https://doi.org/10.1023/a:1010933404324.
  23. A. Liaw, Package ‘randomforest’, University of California, Berkeley, CA, USA, 2018.
  24. I.H. Witten, E. Frank, M.A. Hall, What’s It All About, in: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 338, (2011).
  25. M.R. Abonazel, A.R.R. Alzahrani, A.A. Saber, I. Dawoud, E. Tageldin, A.R. Azazy, Developing Ridge Estimators for the Extended Poisson-Tweedie Regression Model: Method, Simulation, and Application, Sci. Afr. 23 (2024), e02006. https://doi.org/10.1016/j.sciaf.2023.e02006.
  26. A.H. Youssef, M.R. Abonazel, E.G. Ahmed, Robust M Estimation for Poisson Panel Data Model with Fixed Effects: Method, Algorithm, Simulation, and Application, Stat., Optim. Inf. Comp. 12 (2024), 1292–1305. https://doi.org/10.19139/soic-2310-5070-1996.
  27. M. R. Abonazel, A Practical Guide for Creating Monte Carlo Simulation Studies Using R, Int. J. Math. Comp. Sci. 4 (2018), 18-33.
  28. M.R. Abonazel, R.A. Farghali, Liu-Type Multinomial Logistic Estimator, Sankhya B 81 (2018), 203–225. https://doi.org/10.1007/s13571-018-0171-4.
  29. M.R. Abonazel, S.M. El-Sayed, O.M. Saber, Performance of Robust Count Regression Estimators in the Case of Overdispersion, Zero Inflated, and Outliers: Simulation Study and Application to German Health Data, Commun. Math. Biol. Neurosci. 2021 (2021), 55. https://doi.org/10.28919/cmbn/5658.
  30. M.M. Abdelwahab, M.R. Abonazel, A.T. Hammad, A.M. El-Masry, Modified Two-Parameter Liu Estimator for Addressing Multicollinearity in the Poisson Regression Model, Axioms 13 (2024), 46. https://doi.org/10.3390/axioms13010046.
  31. M.R. Abonazel, Handling Outliers and Missing Data in Regression Models Using R: Simulation Examples, Acad. J. Appl. Math. Sci. 6 (2020), 187–203. https://doi.org/10.32861/ajams.68.187.203.
  32. M.R. Abonazel, O.M. Saber, A Comparative Study of Robust Estimators for Poisson Regression Model with Outliers, J. Stat. Appl. Prob. 9 (2020), 279-286. http://dx.doi.org/10.18576/jsap/090208.
  33. M.R. Abonazel, I. Dawoud, Developing Robust Ridge Estimators for Poisson Regression Model, Concurr. Comp.: Pract. Exper. 34 (2022), e6979. https://doi.org/10.1002/cpe.6979.
  34. A.R. Azazy, M.R. Abonazel, A.M. Shafik, T.M. Omara, N.M. Darwish, A Proposed Robust Regression Model to Study Carbon Dioxide Emissions in Egypt, Comm. Math. Biol. Neurosci. 2024 (2024), 86. https://doi.org/10.28919/cmbn/8673.
  35. D. Rossell, D. Telesca, Nonlocal Priors for High-Dimensional Estimation, J. Amer. Stat. Assoc. 112 (2017), 254-265. https://doi.org/10.1080/01621459.2015.1130634.
  36. H. Binder, W. Sauerbrei, P. Royston, Comparison Between Splines and Fractional Polynomials for Multivariable Model Building with Continuous Covariates: A Simulation Study with Continuous Response, Stat. Med. 32 (2013), 2262-2277. https://doi.org/10.1002/sim.5639.
  37. A. Lukman, O. Arowolo, K. Ayinde, Some Robust Ridge Regression for Handling Multicollinearity and Outlier, Int. J. Sci.: Basic Appl. Res. 16 (2014), 192-202.
  38. I. Dawoud, F.A. Awwad, E. Tag Eldin, M.R. Abonazel, New Robust Estimators for Handling Multicollinearity and Outliers in the Poisson Model: Methods, Simulation and Applications, Axioms 11 (2022), 612. https://doi.org/10.3390/axioms11110612.
  39. E.R. Lee, J. Cho, K. Yu, A Systematic Review on Model Selection in High-Dimensional Regression, J. Korean Stat. Soc. 48 (2019), 1-12. https://doi.org/10.1016/j.jkss.2018.10.001.
  40. I. Dawoud, M.R. Abonazel, Robust Dawoud–Kibria Estimator for Handling Multicollinearity and Outliers in the Linear Regression Model, J. Stat. Comp. Simul. 91 (2021), 3678–3692. https://doi.org/10.1080/00949655.2021.1945063.
  41. S. Li, T.T. Cai, H. Li, Transfer Learning for High-Dimensional Linear Regression: Prediction, Estimation and Minimax Optimality, J. R. Stat. Soc. Ser. B: Stat. Methodol. 84 (2021), 149–173. https://doi.org/10.1111/rssb.12479.
  42. P. Filzmoser, K. Nordhausen, Robust Linear Regression for High‐Dimensional Data: An Overview, WIREs Comp. Stat. 13 (2020), e1524. https://doi.org/10.1002/wics.1524.