Building Multiclass Classification Model of Logistic Regression and Decision Tree Using the Chi-Square Test for Variable Selection Method

Waego H. Nugroho, Samingun Handoyo, Yusnita J. Akri, Agus D. Sulistyono

Abstract

The growth and development of children under five (toddlers) affect their health conditions. Each region uniquely identifies the main factors influencing the toddler's health condition. The status of toddlers is generally categorized into two classes, namely normal and abnormal, so it is often found that the condition of toddler status is in the form of multi-response variables. Combining the two binary classes' response variables will form a multiclass response variable requiring different model development techniques and performance measurements. This study aims to determine the main factors that affect toddlers' health conditions in Malang, Indonesia, build multiclass logistic regression and decision tree classification models, and measure the model's performance. The Chi-square test selected predictor features as the input of multiclass logistic regression and decision tree models. From the feature selection, four main factors influence the status of toddlers' health conditions in Malang: the mother's history of diabetes before pregnancy, the father's blood pressure, psychological condition, and drinking water quality. The decision tree model performs better than the logistic regression model on the various performance measures used.

 


Keywords: Chi-square test, decision tree, logistic regression, multiclass classification, variable selection.

 

https://doi.org/10.55463/issn.1674-2974.49.4.17

 

 


Full Text:

PDF


References


CORMACK B. E., JIANG Y., HARDING J. E., CROWTHER C. A., and BLOOMFIELD F. H. Neonatal refeeding syndrome and clinical outcome in extremely low-birth-weight babies: secondary cohort analysis from the provide trial. Journal of Parenteral and Enteral Nutrition, 2021, 45(1): 65-78. https://doi.org/10.1002/jpen.1934

AYELIGN A., and ZERFU T. Household, dietary and healthcare factors predicting childhood stunting in Ethiopia. Heliyon, 2021, 7(4): e06733. https://doi.org/10.1016/j.heliyon.2021.e06733

ROMERO C., and VENTURA S. Educational data mining and learning analytics: An updated survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2020, 10(3): e1355. http://dx.doi.org/10.1002/widm.1355

BAHASSINE S., MADANI A., AL-SAREM M., and KISSI M. Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University-Computer and Information Sciences, 2020, 32(2): 225-231. https://doi.org/10.1016/J.JKSUCI.2018.05.010

MARJI, HANDOYO S., PURWANTO I. N., and ANIZAR M. Y. The Effect of Attribute Diversity in the Covariance Matrix on the Magnitude of the Radius Parameter in Fuzzy Subtractive Clustering. Journal of Theoretical and Applied Information Technology, 2018, 96(12): 3717-3728. http://www.jatit.org/volumes/Vol96No12/11Vol96No12.pdf

PURWANTO I. N., WIDODO A., and HANDOYO S. System for Selection Starting Lineup of a Football Players by Using Analytical Hierarchy Process (AHP). Journal of Theoretical & Applied Information Technology, 2018, 96(1): 19-31. http://www.jatit.org/volumes/Vol96No1/3Vol96No1.pdf

KUSDARWATI H., and HANDOYO S. System for Prediction of Non Stationary Time Series based on the Wavelet Radial Bases Function Neural Network Model. International Journal of Electrical and Computer Engineering, 2018, 8(4): 2327-2337. http://doi.org/10.11591/ijece.v8i4.pp2327-2337

HANDOYO S., and MARJI. The Fuzzy Inference System with Least Square Optimization for Time Series Forecasting. Indonesian Journal of Electrical Engineering and Computer Science, 2018, 7(3): 1015-1026. http://doi.org/10.11591/ijeecs.v11.i3.pp1015-1026

HANDOYO S., and CHEN Y. P. The Developing of Fuzzy System for Multiple Time Series Forecasting with Generated Rule Bases and Optimized Consequence Part. International Journal of Engineering Trends and Technology, 2020, 68(12): 118-122. http://doi.org/10.14445/22315381/IJETT-V68I12P220

GONG C. S. A., SU C. H. S., and TSENG K. H. Implementation of machine learning for fault classification on vehicle power transmission system. IEEE Sensors Journal, 2020, 20(24): 15163-15176. https://doi.org/10.1109/JSEN.2020.3010291

HANDOYO S., MARJI, PURWANTO I. N., and JIE F. The Fuzzy Inference System with Rule Bases Generated by using the Fuzzy C-Means to Predict Regional Minimum Wage in Indonesia. International Journal of Operations and Quantitative Management, 2018, 24(4): 277-292.

https://www.ijoqm.org/papers/24-4-2-p.pdf

HANDOYO S., CHEN Y. P., IRIANTO G., and WIDODO A. The Varying Threshold Values of Logistic Regression and Linear Discriminant for Classifying Fraudulent Firm. Mathematics and Statistics, 2021, 9(2): 135–143. https://doi.org/10.13189/MS.2021.090207

MU Y., LIU X., and WANG L. A Pearson’s correlation coefficient based decision tree and its parallel implementation. Information Sciences, 2018, 435: 40-58. http://dx.doi.org/10.1016/j.ins.2017.12.059

RÁCZ A., BAJUSZ D., and HÉBERGER K. Effect of dataset size and train/test split ratios in QSAR/QSPR multiclass classification. Molecules, 2021, 26(4): 1111. https://doi.org/10.3390/molecules26041111

WANG L., LITTLER T., and LIU X. Gaussian Process Multi-Class Classification for Transformer Fault Diagnosis Using Dissolved Gas Analysis. IEEE Transactions on Dielectrics and Electrical Insulation, 2021, 28(5): 1703-1712. https://doi.org/10.1109/TDEI.2021.009470

SEBŐK M., and KACSUK Z. The multiclass classification of newspaper articles with machine learning: The hybrid binary snowball approach. Political Analysis, 2021, 29(2): 236-249. https://doi.org/10.1017/pan.2020.27

WAŁĘGA G., and WAŁĘGA A. Over-indebted households in Poland: Classification tree analysis. Social Indicators Research, 2021, 153(2): 561-584. https://doi.org/10.1007/s11205-020-02505-6

MENA E., and BOLTE G. Classification tree analysis for an intersectionality-informed identification of population groups with non-daily vegetable intake. BMC Public Health, 2021, 21(1): 2007. https://doi.org/10.1186/s12889-021-12043-6

ROJARATH A., and SONGPAN W. Cost-sensitive probability for weighted voting in an ensemble model for multi-class classification problems. Applied Intelligence, 2021, 51(7): 4908-4932. https://doi.org/10.1007/s10489-020-02106-3

BAHASSINE S., MADANI A., AL-SAREM M., and KISSI M. Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University-Computer and Information Sciences, 2020, 32(2): 225-231. http://dx.doi.org/10.1016/j.jksuci.2018.05.010

NUGROHO W. H., HANDOYO S., and AKRI Y. J. An Influence of Measurement Scale of Predictor Variable on Logistic Regression Modeling and Learning Vector Quantization Modeling for Object Classification. International Journal of Electrical and Computer Engineering, 2018, 8(1): 333-343. https://doi.org/10.11591/IJECE.V8I1.PP333-343

WIDODO A., and HANDOYO S. The Classification Performance Using Logistic Regression and Support Vector Machine (SVM). Journal of Theoretical & Applied Information Technology, 2017, 95(19): 5184-5193. http://www.jatit.org/volumes/Vol95No19/23Vol95No19.pdf

CHIU I. M., ZENG W. H., CHENG C. Y., CHEN S. H., and LIN C. H. R. Using a multiclass machine learning model to predict the outcome of acute ischemic stroke requiring reperfusion therapy. Diagnostics, 2021, 11(1): 80. https://doi.org/10.3390/diagnostics11010080

GNETCHEJO P. J., ESSIANE S. N., DADJÉ A., and ELE P. A combination of Newton-Raphson method and heuristics algorithms for parameter estimation in photovoltaic modules. Heliyon, 2021, 7(4): e06673. https://doi.org/10.1016/j.heliyon.2021.e06673

ABRAMOVICH F., GRINSHTEIN V., and LEVY T. Multiclass classification by sparse multinomial logistic regression. IEEE Transactions on Information Theory, 2021, 67(7): 4637-4646. https://doi.org/10.1109/TIT.2021.3075137

TONKIN M., WOODHAMS J., BULL R., BOND J. W., and SANTTILA P. A comparison of logistic regression and classification tree analysis for behavioural case linkage. Journal of Investigative Psychology and Offender Profiling, 2012, 9(3): 235-258. https://doi.org/10.1002/JIP.1367

BLANQUERO R., CARRIZOSA E., MOLERO-RÍO C., and MORALES D. R. Optimal randomized classification trees. Computers & Operations Research, 2021, 132: 105281. https://doi.org/10.1016/j.cor.2021.105281

PANIGRAHI R., BORAH S., BHOI A. K., IJAZ M. F., PRAMANIK M., KUMAR Y., and HAVERI R. H. A consolidated decision tree-based intrusion detection system for binary and multiclass imbalanced datasets. Mathematics, 2021, 9(7): 751. https://doi.org/10.3390/math9070751

FRAIWAN L., and HASSANIN O. Computer-aided identification of degenerative neuromuscular diseases based on gait dynamics and ensemble decision tree classifiers. PLoS ONE, 2021, 16(6): e0252380. https://doi.org/10.1371/journal.pone.0252380


Refbacks

  • There are currently no refbacks.