Building Multiclass Classification Model of Logistic Regression and Decision Tree Using the Chi-Square Test for Variable Selection Method

Waego H. Nugroho, Samingun Handoyo, Yusnita J. Akri, Agus D. Sulistyono


The growth and development of children under five (toddlers) affect their health conditions. Each region uniquely identifies the main factors influencing the toddler's health condition. The status of toddlers is generally categorized into two classes, namely normal and abnormal, so it is often found that the condition of toddler status is in the form of multi-response variables. Combining the two binary classes' response variables will form a multiclass response variable requiring different model development techniques and performance measurements. This study aims to determine the main factors that affect toddlers' health conditions in Malang, Indonesia, build multiclass logistic regression and decision tree classification models, and measure the model's performance. The Chi-square test selected predictor features as the input of multiclass logistic regression and decision tree models. From the feature selection, four main factors influence the status of toddlers' health conditions in Malang: the mother's history of diabetes before pregnancy, the father's blood pressure, psychological condition, and drinking water quality. The decision tree model performs better than the logistic regression model on the various performance measures used.


Keywords: Chi-square test, decision tree, logistic regression, multiclass classification, variable selection.



Full Text:



CORMACK B. E., JIANG Y., HARDING J. E., CROWTHER C. A., and BLOOMFIELD F. H. Neonatal refeeding syndrome and clinical outcome in extremely low-birth-weight babies: secondary cohort analysis from the provide trial. Journal of Parenteral and Enteral Nutrition, 2021, 45(1): 65-78.

AYELIGN A., and ZERFU T. Household, dietary and healthcare factors predicting childhood stunting in Ethiopia. Heliyon, 2021, 7(4): e06733.

ROMERO C., and VENTURA S. Educational data mining and learning analytics: An updated survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2020, 10(3): e1355.

BAHASSINE S., MADANI A., AL-SAREM M., and KISSI M. Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University-Computer and Information Sciences, 2020, 32(2): 225-231.

MARJI, HANDOYO S., PURWANTO I. N., and ANIZAR M. Y. The Effect of Attribute Diversity in the Covariance Matrix on the Magnitude of the Radius Parameter in Fuzzy Subtractive Clustering. Journal of Theoretical and Applied Information Technology, 2018, 96(12): 3717-3728.

PURWANTO I. N., WIDODO A., and HANDOYO S. System for Selection Starting Lineup of a Football Players by Using Analytical Hierarchy Process (AHP). Journal of Theoretical & Applied Information Technology, 2018, 96(1): 19-31.

KUSDARWATI H., and HANDOYO S. System for Prediction of Non Stationary Time Series based on the Wavelet Radial Bases Function Neural Network Model. International Journal of Electrical and Computer Engineering, 2018, 8(4): 2327-2337.

HANDOYO S., and MARJI. The Fuzzy Inference System with Least Square Optimization for Time Series Forecasting. Indonesian Journal of Electrical Engineering and Computer Science, 2018, 7(3): 1015-1026.

HANDOYO S., and CHEN Y. P. The Developing of Fuzzy System for Multiple Time Series Forecasting with Generated Rule Bases and Optimized Consequence Part. International Journal of Engineering Trends and Technology, 2020, 68(12): 118-122.

GONG C. S. A., SU C. H. S., and TSENG K. H. Implementation of machine learning for fault classification on vehicle power transmission system. IEEE Sensors Journal, 2020, 20(24): 15163-15176.

HANDOYO S., MARJI, PURWANTO I. N., and JIE F. The Fuzzy Inference System with Rule Bases Generated by using the Fuzzy C-Means to Predict Regional Minimum Wage in Indonesia. International Journal of Operations and Quantitative Management, 2018, 24(4): 277-292.

HANDOYO S., CHEN Y. P., IRIANTO G., and WIDODO A. The Varying Threshold Values of Logistic Regression and Linear Discriminant for Classifying Fraudulent Firm. Mathematics and Statistics, 2021, 9(2): 135–143.

MU Y., LIU X., and WANG L. A Pearson’s correlation coefficient based decision tree and its parallel implementation. Information Sciences, 2018, 435: 40-58.

RÁCZ A., BAJUSZ D., and HÉBERGER K. Effect of dataset size and train/test split ratios in QSAR/QSPR multiclass classification. Molecules, 2021, 26(4): 1111.

WANG L., LITTLER T., and LIU X. Gaussian Process Multi-Class Classification for Transformer Fault Diagnosis Using Dissolved Gas Analysis. IEEE Transactions on Dielectrics and Electrical Insulation, 2021, 28(5): 1703-1712.

SEBŐK M., and KACSUK Z. The multiclass classification of newspaper articles with machine learning: The hybrid binary snowball approach. Political Analysis, 2021, 29(2): 236-249.

WAŁĘGA G., and WAŁĘGA A. Over-indebted households in Poland: Classification tree analysis. Social Indicators Research, 2021, 153(2): 561-584.

MENA E., and BOLTE G. Classification tree analysis for an intersectionality-informed identification of population groups with non-daily vegetable intake. BMC Public Health, 2021, 21(1): 2007.

ROJARATH A., and SONGPAN W. Cost-sensitive probability for weighted voting in an ensemble model for multi-class classification problems. Applied Intelligence, 2021, 51(7): 4908-4932.

BAHASSINE S., MADANI A., AL-SAREM M., and KISSI M. Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University-Computer and Information Sciences, 2020, 32(2): 225-231.

NUGROHO W. H., HANDOYO S., and AKRI Y. J. An Influence of Measurement Scale of Predictor Variable on Logistic Regression Modeling and Learning Vector Quantization Modeling for Object Classification. International Journal of Electrical and Computer Engineering, 2018, 8(1): 333-343.

WIDODO A., and HANDOYO S. The Classification Performance Using Logistic Regression and Support Vector Machine (SVM). Journal of Theoretical & Applied Information Technology, 2017, 95(19): 5184-5193.

CHIU I. M., ZENG W. H., CHENG C. Y., CHEN S. H., and LIN C. H. R. Using a multiclass machine learning model to predict the outcome of acute ischemic stroke requiring reperfusion therapy. Diagnostics, 2021, 11(1): 80.

GNETCHEJO P. J., ESSIANE S. N., DADJÉ A., and ELE P. A combination of Newton-Raphson method and heuristics algorithms for parameter estimation in photovoltaic modules. Heliyon, 2021, 7(4): e06673.

ABRAMOVICH F., GRINSHTEIN V., and LEVY T. Multiclass classification by sparse multinomial logistic regression. IEEE Transactions on Information Theory, 2021, 67(7): 4637-4646.

TONKIN M., WOODHAMS J., BULL R., BOND J. W., and SANTTILA P. A comparison of logistic regression and classification tree analysis for behavioural case linkage. Journal of Investigative Psychology and Offender Profiling, 2012, 9(3): 235-258.

BLANQUERO R., CARRIZOSA E., MOLERO-RÍO C., and MORALES D. R. Optimal randomized classification trees. Computers & Operations Research, 2021, 132: 105281.

PANIGRAHI R., BORAH S., BHOI A. K., IJAZ M. F., PRAMANIK M., KUMAR Y., and HAVERI R. H. A consolidated decision tree-based intrusion detection system for binary and multiclass imbalanced datasets. Mathematics, 2021, 9(7): 751.

FRAIWAN L., and HASSANIN O. Computer-aided identification of degenerative neuromuscular diseases based on gait dynamics and ensemble decision tree classifiers. PLoS ONE, 2021, 16(6): e0252380.


  • There are currently no refbacks.