IAOS-ISI 2024, Mexico City

IAOS-ISI 2024, Mexico City

Classification using probabilistic models and other methods for mixed data

Conference

IAOS-ISI 2024, Mexico City

Format: CPS Abstract

Abstract

Classification of labelled observations into two or more well-defined classes based on a set of measurements is a problem that has been developed within Statistics. Methodology for classification is continuously being developed within the area of Statistical Learning. The problem has also been considered for a long time in the area of Artificial Intelligence where recently Deep learning methods are becoming computationally more feasible and have achieved important results on classification or prediction problems. Probabilistic models for Deep learning are also potentially useful and much research activity is in progress, these models combine probability distributions with mathematical concepts of graphs and are known as graphical models, when the graphs are directed they are known as Bayesian networks and when undirected as Markov networks.

When all measurements, features or predictors, can be represented as continuous variables, most known methods apply, but when the variables are both categorical and continuous the performance of the methods might diminish predictivity or have errors with less ability to generalize.

In this work we present comparative results of some classification methods for the case of two populations and a set of binary and continuous variables. Methods used include classical statistical methods like discriminant analysis and logistic regression, as well as discriminant analysis based on probabilistic graphical models; algorithmic methods like support vector machines and random forests; and Deep neural networks. The performance of these methods is compared in terms of classification error rates on both simulated and real data.

The results in terms of error rates showed that i) additive models or linear methods performed poorly, ii) models with interactions and selection of variables or non-linear methods performed better than the linear ones, and iii) Deep neural networks did perform as well as the non-linear methods though they did not show the best performance, this might be due to the small-medium size of the tabular data.