Tree-based statistical learning techniques and explanatory tools
Conference
64th ISI World Statistics Congress
Format: IPS Abstract
Keywords: classification, functional, learning
Session: IPS 421 - Data Science in Statistics: methodological and applied issues
Thursday 20 July 10 a.m. - noon (Canada/Eastern)
Abstract
Random Forests represents one of the most popular Machine Learning tools in the field of supervised classification when the number of observations and the number of variables is too large to predict a priori classes. As known in Statistics and Data Analysis, a good discrimination process requires a selection of predictors to avoid problems of the curse of dimensionality.
RF techniques attempt to overcome this problem through a random selection of features and observations. Thus, they create many decision trees against which to classify elements into a priori classes, according to a 'majority voting' mechanism. In this way, an assignment rule is established to predict the belonging of an element to a class. The characteristics that have been selected and that contribute most to the construction of the trees are listed as the most discriminating. Furthermore, the random choice of predictors solves a redundancy problem in the classification process and provides very high performance of the RF classifier, measured in terms of accuracy. In addition, the resampling of a set of observations for the construction of the different trees also involved a process of the robustness of the method.
Boosting methods further improve the performance of tree-based techniques in supervised classification learning by weighing more the best intermediate solutions.
Tree-based techniques applied to functional data in supervised classification is a field still little known and underdeveloped. The proposed contribution focuses on functional classifiers and explanatory tools to improve their performance. However, the strong automatism in the classification process represents a challenge to the widely established techniques. There is no clear description of the decision process, based on the characteristics of the objects to predict the class of belonging. In particular, the assignment is provided according to a large number of decision trees and different sets of descriptors of the classes a priori. An interesting contribution is to provide new aids which, in addition to the accuracy of prediction, allow to recognize the predictors that most contribute to the separation of the a priori groups. This combines an embedding procedure that seeks multiple solutions and a final compromise.
Moreover, the regard of the characteristics of the functional data can allow the detection of sub-groups in a priori classes can improve interpretation and prediction.
Finally, the description of the separation curves of the classes, rather than simple split values, can allow us to interpret the similarity of the functional characteristics of the curves belonging to the different a priori groups.
Aids to the interpretation of the tree-based functional classifiers is still an open frontier.
Some contributions are advanced in the choice of the best transformation of functional data to grasp the differences between the classes to be predicted in terms of slope or changing rates.
Applications on real data, in medical and environmental fields, have corroborated the proposals.