Deep Learning on Administrative Tabular Data: A Comparative Study
Conference
64th ISI World Statistics Congress
Format: IPS Paper
Keywords: administrative data, deep-learning, machine learning
Session: IPS 245 - The present and future of access to granular administrative data
Wednesday 19 July 2 p.m. - 3:40 p.m. (Canada/Eastern)
Abstract
Deep learning has traditionally been applied to perform analytics on large unstructured data such as videos, audio, images, and text. Initially, deep learning research on tabular data was not performed much, as the fixed structure inherent in tabular data was seen to negate the ability of deep learning techniques to elicit useful representation of tabular data. Recent advances in tabular deep learning have seen applications of the self-attention and transformer architecture as well as transfer learning which has improved the performance of deep learning on tabular data. Some of these approaches has improved on the results attained using traditional machine learning models such as gradient boosted trees for classification and regression tasks.
While these results are promising, the datasets used in these works were datasets which were typically used for bench-marking machine learning algorithms. This raises the question on how extensible the results would be if it were to be applied to administrative tabular data. To answer this question, we will curate a selection of administrative tabular datasets from open-sourced Malaysian administrative data. The curated dataset will be used to perform classification tasks using both traditional machine learning approaches as well as deep learning approaches. Classification is a machine learning technique which has been used to support policy decisions . As an evaluation, we will evaluate the results of the classification tasks as a measure of feature representation.
Feature representation is an important measure, as feature selection of administrative data by subject-matter-experts is a manual and laborious task. It is believed that automatic feature elicitation via deep learning approaches will reduce this dependency. The results of experiments conducted in this study have shown that deep learning tabular algorithms can achieve comparable results with optimised traditional machine learning when applied on open administrative data, without the need for extensive feature selection and feature engineering.