64th ISI World Statistics Congress

64th ISI World Statistics Congress

Stacking machine-learning models for anomaly detection: comparing AnaCredit to other banking datasets

Author

AD
Andrea Del Monaco

Co-author

  • P
    Pasquale Maddaloni
  • D
    Davide Nicola Continanza
  • D
    Daniele Figoli
  • M
    Marco di Lucido

Conference

64th ISI World Statistics Congress

Format: IPS Abstract

Keywords: data science, data-quality-management, granular, machine learning, quality-management

Session: IPS 243 - Data science in official statistical production: insights from central banks

Monday 17 July 4 p.m. - 5:25 p.m. (Canada/Eastern)

Abstract

In recent years, central banks have started to collect very granular and extensively detailed information in order to meet their need of a thorough understanding of economic developments, such as the credit disbursement to the economy from the banking system. This need became paramount as the financial crisis in 2007/2008 and the sovereign debt crisis in 2009/2010 showed that aggregated information were not sufficiently adequate to comprehensively capture the dynamics of both economic and financial phenomena. Nevertheless, the collection of granular information made data quality management very challenging. Machine learning and big data analytics are today applied by central banks to tackle similar problems. Following this trend, we investigate how to leverage machine learning models to address the problem of quality management of a granular dataset when the information therein is already available at an aggregated level in some benchmark dataset that is assumed to fulfill higher quality standards. The idea is to carry out systematic cross-checks between the granular dataset and the benchmark dataset by combining both supervised and unsupervised methods via a stacking algorithm in a weakly supervised fashion. The pipeline is as follow. Firstly, we aggregate the granular dataset to make a testing dataset so that every observation therein matches the corresponding information available in the benchmark dataset. Secondly, we consider a robust regression as a statistical supervised model and two different autoencoder architectures as unsupervised deep learning models; these algorithms are trained on the testing dataset by setting the benchmark data as the ground truth. The output of each of the above models is a normalized score of outlierness. Finally, the predicted scores are fed to the meta-learner of the stacking model. However, such a model needs to know which observations in the testing dataset are labelled as outlier. Since the testing dataset is large and it is not feasible to involve the banking system in labelling it all, some observations are subsampled in a stratified manner according to the Neyman optimal criterion and subsequently submitted to the banking system to get back proper labels; some other observations are instead labelled on the basis of our domain knowledge. To get a fully labelled testing dataset, we consider two weakly supervised approaches. The first approach consists in performing a Monte Carlo simulation to impute the missing labels by drawing a value from the sample distribution of the feedbacks of the banking system. The second approach consists in appealing to semi-supervised learning algorithms. The proposed methodology is applied to check the quality of the granular yet relatively young AnaCredit dataset by comparing it with the mature yet aggregated Balance Sheet Items statistics and the supervisory Financial Reporting survey, separately. In both cases, our pipeline yields a significatively higher F1-score than the ones given by each of the combined algorithms alone. Moreover, the suggested framework is quite flexible, and it reduces the burden on the banking system as its involvement in the pipeline is requested only for a very limited number of observations among all the reported ones.