A Scalable, Explainable Machine Learning Approach for Granular-Level Credit Dataset’s Quality Assurance
Conference
65th ISI World Statistics Congress 2025
Format: CPS Abstract - WSC 2025
Keywords: "data, "financial", "statistical", big data, data science, cloud platform, machine learning
Session: CPS 12 - Financial Modelling and Volatility
Tuesday 7 October 4 p.m. - 5 p.m. (Europe/Amsterdam)
Session: CPS 12 - Financial Modelling and Volatility
Tuesday 7 October 5:10 p.m. - 6:10 p.m. (Europe/Amsterdam)
Abstract
To enhance ongoing supervision, Bank of Thailand initiated the Regulatory Data Transformation (RDT) project, establishing a new granular-level regulatory reporting standard for credit data. Data quality assurance is crucial for supporting policymakers with reliable information. Validation rules, such as format/range validation and consistency check across related reports, are manually created and applied to ensure fulfillment of basic requirements. These rules, however, fail to capture unknown sophisticated errors in multivariate relationships. Thus, we evaluate the efficiency and interpretability of state-of-the-art outlier detection models including LODA (Lightweight On-line Detector of Anomalies), Isolation Forest with DIFFI (Depth-based Isolation Forest Feature Importance) and AWS (Assist-Based Weighting Scheme), and ECOD (Empirical-Cumulative-distribution-based Outlier Detection). We place strong emphasis on model’s explainability as potential errors must be reviewed and communicated back to data providers. Scaling capability is also essential because RDT requires detailed fields with over 25 million records per month. Each of these methods is applied to the sample dataset with 5 million records with labeled outliers; the performances of these model are assessed via MRR (Mean Reciprocal Rank), Mean Average Precision (MAP), and Top-k Accuracy metric. Best-performing model is implemented on Spark and deployed to production on the full dataset. To complement these outlier detection methods, we also propose heuristics for automatic discovery of readily-interpretable multivariate validation rules which, upon verification by our financial auditors, can be incorporated to achieve a more comprehensive test suite for error-checking during data acquisition.