64th ISI World Statistics Congress

64th ISI World Statistics Congress

Using Active Learning to Improve Quality of Machine Learning Models for the Canadian Census

Author

XD
Xiaonan Da

Co-author

  • K
    Kevin Earl

Conference

64th ISI World Statistics Congress

Format: CPS Abstract

Keywords: machine learning, nlp

Session: CPS 29 - Census II

Monday 17 July 5:30 p.m. - 6:30 p.m. (Canada/Eastern)

Abstract

In 2021 Statistics Canada used a natural language processing algorithm to code a significant portion of its Census data. Like any supervised machine learning algorithm, this required a large amount of labeled data. However, given the five-year Census cycle at Statistics Canada, only a small portion of the previously labeled data from the previous Census provided valid labels, leaving a large amount of unlabelled data which was no longer suitable for model training. This arrived as some variables underwent significant changes to the approved label set between census cycles.

Prior to production, we were faced with a decision; train a model using only the records that remained labelled or attempt to relabel the now unlabelled data to fit the new set of approved labels. This task of relabelling data is extremely labour intensive and cost prohibitive, however a model trained only on the small portion of data with labels would ultimately perform poorly in comparison to a model trained on all the data. Due to time constraints and resource limitations, for most of our variables in 2021, the simpler model using less data was implemented.

For the next Census in 2026, we endeavour to find a better solution that will increase model quality while respecting the fact that relabelling data comes at a cost. To this end, we investigate the use of active learning, an iterative method which begins with a simple model and intelligently selects records to label based on the results of that model. These records are then labelled and then added to the model and the process repeats. The goal is to select the records to relabel which will achieve the greatest increase in quality per record labelled. We consider different methods of selection as well as number of records selected per iteration for two of our Census variables. This presentation will discuss the situations in which Statistics Canada stands to gain from active learning, the methods evaluated, and our results.