Identifying Online Platforms: Model development, validation and type-I error reduction
Conference
65th ISI World Statistics Congress 2025
Format: IPS Abstract - WSC 2025
Keywords: big data,, classification, webscraping
Session: IPS 799 - Real-World Machine Learning Applications in Official Statistics
Thursday 9 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)
Abstract
A Machine Learning-based classification model was developed to identify online platform organizations by using the texts on their website. The model was used to identify all (potential) online platform organizations in the Dutch Business Register. The external validity of the model-based findings was verified via a survey held under (a sample of) the organizations identified as potential platforms. The response to the survey confirmed the validity of the model but also revealed a substantial number of type-I errors. Based on these findings, the classification approach was adjusted to reduce the number of false positives as much as possible while retaining its high accuracy and recall. This was achieved by making use of calibrated probabilities and ensembles.