ESTIMATION OF PROPORTIONS IN SMALL AREA ESTIMATION: MACHINE LEARNING APPROACH
Conference
65th ISI World Statistics Congress 2025
Format: CPS Abstract - WSC 2025
Keywords: boosting, estimation, logistic model, machine learning, mixed-effects, mixed-models, probability sampling, random forest, small area estimation, survey-methodology, tree-structure models
Session: CPS 13 - Small Area Estimation for Policy and Socio-Economic Modelling
Tuesday 7 October 4 p.m. - 5 p.m. (Europe/Amsterdam)
Abstract
Sample surveys have been traditionally recognized as cost-effective means of obtaining information to provide estimates for different parameters, not only for the total population of interest but also for various subpopulations (domains) not large enough (even null) to support direct estimates of adequate precision and then not publishable. Small area estimation is a methodology that considers diverse methods to use available auxiliary information for the whole population to allow us to estimate the parameters in the domains (small areas). One possibility is to consider a linear mixed model or a mixed generalised model in the case of estimating a total population to estimate the variable of interest for the non-sampled units, allowing us to get an estimation for all the domains combining sampling units and non-sampling units. However, traditional models must fulfill some assumptions; for instance, the relationship between the auxiliary variables and the variable of interest must be linear, and the associated prediction errors must follow a particular probability distribution, raising problems of multicollinearity and outliers in some cases. Therefore, we propose in this paper a strategy to substitute the traditional mixed generalised model for a more flexible one. In particular, we study a different approach using machine learning regression methods with mixed effects for estimating proportions in small areas without considering any assumptions and obtaining a gain in robustness for outliers and variable selection. Some approaches have already been proposed in the literature for small-area estimation of proportions. The idea is to substitute the linear model with a machine learning regression method following the same stages for estimating the parameter and its precision according to traditional small-area estimation methods. We present a simulation exercise considering model-based and design-based inferences (logistic mixed models, mixed effects random forest, and mixed effects tree boosting) to compare mean squared errors, biases, and computation times for all the methods considered. Also, an actual application for the evaluation of the National Program for the Substitution of Illicit Crops in Colombia is shown, considering these methods to estimate the proportion of families that have suffered forced eradication in the rural areas of the country.