TIES 2024

TIES 2024

Regularized Dirichlet-Tree Regression for Modelling Benthic Macroinvertebrate Counts in Canada’s Oil Sands Region

Conference

TIES 2024

Format: CPS Abstract - TIES 2024

Keywords: industrialwater, multivariate statistics, optimization

Abstract

The Athabasca Oil Sands region in Alberta, Canada holds the world’s largest bitumen deposit, leading to increased industrial activity in the area. Environmentalists have raised concern about potential changes in the regional environmental attributes, such as water chemistry, which might impact local wildlife. Benthic macroinvertebrates in the region serve as important indicators of ecosystem health due to their sensitivity to pollutants. Identifying water quality attributes that influence these biological communities can enhance adaptive monitoring and research efforts. The benthic macroinvertebrate counts at a given taxonomic rank can be modelled by the Dirichlet-multinomial (DM) regression model, which can accommodate multinomial overdispersion. However, DM regression does not consider the evolutionary relationship of the benthic macroinvertebrate through the phylogenetic tree and often the taxonomic rank at which the taxa counts are examined is arbitrarily chosen. The Dirichlet-tree multinomial (DTM) model addresses these limitations by considering counts across different taxonomic ranks within the phylogenetic tree. Estimation of regression coefficients in the DTM regression is complex given the likelihood falling outside of the exponential family. This challenge is compounded by the high dimensionality of the DTM regression, defined as the product of the number of covariates and the number of nodes in the phylogenetic tree, making variable selection necessary. Here, we examine the association between key water quality variables and benthic macroinvertebrate counts via the regularized DTM regression with the sparse group lasso (SGL) penalty. The SGL penalty identifies important covariates across the entire benthic composition and highlights specific covariate-taxon associations. To address challenges in estimation of model parameters, we propose a novel optimization framework via the majorization-minimization (MM) algorithm. We show that the MM algorithm applied to the regularized DTM regression leads to a series of iteratively re-weighted Poisson ridge regressions. We demonstrate the performance of our method through simulations.