Topical Hidden Genome: Discovering Latent Cancer Mutational Topics Using a Bayesian Multilevel Context-Learning Approach
Conference
65th ISI World Statistics Congress 2025
Format: IPS Abstract - WSC 2025
Keywords: ; bayesian multilevel, genomics, mcmc, nlp
Wednesday 8 October 10:50 a.m. - 12:30 p.m. (Europe/Amsterdam)
Abstract
Statistical inference on the cancer-site specificities of collective ultra-rare whole genome somatic mutations is an open problem. Traditional statistical methods cannot handle whole-genome mutation data due to their ultra-high-dimensionality and extreme data sparsity -- e.g. >30 million unique variants are observed in the ~1700 whole-genome tumor dataset considered herein, of which & > 99% variants are encountered only once. To harness information in these rare variants we propose a multilevel meta-feature regression model to extract the critical information from the mutation contexts of rare variants in a way that permits us to also extract diagnostic information from any previously unobserved variants in the new tumor sample. Our framework further leverages topic models from the field of computational linguistics to induce an interpretable dimension reduction of the mutation contexts. The proposed model is implemented using an efficient MCMC algorithm that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of out-of-the-box high-dimensional multi-class regression methods and software. We employ our model on the Pan Cancer Analysis of Whole Genomes (PCAWG) dataset, and our results reveal interesting novel insights.