Making sense of Data from Multiple Sources

Instructors: Thierry Chekouo & Dr. Sandra Safo

03 October 2025

For more details on registrations and submissions for the Making sense of Data from Multiple Sources, please first login to your account. If you do not have an account then you can create one below:

About this short course

Statistical and machine learning methods for multimodal data (e.g., omics, imaging, electronics health records) integration have garnered significant attention in recent literature. Yet, analyzing these different yet complementary data to obtain meaningful insight remains challenging, partly due to the different technologies generating these data types, data dimensionality, and the heterogeneity of these data types. In this course, we will present some of the challenges in analyzing multimodal data. We will cover some techniques in data integration, including early, intermediate and late data integration methods, and discuss the limitations of each approach. We will cover methods that model associations across data types via multivariate correlation analyses, methods that model shared relationships, as well as shared and data-specific associations. We will cover data integration methods for regression, classification, clustering, and data reconstruction. Both explainable linear and nonlinear statistical and machine learning methods would be covered. Both frequentist and Bayesian perspectives would be covered. The techniques to be presented in the course are widely applicable in fields such as medicine, neuroscience, public health, and economics and are increasingly relevant due to the use of technologies that allow for the generation and preprocessing of multimodal data. We will demonstrate all methods presented using available software in R and Python by analyzing datasets related to two or more complex diseases. This course would be useful for individuals with limited or no knowledge of data integration as well as those with expertise in data integration who would like to know more about the state-of-the-art techniques in this field. At the end of this course, participants will be able to: i) identify challenges and opportunities for data integration; ii) identify some methods for data integration with a focus on regression, classification, clustering, and data reconstruction; iii) implement some data integration methods on real data. Presentation slides, recommended text, and R vignettes would be provided.

Instructors' biographies:

Dr. Sandra Safo

Dr. Sandra Safo is an Associate Professor of Biostatistics in the Division of Biostatistics and Health Data Science and a Graduate Faculty of Data Science in the College of Science and Engineering at the University of Minnesota, USA. Her research focuses on developing robust, explainable and usable statistical and machine learning methods and algorithms for high dimensional data to advance clinical translational research and precision medicine. She holds a PhD in Statistics from the University of Georgia and has over 10 years experience in the field of data integration. Her group has developed several methods and software for data integration, and web applications, for a seamless data integration process. She is a standing member of the NIH Analytics and Statistics for Population Research Panel A Study Section, a recipient of the University of Minnesota McKnight Land- Grant Professorship award, a recipient of the Committee of Presidents of Statistical Societies (COPSS) Emerging Leader Award, an Elected Member of the International Statistical Institute, and an Associate Editor for the Journal of Computational and Graphical Statistics. Her methods work is (or has been) supported by multiple PI grants from the NIH and the University of Minnesota.

Dr. Thierry Chekouo

Dr. Thierry Chekouo is an Assistant Professor of Biostatistics in the Division of Biostatistics and Health Data Science at the University of Minnesota, School of Public Health, USA. His research interests are in developing and applying new and advanced statistical learning frameworks for analyzing datasets characterized by high dimensionality and complex structures such as high-throughput genomic, genotype, proteomic, and imaging data. A special focus is on developing integrative Bayesian models combining different sources of data for biomarker discovery and clinical prediction. He holds a PhD in Statistics from the University of Montreal, has over 10 years of experience in analysing high dimensional data for multi-view integration. He is a standing member of the Evaluation Group of Mathematics and Statistics for the Canadian Discovery Grant applications, and he has served as ad hoc reviewer for the NIH and for Canadian funding agencies (e.g., CHIR). He is also a regular reviewer for several prominent Statistics journals.

For whom is this course intended?

This course is intended for students, faculty, and practitioners, beginners, and those with some experience, who want to learn more about data integration techniques and the state-of-the-art techniques in this field. Having knowledge in R/Python would be helpful but not required.