Research
The central theme of our lab is focused on developing and applying informatics, computing and data science methods for discovering actionable knowledge from complex biomedical and health data (e.g., genetics, omics, imaging, biomarker, outcome, EHR, health care). The goal is threefold:
- advance informatics, computing and data science by producing novel algorithms for analyzing large scale heterogeneous data sets;
- provide important new insights into the phenotypic characteristics and genetic and molecular mechanisms of normal and/or disordered biological structures and functions to impact the development of new diagnostic, therapeutic and preventive approaches; and
- improve health and health care by contributing to collaborative, multidisciplinary research that influences policy and practice.
Our research interests include machine learning, medical image computing, biomedical and health informatics, trustworthy AI, NLP/LLMs, network science, imaging genomics, multi-omics and systems biology, Alzheimer’s disease, health disparity, and big data science in biomedicine. Our major focus is to develop and apply advanced artificial intelligence (AI) and machine learning (ML) strategies for analyzing big biobank and health data to advance the study of Alzheimer’s disease (AD) and AD related dementia (ADRD). Recently, we have started to work on additional topics such as Natural Language Processing, Large Language Models and Trustworthy AI. The following are some of our past and ongoing research activities.
Artificial Intelligence Strategies for Alzheimer's Disease Research
This project is supported by an NIA U01 award (U01 AG066833). We propose a comprehensive biomedical computing and health informatics research project to develop and apply cutting-edge AI algorithms and biomedical software for the analysis of large- scale AD data. At the heart of this proposed informatics program is the PennAI method and software for automating machine learning through an AI algorithm that can learn from prior analyses. This approach takes the guesswork out of picking the right machine learning algorithms and parameter settings thus making this computing technology accessible to everyone. Specifically, we will develop three novel informatics methods to tailor PennAI to the analysis of AD data.
Penn Artificial Intelligence and Technology Collaboratory for Healthy Aging: Tech Core
The overarching goal of the Penn Artificial Intelligence and Technology (PennAITech, LinkedIn) Collaboratory (NIA P30 AG073105) is to identify, develop, evaluate, commercialize, and disseminate innovative technology for monitoring aging adults and those with Alzheimer’s Disease (AD) and Alzheimer’s Disease Related Dementias (ADRD) in their home environment and the artificial intelligence (AI) methods and software for analyzing data generated by those technologies. I am leading the Technology Identification and Training Core (Core C), which aims to advance the goal of the PennAITech Collaboratory by identifying new AI algorithms and technology based on the needs of aging Americans and developing training opportunities to facilitate their adoption and use in the pilot funding programs.
Translational Bioinformatics Approaches to Advance Drug Repositioning for Alzheimer's Disease
This project is supported by an NIA R01 award (R01 AG071470). Our overarching goal is to develop machine learning and deep learning approaches as well as informatics tools and pipelines that leverage big data in relevant biomedical domains. These big data include large-scale multidimensional genetic, molecular, biomarker and outcome data from landmark AD studies, functional interaction data among drugs, targets and diseases, pharmacologic perturbation data, electronic health record data, and market scan data. Our computational research is aimed at developing innovative translational big data analytic methods to systematically integrate AD biomarker research and systems medicine study, and facilitate the identification of novel promising targets and drugs for repositioning against AD or AD-related dementia.
Ultrascale Machine Learning to Empower Discovery in Alzheimer’s Disease Biobanks
This project is supported by an NIA U01 award (U01 AG068057). The Artificial Intelligence for Alzheimer’s Disease Initiative (AI4AD, twitter) is a coordinated national initiative to develop transformative AI approaches for high throughput analysis of next generation sequencing (NGS) and related AD biomarker and cognitive data. Biomarker data related to AD are being collected at “ultra-scale” and are likely to unlock numerous opportunities for AD treatment, yet the rapid collection of such data far exceed our current capacity to analyze it. Our collective effort in this proposal will sieve extensive genomic, biomarker, and cognitive data to extract and prioritize the features that are essential to address fundamental barriers to AD prevention and drug discovery.
Informatics Algorithms for Genomic Analysis of Brain Imaging Data
This project is supported by an NLM R01 award (R01 LM013463). Brain imaging genetics studies the relationship between genetic variations and brain imaging phenotypes, and offers enormous potential to reveal the genetic underpinning of the neurobiological system that can impact the development of diagnostic, therapeutic and preventative approaches for complex brain disorders. This project seeks to develop innovative informatics methods and tools for integrative analysis of imaging, genetics and transcriptomics data to identify brain imaging genetic associations with evidence manifested in the human brain transcriptome. Using ADNI and related cohorts as test beds, these methods and tools will be shown to have considerable potential for understanding the molecular mechanism of Alzheimer’s disease, and be expected to impact neurological and psychiatric research in general and benefit public health outcomes.
Machine Learning Framework for Multi-Site Collaborative Brain Big Data Mining
This project is supported by an NSF award (IIS1837964). Recent advances in multimodal brain imaging and high throughput genotyping and sequencing techniques provide exciting new opportunities to ultimately improve our understanding of brain structure and neural dynamics, their genetic architecture, and their influences on cognition and behavior. However, data privacy and security issues have inhibited data sharing across institutes. Emerging multi-site collaborative data analysis can address these issues and facilitate data and computing resource sharing. This project seeks to harness the opportunities of designing new efficient asynchronous distributed machine learning algorithms with rigorous theoretical foundations for multi-site collaborative brain big data mining, creating large-scale computational strategies and effective software tools to reveal sophisticated relationships among heterogeneous brain data.
Integrative Bioinformatics Approaches to Human Brain Genomics and Connectomics
This project is supported by an NIBIB R01 award (R01 EB022574). Integrating human connectomics and brain imaging genomics offers enormous potential, allowing us to perform systems biology approaches of the brain to better understand the interplay between genes, brain connectivity, and phenotypic outcomes (e.g., cognition, behavior, disorder). In this project, we seek to develop novel bioinformatics methods and tools for integrative study of human connectomics and brain imaging genomics. These methods and tools can be applied to: (1) study normal brain functions to impact biomedical research in general, and (2) study brain disorders to improve public health outcomes by facilitating diagnostic and therapeutic progress. We are using the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and Human Connectome Project (HCP) cohorts as test beds to develop methods and tools with potential for a better understanding of the interplay between genes, brain connectivity and function.
Mining Drug Interaction Induced Adverse Effects (ADEs) from Health Record Databases
This project is supported by an NSF award (IIS 1827472 and IIS 1622526). Recent advances in large-scale electronic health record database techniques provide exciting new opportunities to the study of drug safety. Drug-drug interactions (DDIs), a major cause of adverse drug events (ADEs), are a serious global health concern, and a severe detriment to public health. The scale of DDIs involving three or more drugs (also called high-order DDIs) has posed a prohibitory challenge for its molecular pharmacology and clinical research, which motivates alternative strategies such as mining health record data. This project aims to develop large-scale computational strategies and effective software tools for mining high-order DDI effects from health record databases, in order to yield novel discoveries in drug safety, and ultimately to benefit national health and well-being.
Bioinformatics Strategies for Multidimensional Brain Imaging Genetics
This project is supported by an NLM R01 award (R01 LM011360). We have been working on producing novel bioinformatics algorithms and tools for comprehensive joint analysis of large scale heterogeneous imaging genomics data, using Alzheimer’s Disease Neuroimaging Initiative (ADNI) database as a test bed. We have published a variety of novel machine learning models for effective mining of complex imaging genomic associations, including structured sparse regression models, structured sparse canonical correlation analysis (SCCA) models, and gene-gene and gene-environment interaction models. We have also developed a novel imaging genetic enrichment analysis (IGEA) framework for identifying high level associations between gene sets and brain circuitries, and a novel network-based machine learning framework to identify phenotype-relevant functional modules from tissue-specific biological networks. We are working on developing novel machine learning and bioinformatics strategies for integrating brain genomics, transcriptomics and anatomics.
Genetic and Multi-Omic Analysis of Quantitative Phenotypes in AD
This focus aims to investigate the role of genetic variation in disordered brain function using neuroimaging and biomarkers as phenotypes. Besides the method development work described above, we also employ state-of-the-art methods to perform genetic analysis of quantitative phenotypes in AD. ADNI (U01 AG024904) is a landmark study in AD, and Dr. Shen served as a Co-Leader of its Genetics Core between 2009 and 2017. Using data from ADNI and local cohorts, we have completed a series of candidate gene and genome-wide association studies (GWAS) of structural and molecular neuroimaging data and other biomarker data (e.g., cerebrospinal fluid, plasma proteomics, cognition) in mild cognitive impairment (MCI) and AD. These studies yielded many interesting genetic findings in relation to quantitative phenotypes. Given the broadened landscape of ADNI multi-omic domain (e.g., including data from genome, epigenome, transcriptome, proteome, and metabolome), we are interested in expanding the scope of our imaging omics study from the genomic domain to multi-omic domain.
Multidimensional Data Mining and Biomarker Discovery
This topic is aimed to identify biomarkers from multidimensional data sets, including multimodal imaging data, high throughput omics data, and fluid biomarker data, for predicting cognitive and diagnostic outcomes. This work was partially supported by a completed NSF project (IIS-1117335), where we proposed and applied a series of sparse machine learning methods to the ADNI cohort for mining multidimensional imaging, omics and fluid biomarker data and discovering disease-sensitive and/or cognition relevant biomarkers. These approaches include machine learning models for sparse Bayesian classification, structured sparse multi-task regression, sparse learning for joint classification and regression, multi-modal multi-task learning, and multi-task longitudinal learning. Given the scale and complexity of the multidimensional imaging, omics and biomarker data, we are interested in refining our models for multidimensional data integration and longitudinal learning, as well as to address the big data analytic issue.
Biomedical Image Computing
This focus aims to develop and apply image and shape computing methods for analyzing MRI, PET, CT and other 3D imaging data. We have made a variety of contributions to the enhancement of the spherical harmonic (SPHARM) shape modeling technique by addressing its fundamental challenges, including generalization, scalability, and flexibility. Supported by an NIBIB project (R03 EB008674), we developed and released SPHARM-MAT, a SPHARM-based software toolkit for brain imaging. We have applied SPHARM to various biomedical applications, including hippocampal atrophy in brain disorders, cortical analysis in autism, thalamic atrophy in multiple sclerosis, cardiac motion analysis, and evolutionary biology. Besides SPHARM, we have also developed image processing methods for studying craniofacial dysmorphology in fetal alcohol spectrum disorder and spatiotemporal modeling of lung nodules. We are interested in developing novel methods for morphometric analysis of hippocampal subfields as well as image processing and machine learning methods for diagnosing dental hard-tissue conditions.