Research Topics

The central theme of our lab is focused on developing and applying informatics, computing and data science methods for discovering actionable knowledge from complex biomedical and health data (e.g., genetics, omics, imaging, biomarker, outcome, EHR, health care). The goal is threefold:

advance informatics, computing and data science by producing novel algorithms for analyzing large scale heterogeneous data sets;
provide important new insights into the phenotypic characteristics and genetic and molecular mechanisms of normal and/or disordered biological structures and functions to impact the development of new diagnostic, therapeutic and preventive approaches; and
improve health and health care by contributing to collaborative, multidisciplinary research that influences policy and practice.

Our research spans machine learning, medical image computing, biomedical and health informatics, trustworthy AI, NLP/LLMs, network science, imaging genomics, and multi-omics and systems biology, with applications to Alzheimer’s disease and related dementias (ADRD) and other complex disorders. Our primary focus is on developing and applying advanced AI strategies to analyze large-scale biobank and health datasets, with the goal of advancing the understanding, early detection, treatment, prevention, and overall healthcare of complex disorders. We also explore emerging frontiers such as generative AI, agentic AI, and trustworthy multimodal AI to push the boundaries of biomedical research. Below are highlights of our past and ongoing research activities.

Ultrascale Machine Learning to Empower Discovery in Alzheimer’s Disease Biobanks (Phase II)

This U01 renewal project (U01 AG068057) - Artificial Intelligence for Alzheimer's Disease (AI4AD, twitter) - responds to the evolving ADRD (Alzheimer’s Disease and Related Dementias) research landscape, where patient subtyping is crucial for drug trial enhancement and precision medicine (assigning optimal treatments to patients). Creating innovative Artificial Intelligence and Machine Learning (AIML) approaches, we address four key NIA mandates: (1) molecular subtyping for precision medicine, (2) improving ADRD clinical trial design, (3) adapting AI models for consistent performance across cohorts for genetic target and treatment selection, and (4) genome-guided drug repositioning.

Multimodal Digital Phenotyping and Machine Learning from Mobile Health Data

This focus develops computational methods—spanning image processing, time‑series modeling, and predictive analytics—to convert smartphone images/videos and wearable sensor streams into actionable digital biomarkers. Through an academia–industry partnership, we build algorithms that transform phone‑based leg videos into 3D volumetric edema assessments and train wearable‑based models to predict IBD defecation counts. Supported by a MassAITC pilot (NIA P30 AG073107), we also use data from heart‑failure monitoring socks to model edema and fatigue linked to exacerbations. In collaboration with ASPE, we integrate sleep, activity, genetic, and phenotypic data using machine learning approaches to advance ASD prediction, subtyping, and biomarker discovery.

AI for Scalable Epidemiology Using Patient-Reported Outcomes

This project, supported by an NLM R01 award (R01 LM014731), addresses the widespread challenge of medication non-adherence, a major contributor to preventable deaths and substantial healthcare costs. Building on a decade of prior work, our collaborative team will develop AI and natural language processing methods to analyze patient-reported data from online sources and spontaneous reporting systems, uncovering factors that influence adherence and tolerability. Through disease-specific case studies and systematic reviews, we aim to integrate the patient voice into research in a scalable, reproducible way, informing more effective adherence interventions and advancing population health.

Personal Risk Factor–Enhanced ML Models for Early ADRD Prediction

This project is supported by an NSF SCH award (NSF 2500343). We propose to develop a computational platform that combines novel machine learning (ML) and natural language processing (NLP) methods to automatically extract personal risk factors from clinical narratives and use them for accurate, early prediction of Alzheimer’s Disease and Related Dementias (ADRD). By examining how personal and clinical factors interact in disease progression, the project advances both ML methodology and ADRD research, with the potential to transform early detection and management of complex neurological disorders.

Data Mining, Machine Learning and LLM from EHR, Survey and Other Health Databases

This focus aims to develop computational strategies, guided by knowledge graphs as needed, to extract actionable insights from complex healthcare data sources such as EHRs, surveys, audio transcripts, medical literature, claims, and wearable device data. Our goal is to uncover findings that advance learning health research and improve national health and well-being. For example, supported by a PennAITech supplement (NIA P30 AG073105) , we leverage large language models fine-tuned with high-quality caregiver interview data to deliver expert-level AI services for the aging care industry. Supported by a PennAITech pilot (NIA P30 AG073105), we apply machine learning to clinician–caregiver interactions to predict depression and burden in AD/ADRD caregivers.

Artificial Intelligence Strategies for Alzheimer's Disease Research

This project is supported by an NIA U01 award (U01 AG066833). We propose a comprehensive biomedical computing and health informatics research project to develop and apply cutting-edge AI algorithms and biomedical software for the analysis of large- scale AD data. At the heart of this proposed informatics program is the PennAI method and software for automating machine learning through an AI algorithm that can learn from prior analyses. This approach takes the guesswork out of picking the right machine learning algorithms and parameter settings thus making this computing technology accessible to everyone. Specifically, we will develop three novel informatics methods to tailor PennAI to the analysis of AD data.

Penn Artificial Intelligence and Technology Collaboratory for Healthy Aging: Tech Core

The overarching goal of the Penn Artificial Intelligence and Technology (PennAITech, LinkedIn) Collaboratory (NIA P30 AG073105) is to identify, develop, evaluate, commercialize, and disseminate innovative technology for monitoring aging adults and those with Alzheimer’s Disease (AD) and Alzheimer’s Disease Related Dementias (ADRD) in their home environment and the artificial intelligence (AI) methods and software for analyzing data generated by those technologies. I am leading the Technology Identification and Training Core (Core C), which aims to advance the goal of the PennAITech Collaboratory by identifying new AI algorithms and technology based on the needs of aging Americans and developing training opportunities to facilitate their adoption and use in the pilot funding programs.

Translational Bioinformatics Approaches to Advance Drug Repositioning for Alzheimer's Disease

This project is supported by an NIA R01 award (R01 AG071470). Our overarching goal is to develop machine learning and deep learning approaches as well as informatics tools and pipelines that leverage big data in relevant biomedical domains. These big data include large-scale multidimensional genetic, molecular, biomarker and outcome data from landmark AD studies, functional interaction data among drugs, targets and diseases, pharmacologic perturbation data, electronic health record data, and market scan data. Our computational research is aimed at developing innovative translational big data analytic methods to systematically integrate AD biomarker research and systems medicine study, and facilitate the identification of novel promising targets and drugs for repositioning against AD or AD-related dementia.

Ultrascale Machine Learning to Empower Discovery in Alzheimer’s Disease Biobanks

This project is supported by an NIA U01 award (U01 AG068057). The Artificial Intelligence for Alzheimer’s Disease Initiative (AI4AD, twitter) is a coordinated national initiative to develop transformative AI approaches for high throughput analysis of next generation sequencing (NGS) and related AD biomarker and cognitive data. Biomarker data related to AD are being collected at “ultra-scale” and are likely to unlock numerous opportunities for AD treatment, yet the rapid collection of such data far exceed our current capacity to analyze it. Our collective effort in this proposal will sieve extensive genomic, biomarker, and cognitive data to extract and prioritize the features that are essential to address fundamental barriers to AD prevention and drug discovery.

Informatics Algorithms for Genomic Analysis of Brain Imaging Data

This project is supported by an NLM R01 award (R01 LM013463). Brain imaging genetics studies the relationship between genetic variations and brain imaging phenotypes, and offers enormous potential to reveal the genetic underpinning of the neurobiological system that can impact the development of diagnostic, therapeutic and preventative approaches for complex brain disorders. This project seeks to develop innovative informatics methods and tools for integrative analysis of imaging, genetics and transcriptomics data to identify brain imaging genetic associations with evidence manifested in the human brain transcriptome. Using ADNI and related cohorts as test beds, these methods and tools will be shown to have considerable potential for understanding the molecular mechanism of Alzheimer’s disease, and be expected to impact neurological and psychiatric research in general and benefit public health outcomes.

Machine Learning Framework for Multi-Site Collaborative Brain Big Data Mining

This project is supported by an NSF award (IIS1837964). Recent advances in multimodal brain imaging and high throughput genotyping and sequencing techniques provide exciting new opportunities to ultimately improve our understanding of brain structure and neural dynamics, their genetic architecture, and their influences on cognition and behavior. However, data privacy and security issues have inhibited data sharing across institutes. Emerging multi-site collaborative data analysis can address these issues and facilitate data and computing resource sharing. This project seeks to harness the opportunities of designing new efficient asynchronous distributed machine learning algorithms with rigorous theoretical foundations for multi-site collaborative brain big data mining, creating large-scale computational strategies and effective software tools to reveal sophisticated relationships among heterogeneous brain data.

Integrative Bioinformatics Approaches to Human Brain Genomics and Connectomics

This project is supported by an NIBIB R01 award (R01 EB022574). Integrating human connectomics and brain imaging genomics offers enormous potential, allowing us to perform systems biology approaches of the brain to better understand the interplay between genes, brain connectivity, and phenotypic outcomes (e.g., cognition, behavior, disorder). In this project, we seek to develop novel bioinformatics methods and tools for integrative study of human connectomics and brain imaging genomics. These methods and tools can be applied to: (1) study normal brain functions to impact biomedical research in general, and (2) study brain disorders to improve public health outcomes by facilitating diagnostic and therapeutic progress. We are using the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and Human Connectome Project (HCP) cohorts as test beds to develop methods and tools with potential for a better understanding of the interplay between genes, brain connectivity and function.

Mining Drug Interaction Induced Adverse Effects (ADEs) from Health Record Databases

This project is supported by an NSF award (IIS 1827472 and IIS 1622526). Recent advances in large-scale electronic health record database techniques provide exciting new opportunities to the study of drug safety. Drug-drug interactions (DDIs), a major cause of adverse drug events (ADEs), are a serious global health concern, and a severe detriment to public health. The scale of DDIs involving three or more drugs (also called high-order DDIs) has posed a prohibitory challenge for its molecular pharmacology and clinical research, which motivates alternative strategies such as mining health record data. This project aims to develop large-scale computational strategies and effective software tools for mining high-order DDI effects from health record databases, in order to yield novel discoveries in drug safety, and ultimately to benefit national health and well-being.

Bioinformatics Strategies for Multidimensional Brain Imaging Genetics

This project is supported by an NLM R01 award (R01 LM011360). We have been working on producing novel bioinformatics algorithms and tools for comprehensive joint analysis of large scale heterogeneous imaging genomics data, using Alzheimer’s Disease Neuroimaging Initiative (ADNI) database as a test bed. We have published a variety of novel machine learning models for effective mining of complex imaging genomic associations, including structured sparse regression models, structured sparse canonical correlation analysis (SCCA) models, and gene-gene and gene-environment interaction models. We have also developed a novel imaging genetic enrichment analysis (IGEA) framework for identifying high level associations between gene sets and brain circuitries, and a novel network-based machine learning framework to identify phenotype-relevant functional modules from tissue-specific biological networks. We are working on developing novel machine learning and bioinformatics strategies for integrating brain genomics, transcriptomics and anatomics.

Genetic and Multi-Omic Analysis of Quantitative Phenotypes in AD

This focus aims to investigate the role of genetic variation in disordered brain function using neuroimaging and biomarkers as phenotypes. Besides the method development work described above, we also employ state-of-the-art methods to perform genetic analysis of quantitative phenotypes in AD. ADNI (U01 AG024904) is a landmark study in AD, and Dr. Shen served as a Co-Leader of its Genetics Core between 2009 and 2017. Using data from ADNI and local cohorts, we have completed a series of candidate gene and genome-wide association studies (GWAS) of structural and molecular neuroimaging data and other biomarker data (e.g., cerebrospinal fluid, plasma proteomics, cognition) in mild cognitive impairment (MCI) and AD. These studies yielded many interesting genetic findings in relation to quantitative phenotypes. Given the broadened landscape of ADNI multi-omic domain (e.g., including data from genome, epigenome, transcriptome, proteome, and metabolome), we are interested in expanding the scope of our imaging omics study from the genomic domain to multi-omic domain.

Multidimensional Data Mining and Biomarker Discovery

This topic is aimed to identify biomarkers from multidimensional data sets, including multimodal imaging data, high throughput omics data, and fluid biomarker data, for predicting cognitive and diagnostic outcomes. This work was partially supported by a completed NSF project (IIS-1117335), where we proposed and applied a series of sparse machine learning methods to the ADNI cohort for mining multidimensional imaging, omics and fluid biomarker data and discovering disease-sensitive and/or cognition relevant biomarkers. These approaches include machine learning models for sparse Bayesian classification, structured sparse multi-task regression, sparse learning for joint classification and regression, multi-modal multi-task learning, and multi-task longitudinal learning. Given the scale and complexity of the multidimensional imaging, omics and biomarker data, we are interested in refining our models for multidimensional data integration and longitudinal learning, as well as to address the big data analytic issue.

Biomedical Image Computing

This focus aims to develop and apply image and shape computing methods for analyzing MRI, PET, CT and other 3D imaging data. We have made a variety of contributions to the enhancement of the spherical harmonic (SPHARM) shape modeling technique by addressing its fundamental challenges, including generalization, scalability, and flexibility. Supported by an NIBIB project (R03 EB008674), we developed and released SPHARM-MAT, a SPHARM-based software toolkit for brain imaging. We have applied SPHARM to various biomedical applications, including hippocampal atrophy in brain disorders, cortical analysis in autism, thalamic atrophy in multiple sclerosis, cardiac motion analysis, and evolutionary biology. Besides SPHARM, we have also developed image processing methods for studying craniofacial dysmorphology in fetal alcohol spectrum disorder and spatiotemporal modeling of lung nodules. We are interested in developing novel methods for morphometric analysis of hippocampal subfields as well as image processing and machine learning methods for diagnosing dental hard-tissue conditions.