Perelman School of Medicine at the University of Pennsylvania

Wang Lab


Our lab focuses on Alzheimer's disease and other neurodegenerative disorders. Ongoing projects in our lab can be divided into two main directions: genetics and genomics of Alzheimer's Disease and other neurodegenerative disorders, and informatics and algorithm development for genome-scale experiments.

Click each title below to see summaries of each of our ongoing projects.

NIA Genetics of Alzheimer's Disease Data Storage Site (NIAGADS)



NIAGADS is the NIA Genetics of Alzheimer's Disease Data Storage Site. NIAGADS is a genetics data repository set up by the National Institute of Aging (NIA) to facilitate access by qualified investigators to genotypic data for Alzheimer's disease (AD). We propose to expand NIAGADS into a one-stop shopping data warehouse and portal for AD genetics that reflects the latest advances in genetics research and high-throughput genotyping and next-generation sequencing technologies. The completed repository will include a genomics database for AD genetics; enhancement for next-generation sequencing data, secondary data, and analysis results; and accessible workflows and secondary data from various AD genetics projects.


Consortium for Alzheimer's Sequence Analysis (CASA)

The Consortium for Alzheimer's Sequence Analysis (CASA) is a $12.5M collaborative research project funded by NIA to analyze whole exome and whole genome sequence data generated from subjects with Alzheimer's disease (AD) and elderly normal controls.  These data have been generated by the National Human Genome Institute Large-Scale Sequence Program. The goal of the planned analyses is to identify genes that have alleles that protect against or increase susceptibility to AD.  

The project consists of three Projects that drive the main scientific investigation efforts:

  • Project 1 will evaluate variants detected in the sequence data for association with AD to identify protective and susceptibility alleles.
  • Project 2 will evaluate sequence data from multiplex AD families to identify variants associated with AD risk and protection, and evaluate variant co-segregation with AD.
  • Project 3 will focus on structural variants (insertion-deletions, copy number variants, and chromosomal rearrangements).

CASA also has four cores (Administrative; Statistics and Innovation; Data Management and Information Transfer; In Silico Functional Genomics) supporting the three projects and collaborations with other teams funded by NIA to analyze ADSP data.

Our lab leads the Data Management and Information Transfer Core.  We are also part of Project 3 with focus on structural variant calling using whole-genome and whole-exome sequencing data. 


Genome Center of Alzheimer’s Disease (GCAD)

NIA Coordinating Genome Center of Alzheimer’s Disease (GCAD) is a five-year/$10.8M U54 Cooperative Agreement/Specialized Center funded by NIA (U54-AG052427) to facilitate AD gene discovery by coordinating analysis of all AD-relevant data.  GCAD will assemble all data generated by the Alzheimer’s Disease Sequence Project (ADSP) from both the Discovery Phase and the Follow-Up Phase, and all data from non-ADSP sources. 

GCAD consists of three cores:

  • Administrative (Core A); 
  • Data Management, Harmonization, and Information Transfer Core (Core B);
  • Biostatistics and Data Analysis Core (Core C).

GCAD will:

  • Create and support a collaborative network of all GCAD, ADSP, RFA AG16002, and other AD genetics investigators; 
  • Harmonize all genetic and phenotype data and fully annotate all variants;
  • Design all harmonization and annotation protocols, and plan and implement analysis for all data; 
  • Broadly distribute primary data, harmonized annotated analysis-ready files, and analyses results including depositing appropriate data into qualified access databases [National Institute on Aging Genetics of Alzheimer’s Disease Storage site (NIAGADS) and database of Genotypes and Phenotypes (dbGaP)].


Single Cell Analysis Program-Transcriptome Project (SCAP-T)

The Single Cell Analysis Program-Transcriptome Project (SCAP-T) initiative is part of the Single Cell Analysis Program (SCAP) and is funded through the NIH Common Fund.

The goal of this project is to evaluate the cellular heterogeneity by profiling the transcriptomes of thousands of single cells from human brain and heart. This data is being generated under three independent studies that are being conducted at the following institutions:

University of Pennsylvania
University of Southern California
University of California San Diego

Our group is leading the data coordination for this human single cell transcriptomic data, tracking the phenotype information, carrying out the NGS analysis for the data, performing QA/QC and managing the submission of this data to NIH dbGaP/SRA. We have also developed the SCAP-T Data Portal which provides a user-friendly interface for the community to browse all the phenotype data and NGS analysis results, and download the SRA files for cells of interest. For more information please visit the SCAP-T website (


Alzheimer's Disease Sequencing Project (ADSP)

The Alzheimer’s Disease Sequencing Project (ADSP) was established in February 2012 as a Presidential Initiative to fight Alzheimer’s Disease (AD). Developed jointly by the National Institute on Aging (NIA) and the National Human Genome Research Institute (NHGRI), the specific aims of the ADSP are to: 1) identify protective genomic variants in older adults at risk for AD, 2) identify new risk variants among AD cases, and 3) examine these factors in multi-ethnic populations to identify therapeutic targets for disease prevention.

NIAGADS team members working within the Wang Lab, along with the ADSP Data Flow Work Group, are charged with supporting ADSP discovery phase data production, sharing and management, and facilitating data access by the general research community.



Alzheimer's Disease Genetic Consortium (ADGC)

New technologies for reliable and low-cost genome-wide genotyping and sequencing have led to many exciting genetic findings for human diseases in the past decade.  In these studies, a critical element for success is that the sample size be large enough so that there is adequately power to detect genes with modest effect sizes at genome-wide significance.  The Alzheimer’s Disease Genetics Consortium was formed in 2009 to collaboratively use the collective resources of AD research community to identify AD genes. The clinical, neuropathologic, molecular and statistical expertise exists within the AD research community.  Also, much of the needed phenotype data and DNA samples also exist, gathered by the (now 32) NIA-designated AD research/core centers across the nation.

The goal of ADGC is to deconstruct the complete genetic architecture of Alzheimer's disease (AD), and to determine how all inherited factors contribute to the AD phenotype. To this end we will identify, annotate, replicate, and validate all DNA variants that increase risk or protect against AD, determine what genes are connected to these variants, and evaluate the contribution of each to total AD risk. 

The Year 6-10 objectives include (1) expand AD genetics cohorts for non-Caucasian populations including African Americans, Latinos, and Asians;

(2) identify additional AD rare-variant genes using gene-based analyses; (3) perform whole exome sequencing and targeted sequencing on African American and Latino subjects to generalize findings made on Caucasians, to refine gene localization, to identify novel variants, and to identify novel genes found only in other ethnic groups; (4) assemble and harmonize phenotypes available in multiple cohorts to identify subtypes of AD and genes associated with variants associated with those subtypes.

Our lab is responsible for all IT and Data operations within ADGC, and sharing of published through collaboration between ADGC and NIAGADS.  We also contribute to the analysis and interpretation of data. 


RNA-Seq Bioinformatics

Recent evidence has shown that non-coding RNAs are ubiquitous in the cell and that their functions and structure vary to a greater extent than previously imagined. Multiple new RNA classes have been implicated in many diseases, and understanding how these RNAs work is a critical need. While exciting discoveries are accumulating, our functional knowledge of these new RNAs remains limited.

Our lab is interested in developing new computational methods and novel RNA sequencing protocols that are tightly integrated and can economically study novel functional non-coding RNAs including their structures, functions, and other important characteristics such as editing/modifications, and tissue specificity at a genomic scale.  The following are some of the tools our lab has developed:

  1. SaVOR 

SAVoR is an easy-to-use web application that allows the user to visualize RNA-seq data and other genomic annotations on RNA secondary structures. SAVoR is designed to help researchers visualize sequencing data in the context of RNA secondary structures.  

  1. CoRAL

CoRAL is a machine learning package that can predict the precursor class of small RNAs present in a high-throughput RNA-sequencing dataset. In addition to classification, it also produces information about the features that are most important for discriminating different populations of small non-coding RNAs. Complete instructions and documentations can be found : 

  1. HAMR

HAMR (High-throughput Annotation of Modified Ribonucleotides) is a web application that allows you to detect and classify modified nucleotides in RNA-seq data. HAMR scans RNA-sequencing data for sites showing potential signatures of nucleotide modification. Users can input particular genomic regions of interest (BED file format) and HAMR will output a table containing the list of sites with nucleotide patterns that deviate from expectation at a statistically significant rate.

  1. DASHR 

The DASHR database provides information about small non-coding RNA (sncRNA) and their expression in different human tissues and cell types. The content of this database derives from curation, annotation, and computational analysis of small RNA sequencing data sets from multiple sources. Currently the database contains information about more than 46,000 sncRNAs in 42 normal human tissues and cell types from over 30 independent studies.



DNA-Seq and Genome Regulation Bioinformatics

Genome resequencing poses immense challenges for data management, analysis, and interpretation.  First, the sheer amount of information makes data managing, storing, and sharing expensive and cumbersome.    For example, whole-exome sequencing data from 10,000 individuals generates more than 100TB raw data, and raw data from whole-genome sequencing data will reach 1~2PB or more in size, and 5 billion (5 × 109) core-hours.  Second, DNA-Seq data needs to be processed by dedicated computer programs/workflows and careful quality check before genotypes can be used for further analysis.

Our lab has extensive experience processing whole-exome and whole-genome DNA-Seq experiments.  We are one of the five sites that contributed to a large-scale collaborative study published in 2012 [Neale et al. Nature], which used whole-exome sequencing to find de novo mutations in individuals with autism.  Our lab is the Data Coordinating Center for the Alzheimer’s Disease Sequencing Project and have coordinated data production and release with the three large-scale sequencing and analysis centers (LSACs) at Broad Institute, Baylor Human Genome Center, and Washington University Human Genome Institute.  We have developed pipelines for processing WGS/WES sequencing data and the pipeline (DRAW; DNA Resequencing Analysis Workflow) can be run on cloud computing environments such as Amazon Web Services (AWS).

Once candidate loci are identified, the next challenge is to associate it with potential functional consequences such as change of protein structure, mRNA splicing, or gene expression regulation, by integrating functional genomics data such as histone modification, protein-DNA interaction, gene expression QTL, DNase hypersentivity, enhancer activity, and chromatin conformation.


HIPPIE: A High-Throughput Identification Pipeline for Promoter Interacting Enhancer elements is the workflow for analyzing batches of Hi-C paired-end reads in compressed FASTQ format (.fastq.gz) and predict enhancer–target gene interactions. HIPPIE streamlines the entire processing phase including reads mapping, quality control and enhancer–target gene prediction as well as characterizing the interactions.



DRAW (DNA Resequencing Analysis Workflow) automates the workflow of processing raw sequence reads including quality control, read alignment, and variant calling on High-Performance Computing (HPC) facilities such as Amazon Elastic Compute Cloud (EC2). SneakPeek provides an effective interface for reviewing dozens of quality metrics reported by DRAW, so users can assess the quality of data and diagnose problems in their sequencing procedures. Both DRAW and SneakPeek are freely available under the MIT license, and are available as Amazon Machine Images to be used directly on Amazon Cloud with minimal installation.

DRAW automates the entire process of mapping sequence reads, various quality control steps and calling variants. We developed DRAW following Best Practice Variant Detection with the Genomic Analysis Toolkit. DRAW accepts both single-end and pair-end reads in FASTQ format from a variety of DNA-seq experiments including: Whole Genome Sequencing, Whole Exome Sequencing, and target capture sequencing.

A web-based diagnostic tool for reviewing quality metrics generated by our DNA Resequencing Analysis Workflow (DRAW). We have utilized multiple web technologies simultaneously to provide a seamless interface which allows the user to access their assigned projects, generate charts, compare metrics, export data, all on the fly. SneakPeek also gives the user many viewing options when creating a data grid, from being able to select any number of flowcells in any order, to transposing the entire grid itself. We are always working to improve SneakPeek with new features and modules, and welcome suggestions for future releases. You can access SneakPeek here, using the 'DemoUser' login with the password 'demopassword'.