Yi-An Ko (Susztak Lab) From Risk Variants to Genes: Post-GWAS Annotation of Chronic Kidney Disease Risk Loci
Varun Aggarwala (Voight Lab) Heptanucleotide sequence context explains substantial variability in nucleotide substitution probabilities across the human genome
Abstract:The rate of single nucleotide polymorphism varies by ~1000 fold across the human genome and fundamentally impacts evolution and incidence of genetic disease. The identities of the single nucleotides that immediately flank a polymorphic site – or the site’s trinucleotide local sequence context – substantially influence the probability that a nucleotide change will occur. In human populations, the impact of local sequence context on polymorphism rate has not been fully described and is untested beyond the trinucleotide context. To examine the boundaries of the window of local sequence that impacts the probability of polymorphism, we developed a statistical framework to compare different local sequence lengths using non-coding genomic data obtained from the 1000 Genomes Project. We demonstrate that a heptanucleotide sequence context – that is, a model that incorporates the three adjacent nucleotides located both 5′ and 3′ to a polymorphic site – accounts for up to 93% of the variability in the probability of nucleotide substitution observed genome-wide. Our study also reveals previously undocumented variability in the probability of cytosine-to-thymine transition substitutions at CpG dinucleotides. Extension of our statistical framework into coding genomic data demonstrates additional context-specific variability in the probabilities of amino acid substitutions. Based on these observations, we present two statistics, informed by our best performing sequence context model, that are relevant for clinical studies: a substitution tolerance score for genes and a novel tolerance score for amino acids."