Penn Genomics Analysis Core
454 GS FLX Sequencer
A Roche/454 GS FLX sequencer was installed at the DNA Sequencing Facility, School of Medicine in early summer, 2008. The sequencer was funded by an NIH shared instruments grant (PI Frederic Bushman) and the start up cost including the laboratory set up, first year salaries of a full-time technician, and a half-time programmer/analyst was provided jointly by the School of Medicine and the Penn Genome Frontier Institute. The sequencer uses massively parallel pyrosequencing technique to generate as much as fi.ve hundred million bases in a single overnight run with a read length of about 400 bases using Titanium chemistry.
After installation we set up sample preparation procedures including library preparation, emulsion PCR and pyrosequencing for both genomic DNA and amplicons. We did a number of trial sequencing runs and gradually enhanced the quality and throughput of sequence reads. With the recently introduced Titanium kit the throughput has gone up from 100 Mb to ~500 Mb and read length from 250 bases to 400 bases as announced by Roche.
We accept samples (both genomic DNA, long amplicons and regular-sized amplicons) for processing and pyrosequencing on 454 GS FLX sequencer using Titanium chemistry.
Please note that any new generation sequencing platform including Roche/454 is ideally suited for generating millions of sequenced bases from few samples. These sequencers cannot be fully exploited yet for sequencing only a few hundred bases (e.g. 1 or 2 exons of a gene to scan for few known SNP's) from tens or hundreds of samples.
454 pyrosequencing technology generates much longer read lengths compared to other new generation sequencers. The longer reads allow easier assembly of repeat-rich sequences and the use of barcoding strategy.
In cancer research and other fields the technology is primarily suited for:
- De novo whole genome sequencing supported by paired-end mapping.
- Ultra-deep un-targeted or targeted sequencing to detect the spectrum of genetic variations in specific cancer-associated regions or genes including low-abundance variants.
- Tanscriptome sequencing or RNA-Seq to identify alternative splice-variants and discovery of novel transcripts in tumor cell lines or tissues.
- Metagenomics for simultaneous sequencing of DNA from many different microbes extracted from biological or environmental samples.
- Paired-end mapping that identifies extensive structural variations (SV) and maps break points in the cancer genome.
- Viral integration studies to detect potential cancer causing viral elements in tumors.
Potential overlap in applications with Illumina (Solexa) sequencing service offered by the Genomics core
There are applications where one platform is more suitable than the other; there are some that can use either 454 or Illumina platform and also there are some applications where the two platforms play a complimentary role.
- Sequencing of human microbiome: The long reads generated by 454 sequencer are especially suitable for de novo sequencing.
- Metagenomics and Comparative Genomics: 454 sequencer is particularly suitable to interrogate the number and types of species present in environmental samples.
- Targeted or untargeted resequencing: both platforms are suitable but cost/base is lower and coverage is higher with Illumina. However, in many cases 454 long reads analyze entire exons and determine allelic variations with higher confidence.
- Transcriptome sequencing (RNA-Seq): both are suitable. However Illumina provides higher coverage and lower cost when tag counting is important e.g. for quantitative gene expression studies. Illumina is also more cost-effective for sequencing small RNA. 454 is better suited for studies involving alternative splice isoforms. When sequencing and assembly of complex transcript structures are required the longer reads easily outperform the Illumina reads.
- Gene fusion in tumors: A recent publication by Dr Arul Chinnaiyan's group in Nature (Maher et. al., 2009, Nature, 458:97-101) describes an integrative analysis of high throughput long and short read transcriptome sequencing of cancer cells to discover novel gene fusions.
In summary, all factors eg. the cost/base, the fold coverage, and the ease of data analysis in terms of assembly and mapping are to be considered to select the right platform for a particular sequencing project. In many resequencing projects both platforms are being used simultaneously for cross-platform validation.
In order to address this, the investigators are first directed to this and the other website before coming to a consultation session. During consultation all advantages and disadvantages of long read vs short read sequencers including differences in throughput, cost per base, and ease of data analysis are clearly explained to the potential user. They are also provided information on available new generation sequencing resources on Penn campus. A joint decision is then made as to which platform to use. If it turns out that the Illumina is more suitable platform for a particular application the investigator is directed to the Genomics core.
Sample Preparation for a 454 Sequencer Run
This consists of three steps: Library preparation, Titration and emulsion PCR (emPCR), and Pyrosequencing. Preparation of shotgun library for genomic DNA and long amplicons are done at the facility. Amplicon libraries are to be made by the user laboratory. The rest of the steps are carried out at the facility.
gDNA or long PCR products greater than 1.5 Kb
Genomic DNA or long PCR products submitted by the users are fragmented by nebulization, and end-polished. Following double-stranded adaptor ligation, the fragments are immobilized onto streptavidin coated beads, via the biotin moiety of one of the adaptors. A strand-displacing DNA polymerase does a fill-in to repair the gaps generated by the ligation of non-phosphorylated adaptors to the fragments. Next a single-stranded library is created by melting off the non-biotinylated strand of bead-bound fragments.
Library preparation (PCR products up to 700 - 800 bp, preferably below 500 bp
The PCR products are generated by the user lab using sequence specific primers that have 454 primers A and B fused to its 5′ sides. The 454 primer A and B sequences are available on request. Single or pooled PCRs are then submitted to the facility.
Instead of using a very labor-intensive and expensive method of generating large number of long or short amplicons to analyze several genes or large genomic regions, targeted regions of the genome can be captured following Nimblegen Sequence Capture Technology or Agilent Target Enrichment System. Few other companies also offer their capture technologies.
Nimblegen's sequence capture arrays provide enrichment of up to 5 Mb of selected genomic regions, either contiguous or non-contiguous. Recently they have introduced Human Exome Array that targets all human protein coding and miRNA exons. The capture technology has been optimized to be used in conjunction with 454 sequencing. DNA after capture is processed to attach 454 adaptors at the facility. We have done 454 sequencing of Nimblegen captured cancer genes. Nimblegen offers both array- and solution-based capture technologies. Agilent's target enrichment method based on hybridization in solution also offers enrichment of a particular segment of the genome. The method has been optimized for 454 sequencing.
In order to reduce the cost of sequencing, multiplexing can be done by attaching a unique tag to each primer before PCR amplification. After sequencing an equimolar mixture of PCR products from a number of samples, the sequences can be assigned to each sample based on the unique barcodes. Roche offers 12 barcoded adaptors to pool 12 samples together (sequences available).
A few publications involving 454 sequencing with barcoded samples are cited below.
- Hoffman et. al. (2007) DNA bar coding and pyrosequencing to identify rare HIV drug resistant mutations. Nuc. Acids Res., 35, No.13, e91 - This one from Rick Bushman's (Microbiology, Penn) group describes sequencing of barcoded and pooled amplicons.
- Meyer et. al. (2008) Parallel tagged sequencing on the 454 platform. Nature Protocols, 3, No.2, 267-278 - This one describes method for barcoding shotgun DNA libraries as well as PCR products.
- Hamady et. al. (2008) Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Methods, 5, 235-237.
Titration and Emulsion PCR
After preparation the quality assessment and the quantitation of a library are done by flurometry and analysis on a Bioanlyzer. A functional quantitation (titration) is performed by setting up small scale emPCR’s to determine the optimum number of DNA molecules per bead.
The library of DNA fragments is amplified from a single bead-bound copy to millions of copies per bead in an emulsion of water-in-oil mixture. Emulsion PCR ensures functional clonality by physically separating the DNA carrying beads in an emulsion during amplification. Following amplification on a thermocycler the emulsion is broken and the beads carrying the amplified library are recovered. The procedure generates a certain number of beads without any amplified DNA. The beads carrying amplified DNA are separated from empty beads by an enrichment process based on the binding of biotinylated amplification primers to streptavidin coated beads. The sequencing primer is then annealed to the bead bound amplified fragments.
Beads with bound DNA are loaded into the wells of a picotiter plate (PTP) such that the wells contain single DNA beads. The DNA beads are spiked with control DNA beads of known sequence. The packing beads (to stabilize all immobilized components) and the enzyme beads (carrying the enzymes for primer extension and chemiluminescence) are deposited thereafter sequentially. The loaded PTP is inserted into the FLX instrument and the sequencing reagents are sequentially flowed over the entire plate. The PPi released after incorporation of a nucleotide into the growing DNA strand by DNA polymerase is detected by ATP sulfurylase and luciferase in a coupled reaction. The light generated as a result is recorded by a CCD camera from every well on the plate simultaneously in a massively parallel fashion. Each flow of nucleotide is followed by a wash with apyrase to degrade unused nucleotides. After 200 flow cycles the FLX sequencer produces about 1,000,000 reads of length 400b in an overnight run. Sequence accuracy is estimated at 99%.
Data processing occurs in two phases. The run-time phase includes 3 steps in succession - GS Sequencer (acquisition of the raw images), Image Processing, and Signal Processing. This is an automated process occurring as part of a sequencing run. However Image Processing, and Signal Processing can also be carried out on a separate server called DataRig. The end output is Standard Flowgram Format (SFF) files containing the flowgrams for individual reads, the basecalled read sequences, and per-base quality scores. Total number of reads and bases obtained from a run refers to the high quality reads that have passed all filters included in the GS Run Browser of the 454 GS FLX, namely mixed and dot filters, primer filter and signal intensity filter.
The post-run phase of data processing is the most time consuming step. Roche offers three software to generate the final output in desired format - De Novo Assembler assembles the reads into contigs to generate a consensus sequence. Reference Mapper maps the reads to a known reference sequence to generate a consensus sequence along with a list of high confidence mutations. Amplicon Variant Analyzer identifies and quantitates sequence variants by ultra-deep sequencing of amplicons.
A number of companies including DNA Star and SoftGenetics provide off-the-shelf analysis software. Besides there are numerous free analysis tools available on the web. We currently possess windows-based NextGene software from SoftGenetics that provides a biologist friendly windows interface and works as a supplement to other tools.
The facility performs a preliminary analysis including trimming of the standard primers from the reads and barcode sorting of the reads. Further data analysis can be provided as a fee-for-service e.g. assembly of the reads to contigs and mapping to a reference sequence to generate a variation report. Working with the investigator we will do customized downstream analysis like - annotation of known SNPs, or generation of a consensus sequence and/or a list of mutations and rare sequence variants in numerous genes, and the possible implications at the level of coding or non-coding sequence changes and so on.
Expected Run Results (Titanium Chemistry)Throughput based on ~ 400b read length
|Pico Titer Plate (PTP) Device||No of Regions per PTP||Reads per region (x 103)||Throughput per region||Total Reads per PTP||Throughput per PTP|
|70x75||2||450 - 650||180 - 280 Mb||~1,000,000||360 - 560 Mb|
|4||160 - 250||60 - 110 Mb||~800,000||240 - 440 Mb|
|8||80 - 120||30 - 55 Mb||~800,000||240 - 440 Mb|
|16||25 - 40||10 - 20 Mb||~512,000||160 - 320 Mb|
Between 4 - 6 weeks depending on the workload. A lot depends on the quality of the DNA submitted.
Users can submit requests for 1/2 or 1/4th plate only in order to reduce time to fill up a plate.
Included in the request form. There is a 10% surcharge to off-campus users. The data analysis is done after consultation with the investigator at the rate of $100/hr.
For further information and to set up a meeting, please contact Tapan Ganguly Tel: 215-573-7238, e-mail: firstname.lastname@example.org
Penn Genomics Analysis Core is an Abramson Cancer Center Shared Resource that is approved and partially funded by the National Cancer Institute.