Perelman School of Medicine at the University of Pennsylvania

Grice Lab

MSGseqTK


DOI

News

MSGseqTK is made easier citable through DOI on Zenodo.

Latest Version

Click MSGseqTK for latest version.

Introduction

MSGseqTK is a new FMD-index based toolkit for mapping and cleaning WGS metagenomics data. It is based on a new incremental and parallel algorithm that can build very large, whole metagenome scale databases with moderate memory and time. MSGseqTK implements a previous published Bayesian posterior probability frame-work based algorithm to achieve accurate read mapping. For NGS read cleaning (from host contamination), MSGseqTK implements a quick maximum exact match seeds (MEMs) using both reference and background databases with very high accuracy and speed.

Core programs

  • msgseqtk-build is used to build a MSGseqTK database. It takes a FASTA sequence file and optionally a genome name list, and outputs a FMD-index based metagenomics database. It uses an incremental algorithm to build the database block-by-block, so requires limited RAM and space and is fairly fast.
  • msgseqtk-anno is used to annotate a pre-built MSGseqTK database. It takes a MSGseqTK database and optionally a genome annotation list, and outputs a GFF3 annotation for the database. The GFF3 annotation file contains new features on "genome" and "metagenome" levels, which can be later used by 3rd party tools to count genome-level abundance (e.g. featureCount from the Subread package).
  • msgseqtk-inspect is used to validate and inspect a pre-built MSGseqTK database. It can optionally output a genome name list, and the genomic sequences underlying the given database.
  • msgseqtk-mergedb is used to merge multiple pre-built MSGseqTK databases. It takes 2 or more database names and outputs a new database with all genomes from the input databases. It uses the same incremental algorithm and requires limited RAM and space.
  • msgseqtk-align is used to map NGS reads to a pre-built MSGseqTK database, single-end (SE) or paired-end (PE). It takes a database name, 1 or 2 NGS sequence files in FASTQ/FASTA format, and outputs binary BAM or text SAM alignment files. It implements our published Bayesian frame-work based algorithm to achieve accurate read mapping (see AlignerBoost), and depends the htslib package from the Samtools to output BAM files by direct API calls.
  • msgseqtk-clean is used to clean NGS reads from potential host-contaminations. It takes a reference database (i.e. microbial), a background database (i.e. human/host) and input SE/PE sequences in FASTQ/FASTA, and outputs the cleaned SE/PE sequences in FASTQ/FASTA format. It optionally outputs a TSV table with detained information of assigning reads to the reference and background databases.

Download

Please download the source code (written in pure C++11) and pre-compiled binaries from our MSGseqTK GitHub home page.

Pre-built databases

You need to build an MSGseqTK database for each metagenome before using its tools. You can build your own database using msgseqtk-build, or alternatively download the pre-built databases below. All microbial reference genomes are downloaded and updated regularly from the NCBI RefSeq Microbial Database.

  • Bacteria_refrep_chrcomp All Bacteria reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its bacteria genomes.
  • Archaea_refrep_chrcomp All Archaea reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its archaea genomes.
  • Fungi_refrep_chrcomp All Fungi reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its fungi genomes.
  • Viruses_refrep_chrcomp All Viruses reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its virus genomes.
  • Microbial_refrep_chrcomp Merged microbial database of above four databases. Recommended for microbiome studies with samples potentially from all microbial kingdoms. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its microbial genomes.
  • Protists_refrep_chrcomp All (eukaryotic) Protists reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its protist genomes.
  • hg38 Human reference genome built from the UCSC hg38/NCBI GRCh38 assembly. Recommended for background read cleaning if the metagenomics reads are from human host. The metagenomics GFF annotation file contains the GFF annotations from GENCODE human genome v27.
  • mm10 Mouse reference genome built from the UCSC mm10/NCBI GRCm38 assembly. Recommended for background read cleaning if the metagenomics reads are from mouse host. The metagenomics GFF annotation file contains the GFF annotations from GENCODE mouse genome vM18.

Citations

 

Contact us

Please contact Qi Zheng or Elizabeth Grice with any questions.