Perelman School of Medicine at the University of Pennsylvania

Grice Lab

MSGseqTK


DOI

News

MSGseqTK is fully updated and has much smaller memory footprint for database building and read cleaning.

Latest Version

Click MSGseqTK for latest version.

Introduction

MSGseqTK is a new FMD-index based toolkit for mapping and cleaning WGS metagenomics data. It is based on a new incremental and parallel algorithm that can build very large, whole metagenome scale databases with O(2 * N) max RAM usage, where N is the bi-directional size of the metagenome. MSGseqTK implements a previous published Bayesian posterior probability frame-work based algorithm to achieve accurate read mapping quality (mapQ) values. For NGS read cleaning (from host contamination), MSGseqTK implements a lightweight and fast maximum exact match seed (MEM) search algorithm using both reference and background databases, and cleaning reads based on the log-odd ratios of finding MEMs in background compared to reference database, which is proven very fast and accuracy.

Installation

MSGseqTK is developed using GNU Autotools and can be easily installed following the traditional steps of `autoreconf`, `./configure`, `make`, `make check` (optional) and `make install` precedures. See MSGseqTK GitHub home page for details.

Core programs

  • msgseqtk-build is used to build a metagenome database. It can build metagenome sequences in FASTA format into a MSGseqTK database. It uses an incremental algorithm to build the database block-by-block, and uses ~2N of maximum RAM if the block size is chosen not too large (N is the bi-directional size of the metagenome). It can also be used to update an existing metagenome database.
  • msgseqtk-anno is used to annotate a pre-built MSGseqTK database. It takes a MSGseqTK database and optionally a genome annotation list, and outputs a GFF3 annotation for the entire metagenome. The GFF3 annotation file contains new features on "genome" and "metagenome" levels, which can be later used by 3rd party tools to count genome-level abundance (e.g. featureCount from the Subread package).
  • msgseqtk-inspect is used to validate and inspect a pre-built MSGseqTK database. It can optionally output a genome name list, and the genomic sequences underlying the given database.
  • msgseqtk-mergedb is used to merge multiple pre-built MSGseqTK databases. It takes 2 or more database names and merges them into a new database. It uses the same incremental algorithm as for the msgseqtk-build, and has similar maximum RAM usage limits.
  • msgseqtk-align is used to map NGS reads to a pre-built MSGseqTK metagenome database, either single-end (SE) or paired-end (PE). It takes a database name, 1 (SE) or 2 (PE) sequence files in FASTQ/FASTA format, and outputs binary BAM or text SAM alignment files. It implements our published Bayesian frame-work based algorithm to achieve accurate read mapping (see our other tool AlignerBoost), and depends on the htslib library from the Samtools to output BAM files by direct API calls.
  • msgseqtk-clean is used to clean NGS reads from potential host-contaminations. It takes a reference database (i.e. microbial), a background database (i.e. human/host) and input SE/PE sequences in FASTQ/FASTA, and outputs the cleaned SE/PE sequences in FASTQ/FASTA format. It optionally outputs a TSV table with detailed information of assigning reads to the reference and background databases.

Download

The source codes of MSGseqTK (written in C++11) and pre-compiled binaries are freely available from the MSGseqTK GitHub home page.

How to use

Please see User's Manual for how to use MSGseqTK for your metagenomics shot-gun NGS data analysis

Pre-built databases

You need to build an MSGseqTK database for each metagenome before using its tools. You can build your own database using msgseqtk-build, or alternatively download the pre-built databases below. All microbial reference genomes are downloaded and updated regularly from the NCBI RefSeq Microbial Database.

  • Bacteria_refrep_chrcomp All Bacteria reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its bacteria genomes.
  • Archaea_refrep_chrcomp All Archaea reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its archaea genomes.
  • Fungi_refrep_chrcomp All Fungi reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its fungi genomes.
  • Viruses_refrep_chrcomp All Viruses reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its virus genomes.
  • Microbial_refrep_chrcomp Merged microbial database of above four databases. Recommended for microbiome studies with samples potentially from all microbial kingdoms. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its microbial genomes.
  • Protists_refrep_chrcomp All (eukaryotic) Protists reference+representative genomes with complete or chromosome level assembly sequences. The metagenomics GFF annotation file contains all the GFF annotations from NCBI RefSeq for all its protist genomes.
  • hg38 Human reference genome built from the UCSC hg38/NCBI GRCh38 assembly. Recommended for background read cleaning if the metagenomics reads are from human host. The metagenomics GFF annotation file contains the GFF annotations from GENCODE human genome v27.
  • mm10 Mouse reference genome built from the UCSC mm10/NCBI GRCm38 assembly. Recommended for background read cleaning if the metagenomics reads are from mouse host. The metagenomics GFF annotation file contains the GFF annotations from GENCODE mouse genome vM18.

Citations

 

Contact us

Please contact Qi Zheng or Elizabeth Grice with any questions.