Squeeze Meta Manual V1.0.0beta
User Manual:
Open the PDF directly: View PDF
.
Page Count: 32
| Download | |
| Open PDF In Browser | View PDF |
SqueezeMeta: a fully automated metagenomics pipeline, from reads to bins Version 1.0beta, April 2019 Javier Tamames and Fernando Puente-Sánchez CNB- CSIC, Madrid, Spain Index INDEX 2 1. WHAT IS SQUEEZEMETA? 3 2. INSTALLATION 4 3. DOWNLOADING OR BUILDING DATABASES 4 4. EXECUTION, RESTART AND RUNNING SCRIPTS 5 5. USING EXTERNAL DATABASES FOR FUNCTIONAL ANNOTATION 8 6. EXTRA-SENSITIVE DETECTION OF ORFS 9 7. TESTING SQUEEZEMETA 10 8. WORKING WITH OXFORD NANOPORE MINION AND PACBIO READS 10 9. WORKING ON A LOW MEMORY ENVIRONMENT 10 10. SETTING UP THE MYSQL DATABASE 10 11. UPDATING SQUEEZEMETA 11 12. LICENSE AND THIRD-PARTY SOFTWARE 11 13. ABOUT 11 SCRIPTS, FILES AND FILE FORMAT 12 OTHER SCRIPTS IN THE UTILS DIRECTORY 21 EXPLANATION OF SQUEEZEMETA ALGORITHMS 28 SqueezeMeta v1.0beta 2 SqueezeMeta: a fully automated metagenomics pipeline, from reads to bins Version 1.0beta, April 2019 1. What is SqueezeMeta? SqueezeMeta is a full automatic pipeline for metagenomics/metatranscriptomics, covering all steps of the analysis. SqueezeMeta includes multi-metagenome support allowing the co-assembly of related metagenomes and the retrieval of individual genomes via binning procedures. Thus, SqueezeMeta features several unique characteristics: 1. Co-assembly procedure with read mapping for estimation of the abundances of genes in each metagenome 2. Co-assembly of a large number of metagenomes via merging of individual metagenomes 3. Includes binning and bin checking, for retrieving individual genomes 4. The results are stored in a database, where they can be easily exported and shared, and can be inspected anywhere using a web interface. 5. Internal checks for the assembly and binning steps inform about the consistency of contigs and bins, allowing to spot potential chimeras. 6. Metatranscriptomic support via mapping of cDNA reads against reference metagenomes SqueezeMeta can be run in three different modes, depending of the type of multimetagenome support. These modes are: • • • Sequential mode: All samples are treated individually and analysed sequentially. This mode does not include binning. Coassembly mode: Reads from all samples are pooled and a single assembly is performed. Then reads from individual samples are mapped to the coassembly to obtain gene abundances in each sample. Binning methods allow to obtain genome bins. Merged mode: if many big samples are available, co-assembly could crash because of memory requirements. This mode allows the co-assembly of a very large number of samples, using a procedure inspired by the one used by Benjamin Tully for analysing TARA Oceans data (https://dx.doi.org/10.17504/protocols.io.hfqb3mw ). Briefly, samples are assembled individually and the resulting contigs are merged in a single coassembly. Then the analysis proceeds as in the co-assembly mode. This is not the recommended procedure (use co-assembly if possible) since the possibility of SqueezeMeta v1.0beta 3 creating chimeric contigs is higher. But it is a viable alternative when standard co-assembly is not possible. SqueezeMeta uses a combination of custom scripts and external software packages for the different steps of the analysis: 1. Assembly 2. RNA prediction and classification 3. ORF (CDS) prediction 4. Homology searching against taxonomic and functional databases 5. Hmmer searching against Pfam database 6. Taxonomic assignment of genes 7. Functional assignment of genes 8. Blastx on parts of the contigs with no gene prediction or no hits (OPTIONAL) 9. Taxonomic assignment of contigs, and check for taxonomic disparities 10. Coverage and abundance estimation for genes and contigs 11. Estimation of taxa abundances 12. Estimation of function abundances 13. Merging of previous results to obtain the ORF table 14. Binning with MaxBin 15. Binning with MetaBAT 16. Binning integration with DAS tool 17. Taxonomic assignment of bins, and check for taxonomic disparities 18. Checking of bins with CheckM 19. Merging of previous results to obtain the bin table 20. Merging of previous results to obtain the contig table 21. Prediction of kegg and metacyc patwhays for each bin 22. Final statistics for the run 2. Installation For installing SqueezeMeta, download the latest release from the GitHub repository and uncompress the tarball in a suitable directory. The tarball includes the SqueezeMeta scripts as well as the third-party software redistributed with SqueezeMeta (see section 6). The INSTALL files contain detailed installation instructions, including all the external libraries required to make SqueezeMeta run in a vanilla Ubuntu 14.04 or CentOS7 (DVD iso) installation. 3. Downloading or building databases SqueezeMeta uses several databases. GenBank nr for taxonomic assignment, and eggnog, KEGG and Pfam for functional assignment. The script download_databases.pl can be run to download a pre-formatted version of all the databases required by SqueezeMeta. SqueezeMeta v1.0beta 4/SqueezeMeta/scripts/preparing_databases/download_databases.pl , where is the destination folder. This is the recommended option. Alternatively, the script make_databases.pl can be run to download from source and format the latest version of the databases. /SqueezeMeta/scripts/preparing_databases/make_databases.pl The databases occupy 200Gb, but we recommend having at least 350Gb free disk space during the building process. If the SqueezeMeta databases are already built in another location in the system, a different copy of SqueezeMeta can be configured to use them with /SqueezeMeta/scripts/preparing_databases/configure_nodb.pl 4. Execution, restart and running scripts Scripts location The scripts composing the SqueezeMeta pipeline can be found in the .../SqueezeMeta/scripts directory. We recommend adding it to your $PATH environment variable. Execution The command for running SqueezeMeta has the following syntax: SqueezeMeta.pl -m -p -s -f Arguments Mandatory parameters • • • • -m : Run mode (sequential, coassembly, merged) (REQUIRED) -p : Project name (REQUIRED in coassembly and merged modes) -s |-samples: Samples file (REQUIRED) -f|-seq : Fastq read files' directory (REQUIRED) Filtering • • --cleaning: Filters with Trimmomatic (Off by default) -cleaning_options [options]: Options for Trimmomatic (Default if not specified:LEADING:8 TRAILING:8 SLIDINGWINDOW:10:15 MINLEN:30) SqueezeMeta v1.0beta 5 • • • • • • • • • • • Assembly -a [assembler]: assembler (megahit,spades,canu) (Default: megahit) -assembly_options [options]: Options for the assembler (refer to manual of the specified assembler) -c|-contiglen : Minimum length of contigs to keep (Default: 200) -extassembly : Name of the file containing an external assembly provided by the user. The file must contain contigs in fasta format. This overrides the assembly step of SqueezeMeta. Mapping -map : mapping (aligner) software [bowtie,bwa,minimap2-ont,minimap2pb,minimap2-sr] (Default: bowtie) Annotation --nocog: Skip COG assignment --nokegg: Skip KEGG assignment --nopfam: Skip Pfam assignment -extdb : List of user-provided databases for functional annotations. -b|-block-size : block size for diamond against the nr database. Lower values reduce RAM memory usage. Set it to 3 or below for running in a desktop computer (Default: 8) --D|--doublepass: Extra-sensitive ORF prediction. First pass looking for genes using gene prediction, second pass using BlastX (Off by default) • • • Binning --nobins: Skip all binning --nomaxbin: Skip MaxBin binning --nometabat: Skip MetaBat2 binning • • • Performance -t : Number of threads (Default: 12) -canumem : memory for canu in Gb (Default: 32) --lowmem: run on less than 16 Gb of RAM memory (Default:no) • Settings for MinION --minion: Run on MinION reads (use canu and minimap2) • • Information -v: Version number -h: Help SqueezeMeta v1.0beta 6 Example SqueezeMeta call: SqueezeMeta.pl -m coassembly -p test -s test.samples -f mydir --nopfam -miniden 60 This will create a project "test" for co-assembling the samples specified in the file "test.samples", using a minimum identity of 60% for taxonomic and functional assignment, and skipping Pfam annotation. The -p parameter indicates the name under which all results and data files will be saved. This is not required for sequential mode, where the name will be taken from the samples file instead. The -f parameter indicates the directory where the read files specified in the sample file are stored. The samples file The samples file specifies the samples, the names of their corresponding raw read files and the sequencing pair represented in those files, separated by tabulators. It has the format: An example would be Sample1 Sample1 Sample1 Sample1 Sample2 Sample2 Sample3 Sample3 readfileA_1.fastq readfileA_2.fastq readfileB_1.fastq readfileB_2.fastq readfileC_1.fastq.gz readfileC_2.fastq.gz readfileD_1.fastq readfileD_2.fastq pair1 pair2 pair1 pair2 pair1 pair2 pair1 pair2 noassembly noassembly The first column indicates the sample id (this will be the project name in sequential mode), the second contains the file names of the sequences, and the third specifies the pair number of the reads. A fourth optional column can take the "noassembly" value, indicating that these sample must not be assembled with the rest (but will be mapped against the assembly to get abundances). This is the case for RNAseq reads that can hamper the assembly but we want them mapped to get transcript abundance of the genes in the assembly. Notice also that paired reads are expected, and that a sample can have more than one set of paired reads. The sequence files can be in either fastq or fasta format, and can be gzipped. The parameters file The file parameters.pl stored in the scripts directory sets several parameters used by SqueezeMeta's scripts. You can change them to adjust the performance of the pipeline. Restart Any interrupted SqueezeMeta run can be restarted using the program restart.pl. It has the syntax: SqueezeMeta v1.0beta 7 restart.pl This command must be issued in the upper directory to the project, and will restart the run of that project by reading the progress.txt file to find out the point where the run stopped. Alternatively, the user can specify the step in which to restart the analysis, using the step option (refer to the scripts section for the list of steps): restart.pl -step Running scripts Also, any individual script of the pipeline can be run in the upper directory to the project using the same syntax: (for instance, 04.rundiamond.pl to repeat the DIAMOND run for the project) script 5. Using external databases for functional annotation Version 1.0 of SqueezeMeta implements the possibility of using one or several external databases (user-provided) for functional annotation. This is invoked using the --extdb option. The argument must be a file (external database file) with the following format (tab-separated fields): For example, we can create the file mydb.list containing information of two databases: DB1 DB2 /path/to/my/database1 /path/to/annotations/database1 /path/to/my/database2 /path/to/annotations/database2 and give it to SqueezeMeta using --extdb mydb.list. Each database must be a fasta file of aminoacid sequences, in which the sequences must have a header in the format: >ID|...|Function Where ID can be any identifier for the entry, and Function is the associated function that will be used for annotation. For example, a KEGG entry could be something like: >WP_002852319.1|K02835 MKEFILAKNEIKTMLQIMPKEGVVLQGDLASKTSLVQAWVKFLVLGLDRVDSTPTFSTQKYE... You can put anything you want between the first and last pipe, because these are the only fields that matter. For instance, the previous entry could also be: SqueezeMeta v1.0beta 8 >WP_002852319.1|KEGGDB|27/02/2019|K02835 MKEFILAKNEIKTMLQIMPKEGVVLQGDLASKTSLVQAWVKFLVLGLDRVDSTPTFSTQKYE... Just remember not to put blank spaces, because they act as field separators in the fasta format. This database must be formatted for Diamond usage. For avoiding compatibility issues between different versions of Diamond, it is advisable that you use the Diamond that is shipped with SqueezeMeta, and is placed in the bin directory of SqueezeMeta distribution. You can do the formatting with the command: /path/to/SqueezeMeta/bin/diamond makedb -d /path/to/my/formatted/database --in /path/to/my/fastadatabase For each database, you can OPTIONALLY provide a file with functional annotations, such as the name of the enzyme or whatever you want. Its location must be specified in the last field of the external database file. It must have only two columns separated by tabulators, the first with the function, the second with the additional information. For instance: K02835 peptide chain release factor 1 The ORF table will show both the database ID and the associated annotation for each external database you provided. 6. Extra-sensitive detection of ORFs Version 1.0 implements the -D option (doublepass), that attempts to provide a more sensitive ORF detection by combining the Prodigal prediction with a BlastX search on parts of the contigs where no ORFs were predicted, or where predicted ORFs did not match anything in the taxonomic and functional databases. The procedure starts after the usual taxonomic and functional annotation. It masks the parts of the contigs in which there is a predicted ORF with some (taxonomic and functional) annotation. The remaining sequence corresponds to zones with no ORF prediction or orphan genes (no hits). The first could correspond to missed ORFs, the second to wrongly predicted ORFs. Then a Diamond BlastX is run only on these zones, using the same databases. The resulting hits are added as newly predicted ORFs, and the pipeline continues taking into account these new ORFs. The pros: This procedure is able to detect missing genes or correct errors in gene prediction (for example, these derived from frameshifts). For prokaryotic metagenomes, we estimate a gain of 2-3% in the number of ORFs. This method is especially useful when you suspect that gene prediction can underperform, for instance in cases in which eukaryotes and viruses are present. Prodigal is a prokaryotic gene predictor and its behaviour for other kingdoms is uncertain. In these cases, the gain can be higher than for prokaryotes. SqueezeMeta v1.0beta 9 The con: Since it has to do an additional diamond run (and using six frame-Blastx) it slows down the analysis, especially in the case of big and/or many metagenomes. 7. Testing SqueezeMeta The download_databases.pl and make_databases.pl scripts also download two datasets for testing that the program is running correctly. Assuming either was run with the directory as its target the test run can be executed with cd SqueezeMeta.pl -m coassembly -p Hadza -s test.samples -f raw Alternatively, -m sequential or -m merged can be used. 8. Working with Oxford Nanopore MinION and PacBio reads Since version 0.3.0, SqueezeMeta is able to seamlessly work with single-end reads. In order to obtain better mappings of MinION and PacBio reads agains the assembly, we advise to use minimap2 for read counting, by including the -map minimap2-ont (MinION) or -map minimap2-pb (PacBio) flags when calling SqueezeMeta. We also include the canu assembler, which is specially tailored to work with long, noisy reads. It can be selected by including the -a canu flag when calling SqueezeMeta. As a shortcut, the --minion flag will use both canu and minimap2 for Oxford Nanopore MinION reads. 9. Working on a low memory environment In our experience, assembly and DIAMOND against the nr database are the most memoryhungry parts of the pipeline. DIAMOND memory usage can be controlled via the -b parameter (DIAMOND will consume ~5*b Gb of memory). Assembly memory usage is trickier, as memory requirements increase with the number of reads in a sample. We have managed to run SqueezeMeta with as much as 42M 2x100 Illumina HiSeq pairs on a virtual machine with only 16Gb of memory. Conceivably, larger samples could be split an assembled in chunks using the merged mode. We include the shortcut flag --lowmem, which will set DIAMOND block size to 3, and canu memory usage to 15Gb. This is enough to make SqueezeMeta run on 16Gb of memory, and allows the in situ analysis of Oxford Nanopore MinION reads. Under such computational limtations, we have been able to coassemble and analyze 10 MinION metagenomes (taken from SRA project SRP163045) in less than 4 hours. 10. Setting up the MySQL database SqueezeMeta includes a built in MySQL database that can be queried via a web-based interface, in order to facilitate the exploration of metagenomic results. Code and instruction installations can be found at https://github.com/jtamames/SqueezeMdb. SqueezeMeta v1.0beta 10 11. Updating SqueezeMeta Assuming your databases are not inside the SqueezeMeta directory, just remove it, download the new version and configure it with /SqueezeMeta/scripts/preparing_databases/configure_nodb.pl 12. License and third-party software SqueezeMeta is distributed under a GPL-3 license. Additionally, SqueezeMeta redistributes the following third-party software: • • • • • • • • • • • • • • • • • • • • • Megahit Spades canu prinseq prodigal cd-hit amos mummer hmmer DIAMOND bwa minimap2 bowtie2 barrnap MaxBin MetaBAT DAS tool checkm MinPath RDP classifier pullseq 13. About SqueezeMeta is developed by Javier Tamames and Fernando Puente-Sánchez. Feel free to contact us for support (jtamames@cnb.csic.es, fpuente@cnb.csic.es). SqueezeMeta v1.0beta 11 Scripts, files and file format Files marked in blue are placed in the "results" directory; in green, files in "intermediate" directory; orange, in "ext_tables" directory: Step 1: Assembly Script: 01.run_assembly.pl (or 01.run_assembly_merged.pl) Files produced: • 01. .fasta: Fasta file containing the contigs resulting from the assembly • 01. .lon: Length of the contigs (Merged mode will also produce a .fasta and a .lon file for each sample) • 01. .stats: Some statistics on the assembly (N50, N90, number of reads, etc) Step 2: RNA finding Script: 02.run_barrnap.pl Files produced: • 02. .rnas: Fasta file containing all RNAs found • 02. .16S: Assignment (RDP classifier) for the 16S rRNAs sequences found. • 02. .maskedrna.fasta: Fasta file containing the contigs resulting from the assembly, masking the positions where a RNA was found. Step 3: Gen prediction Script: 03.run_prodigal.pl Files produced: • 03. .fna: Nucleotide sequences for predicted ORFs • 03. .faa: Aminoacid sequences for predicted ORFs • 03. .gff: Features and position in contigs for each of the predicted genes (moves to intermediate if -D option is selected) SqueezeMeta v1.0beta 12 Step 4: Homology searching against taxonomic (nr) and functional (COG, KEGG) databases Script: 04.rundiamond.pl Files produced: • 04. .nr.diamond: Result of the homology searching for nr • 04. .eggnog.diamond: Result of the homology searching for COG • 04. .kegg.diamond: Result of the homology searching for KEGG (nr and COGs databases are updated regularly. KEGG database requires licensing, therefore we use the last public available version) • 04. .optdb.diamond: Result of the homology searching for the optional database provided Step 5: HMM search for Pfam database Script: 05.run_hmmer.pl Files produced: • 05. .pfam.hmm: Results of the HMM search (Pfam database is updated regularly) Step 6: Taxonomic assignment Script: 06.lca.pl Files produced: • 06. .fun3.tax.wranks: Taxonomic assignment for each ORF, including taxonomic ranks (moves to intermediate if -D option is selected) • 06. .fun3.tax.noidfilter.wranks: Same as above, but assignment was done not considering identity filters (refer to the explanation of the LCA algorithm in the algorithm section) Step 7: Functional assignment Script: 07.fun3assign.pl Files produced: • 07. .fun3.cog: COG functional assignment for each ORF. (moves to intermediate if -D option is selected) SqueezeMeta v1.0beta 13 • 07. .fun3.kegg: KEGG functional assignment for each ORF. (moves to intermediate if -D option is selected) • 07. .fun3.optdb: Functional assignment for each ORF for the optional database provided. (moves to intermediate if -D option is selected) Format of these files: § Column 1: Name of the ORF § Column 2: Best hit assigment § Column 3: Best average assignment (refer to the explanation of the fun3 algorithm) • 07. .pfam: PFAM functional assignment for each ORF. Step 8: Blastx on parts of the contigs without gene prediction or without hits Script: 08.blastx.pl Files produced: • 08. .gff: Features and position in contigs for each of the mix of prodigal and Blastx ORFs. • 08. .blastx.fna: Nucleotide sequences for blastx ORFs • 08. .fun3.tax.wranks: Taxonomic assignment for the mix of prodigal and Blastx ORFs, including taxonomic ranks • 08. .fun3.cog: COG functional assignment for the mix of prodigal and Blastx ORFs • 08. .fun3.kegg: KEGG functional assignment for the mix of prodigal and Blastx ORFs • 08. .fun3.opt_db: Functional assignment for the mix of prodigal and Blastx ORFs, for the optional database provided. The format of these last three files is the same than above (step 7) Step 9: Contig assignment Script: 09.summarycontigs3.pl Files produced: • 09. .contiglog: Assignment of contigs based on ORFs annotations Format of the file: § Column 1: Name of the contig § Column 2: Taxonomic assignment, with ranks § Column 3: Lower rank of the assignment § Column 4: Disparity value (refer to the algorithm section) § Column 5: Number of genes in the contig For detailed information on the algorithm, please refer to algorithm’s section at the end of this manual. SqueezeMeta v1.0beta 14 Step 10: Mapping of reads to contigs and calculation of abundance measures Script: 10.mapsamples.pl Files produced: • 10. .mapcount: Several measures regarding mapping of reads to ORFs Format of the file: § Column 1: ORF name § Column 2: Length of the ORF § Column 3: Number of reads mapped to that ORF § Column 4: Number of bases mapped to that ORF § Column 5: RPKM value for the ORF (Reads mapped x 10^6 / ORF length x Total reads) § Column 6: Coverage value for the ORF (Bases mapped / ORF length) § Column 7: TPM value for the ORF (Reads mapped x 10 ^6/ å(Reads mapped / ORF length)) § Column 7: Sample to which these abundance values correspond • 10. .contigcov: Several measures regarding mapping of reads to contigs Format of the file: § Column 1: ORF name § Column 2: Coverage value for the contig (Bases mapped / contig length) § Column 3: RPKM value for the contig (Reads mapped x 10^6 / contig length x Total reads) § Column 4: TPM value for the contig (Reads mapped x 10 ^6/ å(Reads mapped / contig length)) § Column 5: Length of the contig § Column 6: Number of reads mapped to that contig § Column 7: Number of bases mapped to that contig § Column 8: Sample to which these abundance values correspond • 10. .mappingstat: Mapping percentage of reads to samples Format of the file: § Column 1: Sample name § Column 2: Total reads for the sample § Column 3: Mapped reads § Column 4: Percentage of mapped reads § Column 5: Total bases for the sample Step 11: Calculation of the abundance of all taxa Script: 11.mcount.pl Files produced: • 11. .mcount SqueezeMeta v1.0beta 15 Format of the file: § Column 1: Taxonomic rank for the taxon § Column 2: Taxon § Column 3: Accumulated contig size: Sum of the length of all contigs for that taxon § Column 4 (and all even columns from this one): Number of reads mapping to the taxon in the corresponding sample § Column 5 (and all odd columns from this one): Number of bases mapping to the taxon in the corresponding sample Step 12: Calculation of the abundance of all functions Script: 12.funcover.pl Files produced: • 12. .cog.stamp: COG function table for STAMP (http://kiwi.cs.dal.ca/Software/STAMP). Format of the file: § Column 1: Functional class for the COG § Column 2: COG id and function name § Column 3 and above: Abundance of reads for that COG in the corresponding sample • 12. .kegg.stamp: KEGG function table for STAMP Format of the file: § Column 1: KEGG id and function name § Column 2 and above: Abundance of reads for that KEGG in the corresponding sample • 12. .cog.funcover: Several measures of the abundance and distribution of each COG Format of the file: § Column 1: COG id § Column 2: Sample name § Column 3: Number of different ORFs of this function in the corresponding sample (copy number) § Column 4: Sum of the length of all ORFs of this function in the corresponding sample (Total length) § Column 5: Sum of the bases mapped to all ORFs of this function in the corresponding sample (Total bases) § Column 6: Coverage of the function (Total bases / Total length) § Column 7: Normalized coverage by size of the sample ( Total bases x 10^6 / Total length x Total number of bases in sample) § Column 8: Normalized coverage by size of the sample per copy ( Total bases x 10^6 x copy number / Total length x Total number of bases in sample) SqueezeMeta v1.0beta 16 Column 9: TPM for the function (Reads mapped x 10 ^6/ å(Reads mapped / å contig length for the function)) § Column 9: Number of the different taxa per rank (k: kingdom, p: phylum; c: class; o: order; f: family; g: genus; s: species) in which this COG has been found § Column 10: Function of the COG • 12. .kegg.funcover: Several measures of the abundance and distribution of each KEGG Format of the file: Same format than previous one but replacing COGs by KEGGs § In addition, .funcover and .stamp files will be created for the user-provided databases specified via the --extdb argument. Step 13: Creation of the ORF table Script: 13.mergeannot2.pl File produced: • 13. .orftable Format of the file: § Column 1: ORF name § Column 2: Contig name § Column 3: Molecule (CDS or RNA) § Column 4: Method of ORF prediction (prodigal, barrnap, blastx) § Column 5: Length of the ORF (nucleotides) § Column 6: Length of the ORF (amino acids) § Column 7: GC percentage for the ORF § Column 8: Functional name of the ORF § Column 9: Taxonomy for the ORF § Column 10: KEGG id for the ORF (If a * sign is shown here, it means that the functional assignment was done by both best hit and best average scores, therefore is more reliable. Otherwise, the assignment was done using just the best hit, but there is evidence of a conflicting annotation) § Column 11: KEGG function § Column 12: KEGG functional class § Column 13: COG id for the ORF (If a * sign is shown here, it means that the functional assignment was done by both best hit and best average scores, therefore is more reliable. Otherwise, the assignment was done using just the best hit, but there is evidence of a conflicting annotation) § Column 14: COG function § Column 15: COG functional class § Column 16: Function in the external database provided SqueezeMeta v1.0beta 17 § § § § Column 17: Functional class or associated information provided for the external database. (If there is more than one external databases, all of them will be shown here) Column 18: Pfam annotation Column 19 and beyond: TPM, coverage, read count and base count for the ORF in the different samples Step 14: Binning with MaxBin Script: 14.bin_maxbin.pl Files produced: • Directory maxbin containing fasta files of contigs corresponding to each bin Step 15: Binning with MetaBat2 Script: 15.bin_metabat2.pl Files produced: • Directory metabat containing fasta files of contigs corresponding to each bin Step 16: Merging bins with DAS Tool Script: 16.dastool.pl Files produced: • Directory DAS containing fasta files of contigs corresponding to each bin Step 17: Taxonomic assignment of bins Script: 17.addtax2.pl Files produced: • tax files for each fasta in the directory DAS (or the binning directory) • 17. .bintax: Format of the file: § Column 1: Binning method SqueezeMeta v1.0beta 18 § Column 2: Name of the bin § Column 3: Taxonomic assignment for the bin, with ranks § Column 4: Size of the bin (accumulated sum of contig lengths) § Column 5: Disparity of the bin (refer to the algorithm section) For detailed information on the algorithm, please refer to algorithm’s section at the end of this manual. Step 18: Bin assessment with CheckM Script: 18.checkM_batch.pl File produced: • 18. .checkM Format of the file: Concatenated CheckM output for each bin Step 19: Creation of the bin table Script: 19.getbins.pl Files produced: • 19. .bincov: Coverage and RPKM values for each bin Format of the file: § Column 1: Bin name § Column 2: Binning method § Column 3: Coverage of the bin in the corresponding sample (Sum of bases from reads in the sample mapped to contigs in the bin / Sum of length of contigs in the bin) § Column 4: RPKM for the bin in the corresponding sample (Sum of reads from the corresponding sample mapping to contigs in the bin x 10^6 / Sum of length of contigs in the bin x Total number of reads) § Column 5: Sample name • 19. .bintable: Compilation of all data for bins Format of the file: § Column 1: Bin name § Column 2: Binning method § Column 3: Taxonomic annotation (from the annotations of the contigs) § Column 4: Taxonomy for the 16S rRNAs if the bin (if any) § Column 5: Bin size (sum of length of the contigs) § Column 6: GC percentage for the bin § Column 7: Number of contigs in the bin § Column 8: Disparity of the bin § Column 9: Completeness of the bin (checkM) § Column 10: Contamination of the bin (checkM) § Column 11: Strain heterogeneity of the bin (checkM) SqueezeMeta v1.0beta 19 § Column 12 and beyond: Coverage and RPKM values for the bin in each sample. Step 20: Creation of the contig table Script: 20.getcontigs.pl Files produced: • 20. .contigsinbins: List of contigs and corresponding bins • 20. .contigtable: Compilation of data for contigs Format of the file: § Column 1: Contig name § Column 2: Taxonomic annotation for the contig (from the annotations of the ORFs) § Column 3: Disparity of the contig § Column 4: GC percentage for the contig § Column 5: Contig length § Column 6: Number of genes in the contig § Column 7: Bin to which the contig belong (if any) § Column 8 and beyond: Values of coverage, RPKM and number of mapped reads for the contig in each sample Step 21: Prediction of pathway presence in bins using MinPath Script: 21. minpath.pl Files produced: • 21. .kegg.pathways: Prediction of KEGG pathways in bins Format of the file: § Column 1: Bin name § Column 2: Taxonomic annotation for the bin § Column 3: Number of KEGG pathways found § Column 4 and beyond: NF indicates that the pathway was not predicted. A number shows that the pathway was predicted to be present, and correspond to the number of enzymes of that pathway that were found. • 21. .metacyc.pathways: Prediction of Metacyc pathways in bins Format of the file: Same as for KEGG, but using MetaCyc pathways Step 22: Final statistics for the run Script: 22.stats.pl File produced: • 22. .stats: Several statistics regarding ORFs, contigs and bins. SqueezeMeta v1.0beta 20 Other scripts in the utils directory Directory utils contains several accessory scripts that are not directly related to the main analysis but extend SqueezeMeta distribution with additional capabilities. SQM_reads.pl This procedure performs taxonomic and functional assignments directly on the reads. This is useful when the assembly is not good, usually because of low sequencing depth, high diversity of the microbiome, or both. One indication that this is happening can be found in the mappingstat file. Should you find there low mapping percentages (below 50%), it means that most of your reads are not represented in the assembly and can we can try to classify the reads instead of the genes/contigs. It will probably provide an increment in the number of annotations. But on the other hand, the annotations could be less precise (we are working with a smaller sequence) and you lose the capacity to map reads onto an assembly and thus comparing metagenomes using a common reference. This method is also less suited to analyze long MinION reads where more than one gene can be represented. The usage of SQM_reads is very similar to that of SqueezeMeta: SQM_reads.pl -p -s -f Arguments Mandatory parameters • • • -p: Project name (REQUIRED) -s|-samples: Samples file (REQUIRED) -f|-seq: Fastq read files' directory (REQUIRED) Options • • • • • --nocog: Skip COG assignment --nokegg: Skip KEGG assignment -t: Number of threads (Default: 12) -b|-block-size: block size for diamond against the nr database. Lower values reduce RAM memory usage. Set it to 3 or below for running in a desktop computer (Default: 8) -e|-evalue: max e-value for discarding hits in the diamond run (Default: 1e-03) SqueezeMeta v1.0beta 21 • -miniden: identity percentage for discarding hits in diamond run (Default: 50) The method will do a Diamond Blastx alignement of the reads with nr, COG and KEGG databases, and will assign taxa as functions using the lca and fun3 methods, as SqueezeMeta does. Output It produces the following files: • .out.allreads: Taxonomic and functional assignment for each read Format of the file: § Column 1: Sample name § Column 2: Read name § Column 3: Corresponding taxon § Column 4 and beyond Functional assignments (COG, KEGG) • .out.mcount: Abundance of all taxa Format of the file: § Column 1: Taxonomic rank for the taxon § Column 2: Taxon § Column 3: Accumulated read number: Number of reads for that taxon in all samples § Column 4 and beyond: Number of reads for the taxon in the corresponding sample • .out.funcog: Abundance of all COG functions Format of the file: § Column 1: COG ID § Column 2: Accumulated read number: Number of reads for that COG in all samples § Column 3 and beyond: Number of reads for the COG in the corresponding sample § Next to last column: COG function § Last column: COG functional class SqueezeMeta v1.0beta 22 • .out.funkegg: Abundance of all KEGG functions Format of the file: § Column 1: KEGG ID § Column 2: Accumulated read number: Number of reads for that KEGG in all samples § Column 3 and beyond: Number of reads for the KEGG in the corresponding sample § Next to last column: KEGG function § Last column: KEGG functional class sqm2itol.pl This program generates the files for creating a radial plot of abundances using iTOL (https://itol.embl.de/), such as the ones below, taken from the SqueezeMeta paper: SqueezeMeta v1.0beta 23 Figure 1: Taxonomic plot. Abundance of bins in the diverse samples. Bins were compared with the CompareM software (https://github.com/dparks1134/CompareM) to estimate their reciprocal similarities. The distances calculated between the bins were used to create a phylogenetic tree illustrating their relationships. The tree is shown in the inner part of the Figure. Branches in the tree corresponding to the four more abundant phyla in the tree (Firmicutes, Bacteriodetes, Proteobacteria and Spirochaetes) were colored. Bins were named with their id number and original genera, and labels for the most abundant genera were also colored. Outer circles correspond to: the completeness of the bins (green-colored, most internal circle), and the abundance of each bin in each sample (red-colored). Each circle corresponds to a different sample, and the red color intensities correspond to the bin’s abundance in the sample. Figure 2: Functional plot. Presence of several carbohydrate degradation pathways in bins. The outer circles indicate the percentage of genes from a pathway present in each of the bins. According to that gene profile, MinPath estimates whether or not the pathway is present. Only pathways inferred to be present are colored. As in Figure 1, the bins tree is performed from a distance matrix of the orthologous genes’ amino acid identity, using the compareM software. The four most abundant phyla are SqueezeMeta v1.0beta 24 colored (branches in the tree), as well as the most abundant genera (bin labels). The picture was elaborated using the iTOL software. Usage: sqm2itol.pl -p projectname Arguments -p: Project name (REQUIRED). Containing a valid SqueezeMeta analysis -completion [percentage]: Select only bins with completion above that threshold (default: 30) -contamination [percentage]: Select only bins with contamination below that threshold (default: 100) -classification [metacyc|kegg]: Functional classification to use (default: metacyc) functions [file]: File containing the name of the functions to be considered (for functional plots). For example: • • • • • arabinose degradation galactose degradation glucose degradation The program will generate several datafiles that you must upload to iTOL to produce the figure. make-tables.py This script generates tabular outputs, suitable for analysis in environments such as R, from a SqueezeMeta run. It will aggregate the abundances of the ORFs assigned to the same feature (be it a given taxon or a given function) and produce tables with features in rows and samples in columns. Usage: make-tables.py [options] Arguments SqueezeMeta v1.0beta 25 Mandatory parameters • • Project name (REQUIRED) Output directory Options • --ignore_unclassified: ignore ORFs with no functional classification when aggregating abundances for functional categories (KO, COG, PFAM). Output • • For each functional classification system (KO, COG, PFAM) the script will produce the following files: ◦ . .abunds.tsv: raw abundances of each functional category in the different samples. ◦ . .tpm.tsv: normalized (TPM) abundances of each functional category in the different samples. This normalization takes into account both sequencing depth and gene length. ▪ The --ignore_unclassified can be used to control whether unclassified ORFs are counted towards the total for normalization. For each taxonomic rank (superkingdom, phylum, class, order, family, genus, species) the script will produce the following files: ◦ . .allfilter.abund.tsv: raw abundances of each taxon for that taxonomic rank in the different samples, applying the identity filters for taxonomic assignment (see explanation for LCA algoritm below) ◦ . .allfilter.percent.tsv: percent abundances of each taxon for that taxonomic rank in the different samples. ◦ taxon for that taxonomic rank in the different samples. Identity filters for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes (see below). ◦ . .prokfilter.abund.tsv: percent abundances of each taxon for that taxonomic rank in the different samples. Identity filters for taxonomic assignment are applied to prokaryotic (bacteria + archaea) ORFs but not to Eukaryotes (see below). Details SqueezeMeta v1.0beta 26 • • By default, SqueezeMeta applies Luo’s et al (2014) identity cutoffs in order to assign a ORF to a given taxonomic rank (see section XX LINK A LCA). In our tests, these cutoffs resulted in a very low percentage of annotation for eukaryotic ORFs. To circunvent this issue, the *.prokfilter.* files generated by this script contain the aggregated taxonomic abundances obtained by applying Luo’s filter only to Bacteria and Archaea, but not to Eukaryotes. SqueezeMeta uses NCBI’s nr database for taxonomic annotation, and reports the superkingdom, phylum, class, order, family, genus and species ranks. In some cases, the NCBI taxonomy is missing some intermediate ranks. For example, the NCBI taxonomy for the order Trichomonadida is: ◦ superkingdom:Eukaryota ◦ no rank:Parabasalia ◦ order: Trichomonadida NCBI does not assign Trichomonadida to any taxa in the class and phylum ranks. For clarity, the make-tables.py will indicate this by recycling the highest available taxonomy and adding the “(no in NCBI)” string after. For example, ORFs that can be classified down to the Trichomonadida order (but are unclassified at the family level) will be reported as: In that example, the make-tables.py script would report the taxonomy as: • ◦ superkingdom: Eukaryota ◦ phylum: Trichomonadida (no phylum in NCBI) ◦ class: Trichomonadida (no class in NCBI) ◦ order: Trichomonadida ◦ family: Unclassified Trichomonadida ◦ genus: Unclassified Trichomonadida ◦ species: Unclassified Trichomonadida Some ORFs will have multiple KEGG/COG annotations in the 13.*.orftable file. This is due to their best hit in the KEGG/COG databases actually being annotated with more than one function. The script will split the abundances of those ORFs between the different functions they have been assigned to. make-SqueezeMdb-files.py Generates all the files required for loading a SqueezeMeta project into the web interface (https://github.com/jtamames/SqueezeMdb). Usage: make-SqueezeMdb-files.py SqueezeMeta v1.0beta 27 Explanation of SqueezeMeta algorithms The LCA algorithm We use a Last Common Ancestor (LCA) algorithm to assign taxa to genes. For the aminoacid sequence of each gene, diamond (blastp) homology searches are done against the GenBank nr database (updated weekly). A e-value cutoff of 1e-03 is set by default. The best hit is obtained, and then we select a range of hits (valid hits) having at least 80% of the bitscore of the best hit and differing in less than 10% identity also with the best hit (these values can be set). The LCA of all these hits is obtained, that is, the taxon common to all hits. This LCA can be found at diverse taxonomic ranks (from phylum to species). We allow some flexibility in the definition of LCA: a small number of hits belonging to other taxa than the LCA can be allowed. In this way we deal with putative transfer events, or incorrect annotations in the database. This value is by default 10% of the total number of valid hits, but can be set by the user. Also, the minimum number of hits to the LCA taxa can be set. An example is shown in the next table: GenID Hit ID Gen1 Hit1 Gen1 Hit2 Gen1 Hit3 Gen1 Hit4 Gen1 Hit5 Hit taxonomy Genus:Polaribacter Family: Flavobacteriaceae Order:Flavobacteriales Genus:Polaribacter Family: Flavobacteriaceae Order:Flavobacteriales Family: Flavobacteriaceae Order:Flavobacteriales Genus:Algibacter Family: Flavobacteriaceae Order:Flavobacteriales Genus:Rhodospirillum Family: Rhodospirillaceae Order:Rhodospirillales Identity e-value 75.2 1e-94 71.3 6e-88 70.4 2e-87 68.0 2e-83 60.2 6e-68 In this case, the four first hits are the valid ones. Hit 5 does not make the identity and evalue thresholds. The LCA for the four valid hits is Family: Flavobacteriaceae, that would be the reported result. SqueezeMeta v1.0beta 28 Our LCA algorithm includes strict cut-off identity values for different taxonomic ranks, according to Luo et al, Nucleic Acids Research 2014, 42, e73. This means that hits must pass a minimum (aminoacid) identity level in order to be used for assigning particular taxonomic ranks. These thresholds are 85, 60, 55, 50, 46, 42 and 40% for species, genus, family, order, class, phylum and superkingdom ranks, respectively. Hits below these levels cannot be used to make assignments for the corresponding rank. For instance, a protein will not be assigned to species level if it has no hits above 85% identity. Also, a protein will remain unclassified if it has no hits above 40% identity. The inclusion of these thresholds guarantees that no assignments are done based on weak, inconclusive hits. The fun3 algorithm Fun3 is the algorithm that produces functional assignments (for COGs, KEGG and external databases). It reads the Diamond Blastx output of the homology search of the metagenomic genes for these databases. The homology search has been done with the defined parameters of e-value and dentity, so that no hits below above the minimum evalue or below the minimum identity are found. Also, partial hits (where query and hits align in less than the percentage given by the user, 30% by default) are discarded. The hits that pass the filters can correspond to more than one functional id (for instance, COG or KEGG ID). Fun3 provides two types of classification: Best hit is just the functional id of the highest scoring hit. Best average tries to evaluate also if that functional id is significantly better than the rest. For that, it takes the first n hits corresponding to each functional id (n set by the user, default is 5) and calculates their average bitscore. The gene is assigned to the functional id with the highest average bitscore that exceeds in a given percentage (given by the user, by default 10%) the score of the second one. This method reports less assignments but it is also more precise, avoiding confusions between closely related protein families. A unique functional assignment, the best hit, is shown in the gene table. There, the functional id is shown with a * symbol to indicate that the assignment is supported also by the best average method. Doublepass: Blastx on contig gaps The -D option activates the doublepass procedure, where regions of the contigs where no ORFs where predicted, or where these ORFs could not be assigned taxonomically and functionally, are queried against the databases using blastx. This method allows to recover putative ORFs missed by Prodigal, or to correct wrongly predicted ORFs. The following figure illustrates the steps of the doublepass procedure: SqueezeMeta v1.0beta 29 Consensus taxonomic annotation for contigs and bins The consensus algorithm attempts to obtain a consensus taxonomic annotation for the contigs according to the annotations of each of its genes. The consensus taxon is the one fulfilling: -50% of the genes of the contig belong to (are annotated to) this taxon, and -70% of the annotated genes belong to (are annotated to) this taxon. Notice that the first criterion refers to all genes in the contig, regardless if they have been annotated or not, while the second refers exclusively to annotated genes. As the assignment can be done at different taxonomic ranks, the consensus is the deepest taxon fulfilling the criteria above. For instance, consider the following example for a contig with 6 genes: Gen1: k_Bacteria;p_Proteobacteria;c_Gamma-Proteobacteria;o_Enterobacteriales;f_ Enterobacteriaceae;g_Escherichia;s_Escherichia coli Gen2: k_Bacteria;p_Proteobacteria;c_Gamma-Proteobacteria;o_Enterobacteriales;f_ Enterobacteriaceae;g_Escherichia Gen3: k_Bacteria;p_Proteobacteria;c_Gamma-Proteobacteria;o_Enterobacteriales;f_ Enterobacteriaceae;g_Escherichia Gen4: k_Bacteria;p_Proteobacteria;c_Gamma-Proteobacteria;o_Enterobacteriales;f_ Enterobacteriaceae SqueezeMeta v1.0beta 30 Gen5: No hits Gen6: k_Bacteria;p_Firmicutes In this case, the contig will be assigned to k_Bacteria;p_Proteobacteria;c_GammaProteobacteria;o_ Enterobacteriales;f_ Enterobacteriaceae, which is the deepest taxon fulfilling 50% of all the genes belonging to that taxon (4/6=66%), and having 70% of the annotated genes (4/5=80%). The assignment to genus Escherichia was not done since just 3/5=60% of the annotated genes belong to it, which is below the cutoff threshold. For annotating the consensus of bins, the procedure is the same, but using the annotations of the corresponding contigs instead. Disparity calculation Notice that in the example above, the end part of the contig seems to depart from the common taxonomic origin of the rest. This can be due to misassembly resulting in chimerism, or other causes such as a recent LCA transfer or a wrong annotation for the gene. The disparity index attempts to measure this effect, so that the contigs can be flagged accordingly (for instance, we could decide not trusting contigs with high disparity). Disparity index is calculated for the taxonomic rank assigned by consensus algorithm (in the previous example, family). We compare the assignments at that level for every pair of genes in the contig, and count the number of agreements and disagreements. If one of the taxa has no annotation at that level, is not counted for agreement but it is counted for disagreements if previous ranks do not coincide (we assume that if higher ranks do not agree, lower ranks will not either). That is: Gen1-Gen2: Gen1-Gen3: Gen1-Gen4: Gen1-Gen5: Gen1-Gen6: Gen2-Gen3: Gen2-Gen4: Gen2-Gen5: Gen2-Gen6: Gen3-Gen4: Gen3-Gen5: Gen3-Gen6: Gen4-Gen5: Gen4-Gen6: Gen5-Gen6: Agree Agree Agree Unknown Disagree Agree Agree Unknown Disagree Agree Unknown Disagree Unknown Disagree Unknown SqueezeMeta v1.0beta (at phylum level) (at phylum level) (at phylum level) (at phylum level) 31 Disparity index is the ratio between the number of disagreements and the total number of comparisons, in this case 4/15=0.26 For calculating the disparity of bins, the procedure is the same, just using the annotations for the contigs belonging to the bin instead. SqueezeMeta v1.0beta 32
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf Linearized : No Page Count : 32 PDF Version : 1.4 Title : Microsoft Word - SqueezeMeta.docx Creator : Word Create Date : 2019:04:12 14:20:31Z Producer : www.ilovepdf.com Modify Date : 2019:04:12 15:17:31ZEXIF Metadata provided by EXIF.tools