Octopus User Manual
User Manual:
Open the PDF directly: View PDF
.
Page Count: 42
| Download | |
| Open PDF In Browser | View PDF |
Octopus User Manual Version 0.5.3 beta 1 January 2019 University of Oxford ! of 42 1 ! OVERVIEW 6 Introduction ........................................................................................................................6 What’s in this manual? ........................................................................................................6 Availability ..........................................................................................................................6 License and copyright .........................................................................................................6 Further assistance ..............................................................................................................6 INTRODUCTION 7 Variant calling .....................................................................................................................7 Hybrid mapping based variant calling ..................................................................................7 Haplotype based variant calling ........................................................................................... 7 Local variant phasing ..........................................................................................................7 INSTALLATION 8 System Requirements ................................................................................................ 8 Hardware ...........................................................................................................................8 Required Software ..............................................................................................................8 Optional software ...............................................................................................................9 Downloading .............................................................................................................. 9 Building .....................................................................................................................9 Easy install with Python ......................................................................................................9 Building with CMake .........................................................................................................10 Debug builds ....................................................................................................................10 RUNNING TESTS 10 GETTING STARTED 11 Basic usage ......................................................................................................................11 Required arguments .........................................................................................................11 Optional arguments ..........................................................................................................11 Reporting bugs .................................................................................................................11 Requesting new features ...................................................................................................12 CALLING MODELS 13 Individual ..........................................................................................................................13 Population ........................................................................................................................13 Trio................................................................................................................................... 13 Cancer .............................................................................................................................13 ! of 42 2 ! Polyclone .........................................................................................................................13 EXAMPLES 14 Calling germline variants in a single sample .......................................................................14 Calling variants in a targeted exome panel .........................................................................14 Ignoring decoy contigs from a whole genome run ..............................................................14 Calling germline variants in a population ............................................................................15 Calling de novo mutations in a trio .....................................................................................15 Calling somatic mutations in a tumour-normal pair .............................................................15 HLA genotyping ................................................................................................................15 Calling variants in haploid organism ...................................................................................15 Running in multithread mode .............................................................................................16 Using a configuration file ...................................................................................................16 Random forest filtering ......................................................................................................16 BEST PRACTICES 17 Reference selection ..........................................................................................................17 Read mapping ..................................................................................................................17 Read preprocessing ..........................................................................................................17 Variant calling ...................................................................................................................17 Variant call filtering ............................................................................................................17 COMMAND LINE REFERENCE 18 General .................................................................................................................... 18 Read Pre-Processing ................................................................................................20 Variant Generation ....................................................................................................22 Haplotype Generation ............................................................................................... 23 Calling .....................................................................................................................24 Trio ..........................................................................................................................25 Cancer .....................................................................................................................25 Polyclone ................................................................................................................. 26 Phasing ...................................................................................................................26 Call Filtering ............................................................................................................. 26 VARIANT FILTERING 28 Measure reference ............................................................................................................28 Threshold filtering .............................................................................................................30 ! of 42 3 ! Random forest filtering ......................................................................................................30 Training random forests .....................................................................................................31 OUTPUT FORMAT 32 PERFORMANCE OPTIMISATION 33 Execution time ..................................................................................................................33 Memory consumption .......................................................................................................33 Multithreading ...................................................................................................................34 Variant generation .............................................................................................................34 Haplotype generation and phasing ....................................................................................35 Calling model selection and parametrisation ......................................................................35 TROUBLESHOOTING 36 Building ...................................................................................................................36 Why are the requirements so strict? ..................................................................................36 CMake chooses a bad compiler ........................................................................................36 Compilation fails ...............................................................................................................36 Linking fails ......................................................................................................................36 Boost libraries fail to link ...................................................................................................37 Compilation has lots of #pragma warnings ........................................................................37 Runtime ...................................................................................................................37 Segmentation fault ............................................................................................................37 Execution is slow ..............................................................................................................37 Execution delays after initialising calling components in threaded mode .............................. 37 Run hangs in decoy contigs ..............................................................................................37 Behaviour ................................................................................................................38 No calls are reported ........................................................................................................38 Regions are skipped because of too many haplotypes .......................................................38 A call changes when a different input region is given ..........................................................38 Why doesn’t octopus report genotype likelihoods? ............................................................38 Why do octopus VCF files contain * and .? ........................................................................38 SNP accuracy improves in fast mode ................................................................................ 39 Calling performance is worse with assembler .....................................................................39 CONTACT 40 APPENDIX 40 Installing Requirements .............................................................................................40 ! of 42 4 ! OCTOPUS USER MANUAL OS X ................................................................................................................................40 Ubuntu .............................................................................................................................40 Variant Generation ....................................................................................................41 Haplotype Generation ............................................................................................... 42 Phasing ...................................................................................................................42 GLOSSARY 42 ! of 42 5 ! OCTOPUS USER MANUAL OVERVIEW Introduction Octopus is a command line tool that detects genetic variation from high-throughput sequencing data (reads) relative to a reference sequence. The tool must be provided with an indexed FASTA reference file and one or more SAM format mapped and aligned read files, and will produce a set of phased variants in the VCF format. Octopus is able to call single nucleotide variants (SNVs) and small indels (< 2000bp), and can be used detect and classify germline, somatic, or de novo mutations across multiple samples. What’s in this manual? This is a user manual intended for novice to advanced users to get octopus running optimally. It is not intended to give a detailed description of the algorithms implemented in octopus (although some pertinent details are given in the appendix to help understand some parameters), nor is it a technical manual for software developers. Please refer to the octopus paper and developer manual for detailed descriptions of these topics. Availability Octopus is hosted on Github. License and copyright Octopus is distributed under the MIT license. The copyright holder is the Daniel Cooke (daniel.cooke@well.ox.ac.uk) who reserves the right to change the license terms. Further assistance • There is additional documentation on the Github wiki. • For general discussion, please use the octopus Gitter chat. • For bugs and feature requests, please use the octopus issue tracker. • Other questions can be directed to Daniel Cooke: daniel.cooke@well.ox.ac.uk. ! of 42 6 ! OCTOPUS USER MANUAL INTRODUCTION Variant calling Variant calling is an inference task; the aim is to report the underlying genome of the sample under consideration, given a set of indirect observations of the samples genome (reads). This is statistical problem as the underlying read data is noisy due to the sequencing process (errors can be introduced in the library preparation and sequencing itself). In practise, not all inferred genetic information is reported as the vast majority of information is conserved amongst populations. Instead only differences compared to a reference sequence for the population are reported, these are called variants. Hybrid mapping based variant calling Mapping based variant callers require preprocessed input from a read mapper. A read mapper takes raw sequencing reads and attempts to determine the origin of each read independently relative to a reference genome. Most mappers will subsequently align the read around the mapped location. The variant calling task is much simplified when read mapping information is available as the domain of possible variants is significantly reduced. However, with mapping based approaches, the overall accuracy of the caller may be bounded by the accuracy of the mapper and alignment algorithm; the mapping stage itself can be viewed as a variant calling process as the mapper must also account for deviations from the reference sequence due to real variation and sequencing errors. The other method is to avoid using a mapper entirely; just take the raw reads and assemble them into full contigs. Such approaches do exist, and usually empty De Bruin graphs and similar algorithms, but are usually underpowered compared to mapping based approaches. There is also a significant computational overhead attached to assembly based approaches. Experience is showing that the best overall solution is a hybrid approach where read mapping information is used, but only partially. Reads mapping within a certain genomic interval are locally reassembled and then aligned to the assembled contig. The idea being that the read mapper may be wrong, but it is unlikely to be very wrong; the true read origin is unlikely to be far way from the mapped location. Haplotype based variant calling Haplotype based variant callers attempt to jointly genotype more than one genomic position simultaneously. This is in contrast to traditional positional based variant calling that only genotype a single location at once. The advantage of haplotype based is that the space of possible errors increases exponentially with haplotype length, while the space of true haplotypes remains constant in the number of samples and organism ploidy. It is therefore much easier to classify true variation and sequencing errors. Local variant phasing Phasing refers to assigning called alleles to a particular haplotype; calls are phased if information indicating which called alleles occur on the same haplotype is provided. It is only possible to fully reconstruct a samples genome if phased calls are generated. Octopus is able to generate phased calls - the phase information is provided in the final call set. ! of 42 7 ! OCTOPUS USER MANUAL INSTALLATION This section gives detailed instructions on how to obtain, build, and install octopus. Please refer to the troubleshooting section for common installation problems not addressed here. SYSTEM REQUIREMENTS Octopus is mostly written in C++, and therefore requires the source code to be compiled for the target machine architecture with a C++ compiler. You will need to consult your operating systems technical documentation to determine a suitable C++ compiler. Hardware In principle octopus can run on any machine capable of compiling a C++14 program. However, given the complex numerical algorithms involved in running octopus the following guidelines are offered 1: Technology Minimum Recommended Processor Intel Core i5 32 x Intel Core i5 Memory 8GB 16GB Disk 500GB 1TB Most modern desktop and laptop computers should satisfy these requirements. The user should understand that hardware requirements will vary greatly depending on the use-case and workload. For example, calling many high coverage samples will require far greater memory than a single low coverage sample. Octopus is fully multithreaded, and to achieve reasonable runtime performance on large tasks it is recommended to make multiple processor cores available. Octopus requires SSE2 hardware support. Required Software • A C++14 compliant compiler with SSE2 support2 • An implementation of the C++14 standard library3 • Boost 1.65 or greater • htslib 1.4 or greater • CMake 3.9 or greater 1 These guidelines are based on running octopus on a single high coverage (~50x) human sample. GCC 6.2.1 and below have bugs which affect octopus; only use GCC 6.3 and above. LLVM Clang 3.8 has been tested and compiles. Visual Studios and Intel C++ compilers have not been tested. 2 3 It is highly recommended to use the compilers native C++ standard library implementation: libstdc++ for GCC and libc++ for Clang. ! of 42 8 ! OCTOPUS USER MANUAL Optional software • Git 2.5 or greater • Python3 Instructions on obtaining the requirements on OS X and Ubuntu are given in the Appendix. DOWNLOADING Octopus is distributed via the project hosting website Github. There are two ways to obtain a copy of the source code from Github: 1. Visit the octopus Github webpage and click the Clone or download box. This will download a zip file named octopus-master.zip containing the octopus source code. Move the zip file to a suitable location, unzip it, and rename the folder to octopus. 2. Open a command line terminal and move to a directory where you would like octopus to be downloaded, then execute the git command: $ git clone https://github.com/luntergroup/octopus.git The octopus source code will be downloaded into a folder named octopus. BUILDING Once the source code is obtained there are two methods to create an executable for your target machine, both require CMake to generate a native makefile which is used by a native build-tool to build the final executable. It is highly recommended to do an out of source build. Easy install with Python In the scripts directory there is a Python3 script install.py which will execute all the necessary build steps: $ ./scripts/install.py Which will install into the octopus bin directory. To install into a different location (e.g. /usr/local/bin) use: $ ./scripts/install.py --prefix /usr/local/bin If CMake is not able to find a suitable C++ compiler, it may be necessary to explicitly specify where such a compiler exists: $ ./scripts/install.py --cxx_compiler /path/to/compiler/cpp On some systems, you may also need to specify a C compiler which is the same version as your C++ compiler, this can be done with the c_compiler option, e.g.: $ ./scripts/install.py --cxx_compiler g++-7 --c_compiler gcc-7 ! of 42 9 ! OCTOPUS USER MANUAL The installation script can also be used to install all dependencies, including a suitable compiler: $ ./scripts/install.py --install-dependencies Building with CMake It is also possible to build the source directly with CMake: $ cd build $ cmake .. && make install $ cmake -DCMAKE_INSTALL_PREFIX=/usr/local/bin .. $ cmake -DCMAKE_CXX_COMPILER=g++-4.2 .. Using the python script is recommended however as it ensures an out of source build. Debug builds It is possible to build octopus with debug information. This is only recommended for debugging and will hopefully not be needed for users. To do so, add the command --sanitize to the Python install script. RUNNING TESTS If you downloaded a developmental version of octopus, it is good practise to run all the packaged tests before using any of the tools for production work. The release versions are guaranteed to have passed all tests. To install octopus for testing and run the tests use: $ test/install.py Like the other install script this command can also be supplied with a compiler. ! of !42 10 OCTOPUS USER MANUAL GETTING STARTED Once successfully installed octopus is ready to run. This section is for novice users who want a gentle introduction to variant calling. Advanced users should consult the command line reference section for detailed descriptions of specific features. Basic usage Octopus is a command line tool and must be executed from a command terminal. The simplest octopus run is without any arguments: $ octopus Which will report a user error informing there are missing required arguments! You can request a reminder of all required and optional parameters with the --help command: $ octopus --help This will display a similar table to the command reference below. Required arguments Only two command line argument are required. First, the reference genome to use for analysis specified with --reference; -R, which must be given a path to a FASTA file containing the reference genome. A FASTA index file with the same name, but extension .fai is also required to exist in the same directory as the given reference. Second, a list of read file (BAM or CRAM format) paths must be supplied. These can either be supplied directly with the --reads; -I option, or with the --reads-file; -i option, which must be given a path which itself contains a list of paths to read files. These two options can also be used conjunctively, any duplicate files will be ignored. Each read file must have an associated index file that exists in the same directory as the read file (.bai for BAM and .crai for CRAM). Optional arguments Octopus has many optional arguments that affect accuracy and runtime performance. The default parameters have been chosen with human germline sequence data mapped with BWA-MEM in mind; many users will find the default arguments offer adequate performance on human samples. For non human data samples, the default parameters may not offer good performance, especially for non-diploid organisms, and users are advised to carefully read the available options. Even for users only interested in human samples, it is recommended they briefly acquaint themselves with the available options. A detailed description of all command line options can be found later in this manual. Reporting bugs Octopus is currently in pre-release, so it is likely that some bugs will be present. If you encounter a bug, please first check the octopus issue tracker to make sure it is not already reported. Also, if you're using a tagged ! of !42 11 OCTOPUS USER MANUAL release build, please check closed issues and newer releases before reporting the bug as it may already have been fixed! Once a bug has been verified, try if possible to find a minimal verifiable example (MVE); that is, the least amount of data that triggers the bug. The first step to finding an MVE is usually to locate the approximate genomic region where the bug is triggered, and then calling with smaller targeted regions to try to pinpoint the problem. Usually this task is easier when running in a single thread, however, this can be time consuming if calling over large amount of data, in which case you could try running with multiple threads and with the -debug command which should help indicate where the issue occurred. Once an MVE is found, recompile octopus in sanitize mode which adds significant debugging information to the executable. Any errors will be reported to stderr should be recorded, and sent along with octopus's own debug log to the octopus issue tracker. Requesting new features Feedback is very welcome! Please start by suggesting a new feature on the octopus forum and then if well received make an official feature requests to the octopus issue tracker. ! of !42 12 OCTOPUS USER MANUAL CALLING MODELS Octopus provides a framework for genotyping samples given different states of knowledge about those samples, such as different sample biology or experimental method, expressed via a calling model. A calling model serves two purposes: firstly, it defines the type of calls and inferences that should be made (e.g. somatic or de novo classification), and secondly, it defines a probability model to calculate posterior probabilities for the given call types. Octopus currently has five calling models, which are briefly discussed below. Individual The individual calling model is the simplest, it is intended to model a single healthy individual with known chromosome copy number (ploidy) for all chromosomes (or contigs). The advantage of having a bespoke model for an individual is that the genotype posterior distribution can be calculated exactly. Population The population calling model is intended for genotyping multiple unrelated samples from a population. Like the individual model, it is assumed each sample is healthy with a known chromosome copy number. The first advantage of calling samples jointly, as apposed to calling each sample individually and then merging the results, is that power is increased to call common variation. This is particularly true for low coverage data. The second advantage is that genotyping samples jointly allows a consistent call set to be produced; merging independent call-sets can be very challenging. Trio The trio calling model is used to genotype a family consisting of a mother, father, and offspring. All members of the trio are assumed to be healthy with known chromosome copy numbers. However, unlike the population model, the trio model explicitly models the relationship between the samples, and is therefore able to classify de novo mutations in the child. Cancer The cancer model is used to genotype tumours from a single individual. All tumours are assumed to be metastasis from the same primary tumour (or the primary itself). The model can be used to classify somatic mutations, and infer local copy number changes around called mutations. Unlike the other calling models, the chromosome copy number of each tumour is not assumed to be known, however, if a normal sample with known chromosome copy number is also present the classification power of the model is increased. Polyclone The polyclone calling model is designed for calling variants in a mixed haploid sample where the number and mixture frequency of clones is unknown. An application is calling variants in bacterium samples which could contain more than one isolate due to contamination, mixed infection, or in-host evolution. The number of clones is automatically inferred from the data. ! of !42 13 OCTOPUS USER MANUAL EXAMPLES This section contains some common use-case examples to get started. Please refer to the following sections for more information regarding calling models and parameters. Note all of the examples in this section use the default output mode (standard output) for brevity, to write to a file just add the --output; -o command. Calling germline variants in a single sample As previously described, octopus has two distinct models for germline variant calling - one for a single individual and another for populations. Fortunately there is no concern for the user as the appropriate model is selected automatically: $ octopus -R human.fa -I NA12878.bam Assuming the file NA12878.bam contains a single sample, this will use the individual calling model. Octopus does not care how many samples are actually in a read file, so if the input read file contains multiple samples but only a single sample is required for analysis, the name of the sample is required as input: $ octopus -R human.fa -I multi_sample.bam --samples NA12878 Calling variants in a targeted exome panel All octopus calling models can be supplied with a list of target intervals to analyse, for a small number of regions the option --regions; -T can be used: $ octopus -R human.fa -I NA12878.bam -T 22:35,799,116-35,799,685 This option can be used multiple times, or can be supplied with a space separated list of arguments. However, for longer target interval lists it may be easier to create a file which lists the regions (one per line) and pass this to octopus using the --regions-file; -t option: $ octopus -R human.fa -I NA12878.bam -t exome-panel.txt Note these options can be used in conjunction, and there is no need to worry about duplicates or overlaps octopus will resolve this internally. Ignoring decoy contigs from a whole genome run There is a useful option, --skip-regions; -K, that serves as the converse of the --regions command; it informs octopus not to analyse the given options. The main utility of this is for ignoring decoy contigs or centromeres in whole genome runs. There is a homologous command, --skip-regions-file; -k, which takes a path to a file containing regions to ignore: $ octopus -R human.fa -I NA12878.bam -k human-decoy.txt It is possible to use all the region specific commands in conjunction to get fine grain control over which regions to call. ! of !42 14 OCTOPUS USER MANUAL Calling germline variants in a population The population model is the default for more than a single sample, so just supply a list of samples: $ octopus -R human.fa -I NA12878.bam NA12891.bam For larger sample sets, it is usually better to have the read paths in a file: $ octopus -R human.fa -i reads.txt Calling de novo mutations in a trio To call germane and de novo mutation in a trio, just specify --maternal-sample; -M and --paternalsample; -F: $ octopus -R human.fa -i ceu_trio.txt -M NA12892 -F NA12892 The child is automatically deduced. The trio can also be specified with a PED file: $ octopus -R human.fa -i ceu_trio.txt --pedigree ceu_trio.ped Calling somatic mutations in a tumour-normal pair To call germline and somatic variants in tumour samples, supply either a normal sample: $ octopus -R human.fa -I normal.bam tumour.bam -N normal If a normal sample is unavailable, tumour only calling can be invoked by explicitly selecting the cancer calling model: $ octopus -R human.fa -I tumour1.bam tumour2.bam -C cancer HLA genotyping Octopus is able to call very long haplotypes, especially in variant dense regions, which makes it an ideal tool for calling HLA haplotypes. By default octopus will not make maximally long haplotypes - and therefore phase regions - due to the computational complexity involved in such optimisation, and the diminishing return of very long haplotypes. But in the HLA, longer haplotypes are desired, which can be achieved using the -phasing-level; -l command: $ octopus -R human.fa -I NA12878.bam -t hla-regions.txt -l aggressive It may also be beneficial to increase the default value of --max-haplotypes to 256 or 512. Calling variants in haploid organism The default parameters are set with human sequence data in mind, for non-human samples it is recommended to adjust the options for the organism being analysed. For haploid organisms such as bacteria and viruses, the most important parameter to change is --organism-ploidy; -P which sets the default ploidy to use. Depending on the organism, it may also be important to adjust the variant priors: --snpheterozygosity and --indel-heterozygosity: ! of !42 15 OCTOPUS USER MANUAL $ octopus -R ecoli.fa -I ecoli.bam -P 1 --snp_heterozygosity 0.01 To call variants in a haploid sample which potentially contains an unknown mix of multiple clones (e.g. bacteria or viral samples), specify the polyclone calling model. $ octopus -R H37Rv.fa -I mycobacterium_tuberculosis.bam -C polyclone Running in multithread mode By default all octopus runs execute using a single thread, but it is trivial to use multiple threads using the -threads command: $ octopus -R human.fa -I NA12878.bam --threads This is the recommended approach to multithreading with octopus, but the command also takes an optional number of threads to use, which must be specified immediately after the command: $ octopus -R human.fa -I NA12878.bam --threads=4 The former form is recommended because it allows octopus to optimise thread usage, and also enables the use of specific multithreaded algorithms. Using a configuration file Octopus allows all command line options to be specified using a configuration file, which some users may prefer is the same configuration us used often. The configuration file is just a text file with each line containing a option=value pair: $ octopus --config my_octopus_config.txt Random forest filtering To use random forest filtering just specify the --forest-file option for germline calls and the -somatic-forest-file option for somatic calls: $ octopus -R human.fa -I NA12878.bam --forest-file germline.forest ! of !42 16 OCTOPUS USER MANUAL BEST PRACTICES This section gives some brief advise on best practise workflow from FASTQ to VCF. Reference selection Use the latest possible reference genome for your sample. Reference assemblies are often updated to reflect resolutions of complex loci, or to add decoy sequence which reduces mapping issues and improve calling quality. Read mapping Octopus requires mapped and aligned reads in the SAM format.The quality of the mapping software is therefore an essential part of the variant calling process. While the performance of mappers can vary considerably depending on the type of sequencing data used, BWA-MEM (default settings) is recommended as it is widely used, well tested, and has been shown to perform well on a wide range of data - in particular human genetic data. Read preprocessing Octopus does not require any read preprocessing after mapping, such as duplicate marking, indel realignment, or base quality recalibration. Unlikely other variant callers, octopus is unlikely to benefit from such techniques as reads are preprocessed internally, indels are essentially realigned during calling, and base quality scores are also internally manipulated depending on sequence context. However, if you're data is already preprocessed, octopus should perform equally well with this data. Variant calling Ensure the correct calling model is selected for the type of data to be analysed. Look at the private parameters for the chosen calling model and verify the defaults are reasonable. At the very least, check the ploidy assumptions are correct. Once the calling model is appropriately configured, consult the performance optimisation section to help tune other calling parameters. Variant call filtering This version of octopus provides random forest and threshold based variant call filtering. We recommend using the random forest for germline and somatic calling, and the default filter expressions for trio calling. ! of !42 17 OCTOPUS USER MANUAL COMMAND LINE REFERENCE This section contains a description of each command line option. The commands are separated into sections which roughly correspond to different area of concern. At the end of each section a detailed explanation of any non-trivial commands is given. Some options have so called default implicit values, that is, they are be default disabled, but can be enabled with the implicit default value by just specifying the option name. Implicit options are labelled as =(default implicit value). Entries with a red border are currently placeholders and are not yet implemented, they are included to give an indication of what will be available in the first official release. Entries with an orange border are currently implemented, but are likely to change before the first official release. GENERAL Default value Command Description --reference, -R The reference genome to use for analysis. Must match the reference genome used to map reads against. None --reads, -I The read files to use for analysis. Can be specified multiple times and given a space separated list of argument. None --reads-file, -i A path that contains a list of read file paths, one per line, to use for analysis. None --regions, -T A list of genomic intervals to analyse. --regions-file, -t A path to a file that contains a list of genomic intervals to analyse. Must have one region per line. BED format is accepted. None --skip-regions, -K A list of genomic intervals that should be ignored. None --skip-regions-file, k A path to a file that contains a list of genomic intervals that should be ignored. Must have one region per line. BED format is accepted. None ! of !42 18 All regions present in the reference index. OCTOPUS USER MANUAL Default value Command Description --one-based-indexing Reads all user input regions using one-based indexing rather than zero based. --samples, -S A list of samples to analyse, which must be a subset of those samples in the reads. --samples-file, -s A path to a file containing a list of samples to analyse. None --pedigree PED file containing sample pedigree. Only currently used by trio calling model. None --fast Disables various algorithmic features to significantly reduce runtime, at the cost of worse calling accuracy. Equivalent to -a off -l minimal -x 50. Off --very-fast Disables various algorithmic features to significantly reduce runtime, at the cost of worse calling accuracy. Equivalent to --fast --inactive-flankscoring off. Off --threads Enables multithreading. If not supplied with an argument (recommended), the number of threads is automatically determined. Otherwise the number of threads is limited to the given number. --working-directory, -w Any path given to octopus will be relative to the working directory, unless the path is already valid. None --temp-directory-prefix Prefix name of octopus temporary directory used during calling (created in working directory). octopus-temp --max-reference-cachefootprint, -X The maximum amount of memory available to cache reference sequence. Caching reference sequence reduces file IO. --target-read-bufferfootprint, -B The recommended amount of memory available for buffering read data. This is not a strict limit. --target-working-memory Target maximum working memory for analysis. This is not a strict limit, but may disable certain memory intensive optimisations. --max-open-read-files Limits the number of open read files to the given number. Note each read file also has an index which is not accounted for. 250 --contig-output-order Which order should contigs appear in the final output? Possible values are: lexicographicalAscending, lexicographicalDescending, contigSizeAscending, contigSizeDescending, asInReference, asInReferenceReversed. asInReferenceI ndex ! of !42 19 No All samples found in the reads. Disabled. =(automatic) 500MB 6GB None OCTOPUS USER MANUAL Default value Command Description --sites-only Remove genotype calls and associated information from final VCF output. No --regenotype A VCF file specifying sites to regenotype; only calls listed in this files will appear in the output. None --legacy Outputs a more conventional VCF file in addition to the standard octopus format. --debug Writes verbose debug information to a log file. Can be supplied with a path. Off =(octopus_de bug.log) --trace Writes very verbose debug information to a log file. For maintainer use only. Can be supplied with a path. Off =(octopus_tra ce.log) --version Displays the current version number and other meta information. None --config A configuration file that contains values for some or all of the options listed here. None --bamout Output realigned BAM files. Full path for single sample calling, output directory for multi-sample calling. None --split-bamout Output realigned split BAM files. Output prefix for single sample calling, directory for multi-sample calling. None Off READ PRE-PROCESSING Default value Command Description --read-transforms Use to turn off all read transformations. Reads can still be filtered. On --soft-clip-masking Use to turn off soft clip masking (assigning base quality zero) of soft clipped read flanks. On --mask-tails Set this many tail base qualities of all reads to zero. --mask-low-quality-tails Masks (assigns base quality zero) the tail given number of bases of each read. --mask-soft-clippedboundries Masks (assigns base quality zero) to the soft clipped flanks of reads, plus an additional number of given bases. ! of !42 20 None No =(3) No OCTOPUS USER MANUAL Default value Command Description --adapter-masking Prevents read bases that are considered likely adapter contaminants, as determined by octopuses native adapter contamination detector, from being masked (assigned base quality zero). This command is redundant unless the command --allow-adaptercontaminated-reads is also used. On --overlap-masking Prevents masking (assigning base quality zero) of read bases that overlap (w.r.t mapping location) of other segments within the reads template. For paired-end reads, this usually refers to the reads mate. Only one corresponding base of each read is masked; the other is left untouched. On --read-filtering Prevents any read from being quality control filtered, this does not affect downsampling. On --consider-unmapped-reads Turns off filtering of reads marked as unmapped. Note this is not the same as reads with mapping quality zero. No --min-mapping-quality Discards reads with mapping quality less than this before calling. 20 --good-base-quality The base quality threshold to use for the the options --min-good-base-fraction and --min-goodbases. 20 --min-good-base-fraction The maximum fraction of bases below --min-goodbase-quality before the read is discarded. Off --min-good-bases The minimum number of bases equal to or above -min-good-base-quality before a read is considered. 20 --allow-qc-fails Prevents removal of reads marked as QC failed. No --min-read-length Discards reads with less bases than this. None --max-read-length Discards reads with more bases than this. None --allow-marked-duplicates Prevents removal of reads pre-marked as duplicates. No --allow-octopusduplicates Prevents removal of reads that octopuses native duplicate detector marks as duplicates. No --allow-secondaryalignmenets Allows reads marked as being secondary alignments. Yes --allow-suplementaryalignments Allows reads marked as being supplementary alignments. Yes ! of !42 21 OCTOPUS USER MANUAL Command Default value Description --no-reads-with-unmapped- Filter reads where one or more segments in the reads segments template are marked as unmapped. For paired-end reads, this usually refers to the read mate. No --no-reads-with-distance- Filter reads that have template segments mapped to a segments different contig. For paired end reads, this usually refers to the read mate. No --no-adaptercontaminated-reads Prevents removal of reads that are likely to contain adapter contamination, as determined by octopuses native adapter contamination detector. No --disable-downsampling Turns off all downsampling. Reads may still be filtered. No --downsample-above Trigger downsampling of a sample when the read depth in a region is above this value. 500 --downsample-target Once a region has been flagged for downsampling, try to remove reads in the region to achieve this level of coverage. Must be greater than --downsampleabove. 400 VARIANT GENERATION Default value Command Description --raw-cigar-candidategenerator, -g Turn on or off the raw cigar variant candidate generator to propose candidate variants. On --repeat-candidategenerator Turn on or off the repeat candidate generator to propose candidate variants. On --assembly-candidategenerator, -a Turn on or off the local reassembler generator to propose candidate variants. On --source-candidates Consider all sites in the given VCF format as candidate variants. This differs from the option --regenotype as the final call set is not required to be a subset of these calls. None --max-variant-size The maximum variant size (w.r.t genomic interval span) that any candidate variant generator may propose. 2,000 --min-supporting-reads Overrides the default raw cigar generator and applies a simple threshold inclusion predicate based on the number of observed reads. Observations must have base quality greater than that indicated in --minbase-quality. 2 --min-base-quality The minimum base quality a read base must have before it is considered as supporting a variant. 20 ! of !42 22 OCTOPUS USER MANUAL Default value Command Description --kmer-sizes Default k-mer sizes to use for assembly. --num-fallback-kmers The number of fallback k-mer sizes to try if the default sizes fail to provide a valid graph. 10 --fallback-kmer-gap The gap size of fallback k-mers. 10 --max-region-to-assemble The maximum region size that will be used for local reassembly. Larger sizes may result in larger structural variation being found, but reduces sensitivity to smaller variation. 400 --max-assemble-regionoverlap The maximum number of bases assembly windows are allowed to overlap. A higher overlap may increase sensitivity but increase runtime. 200 --assemble-all Forces local reassembly of all genomic regions. No --assembler-mask-basequality Mismatching bases with quality less than this will be masked as reference before being threaded into the assembly graph. 10 --min-kmer-prune The minimum number of k-mer observations to keep the k-mer in the graph after pruning. 2 --max-bubbles The maximum number of bubbles to extract from the assembly graph. 30 --min-bubble-score The minimum bubble score to extract from the assembly graph. 10 15 20 2 HAPLOTYPE GENERATION Command Description --max-haplotypes, -x The maximum number of haplotypes that can be used to generate candidate genotypes. If the haplotype generator proposes more haplotypes than this then the excess will be filtered. 200 --haplotype-holdoutthreshold If a region contains more haplotypes than this, then a subset of alternative alleles will be temporarily removed (held out) and only be analysed once some haplotypes have been discarded. 2,500 --haplotype-overflow The maximum number of haplotypes a region may have before the region is unconditionally skipped (without attempting to hold out alternative alleles). --max-holdout-depth The maximum number attempts to hold out alternative alleles in a region before the region is skipped. Default value ! of !42 23 200,000 20 OCTOPUS USER MANUAL Default value Command Description --extension-level Level of haplotype extension. Possible values are conservative, normal, optimistic, and aggressive. --haplotype-extensionthreshold, -e Haplotypes with posterior probability (of occurrence in the sample set) can be removed before haplotype extension. 100 --dedup-haplotypes-withprior-model Deduplicate haplotypes using mutation prior model, rather than naive method. Yes --protect-referencehaplotype Never filter the reference haplotype. Yes Normal The option --max-haplotype is a target for the haplotype generator as well as a strict limit for the caller; the haplotype generator will attempt to satisfy the request, but if it fails to do so, the caller will filter the generated haplotype set to this limit. CALLING Default value Command Description --caller, -C Which calling model to use. --organism-ploidy, -P The autosome ploidy for the analysed organism. All contigs will have this ploidy unless marked otherwise. --contig-ploidies, -p Assigns ploidies to contigs, overriding the default organism ploidy. --snp-heterozygosity The SNP heterozygosity in the sample population. 0.001 --snp-heterozygositystdev The SNP heterozygosity standard deviation in the sample population. 0.01 --indel-heterozygosity The INDEL heterozygosity in the sample population. --min-variant-posterior The minimum posterior probability (QUAL) for a variant to be reported. --use-uniform-genotypepriors Use uniform genotype priors. No --use-independentgenotype-priors Use independent genotype priors for joint calling. No --model-posterior Calculate model posteriors for every call. Off --inactive-flank-scoring Use to disable calculation to account for flank mismatches in HMM routine. On --model-mapping-quality Use read mapping quality in read likelihood calculation. Yes ! of !42 24 individual or population 2 Y=1 chrY=1 MT=1 chrM=1 0.0001 2 OCTOPUS USER MANUAL Default value Command Description --max-genotypes Maximum number of genotypes to consider. Currently only used by cancer and polyclone calling models. --max-joint-genotypes Maximum number of joint genotype vectors that can be considered (applicable to population and trio calling models). --sequence-error-model The sequencing error model to use for read likelihood calculation. Possible values are hiseq and x10. --max-vb-seeds Maximum number of seeds to use in Variational Bayes inference. 12 --refcall Report reference confidence calls. Off --min-refcall-posterior The minimum posterior probability (QUAL) for a reference allele to be reported. 5,000 1,000,000 hiseq 2 TRIO Command Description --maternal-sample; -M Which of the given samples is the mother in the trio. None --paternal-sample; -F Which of the given samples is the father in the trio. None --denovo-snv-mutationrate The germline snv de novo mutation rate. --denovo-indel-mutationrate The germline indel de novo mutation rate. --min-denovo-posterior The minimum posterior probability (phred scale) to emit a de novo mutation call. --denovos-only Only report DENOVO mutations. Default value 1.38 x 10-8 10-9 3 No CANCER Default value Command Description --normal-sample; -N Which of the given samples is the normal. --max-somatic-haplotypes Maximum number of somatic haplotypes to consider. --somatic-snv-mutationrate The somatic SNV mutation rate for the cancer to be analysed. 10-4 --somatic-indelmutation-rate The somatic INDEL mutation rate for the cancer to be analysed. 10-5 ! of !42 25 None 2 OCTOPUS USER MANUAL Default value Command Description --min-expected -somatic-frequency The minimum expected somatic allele frequency in the sample. 0.03 --min-credible -somatic-frequency The minimum inferred somatic allele frequency that will be emitted. 0.01 --credible-mass Mass of the posterior allele frequency distribution to use when calculating allele frequency. 0.9 --tumour-germlineconcentration Dirichlet concentration parameter for tumour germline haplotypes. 1.5 --normal-contaminationrisk The risk level that the normal contains contamination from the tumour. Possible values: low, high. low --min-somatic-posterior The minimum posterior probability an allele is somatic to be reported. 0.5 --somatics-only Only report SOMATIC mutations. No POLYCLONE Command Description --max-clones The maximum number of clones to try use when calling subclonal variants. --min-clone-frequency Minimum expected clone frequency in the sample. Default value 3 0.01 PHASING Default value Command Description --phasing-level, -l The level of phasing. Possible values are: minimal, conservative, moderate, normal, and aggressive. --min-phase-score The minimum phase score (phred scale) a potential phase set may have to be called. Normal 10 CALL FILTERING Default value Command Description --call-filtering, -f Use to enable Call Set Refinement (CSR). ! of !42 26 On OCTOPUS USER MANUAL Command Description Default value --filter-expression Boolean expression to use to filter calls. Current version only supports OR operations and the measure name must appear on the left hand side of the comparator. QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 --somatic-filterexpression Filter expression for somatic calls. QUAL < 2 | GQ < 20 | MQ < 30 | SB > 0.9 | SD > 0.9 | BQ < 20 | DP < 3 | MF > 0.2 | NC > 1 | FRF > 0.5 --denovo-filterexpression Filter expression for de novo calls. QUAL < 50 | PP < 40 | GQ < 20 | MQ < 30 | AF < 0.1 | SB > 0.95 | BQ < 20 | DP < 10 | DC > 1 | MF > 0.2 | FRF > 0.5 | MP < 30 | MQ0 > 2 --refcall-filterexpression Filter expression for homozygous reference calls. QUAL < 2 | GQ < 20 | MQ < 10 | DP < 10 | MF > 0.2 --use-calling-reads-forfiltering Use the reads used for calling for filtering. Otherwise filtering reads will use default read filters and transforms. No --keep-unfiltered-calls If variant call filtering is turned on, also keep a copy of unfiltered calls. No --training-annotations Emits CSR measures to the output VCF in INFO fields. None --filter-vcf Run CSR filtering on this octopus VCF without calling. None --forest-file Trained ranger forest to use for germline variant filtering. None --somatic-forest-file Trained ranger forest to use for somatic variant filtering. None ! of !42 27 OCTOPUS USER MANUAL VARIANT FILTERING Variant filtering is used to remove false positive calls that may be introduced due to systematic errors in sequencing or mapping. Ideally, these sources of errors would be fully modelled by the data likelihood model, but capturing all types of error at this stage is extremely difficult. A number of approaches to variant filtering have been proposed, including simple threshold based approaches and sophisticated methods using machine learning. However, all approaches first require defining a set of statistics, or measures, that will be used to classify calls as passing or failing in one way or another. These quality of these statistics will ultimately decide the accuracy of variant filtering, regardless of the actual methodology implemented. The default read filter has been chosen to minimise the chance of filtering true positives, whilst eliminating high quality false positives. To achieve very high specificity, it may be necessary to increase the stringency of the filter conditions. Not all measures available in Octopus are computed during the calling phase, hence some filter expressions require re-access to the read data. The main reason for this is that it may be beneficial to relax the read filtering constraints used for calling compared to filtering. For example, during calling it is usually advisable to filter reads with mapping quality less than 20 as these reads are a common source of false positives. To use them during the calling step would likely increase the false positive rate considerably and increase computation time as more candidates would need to be considered. However, these reads are useful for filtering as they indicate the region where the reads are mapped is likely to contain mapping artefacts. Measure reference Below is a list of all available measures: Measure name Requires Sample reads specific? Description AC No No Number of ALT alleles called. AD Yes Yes Minor empirical allele depth. AF Yes Yes Minor empirical allele frequency. ARF Yes Yes Fraction of reads overlapping the call that cannot be assigned to a unique haplotype. BMC Yes Yes Number of base mismatches at variant position in reads supporting variant haplotype. BMF Yes Yes Fraction of reads with base mismatches at variant position. BQ Yes Yes Median base quality of read bases supporting ALT alleles. CC No No Classification confidence: PP / QUAL CRF Yes No Fraction of reads supporting ALT alleles that are soft clipped. DC Yes Yes Number of reads supporting a de novo haplotype in the normal. DENOVO No No Is the call DENOVO? ! of !42 28 Measure name Requires Sample reads specific? Description DP Yes Yes Number of reads overlapping the call. This is recalculated for filtering so may be higher than the calling depth. FRF Yes No Fraction of reads overlapping the call that were filtered for calling. GC No No GC content around the call. GQ No Yes The sample GQ. GQD Yes Yes Genotype Quality by Depth: GQ / DP MC Yes Yes Number of allele mismatches at variant position in reads supporting variant haplotype. MF Yes Yes Fraction of reads with mismatches at variant position. MP No No The model posterior is the probability the model Octopus used for calling is true, compared to other possible models. MQ Yes No Root Mean Squared (RMQ) mapping quality of reads overlapping the call. MQ0 Yes No Number of reads overlapping the call with mapping quality zero. MQD Yes No Maximum pairwise difference in median mapping qualities of reads supporting each haplotype. MRC Yes Yes Number of reads supporting the call that appear misaligned. NC Yes Yes Number of reads supporting a somatic haplotype in the normal. PP No Yes The calls Posterior Probability. PPD Yes Yes Posterior Probability by Depth: PP / DP QUAL No No The calls quality score. QD No No Quality by Depth: QUAL / DP REFCALL No No Are all samples homozygous reference? REB Yes Yes Bias of variants at end (head or tail) of reads. RSB Yes Yes Bias of variant side in supporting reads. RTB Yes Yes Bias of variants at tail of reads. SB Yes Yes Strand bias of reads based on haplotype support. SD Yes Yes Strand bias of reads overlapping the site; probability mass in tails of Beta distribution. ! of !42 29 Measure name Requires Sample reads specific? Description SF Yes Yes Maximum fraction of reads supporting ALT alleles that are supplementary. SHC Yes Yes Number of called somatic haplotypes. SMQ Yes Yes Median mapping quality of reads assigned to called somatic haplotypes. SOMATIC No Yes Does the sample have somatic mutations? STR_LENGTH No Yes Length of overlapping STR. STR_PERIOD No Yes Period of overlapping STR. Threshold filtering Octopus currently provides simple threshold based filtering. A number of measures that can be used to define a Boolean filter expression. Currently, Octopus only supports expressions with OR operations and the less than (<) and greater than (>) comparators. Furthermore, measure name must appear on the left hand side of each condition in the expression. Random forest filtering Random forest filtering is a more flexible and powerful method to filter variant calls than threshold filtering, if sufficient training data is available. Octopus supports random forest filtering for forests that have been trained with the open source Ranger package. Pre-trained forests for germline and somatic variant calling are available on Google Cloud. These can be downloaded manually, or automatically by adding the --downloadforests command to the Python installer: $ ./scripts/install.py --download-forests The forest files (ending in .forest) will be downloaded to the /resources/forests directory in the top level source directory. Octopus currently allows two random forests to be used: one for germline variants (--forest-file), and another for somatic variants (--somatic-forest-file). In principle, it would be possible to just use one forest that has been trained to cover all call types, but it is usually preferable to train separate forests when there is known structure in the data, and sufficient training examples are available. To apply random forest filtering to typical germline calling add the --forest-file option: $ octopus -R human.fa -I NA12878.bam --forest-file germline.forest To filter germline and somatic calls, both the --forest-file and --somatic-forest-file options need to be given: $ octopus -C cancer -R human.fa -I tumour.bam \ ! of !42 30 --forest-file germline.forest --somatic-forest-file somatic.forest The pre-trained germline forest has been trained on various whole-genome replicates of NA12878, while the somatic forest has been trained on synthetic whole genome tumour data. The coverages of the training data is typical for WGS (10-60X), and may not be suitable for extremely high depth sequencing (e.g. amplicon). Training random forests Octopus expects ranger random forests that have been trained on all available measures. The full list is: AC AD AF ARF BQ CC CRF DP FRF GC GQ GQD NC MC MF MP MRC MQ MQ0 MQD PP PPD QD QUAL REFCALL REB RSB RTB SB SD SF SHC SMQ SOMATIC STR_LENGTH STR_PERIOD Ranger expects a text file (either csv, tsv, or space separated) containing values for each measure (in the order above), and a binary variable in the final column labelled (TP) indicating if the measures in the row originate from a true or false call. Each row should therefore contain measures from a single sample. In order to generate this file, Octopus needs to produce a VCF file annotated with each of these measures, which is requested with the --training-annotations option provided with a list of of measures, or for convincing, just "forest": $ octopus -R human.fa -I NA12878.bam -o octopus.NA12878.annotated.vcf.gz \ --training-annotations forest Ranger supports several random forest types, Octopus requires the probability classification variants which is specified with the --probability option. There are various other parameters that control the random forest that can be specified. We have generally found a forest containing 100-500 trees, and a minimum node size of 5-20 works well. There are two Python3 scripts in the /scripts directory: train_random_forest.py and train_somatic_random_forest.py - in the top level source tree that can be used to train ranger forests. ! of !42 31 OUTPUT FORMAT Octopus outputs variants using a simple but rich VCF format. Although the format is fully compliant with the VCF specification (version 4.3), some users may find it unfamiliar, and some tools will fail to fully parse all variants. Variant call output is challenging; the output should be consistent but succinct, however, many tools use representations that is one or the other. For example, records such as the following are not uncommon: #CHROM POS ID 1 102738191 1 102738191 REF . . ALT QUAL FILTER INFO ATTATTTAT A . ATTATTTATTTAT A . FORMAT NA12878 . . GT . . GT 1/0 1/0 The problem with this representation is that the two records are not consistent as both records infer the reference allele at the same position, but the site is heterozygous non-reference for two different deletions. The site can be consistently by joining both records, such as: #CHROM POS ID 1 102738191 REF . ALT QUAL FILTER INFO ATTATTTATTTAT ATTAT,A FORMAT NA12878 . . . GT 1/2 But this representation rapidly becomes unmanageable (and unreadable) as the length and number of overlapping alleles increases. Octopus solves this issue by making use of two additional symbols: • The asterisk symbol (*) in the ALT field is specified in the VCF specification as "The ‘*’ allele is reserved to indicate that the allele is missing due to a upstream deletion". • The dot symbol (.) in the GT field is specified in the VCF specification as "If a call cannot be made for a sample at a given locus, ‘.’ should be specified for each missing allele in the GT fields". Using these two symbols, octopus would represent the above site like: #CHROM POS ID 1 102738191 1 102738191 REF . . ALT QUAL FILTER INFO ATTATTTAT A,* . ATTATTTATTTAT A . FORMAT NA12878 . . GT . . GT 1|2 .|1 There are three important observations here: • The records are phased, so must be considered together. • The first record, which is always contains the shorter allele (i.e. the one which ends first along the reference sequence), specifies the allele on the first haplotype is the deletion of the length given in the ALT field, while the other haplotype is non-reference, but is specified in a later record. • The second record, specifies the deletion given in the ALT field on the other haplotype than the one in the previous record. And a missing allele on the other haplotype. When read sequentially, these observations suggest a single unique genotype for the sample at the site. Although it may seem odd to specify the first allele in the second genotype as missing, this is only true without the context of the first record, and crucially does not contradict the first record. Both records are consistent when considered together. To support tools unable to process octopus's default representation (e.g. RTG Tools), octopus has the -legacy command line option which produces an additional VCF file using the first format in the example. ! of !42 32 PERFORMANCE OPTIMISATION Octopus is a sophisticated program with many parameters. The default values for these parameters have been chosen with an emphasis on calling accuracy, while keeping runtime reasonable. They should provide good all round performance for most users. However, if octopus is not performing adequately on your particular dataset it may be due to non-optimal parameterisation. Generally, there is a direct tradeoff between calling accuracy and runtime resource consumption (memory and CPU time). Execution time By default octopus favours slower, more accurate variant calling. If accuracy is not critical, in particular around highly polymorphic and complex indel regions, it is possible to achieve significant reductions in runtime by altering the behaviour of certain components. In summary, the main components of interest are: Component Associated commands Explanation Variant generation -a, --kmersizes, --minbubble-score The number of candidate variants that must be considered directly affects runtime complexity. In general, more sensitive variant generation will result in more accurate results but longer runtimes. However, it is important to be aware generating too many candidate variants may actually decrease accuracy as the model may become overwhelmed. Variant generation using local reassembly is also inherently computational expensive, and the proportion of sites that must be assembled will directly affect runtime. Haplotype generation -x The number of haplotypes considered has a direct impact on calling accuracy and runtime. Phasing -l Phasing requires potentially multiple marginalisation across the entire genotype posterior distribution. This can be costly when there are many genotypes, or many candidate variants. Calling model Model dependent (see below) Some calling models have computationally complex inference procedures. For example, the cancer calling model implements an iterative Variational Bayes algorithm. These algorithms usually have convergence and performance limits which can affect runtime. There are two convenience command line options --fast and --very-fast that can be used that automatically adjust these parameters to achieve exceptional runtimes. Memory consumption Memory consumption will naturally fluctuate during an run depending on the complexity of the region currently being analysed, however, by far the main source of memory consumption in octopus is from buffering read data. While the size of the read buffer is not directly controllable, the user is able to hint at the maximum buffer size with the --target-read-buffer-footprint option. As this is only a hint, it can be ignored, but in ! of !42 33 most cases the request is satisfied. The exception to this is when multithreading is used, where there is a greater possibility of the requested limit being exceeded. In this case it is recommended to set the target around 20% less than the true limit. Another source of memory consumption if reference caching which can be used to improve runtime execution speed as it reduces the need to read from disk. A moderate default reference cache size is used, which can be adjusted by the user. Finally, significant memory usage can be observed if unreasonable calling parameters are chosen. For example, setting --max-haplotypes above 5,000 is likely to lead to memory explosion. Octopus may issue a warning in such cases, but will try to satisfy the users request. The --target-working-memory option can be specified which may limit peak memory use. Although this is not a strict limit, some algorithms have certain optimisations that require high memory use that may be disabled when using this option. The Variational Bayes algorithm is an example. Multithreading The default behaviour is to run using only a single thread. To enable octopuses built in multithreading capabilities, the --threads command must be used. If an argument is provided to this command then octopus will use this many threads, if no argument is given to this command then octopus will automatically determine the number of threads to use. It is highly recommended not to specify the number of threads, as octopus can probably optimise thread usage better than the user, and it also enables use of specialised multithreaded algorithms which are not available when the thread count is restricted. When running a multithreaded job, it is also important to consider the sizes of the reference cache and read buffer which are set with the -X and -B commands respectively. Octopus will always try to respect these buffer sizes; even when using multiple threads. In doing so, a small read buffer size can see each thread being assigned little work, and a small reference cache can lead to cache thrashing. While a larger reference cache size will always improve runtime performance, increasing the read buffer sizes too much can lead to large workloads for each thread which is undesirable as overall thread throughput can decrease. The optimal balance is highly dependent on the data, most critically the depth of coverage. It is recommended the user experiment with different buffer sizes to find an optimal throughput. As a rough guide it is recommended to at least double the default reference cache and read buffer sizes when using multithreading, and quadruple the default sizes when running on a machine with more than 100 cores. Variant generation Octopus uses two default candidate variant generators; the raw cigar generator and the local reassembly generator. While is almost never a good reason to disable the raw cigar generator, it is worth considering the benefit of leaving the assembler enabled, which can bring significant runtime overheads. The main reason to have the assembler enabled is to resolve larger indels and complex variation. As a rule of thumb, the assembler will have less impact with greater read length and decreasing sample diversity. For samples with high diversity, it is recommended to leave the assembler on. ! of !42 34 Haplotype generation and phasing Haplotype generation and phasing are closely related; longer haplotypes will produce longer range variant phasing, and in general, longer haplotypes usually result in more accurate variant calls. Therefore it is important to recognise that adjusting the level of phasing will also impact the quality of variant calling. By default phasing is medium range. This means octopus will perform haplotype extension when possible, but will stop if too much time is spent in a particular region, or if there are too many haplotypes supported by the data. Experimentation has shown this to provide the best overall calling accuracy while also giving accurate phasing in the vast majority of cases. For human data, the exception is the HLA which may benefit from higher phase levels. In general, increasing the phasing level beyond the default level will not usually improve variant call accuracy, and may decrease it, unless the number of allowed haplotypes is also increased, as the risk of pruning a true haplotype increases with each haplotype extension. Calling model selection and parametrisation The most important consideration to make before variant calling is the calling model to use. Using the population calling model on tumour samples will not result in high quality calls (or classified somatic mutations). In most instances, if all relevant sample information is correctly entered to the command line then octopus will automatically select the appropriate calling model. However, there may be exceptions (e.g. tumour only calling), so it is important to check which calling model was invoked when execution begins, or explicitly set the calling model. Once the appropriate calling model has been chosen, it is vital to consider the parameters specific to that model. For example, the de novo mutation rate is specific to the trio calling model. The more accurately the calling model is parametrised, the more accurate the calls will be. It is therefore worth spending time reviewing the parameters. Some parameters are common to all calling models. Of these, the copy number directives are the most important. In addition to choosing a default organism ploidy, octopus also allows copy number specification of individual chromosomes, so it is important to set correct copy number for sex chromosomes if applicable. ! of !42 35 TROUBLESHOOTING BUILDING Why are the requirements so strict? Octopus uses some advances features of the latest official C++ standard (C++14) and therefore requires a mature C++ compiler. The other requirements have been set to recent versions because these are the versions used to develop and test octopus, and we cannot guarantee that earlier versions will perform as well. In particular CMake introduces improved interprocedural optimisation in version 3.9, whilst Boost 1.65 contains bug fixes and improvements which octopus uses. This also reduces the burden on future releases to be backwards compatible with ageing tools, which allows focus on new features and improvements. CMake chooses a bad compiler As the previous issue explains, some compilers advertised as being C++14 compliant are not due to compiler bugs. Unfortunately, CMake cannot recognise this and will happily select a buggy compiler if you have one in a standard location on your system. If you have installed another working compiler in a non-standard location, you need to tell CMake how to find it with the CMAKE_CXX_COMPILER command: $ cmake -D CMAKE_CXX_COMPILER=/path/to/compiler/cpp .. Or if you’re using the Python install script (recommended): $ ./scripts/install.py --cxx_compiler=/path/to/compiler/cpp Compilation fails Octopus uses advanced C++14, and therefore requires a robust C++14 compiler and standard library implementation. Many of the compiler versions advertised as being C++14 compliant have bugs that prevent compilation. The only compilers that have currently been shown to work are Clang 3.8 and GCC 6.2. Linking fails Ensure the correct compiler driver was selected: using Clang this means selecting clang++ and not clang: $ ./scripts/install.py --cxx_compiler=/path/to/clang/bin/clang++ Similarly for GCC use g++ rather than gcc: $ ./scripts/install.py --cxx_compiler=/path/to/gcc/bin/g++ On some systems, you may also need to specify a C compiler which is the same version as your C++ compiler, in which case you can also specify this with the c_compiler option: $ ./install.py --cxx_compiler=g++-7 --c_compiler=gcc-7 If the issue is specifically with linking against Boost then see the next issue. ! of !42 36 Boost libraries fail to link First ensure that Boost was built using the same compiler used to compile octopus - using different compilers will often cause linking problems. Second, if the required Boost libraries are not installed in a standard location on your system, and you have not set appropriate environment variables, you may see an error message when running octopus (even after a successful build). To resolve this you just need to set the appropriate environment variable: $ export LD_LIBRARY_PATH=/path/to/boost/lib before executing octopus. Compilation has lots of #pragma warnings This is a Boost issue that was introduced in Boost 1.60, and should be fixed in future releases of Boost. This issue does not affect octopus. RUNTIME Segmentation fault Sorry! Please see the advise under ‘reporting bugs’. Execution is slow The best way to improve runtime is to use multithreading with the --threads command. If this does not offer adequate performance then consider switching off the assembler with the -a command or try the command --fast. Please see the section Execution time under performance optimisation. Execution delays after initialising calling components in threaded mode This is due to the way tasks are distributed to threads: the challenge is to split the genome into chunks (or tasks) that are independent in the sense that the union of the calls produced by calling each independently is the same as calling them together. Clearly this is always true for different contigs, but it is non-trivial to detect independent tasks for a single contig. In addition, if tasks are too small or too large, runtime performance is not optimal. Octopus solves this problem by focusing on creating tasks that will lead good thread usage, while relying on a conflict detection algorithm to find non-independent tasks after calling has completed. This algorithm requires all tasks for a given contig to be known before any resolutions can be made, and therefore all tasks for a contig must be generated before calling can begin. Run hangs in decoy contigs Decoy contigs often have very high coverage (above 20,000X) and therefore the downsampling will likely occur. The downsampling algorithm octopus implements is non-trivial as it attempts to keep an even coverage across the downsampled region to avoid any bias. Unfortunately in very high coverage regions this can be very slow. We are working on improving this, but for now we suggest either increasing the downsampling thresholds, or removing these decoy contigs from analysis completely (using the option --skip-regions(file). ! of !42 37 BEHAVIOUR No calls are reported Try turning off read filters with the command --read-filtering off. If this results in calls then one or more of the filters is probably over-filtering, which usually implies a quirk of the read mapper that octopus does not recognise. In such cases it is best to find which filter is triggering (the --debug command can be used to help) and disable it. Regions are skipped because of too many haplotypes This warning can occur in very complex regions and is not normally a problem as it is usually reflective of bad read mappings. However, it is possible to force octopus to call these regions by increasing the --maxhaplotypes and --haplotype-overflow command line options. A call changes when a different input region is given Haplotype based variant callers gain power by jointly calling adjacent alleles, hence the result for one position is dependent on other nearby positions. If two target regions differ then different haplypes may be proposed for each, which can affect the overall result for a position even if it is present in both target regions, and even if no other calls result from the difference. Octopus tries to avoid this problem as much as possible by considering regions beyond those requested, but this cannot eliminate the problem entirely - the only complete solution is to call entire contigs. Why doesn’t octopus report genotype likelihoods? Octopus calculates the likelihood of the reads given a haplotype while records in the VCF file are alleles. It is non-trivial to calculate the genotype likelihood w.r.t an allele given the genotype likelihoods w.r.t haplotypes as the calculation must also condition on all other alleles used to compute the haplotypes likelihoods. This is not the case for posteriors, which can be marginalised in a trivial way, as all required information regarding other alleles is already present in the single posterior value. While we appreciate users like to see genotype likelihoods. We feel we can offer users more accurate results by providing flexible priors, and reporting exact genotype posteriors in the final output. Why do octopus VCF files contain * and .? Octopus uses advanced parts of the VCF specification to archive a consistent genotyping over all samples. The problem appears because primarily because multi-allelic sites must either be represented on a single line or split into multiple records. For shorter alleles the former option is fine, but the latter option is better for longer alleles (otherwise records containing 100+ bases to represent a SNP occur). The principle problem with splitting records like this is that inconstancies can arise if the records are treated independently, for example, consider a simple example: #CHROM POS 1 100 ID . REF A ALT C,G QUAL . FILTER INFO . . Now suppose we split the record into two separate records: ! of !42 38 FORMAT NA12878 GT 1|2 #CHROM POS 1 100 1 100 ID . . REF A A ALT C G QUAL . . FILTER INFO . . . . FORMAT NA12878 GT 1|0 GT 0|1 There is an inconstancy here! The first record states the genotype is C|A while the second claims it is A|G! The problem is that the reference is overrepresented in the genotype call. What we should be saying here is that the genotype cannot be fully represented due to a conflict with another call, which is exactly what octopus does! #CHROM POS 1 100 1 100 ID . . REF A A ALT C G QUAL . . FILTER INFO . . . . FORMAT NA12878 GT 1|. GT .|1 Note: octopus would actual represent this call as a multi-allelic record (as in the first case), this example is just intended to demonstrate the behaviour. Now for the “*” which can appear as an ALT allele. This is somewhat like the dot used to indicate an interaction, but specifically denotes an overlap with a deletion. Unfortunately, many other variant callers - and therefore downstream analysis tools - do not output consistent calls. In order to force octopus to output an additional potentially inconsistent VCF file which may work better with downstream tools, just add the option --legacy. SNP accuracy improves in fast mode In fast mode octopus turns off haplotype extension, which means there is no haplotype posterior based filtering. Haplotype extension is usually good; it can help resolve complex regions and allows long range phasing. However, in regions where the data cannot be modelled correctly (e.g. if many reads are missmapped), it can lead to significant portions of the overall posterior distribution being removed. This causes false positive calls to be given far higher posterior (and hence QUAL) than if no extension was used. This is especially true for SNPs. Many of these high quality false positives will filtered with the default model filters, but it is possible for some to pass these filters if too many reasonable haplotypes are removed before the model selection is applied. We are working on ways to improve this, one idea is to do a double pass of any complex regions to improve resolution. If this is causing a significant problem for you now, you can help alleviate the issue by increasing the posterior filter threshold (--haplotype-extension-threshold), or simply turn off haplotype extension by setting --phasing-level to minimal. Calling performance is worse with assembler The assembler often proposes many complex candidate variants which increases the number of haplotypes considerably. This can force octopus to rely on its haplotype filtering algorithms more than without the assembler, and while these algorithms often perform well, they are not foolproof. In addition, many gold standard reference sets will not contain real complex variants of the type the assembler is able to propose, as these are challenging to validate, and many variant callers are not even capable of calling such variants. It is advised the user interest such results with a degree of caution. ! of !42 39 CONTACT • Author and maintainer: Daniel Cooke (dcooke@well.ox.ac.uk). • Author: Gerton Lunter (gerton.lunter@well.ox.ac.uk). APPENDIX INSTALLING REQUIREMENTS OS X On OS X, Clang is recommended, and all requirements can be installed with the package manager Homebrew: $ brew update $ brew install git $ brew install --with-clang llvm $ brew install boost $ brew install cmake $ brew install python@3 $ brew install htslib If you already have any of these packages installed on your system with Homebrew the command will fail, but you can update to the latest version by using brew upgrade instead of brew install. Ubuntu On Ubuntu, most requirements can be obtained with apt-get. GCC 7 is recommended as this will simplify installing Boost. To obtain the requirements execute: $ sudo add-apt-repository ppa:ubuntu-toolchain-r/test $ sudo apt-get update && sudo apt-get upgrade $ sudo apt-get install gcc-7 $ sudo apt-get install git-all $ sudo apt-get install python3 Test GCC was successfully installed by typing $ g++-7 --version As only htslib 1.2.1 is available using apt-get, htslib 1.4 must be installed manually: $ sudo apt-get install autoconf $ git clone https://github.com/samtools/htslib.git $ cd htslib $ autoheader $ autoconf ! of !42 40 $ ./configure $ make && sudo make install CMake is also easy to install: $ sudo apt-get purge cmake $ mkdir ~/temp && cd ~/temp $ wget https://cmake.org/files/v3.9/cmake-3.9.6.tar.gz $ tar -xzvf cmake-3.9.6.tar.gz $ cd cmake-3.9.6/ $ ./bootstrap $ make -j4 $ sudo make install $ cmake --version Boost can be installed as follows: $ wget -O boost_1_65_1.tar.gz http://sourceforge.net/projects/boost/ files/boost/1.65.1/boost_1_65_1.tar.gz/download $ tar xzvf boost_1_65_1.tar.gz && cd boost_1_65_1 $ sudo apt-get update $ sudo apt-get install build-essential g++ python-dev autotools-dev libicu-dev build-essential libbz2-dev $ ./bootstrap.sh --prefix=/usr/ $ ./b2 $ sudo ./b2 install VARIANT GENERATION Genotype calls are made by generating a set of candidate variants and weighing up the evidence for each. The space of all possible variants is infinite, so heuristics must be employed to generate the candidate set. The performance of the variant generator is critical as the final calls will be a subset of the generated candidate variant set. Octopus can make use of multiple independent variant generators and take the union of the result of each. The two default variant generators are the raw cigar generator and the local reassembly generator. The raw cigar generator proposes a candidate whenever a mismatch, as indicated in a reads cigar string, satisfies some condition. The default condition used is dependent on the calling model selected, but can be modified by the user. The local reassembly generator constructs a De Bruin graph of all reads in a genomic interval, threaded around the reference sequence, and then extracts paths through the graph which deviate from the single reference path. Currently the only other non-default generator is the source file generator which simply extracts previously made calls from a VCF format file. We are working on a generator that will extract known variants from online databases. ! of !42 41 HAPLOTYPE GENERATION The haplotype generator takes the set of candidate variants generated by the variant generators and constructs sets of candidate haplotypes. The advantage of having these two stages independent is that haplotypes can then by dynamically manipulated after construction. In particular, it allows a haplotype to be extended conditionally on the posterior probability that the haplotype is present in the union of all sample genotypes. PHASING Phasing takes place after variants have been called - the algorithm uses the regions spanned by the calls and the inferred genotype posterior distribution to find optimal phase regions. It proceeds by arranging the called regions into groups such that the entropy of marginal posterior distributions between groups is maximised. In other words, if the posterior distribution assigns similar mass to different phasing of the same variant calls, then the true phasing is unknown, but if one particular phasing carries most of the posterior mass, then there is strong evidence that phasing is the true physical phasing. GLOSSARY Allele A particular instance of a genomic region, that is, a nucleotide sequence and a genomic region. Genomic region/interval A coordinate range with respect to a reference genome. All genomic regions must specify the contig (chromosome) and begin and end positions in the reference. Genotype A collection of haplotypes or alleles of known cardinality. Haplotype An ordered set of alleles which occur on the same chromosome in the sample. ! of !42 42
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.3 Linearized : No Page Count : 42 Title : octopus-user-manual Producer : macOS Version 10.14.2 (Build 18C54) Quartz PDFContext Creator : Pages Create Date : 2019:01:02 12:51:06Z Modify Date : 2019:01:02 12:51:06ZEXIF Metadata provided by EXIF.tools