Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 16

SomaticSeq Documentation

Li Tai Fang

May 23, 2018

1 Introduction

SomaticSeq is a ﬂexible post-somatic-mutation-calling workﬂow for improved accuracy. We have incorporated

multiple somatic mutation caller(s) to obtain a combined call set, and then it uses machine learning to

distinguish true mutations from false positives from that call set. We have incorporated the following

somatic mutation caller: MuTect/Indelocator/MuTect2, VarScan2, JointSNVMix, SomaticSniper, VarDict,

MuSE, LoFreq, Scalpel, Strelka, and TNscope. You may incorporate some or all of those callers into your

own pipeline with SomaticSeq.

The manuscript, An ensemble approach to accurately detect somatic mutations using Somat-

icSeq, is published in Genome Biology 2015, 16:197. The SomaticSeq project is located at

http://bioinform.github.io/somaticseq/ . The data described in the manuscript is also described at

http://bioinform.github.io/somaticseq/data.html. There have been some major improvements since the pub-

lication.

SomaticSeq.Wrapper.sh is a bash script that calls a series of scripts to combine the output of the somatic

mutation caller(s), after the somatic mutation callers are run. Then, depending on what R scripts are fed to

SomaticSeq.Wrapper.sh, it will either 1) train the call set into a classiﬁer, 2) predict high-conﬁdence somatic

mutations from the call set based on a pre-deﬁned classiﬁer, or 3) simply label the calls (i.e., PASS, LowQual,

or REJECT) based on majority vote of the tools.

1.1 Dependencies

•Python 3, plus regex, pysam, numpy, and scipy libraries. All the .py scripts are written in Python 3.

•R, plus the ada package in R.

•BEDTools (if there is an inclusion and/or an exclusion region ﬁle)

•Optional: Free GATK version 3. You can use GATK CombineVariants to merge VCF ﬁles if the jar

ﬁle is provided to --gatk. Otherwise, the sorting is done by the vcfsort.pl script written by German

Gaston Leparc.

•Optional: dbSNP and COSMIC ﬁles in VCF format (if you want to use these features as a part of the

training).

•At least one of MuTect/Indelocator/MuTect2, VarScan2, JointSNVMix2, SomaticSniper, VarDict,

MuSE, LoFreq, Scalpel, and/or Strelka2. Those are the tools we have incorporated in SomaticSeq. If

there are other somatic tools that may be good addition to our list, please make the suggestion to us.

1.2 Docker repos

SomaticSeq and (almost) all the somatic mutation callers we have incorporated are now dockerized. The

exception is VarScan2 because we do not have distribution rights.

•SomaticSeq is dockerized at https://hub.docker.com/r/lethalfang/somaticseq/.

•MuTect2 (tested with GATK:4.beta.6): https://hub.docker.com/r/broadinstitute/gatk

•VarScan2 (untested): https://hub.docker.com/r/djordjeklisic/sbg-varscan2/

•JointSNVMix2: https://hub.docker.com/r/lethalfang/jointsnvmix2/

•SomaticSniper: https://hub.docker.com/r/lethalfang/somaticsniper/

•VarDict: https://hub.docker.com/r/lethalfang/vardictjava/

•MuSE: https://hub.docker.com/r/marghoob/muse/

•LoFreq: https://hub.docker.com/r/marghoob/lofreq/

•Scalpel: https://hub.docker.com/r/lethalfang/scalpel/

•Strelka2: https://hub.docker.com/r/lethalfang/strelka/

2 To use SomaticSeq.Wrapper.sh

The SomaticSeq.Wrapper.sh is a wrapper script that calls a series of programs and procedures after you

have run your individual somatic mutation callers. In the next section, we will describe the workﬂow in more

detail, so you may not be dependent on this wrapper script. You can either modify this wrapper script or

create your own workﬂow.

2.1 To train data set into a classiﬁer

To create a trained classiﬁer, ground truth ﬁles are required for the data sets. There is also an option to

include a list of regions to include and/or exclude. The exclusion or inclusion regions can be VCF or BED

ﬁles. An inclusion region may be subset of the call sets where you have validated their true/false mutation

status, so that only those regions will be used for training. An exclusion region can be regions where the

“truth” is ambigious.

All the output VCF ﬁles from individual callers are optional. Those VCF ﬁles can be bgzipped if they

have .vcf.gz extensions. It is imperative that the parameters used for the training and prediction are identical.

1# For t r a i n i n g , t ru th f i l e and the c o r r e c t R s c r i p t a r e r e q u i r e d .

3Somatic Seq . Wrapper . sh \

−−mutect MuTect/ v a r i a n t s . snp . v c f \

5−−mutect2 MuTect2/ v a r i a n t s . v c f \

−−i n d e l o c a t o r I n d e l o c a t o r / v a r i a n t s . i n d e l . v c f \

7−−varscan−snv VarScan2/ v a r i a n t s . snp . v c f \

−−varscan−i n d e l VarScan2/ v a r i a n t s . i n d e l . v c f \

9−−jsm JointSNVMix2/ v a r i a n t s . snp . v c f \

−−s n i p e r Somatic Sniper / v a r i a n t s . snp . v c f \

11 −−v a r d i c t VarDict / v a r i a n t s . v c f \

−−muse MuSE/ v a r i a n t s . snp . v c f \

13 −−l o f r e q −snv LoFreq / v a r i a n t s . snp . v c f \

−−l o f r e q −i n d e l LoFreq / v a r i a n t s . i n d e l . v c f \

15 −−s c a l p e l S c a l p e l / v a r i a n t s . i n d e l . v c f \

−−strelka−snv S t r e l k a / v a r i a n t s . s nv . v c f \

17 −−strelka−i n d e l S t r e l k a / v a r i a n t s . i n d e l . v c f \

−−t n scope TNscope . v c f . gz \

19 −−normal−bam matched normal . bam \

−−tumor−bam tumor . bam \

21 −−ada−r−s c r i p t a d a m o d e l b u i l d e r .R \

−−genome−r e f e r e n c e human b37 . f a s t a \

23 −−co sm ic co smi c . b37 . v71 . v c f \

−−dbsnp dbSNP . b37 . v141 . v c f \

25 −−gatk $PATH/TO/GenomeAnalysisTK . j a r \

−−exclusion−r e g i o n i g n o r e . bed \

27 −−inclusion−r e g i o n v a l i d a t e d . bed

−−tr ut h −snv t r u t h . snp . v c f \

29 −−tr ut h −i n d e l tr ut h . i n d e l . v cf \

−−output−d i r $OUTPUT DIR

SomaticSeq.Wrapper.sh supports any combination of the somatic mutation callers we have incorporated

into the workﬂow. SomaticSeq will run based on the output VCFs you have provided. It will train for SNV

and/or INDEL if you provide the truth.snp.vcf and/or truth.indel.vcf ﬁle(s) as well as the proper R script

(ada model builder.R). Otherwise, it will fall back to the simple caller consensus mode.

2.2 To predict somatic mutation based on trained classiﬁers

Make sure the classiﬁers and the proper R script (ada model predictor.R) are supplied, Without either of

them, it will fall back to the simple caller consensus mode.

# The ∗. RData f i l e s a re t ra i ne d c l a s s i f i e r from the t r a i n i n g mode .

2SomaticSeq . Wrapper . sh \

−−mutect MuTect/ v a r i a n t s . snp . v c f \

4−−mutect2 MuTect2/ v a r i a n t s . v c f \

−−i n d e l o c a t o r I n d e l o c a t o r / v a r i a n t s . i n d e l . v c f \

6−−varscan−snv VarScan2 / v a r i a n t s . snp . v c f \

−−varscan−i n d e l VarScan2 / v a r i a n t s . i n d e l . v c f \

8−−jsm JointSNVMix2/ v a r i a n t s . snp . v c f \

−−s n i p e r So ma ti cS ni pe r / v a r i a n t s . snp . v c f \

10 −−v a r d i c t VarDict / v a r i a n t s . v c f \

−−muse MuSE/ v a r i a n t s . snp . v c f \

12 −−l o f r e q −snv LoFreq / v a r i a n t s . snp . v c f \

−−l o f r e q −i n d e l LoFreq / v a r i a n t s . i n d e l . v c f \

14 −−s c a l p e l S c a l p e l / v a r i a n t s . i n d e l . v c f \

−−strelka−snv S t r e l k a / v a r i a n t s . snv . v c f \

16 −−strelka−i n d e l S t r e l k a / v a r i a n t s . i n d e l . v c f \

−−t n sc o pe TNscope . v c f . gz \

18 −−normal−bam matched normal . bam \

−−tumor−bam tumor . bam \

20 −−ada−r−s c r i p t a d a m o d e l p r e d i c t o r . R \

−−genome−r e f e r e n c e human b37 . f a s t a \

22 −−co sm ic co smi c . b37 . v71 . v c f \

−−dbsnp dbSNP . b37 . v141 . v c f \

24 −−snpeff−d i r $PATH/TO/DIR/ s n p S i f t \

−−gatk $PATH/TO/GenomeAnalysisTK . j a r \

26 −−exclusion−r e g i o n i g n o r e . bed \

−−inclusion−r e g i o n v a l i d a t e d . bed

28 −− classifier −snv sSNV . C l a s s i f i e r . RData \

−− classifier −i n d e l sINDEL . C l a s s i f i e r . RData \

30 −−pa ss−threshold 0.5 \

−−lowqual−threshold 0.1 \

32 −−output−d i r $OUTPUT DIR

If both MuTect2 and MuTect/Indelocator VCF ﬁles are provided, the script is written such that it will

use MuTect2 and ignore MuTect.

2.3 To classify based on simple majority vote

Same as the command previously, but not including the R script or the ground truth ﬁles. Without those

information, SomaticSeq will fall back into a simple majority vote.

3 The step-by-step SomaticSeq Workﬂow

We’ll describe the workﬂow here, so you may modify the workﬂow and/or create your own workﬂow instead

of using the wrapper script described previously.

3.1 Combine the call sets

We use GATK CombineVariants to combine the VCF ﬁles fkeep-intermediates:rom diﬀerent callers, although

it does not matter what tools are used to merge VCF ﬁles. We GATK CombineVariants because it’s quite

fast. To make them compatible with GATK, the VCF ﬁles are modiﬁed.

A simple alternative method is to use GNU sort and uniq in Linux to list all the unique ﬁrst-5-column

(i.e., CHROM, POS, ID, REF, and ALT) from all the VCF ﬁles, and then ﬁll the remaining required VCF

columns with whatever string and sort it according to the reference. That VCF ﬁle will do the job just ﬁne.

For our GATK CombineVariants procedure:

1. Modify MuTect and/or Indelocator output VCF ﬁles. Since MuTect’s output VCF do not always put

the tumor and normal samples in the same columns, the script will determine that information based

on either the BAM ﬁles (the header has sample name information), or based on the sample information

that you tell it, and then determine which column belongs to the normal, and which column belongs

to the tumor.

1# Modify MuTect and I n d e l o c a t o r ’ s o ut pu t VCF ba se d on BAM f i l e s

modify MuTect . py −i n f i l e i n put . v c f −o u t f i l e ou tpu t . v cf −nbam normal . bam −tbam tumor . bam

# Based on t he sample name you su pp l y :

5modify MuTect . py −i n f i l e i n put . v c f −o u t f i l e ou tpu t . v cf −nsm NormalSampleName −tsm

TumorSampleName

2. For MuTect2, this script will split multi-allelic records into one variant per line in the VCF ﬁle. This

is to make thing easier for the SSeq merged.vcf2tsv.py script later.

1# Based on t he sample name you su pp l y :

modify MuTect2 . py −i n f i l e MuTect2 . F i l t e r e d . v c f −snv m ute ct . snp . v c f −i n d e l mutect . i n d e l .

v c f

3. Modify VarScan’s output VCF ﬁles to be rigorously concordant to VCF format standard, and to attach

the tag ’VarScan2’ to somatic calls.

# Do i t f o r both th e SNV and i n d e l

2modify VJSD . py −method VarScan2 −i n f i l e i npu t . v c f −o u t f i l e output . v cf

4. JointSNVMix2 does not output VCF ﬁles. In our own workﬂow, we have already converted its text

ﬁle into a basic VCF ﬁle with an 2 awk one-liners, which you may see in the Run 5 callers directory,

which are:

# To a vo i d t e x t f i l e s i n t he o r d e r o f t e r a b y t e s , t h i s awk one−l i n e r k e ep s e n t r i e s where

th e r e f e r e n c e i s not ”N” , and th e so ma tic p r o b a b i l i t i e s a re at l e a s t 0 . 9 5 .

2awk −F”\t ” ’NR!=1 && $4 !=”N” && $10+$11 >=0.95 ’

4# This awk one−l i n e r c o n v e r t s th e t e x t f i l e i n t o a b a s i c VCF f i l e

awk −F”\t ” ’{p r i n t $1 ”\t ” $2 ”\t . \t ” $3 ”\t ” $4 ”\t . \t . \tAAAB=” $10 ” ;AABB=” $11 ”\

tRD :AD\t ” $5 ” : ” $6 ”\t ” $7 ” : ” $8 }’

## The a c t u a l commands we ’ ve used i n our wo rk fl ow :

8echo −e ’##f i l e f o r m a t=VCFv4 . 1 ’ >unsorted . vcf

echo −e ’##INFO=<ID=AAAB, Number=1,Type=Fl oa t , D e s c r i p t i o n=” P r o b a b i l i t y o f J oi n t Genotype

AA i n Normal and AB i n Tumor”>’>> u ns o rt ed . v c f

10 echo −e ’##INFO=<ID=AABB, Number=1 ,Type=F lo at , D e s c r i p t i on =” P r o b a b i l i t y o f J o in t Genotype

AA i n Normal and BB i n Tumor”>’>> u n s or te d . v c f

echo −e ’##FORMAT=<ID=RD, Number=1 ,Type=I n t eg e r , D e s c r i p t i o n =”Depth o f r e f e r e n c e −

s u p p o r t i n g b a s e s ( r e a d s 1 ) ”>’>> u ns o rt ed . v c f

12 echo −e ’##FORMAT=<ID=AD, Number=1,Type=I n t eg e r , D e s c r i p t i o n =”Depth o f v ar i an t −supporting

b a s e s ( r e a d s 2 ) ” >’>> u ns o rt ed . v c f

echo −e ’#CHROM\tPOS\tI D \tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNORMAL\tTUMOR’ >>

u ns o rt ed . v c f

python $PATH/TO/ jsm . py c l a s s i f y j oi n t s nv m i x t w o genome . GRCh37. f a normal . bam tumor . bam

t r a i n e d . pa ra me te r . c f g / dev / s t do u t | \

16 awk −F”\t ” ’NR!=1 && $4 !=”N” && $10+$11 >=0.95 ’ | \

awk −F”\t ” ’{p r i n t $1 ”\t ” $2 ”\t . \t ” $3 ”\t ” $4 ”\t . \t . \tAAAB=” $10 ” ;AABB=” $11 ”\

tRD :AD\t ” $5 ” : ” $6 ”\t ” $7 ” : ” $8 }’>> u n s or te d . v c f

After that, you’ll also want to sort the VCF ﬁle. Now, to modify that basic VCF into something that

will be compatible with other VCF ﬁles under GATK CombineVariants:

1modify VJSD . py −method JointSNVMix2 −i n f i l e i n p u t . v cf −o u t f i l e outpu t . v cf

5. Modify SomaticSniper’s output:

1modify VJSD . py −method So mat i c Sni p e r −i n f i l e i n put . v c f −o u t f i l e ou tpu t . v cf

6. VarDict has both SNV and indel, plus some other variants in the same VCF ﬁle. Our script will

create two ﬁles, one for SNV and one for indel, while everything else is ignored for now. By default,

LikelySomatic/StrongSomatic and PASS calls will be labeled VarDict. However, in our SomaticSeq

paper, based on our experience in DREAM Challenge, we implemented two custom ﬁlters to relax the

VarDict tagging criteria.

1# D e f a u l t V ar Di ct t a g g i n g c r i t e r i a , o nl y PASS ( and L i k e l y o r S tr o ng So ma ti c ) :

modify VJSD . py −method VarDict −i n f i l e i n t p u t . v c f −o u t f i l e ou tput . v cf

# When ru nn ing VarDict , i f v a r 2 v c f p a i r e d . p l i s used t o g e ne r a te t he VCF f i l e , you may

r e l a x t he t a g g i n g c r i t e r i a w it h −f i l t e r p a ir e d

5modify VJSD . py −method VarDict −i n f i l e i n t p u t . v c f −o u t f i l e ou tput . v cf −f i l t e r p a ir e d

7# When ru nni ng VarD ict , i f v a r 2 v c f s o m a t i c . p l i s us ed t o g e n e r a t e t he VCF f i l e , you may

r e l a x t he t a g g i n g c r i t e r i a w it h −f i l t e r so mat i c

modify VJSD . py −method VarDict −i n f i l e i n t p u t . v c f −o u t f i l e ou tput . v cf −f i l t e r s o m ati c

In the SomaticSeq paper, -ﬁlter somatic was used because var2vcf somatic.pl was used to generate

VarDict’s VCF ﬁles. In the SomaticSeq.Wrapper.sh script, however, -ﬁlter paired is used because

VarDict authors have since recommended var2vcf paired.pl script to create the VCF ﬁles. While there

are some diﬀerences (diﬀerent stringencies in some ﬁlters) in what VarDict labels as PASS between

the somatic.pl and paired.pl scripts, the diﬀerence is miniscule after applying our custom ﬁlter (which

relaxes the ﬁlter, resulting in a diﬀerence about 5 calls out of 15,000).

The output ﬁles will be snp.output.vcf and indel.output.vcf.

7. MuSE was not a part of our analysis in the SomaticSeq paper. We have implemented it later.

modify VJSD . py −method MuSE −i n f i l e inp u t . v c f −o u t f i l e output . v c f

8. LoFreq and Scalpel do not require modiﬁcation. LoFreq has no sample columns anyway.

9. Add “GT” ﬁeld to sample columns to make it compatible with GATK CombineVariants.

1m o d i f y S t r e l k a . py −i n f i l e som a t ic . snv s . v cf . gz −o u t f i l e s t r a l k a . snv . v c f

10. Finally, with the VCF ﬁles modiﬁed, you may combine them with GATK CombineVariants: one for

SNV and one for indel separately. There is no particular reason to use GATK CombineVariants.

Other combiners should also work. The only useful thing here is to combine the calls and GATK

CombineVariants does it well and pretty fast.

1# Combine th e VCF f i l e s f o r SNV. Any or a l l o f the VCF f i l e s may be p r es en t .

#−nt 6 means to u se 6 t h r e a d s i n p a r a l l e l

3ja v a −j a r $PATH/TO/ GenomeAnalysisTK . j a r −T CombineVariants −R genome . GRCh37 . f a −nt 6 −−

setKey n u l l −−genotypemergeoption UNSORTED −V mutec t . v c f −V v ar sc an . snp . v c f −V

jointsnvmix . vcf −V snp . v a r d i c t . v c f −V muse . v c f −−out CombineVariants . snp . vcf

ja v a −j a r $PATH/TO/ GenomeAnalysisTK . j a r −T CombineVariants −R genome . GRCh37 . f a −nt 6 −−

setKey n u l l −−genotypemergeoption UNSORTED −V i n d e l o c a t o r . v c f −V v ar sc an . snp . v c f −V

i n d e l . v a r d i c t . v c f −−ou t C ombineVari ants . i n d e l . v c f

3.2 For model training: process and annotate the VCF ﬁles (union of call sets)

This step may be needed for model training. The workﬂow in SomaticSeq.Wrapper.sh allows for inclusion

and exclusion region. An inclusion region means we will only use calls inside these regions. An exclusion

region means we do not care about calls inside this region. DREAM Challenge had exclusion regions, e.g.,

blacklisted regions, etc.

# In t he DREAM Stage 3 d i r e c t o r y , we have i n c l u d ed an e x c l u s i o n r e g i o n BED f i l e a s an

example

2# This command u se s BEDtools to r i d o f a l l c a l l s in t he e x c l u s i o n r e g i on

intersectBed −header −a BINA somatic . snp . v c f −b i g n o r e . bed −v>s om at i c . snp . p r o c e s s e d . v c f

4intersectBed −header −a BINA somatic . i n d e l . v c f −b i g n o r e . bed −v>so m at ic . i n d e l . p r o c e s s e d .

v c f

6# A l t e r n a t i v e l y ( or both ) , t h i s command u se s BEDtools to keep on ly c a l l s i n the i n c l u s i o n

r e g i o n

intersectBed −header −a BINA somatic . snp . v c f −b i n c l u s i o n . bed >so ma ti c . snp . p r o c e s s e d . v c f

8intersectBed −header −a BINA somatic . i n d e l . v c f −b i n c l u s i o n . bed >so ma t ic . i n d e l . p r o c e s s e d .

v c f

3.3 Convert the VCF ﬁle, annotated or otherwise, into a tab separated ﬁle

This script works for all VCF ﬁles. It extracts information from BAM ﬁles as well as some VCF ﬁles created

by the individual callers. If the ground truth VCF ﬁle is included, a called variant will be annotated as a

true positive, and everything will be annotated as a false positive.

1# SNV

SSe q merged . v c f 2 t s v . py −r e f genome . GRCh37 . f a −myvcf s om a ti c . snp . p r o c e s s e d . v c f −t r u t h Ground .

t r u t h . snp . v c f −mutect MuTect/ v a r i a n t s . snp . v c f . gz −v ar sc an VarScan2 / v a r i a n t s . snp . v c f −jsm

JSM2/ v a r i a n t s . v c f −s n i p e r S om ati cS ni pe r / v a r i a n t s . v c f −v a r d i c t Va rDict / snp . v a r i a n t s . v c f

−muse MuSE/ v a r i a n t s . v c f −l o f r e q LoFreq / v a r i a n t s . snp . v c f −s t r e l k a S t r e l k a / v a r i a n t s . snp .

v c f −dedup −tbam tumor . bam −nbam normal . bam −o u t f i l e Ensemble . sSNV . t s v

That was for SNV, and indel is almost the same thing. After version 2.1, we have replaced all information

from SAMtools and HaplotypeCaller with information directly from the BAM ﬁles. The accuracy diﬀerences

are negligible with signiﬁcant improvement in usability and resource requirement.

# INDEL :

2SSe q merged . v c f 2 t s v . py −r e f genome . GRCh37 . f a −myvcf s om a ti c . i n d e l . p r o c e s s e d . v c f −truth

Ground . t r u t h . i n d e l . v c f −va rs ca n VarScan2 / v a r i a n t s . snp . v c f −v a r d i c t VarDict / i n d e l .

v a r i a n t s . v c f −l o f r e q LoFreq / v a r i a n t s . i n d e l . v c f −s c a l p e l S c a l p e l / v a r i a n t s . i n d e l . v c f −

s t r e l k a S t r e l k a / v a r i a n t s . i n d e l . v c f −tbam tumor . bam −nbam normal . bam −dedup −o u t f i l e

Ensemble . sINDEL . t s v

At the end of this, Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv are created.

All the options for SSeq merged.vcf2tsv.py are listed here. They can also be displayed by running

SSeq merged.vcf2tsv.py --help.

2−myvcf I nput VCF f i l e o f t he merged c a l l s [REQUIRED]

−r e f Genome r e f e r e n c e f a / f a s t a f i l e [REQUIRED]

4−nbam BAM f i l e o f t he matched normal sample [REQUIRED]

−tbam BAM f i l e o f t he tumor sample [REQUIRED]

6−r e f Genome r e f e r e n c e f a / f a s t q f i l e [REQUIRED]

−t r u t h Ground t r u t h VCF f i l e . Ev ery o t h e r p o s i t i o n i s a F a l s e P o s i t i v e .

8−dbsnp dbSNP VCF f i l e

−cosm ic COSMIC VCF f i l e

10 −mutect VCF f i l e from e i t h e r MuTect2 , MuTect , or I n d e l o c a t o r

−s n i p e r VCF f i l e from S om ati cS ni pe r

12 −va r s c a n VCF f i l e from VarScan2

−jsm VCF f i l e from Bina ’ s wo rk fl ow t h at c o n t a i n s JointSNVMix2

14 −v a r d i c t VCF f i l e t ha t c o n ta i n s o nl y SNV o r on ly INDEL from VarDic t

−muse VCF f i l e from MuSE

16 −l o f r e q VCF f i l e from LoFreq

−s c a l p e l VCF f i l e from S c a l p e l

18 −s t r e l k a VCF f i l e from S t r e l k a

−dedup A f l a g t o c o n s i d e r o nl y pr imar y r ea d s

20 −minMQ Minimum mapping q u a l i t y f o r r ea d s t o be c o n s i d e r e d ( D e f au l t = 1 )

−minBQ Minimum bas e q u a l i t y f o r r e ad s to be c o n s i d e r e d ( D e f a ul t = 5)

22 −m i n c a l l e r Minimum number o f caller classification f o r a c a l l to be c o n s i d e r e d ( Use 0 . 5

to c o n s i d e r some LowQual c a l l s . D e fa u lt = 0 ) .

−s c a l e The o p t i o n s a r e ph red , f r a c t i o n , or None , t o c o n v e r t numbers t o Phred s c a l e

o r f r a c t i o n a l s c a l e . ( d e f a u l t = None , i . e . , no c o n v e r s i o n )

24 −o u t f i l e Output TSV f i l e name

Note: Do not worry if Python throws a warning like this.

1RuntimeWarning : i n v a l i d va lu e e nco un ter ed i n double scalars

z = ( s −ex pe ct ed ) / np . s q r t ( n1∗n2∗( n1+n2+1) / 1 2 . 0 )

This is to tell you that scipy was attempting some statistical test with empty data. That’s usually due

to the fact that normal BAM ﬁle has no variant reads at that given position. That is why lots of values are

NaN for the normal.

3.4 Model Training or Mutation Prediction

You can use Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv ﬁles either for model training (provided that their

mutation status is annotated with 0 or 1) or mutation prediction. This is done with stochastic boosting

algorithm we have implemented in R.

Model training:

# T r a i n i n g :

2r s c r i p t / a d a m o d e l b u i l d e r .R Ensemble . sSNV . t s v

r s c r i p t / a d a m o d e l b u i l d e r .R Ensemble . sINDEL . t s v

Ensemble.sSNV.tsv.Classiﬁer.RData and Ensemble.sINDEL.tsv.Classiﬁer.RData will be created from

model training.

Mutation prediction:

1# Mutation p r e d i c t i o n :

r s c r i p t / a d a m o d e l p r e d i c t o r . R Ensemble . sSNV . t s v . C l a s s i f i e r . RData Ensemble . sSNV . t s v

T ra in e d . sSNV . t s v

3r s c r i p t / a d a m o d e l p r e d i c t o r . R Ensemble . sINDEL . t s v . C l a s s i f i e r . RData Ensemble . sINDEL . t s v

T ra in e d . sINDEL . t s v

After mutation prediction, if you feel like it, you may convert Trained.sSNV.tsv and Trained.sINDEL.tsv

into VCF ﬁles. Use -tools to list ONLY the individual tools used to have appropriately annotated VCF

ﬁles. Accepted tools are CGA (for MuTect/Indelocator), VarScan2, JointSNVMix2, SomaticSniper, VarDict,

MuSE, LoFreq, and/or Scalpel. To list a tool without having run it, the VCF will be annotated as if the

tool was run but did not identify that position as a somatic variant.

1# P r o b a b i l i t y above 0 . 7 l a b e l e d PASS (−pa ss 0 . 7 ) , and between 0 . 1 and 0 . 7 l a b e l e d LowQual (−

low 0 . 1 ) :

# Use −a l l to i n c l u d e REJECT c a l l s i n t he VCF f i l e

3# Use −phred to c on ve rt p r o b a b i l i t y v al u es ( between 0 to 1 ) i nt o Phred s c a l e i n t he QUAL

column i n th e VCF f i l e

5S S e q t s v 2 v c f . py −t s v T ra in ed . sSNV . t s v −v c f Tra in ed . sSNV . v c f −p as s 0 . 7 −low 0 . 1 −tools

CGA VarScan2 JointSNVMix2 So ma ti cS ni pe r VarDict MuSE LoFreq S t r e l k a −all −phred

7S S e q t s v 2 v c f . py −t s v T ra in ed . sINDEL . t s v −v c f T ra in ed . sINDEL . v c f −p a ss 0 . 7 −low 0 . 1 −tools

CGA VarScan2 VarDict LoFreq S c a l p e l S t r e l k a −a l l −phred

4 To run the dockerized somatic mutation callers

For your convenience, we have created a couple of scripts that can generate run script for the dockerized

somatic mutation callers.

4.1 Location

•somaticseq/utilities/dockered pipelines/

4.2 Requirements

•Have internet connection, and able to pull and run docker images from docker.io

•Have cluster management system such as Sun Grid Engine, so that the ”qsub” command is valid

4.3 Example commands

For single-threaded jobs, best suited for whole exome sequencing or less.

1# Example command to submit t he run s c r i p t s f o r each o f the f o l l o w i n g s oma tic mutation

c a l l e r s

u t i l i t i e s / d o c k e r e d p i p e l i n e s / s in gl eT h re ad / s u b m i t c a l l e r s s i n g l e T h r e a d . sh \

3−−normal−bam /ABSOLUTE/PATH/TO/ nor ma l s am ple . bam \

−−tumor−bam /ABSOLUTE/PATH/TO/ t umor sa mpl e . bam \

5−−human−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a \

−−output−d i r /ABSOLUTE/PATH/TO/RESULTS \

7−−s e l e c t o r /ABSOLUTE/PATH/TO/Exome Capture . GRCh38 . bed \

−−dbsnp /ABSOLUTE/PATH/TO/dbSNP . GRCh38 . v c f \

9−−a c t i o n qsub \

−−mutect2 −−jointsnvmix2 −−somaticsniper −−vardict −−muse −−lofreq −−scalpel −−strelka −−

somaticseq

For multi-threaded jobs, best suited for whole genome sequencing.

# Su bmi tt ing mu tation c a l l e r j o b s by s p l i t t i n g ea ch j ob i n t o 36 even r e g i o n s .

2u t i l i t i e s / d o c k e r e d p i p e l i n e s / mu lt iTh rea ds / s u b m i t c a l l e r s m u l t i T h r e a d s . sh \

−−normal−bam /ABSOLUTE/PATH/TO/ nor ma l s am ple . bam \

4−−tumor−bam /ABSOLUTE/PATH/TO/ t umor sa mpl e . bam \

−−human−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a \

6−−output−d i r /ABSOLUTE/PATH/TO/RESULTS \

−−dbsnp /ABSOLUTE/PATH/TO/dbSNP . GRCh38 . v c f \

8−−threads 36 \

−−a c t i o n qsub \

10 −−mutect2 −−somaticsniper −−vardict −−muse −−lofreq −− scalpel −−strelka −−somaticseq

4.3.1 Parameters

−−normal−bam /ABSOLUTE/PATH/TO/ no rm al sampl e . bam ( Re qu ired )

2−−tumor−bam /ABSOLUTE/PATH/TO/ tumo r sampl e . bam ( R equ ir ed )

−−human−r e f e r e n c e /ABSOLUTE/PATH/TO/ hu man ref er e nce . f a ( Requ ire d )

4−−dbsnp /ABSOLUTE/PATH/TO/dbsnp . v c f ( R eq ui red f o r MuSE and LoFreq )

−−cos mic /ABSOLUTE/PATH/TO/ co smic . v c f ( O p tio n al )

6−−s e l e c t o r /ABSOLUTE/PATH/TO/ C a p t u r e r e g i o n . bed ( O pt i on a l . W il l assume

whole genome from the . f a i f i l e w ith ou t i t . )

−−e x cl u d e /ABSOLUTE/PATH/TO/ B l a c k l i s t r e g i o n . bed ( O pt io na l )

8−−min−a f ( Op tio nal . The minimum VAF c u t o f f for VarDict and VarScan2 .

D e f a u l t s a r e 0 . 1 0 f o r VarScan2 and 0 . 0 5 f o r VarDict ) .

−−a c t i o n qsub ( O pt i on a l : t he command p r e c e d i n g t he . cmd s c r i p t s .

Default is echo)

10 −−t h r e a d s 36 ( O pt i o n al for mu lt iT hr ea ds and i n v a l i d f o r singleThread :

e v e n l y s p l i t t h e genome i n t o 36 BED f i l e s . D e f a u l t = 1 2 ) .

−−mutect2 ( O pt io na l f l a g to in v o ke MuTect2 )

12 −−va rsc an2 ( Op t ion a l f l a g to in v ok e VarScan2 )

−−j o i n t s n v m i x 2 ( O pt i on a l f l a g t o i n v o k e JointSNVMix2 )

14 −−s o m a t i c s n i p e r ( Op ti on al f l a g t o i nv ok e S om at ic Sn ip er )

−−v a r d i c t ( Op ti ona l f l a g t o in vo ke VarDict )

16 −−muse ( Op t io na l f l a g to in v o ke MuSE)

−−l o f r e q ( Op t io n a l f l a g t o in v o k e LoFreq )

18 −−s c a l p e l ( Op tio na l f l a g to i nv ok e S c a l p e l )

−−s t r e l k a ( O pt io na l f l a g t o in vo ke S t r e l k a )

20 −−s o m a t i c s e q ( O p ti o n a l f l a g t o i n v ok e S om at ic Se q . Th i s s c r i p t al wa y s be

echo ’ ed , a s i t s h oul d not be sub mi tt ed until a l l t he c a l l e r s above com ple te ) .

−−output−d i r /ABSOLUTE/PATH/TO/OUTPUT DIRECTORY ( R equir ed )

22 −−somaticseq−d i r So ma ti cSe q Ou tp ut D ire ct or y ( Op ti on al . The d i r e c t o r y name of

t he So ma ti cS eq o ut pu t . D e f a u l t = S om at icS eq ) .

−−somaticseq−t r a i n ( O pt i o nal f l a g to i nvo k e SomaticSeq to produce c l a s s i f i e r s i f

gr ound t r u t h VCF f i l e s a r e p r o v i de d . Only recommended i n s i n g l e T h r e a d mode , b e ca u s e

o t h e r w i s e i t ’ s b e t t e r t o com bine th e ou tpu t TSV f i l e s f i r s t , and then t r a i n c l a s s i f i e r s

. )

24 −−somaticseq−a c t i o n ( O pt io na l . What to do w i th t h e s o m a t i c s e q . cmd . D e f a u l t i s echo

. Only do ” qsu b ” i f you have al r ea dy compl eted a l l th e mutation c a l l e r s , but want t o run

SomaticSeq at a d i f f e r e n t s e t t i n g . )

−− classifier −sn v T r a i n e d s S N V C l a s s i f i e r . RData ( O pt io na l i f t h e r e i s a

c l a s s i f e r you want t o u se )

26 −− classifier −i n d e l T r a i n e d s I N D E L C l a s s i f i e r . RData ( O pt io n al i f t h e r e i s a

c l a s s i f e r you want t o u se )

−−tr ut h −snv sS NV gro un d t ru th . v c f ( O pt i on a l i f t h e r e i s a gr oun d t ru th ,

and everything e l s e w i l l be l a b e l e d false positive )

28 −−tr ut h −i n d e l sIND EL g ro un d truth . v c f ( O p ti on al i f t h e r e i s a g round t ru th ,

and everything e l s e w i l l be l a b e l e d false positive )

−−exome ( Op t i on al f l a g f o r S t r e l k a )

30 −−scalpel−two−p a s s ( Op t i on a l p arame ter f o r S c a l p e l . D e fa ul t = false . )

−−mutect2−arguments ( Extra p a ra m ete r s t o pa s s o nto Mutect2 , e . g . , −−mutect2−

arguments ’−− i n i t i a l t u m o r l o d 3 . 0 −−log somatic prior −5.0 −−min base quality score

2 0 ’ )

32 −−mutect2−f i l t e r −a rguments ( Extr a p ar am ete rs t o p as s ont o F i l t e r M u t e c t C a l l s )

−−varscan−arguments ( Extra p a ram e te r s t o pa s s onto VarScan2 )

34 −−varscan−pileup−arguments ( Extra p ara met ers to pas s onto sa mt oo ls mpileup t ha t c r e a t e s

p i l e u p f i l e s f o r VarScan)

−−jsm−train−arguments ( Extra p ara met ers to pas s onto JointSNVMix2 ’ s t r a i n command)

36 −−jsm−classify−arguments ( Extra p a ra m ete r s t o pa s s o nto JointSNVMix2 ’ s c l a s s i f y command

)

−−somaticsniper−arguments ( Extra p a ram e te r s t o pa s s on to S o m ati c S n ip e r )

38 −−vardict−arguments ( Extra p ara m et e rs t o pa s s onto VarDict )

−−muse−arguments ( Extra par a me t ers to pa s s on to MuSE)

40 −−l o f r e q −arguments ( Extra pa r ame t er s to pa s s o nto LoFreq )

−−scalpel−discovery−arguments ( Ex tra p ar am et er s to p as s onto S c a l pe l ’ s d i s c o v e r y command)

42 −−scalpel−ex po rt −a rguments ( E xtr a p a ra me te rs t o p a s s on to S c a l p e l ’ s e x p o r t command)

−−strelka−config−ar guments ( E xt ra p ar a me t er s t o p a s s on to S t r e l k a ’ s c o n f i g command)

44 −−strelka−run−a rguments ( Extra p ar am e te r s t o p a ss o nt o S t r e k l a ’ s run command )

−−somaticseq−arguments ( Extra p ara m et e rs t o pa s s onto SomaticSeq . Wrapper . sh )

4.3.2 What does the single-threaded command do

•For each ﬂag such as --mutect2, --jointsnvmix2, ...., --strelka, a run script ending with .cmd will

be created in /ABSOLUTE/PATH/TO/RESULTS/logs. By default, these .cmd scripts will only be

created, and their ﬁle path will be printed on screen. However, if you do “--action qsub”, then these

scripts will be submitted via the qsub command. The default action is “echo.”

–Each of these .cmd script correspond to a mutation caller you speciﬁed. They all use docker

images.

–We may improve their functionalities in the future to allow more tunable parameters. For the

initial releases, POC and reproducibility take precedence.

•If you do “--somaticseq,” the somaticseq script will be created in /ABSOLUTE/PATH/TO/RESULT-

S/SomaticSeq/logs. However, it will not be submitted until you manually do so after each of these

mutation callers is ﬁnished running.

–In the future, we may create more sophisticated solution that will automatically solves these

dependencies. For the initial release, we’ll focus on stability and reproducibility.

•Due to the way those run scripts are written, the Sun Grid Engine’s standard error log will record the

time the task completes (i.e., Done at 2017/10/30 29:03:02), and it will only do so when the task is

completed with an exit code of 0. It can be a quick way to check if a task is done, by looking at the

ﬁnal line of the standard error log ﬁle.

4.3.3 What does the multi-threaded command do

It’s very similar to the single-threaded WES solution, except the job will be split evenly based on genomic

lengths.

•If you speciﬁed “--threads 36,” then 36 BED ﬁles will be created. Each BED ﬁle represents 1/36

of the total base pairs in the human genome (obtained from the .fa.fai ﬁle, but only including 1,

2, 3, ..., MT, or chr1, chr2, ..., chrM contigs). They are named 1.bed, 2.bed, ..., 36.bed, and will

be created into /ABSOLUTE/PATH/TO/RESULTS/1, /ABSOLUTE/PATH/TO/RESULTS/2, ...,

/ABSOLUTE/PATH/TO/RESULTS/36. You may, of course, specify any number. The default is 12.

•For each mutation callers you specify (with the exception of SomaticSniper), a script will be created

into /ABSOLUTE/PATH/TO/RESULTS/1/logs, /ABSOLUTE/PATH/TO/RESULTS/2/logs, etc.,

with partial BAM input. Again, they will be automatically submitted if you do “--action qsub.”

•Because SomaticSniper does not support partial BAM input (one would have to manually split the

BAMs in order to parallelize SomaticSniper this way), the above mentioned procedure is not applied

to SomaticSniper. Instead, a single-threaded script will be created (and potentially qsub’ed) into

/ABSOLUTE/PATH/TO/RESULTS/logs.

–However, because SomaticSniper is by far the fastest tool there, single-thread is doable even for

WGS. Even single-threaded SomaticSniper will likely ﬁnish before parallelized Scalpel. When I

benchmarked the DREAM Challenge Stage 3 by splitting it into 120 regions, Scalpel took 10

hours and 10 minutes to complete 1/120 of the data. SomaticSniper took a little under 5 hours

for the whole thing.

–After SomaticSniper ﬁnishes, the result VCF ﬁles will be split into each of the /ABSOLUTE/-

PATH/TO/RESULTS/1, /ABSOLUTE/PATH/TO/RESULTS/2, etc.

•JointSNVMix2 also does not support partial BAM input. Unlike SomaticSniper, it’s slow and takes

massive amount of memory. It’s not a good idea to run JointSNVMix2 on a WGS data. The only way

to do so is to manually split the BAM ﬁles and run each separately. We may do so in the future, but

JointSNVMix2 is a 5-year old that’s no longer being supported, so we probably won’t bother.

•Like the single-threaded case, a SomaticSeq run script will also be created for each partition like

/ABSOLUTE/PATH/TO/RESULTS/1/SomaticSeq/logs, but will not be submitted until you do so

manually.

–For simplicity, you may wait until all the mutation calling is done, then run a command like

1f i n d /ABSOLUTE/PATH/TO/RESULTS −name ’ somaticseq∗. cmd ’ −ex e c qsub {} \ ;

5 Use BAMSurgeon to create training set

For your convenience, we have created a couple of wrapper scripts that can generate the run script to create

training data using BAMSurgon at somaticseq/utilities/dockered pipelines/bamSimulator. Descriptions and

example commands can be found in the README there.

5.1 Requirements

•Have internet connection, and able to pull and run docker images from docker.io

•Have cluster management system such as Sun Grid Engine, so that the ”qsub” command is valid

6 Release Notes

Make sure training and prediction use the same version. Otherwise the prediction is not valid.

6.1 Version 1.0

Version used to generate data in the manuscript and Stage 5 of the ICGC-TCGA DREAM Somatic Mutation

Challenge, where SomaticSeq’s results were #1 for INDEL and #2 for SNV.

In the original manuscript, VarDict’s var2vcf somatic.pl script was used to generate VarDict VCFs,

and subsequently “-ﬁlter somatic” was used for SSeq merged.vcf2tsv.py. Since then (including DREAM

Challenge Stage 5), VarDict recommends var2vcf paired.pl over var2vcf somatic.pl, and subsequently “-ﬁlter

paired” was used for SSeq merged.vcf2tsv.py. The diﬀerence in SomaticSeq results, however, is pretty much

negligible.

6.2 Version 1.1

Automated the SomaticSeq.Wrapper.sh script for both training and prediction mode. No change to any

algorithm.

6.3 Version 1.2

Have implemented the following improvement, mostly for indels:

•SSeq merged.vcf2tsv.py can now accept pileup ﬁles to extract read depth and DP4 (reference forward,

reference reverse, alternate forward, and alternate reverse) information (mainly for indels). Previously,

that information can only be extracted from SAMtools VCF. Since the SAMtools or HaplotypeCaller

generated VCFs hardly contain any indel information, this option improves the indel model. The

SomaticSeq.Wrapper.sh script is modiﬁed accordingly.

•Extract mapping quality (MQ) from VarDict output if this information cannot be found in SAMtools

VCF (also mostly beneﬁts the indel model).

•Indel length now positive for insertions and negative for deletions, instead of using the absolute value

previously.

6.4 Version 2.0

•Removed dependencies for SAMtools and HaplotypeCaller during feature extraction.

SSeq merged.vcf2tsv.py extracts those information (plus more) directly from BAM ﬁles.

•Allow not only VCF ﬁle, but also BED ﬁle or a list of chromosome coordinate as input format for

SSeq merged.vcf2tsv.py, i.e., use -mybed or -mypos instead of -myvcf.

•Instead of a separate step to annotate ground truth, that can be done directly by

SSeq merged.vcf2tsv.py by supplying the ground truth VCF via -truth.

•SSeq merged.vcf2tsv.py can annotate dbSNP and COSMIC information directly if BED ﬁle or a list

of chromosome coordinates are used as input in lieu of an annotated VCF ﬁle.

•Consolidated feature sets, e.g., removed some redunda Fixed a bug: if JointSNVMix2 is not included,

the values should be “NaN” instead of 0’s. This is to keep consistency with how we handle all other

callersnt feature sets coming from diﬀerent resources.

6.5 Version 2.0.2

•Incorporated LoFreq.

•Used getopt to replace getopts in the SomaticSeq.Wrapper.sh script to allow long options.

6.6 Version 2.1.2

•Properly handle cases when multiple ALT’s are calls in the same position. The VCF ﬁles can either

contain multiple calls in the ALT column (i.e., A,G), or have multiple lines corresponding to the same

position (one line for each variant call). Some functions were signiﬁcantly re-written to allow this.

•Incorporated Scalpel.

•Deprecated HaplotypeCaller and SAMTools dependencies completely as far as feature generation is

concerned.

•The Wrapper script removed SnpSift/SnpEﬀ dependencies. Those information can be directly obtained

during the SSeq merged.vcf2tsv.py step. Also removed some additional legacy steps that has become

useless since v2 (i.e., score Somatic.Variants.py). Added a step to check the correctness of the input.

The v2.1 and 2.1.1 had some typos in the wrapper script, so only describing v2.1.2 here.

6.7 Version 2.2

•Added MuTect2 support.

6.8 Version 2.2.1

•InDel 3bp now stands for indel counts within 3 bps of the variant site, instead of exactly 3 bps from

the variant site as it was previously (likewise for InDel 2bp).

•Collapse MQ0 (mapping quality of 0) reads supporting reference/variant reads into a single metric of

MQ0 reads (i.e., tBAM MQ0 and nBAM MQ0). From experience, the number of MQ0 reads is at

least equally predictive of false positive calls, rather than distinguishing if those MQ0 reads support

reference or variant.

•Obtain SOR (Somatic Odds Ratio) from BAM ﬁles instead of VarDict’s VCF ﬁle.

•Fixed a typo in the SomaticSeq.Wrapper.sh script that did not handle inclusion region correctly.

6.9 Version 2.2.2

•Got around an occasional unexplained issue in then ada package were the SOR is sometimes categorized

as type, by forcing it to be numeric.

•Defaults PASS score from 0.7 to 0.5, and make them tunable in the SomaticSeq.Wrapper.sh script

(--pass-threshold and --lowqual-threshold).

6.10 Version 2.2.3

•Incorporated Strelka2 since it’s now GPLv3.

•Added another R script (ada model builder ntChange.R) that uses nucleotide substitution pattern as

a feature. Limited experiences have shown us that it improves the accuracy, but it’s not heavily tested

yet.

•If a COSMIC site is labeled SNP in the COSMIC VCF ﬁle, if cosmic and CNT will be labeled as 0.

The COSMIC ID will still appear in the ID column. This will not change any results because both of

those features are turned oﬀ in the training R script.

•Fixed a bug: if JointSNVMix2 is not included, the values should be “NaN” instead of 0’s. This is to

keep consistency with how we handle all other callers.

6.11 Version 2.2.4

•Resolved a bug in v2.2.3 where the VCF ﬁles of Strelka INDEL and Scalpel clash on GATK Com-

bineVariants, by outputting a temporary VCF ﬁle for Strelka INDEL without the sample columns.

•Caller classiﬁcation: consider if Scalpel = 1 only if there is a SOMATIC ﬂag in its INFO.

6.12 Version 2.2.5

•Added a dockerﬁle. Docker repo at https://hub.docker.com/r/lethalfang/somaticseq/.

•Ability to use vcfsort.pl instead of GATK CombineVariants to merge VCF ﬁles.

6.13 Version 2.3.0

•Moved some scripts to the utilities directory to clean up the clutter.

•Added the split Bed into equal regions.py to utilities, which will split a input BED ﬁle into multiple

BED ﬁles of equal size. This is to be used to parallelize large WGS jobs.

•Made compatible with MuTect2 from GATK4.

•Removed long options for the SomaticSeq.Wrapper.sh script because it’s more readable this way.

•Added a script to add “GT” ﬁeld to Strelka’s VCF output before merging it with other VCF ﬁles.

That was what caused GATK CombineVariants errors mentioned in v2.2.4’s release notes.

•Added a bunch of scripts at utilities/dockered pipelines that can be used to submit (requiring Sun

Grid Engine or equivalent) dockerized pipeline to a computing cluster.

6.14 Version 2.3.1

•Improve the automated run script generator at utilities/dockered pipelines.

•No change to SomaticSeq algorithm

6.15 Version 2.3.2

•Added run script generators for dockerized BAMSurgeon pipelines at utilities/dock-

ered pipelines/bamSurgeon

•Added an error message to r scripts/ada model builder ntChange.R when TrueVariants or False don’t

have both 0’s and 1’s. Other than this warning message change, no other change to SomaticSeq

algorithm.

6.16 Version 2.4.0

•Restructured the utilities scripts.

•Added the utilities/ﬁlter SomaticSeq VCF.py script that “demotes” PASS calls to LowQual based on

a set of tunable hard ﬁlters.

•BamSurgeon scripts invokes modiﬁed BamSurgeon script that splits a BAM ﬁle without the need to

sort by read name. This works if the BAM ﬁles have proper read names, i.e., 2 and only 2 identical

read names for each paired-end reads.

•No change to SomaticSeq algorithm

6.17 Version 2.4.1

•Updated some docker job scripts.

•Added a script that converts some items in the VCF’s INFO ﬁeld into the sample ﬁeld, to pre-

cipitate the need to merge multiple VCF ﬁles into a single multi-sample VCF, i.e., utilities/refor-

mat VCF2SEQC2.py.

•No change to SomaticSeq algorithm

6.18 Version 2.5.0

•In modify VJSD.py, get rid of VarDict’s END tag (in single sample mode) because it causes problem

with GATK CombineVariants.

•Added limited single-sample support, i.e., ssSomaticSeq.Wrapper.sh is the wrapper script. singleSam-

ple callers singleThread.sh is the wrapper script to submit single-sample mutation caller scripts.

•Added run scripts for read alignments and post-alignment processing, i.e,. FASTQ →BAM, at utili-

ties/dockered pipelines/alignments.

•Fixed a bug where the last two CD4 numbers were both alternate concordant reads in the output VCF

ﬁle, when the last number should’ve been alternate discordant reads.

•Changed the output ﬁle names from Trained.s(SNV|INDEL).vcf and Untrained.s(SNV|INDEL).vcf to

SSeq.Classiﬁed.s(SNV|INDE).vcf and Consensus.s(SNV|INDEL).vcf. No change to the actual tumor-

normal SomaticSeq algorithm.

•Added utilities/modify VarDict.py to VarDict’s “complex” variant calls (e.g., GCA¿TAC) into SNVs

when possible.

•Modiﬁed r scripts/ada model builder ntChange.R to allow you to ignore certain features, e.g.,

r scripts/ada model builder ntChange.R Training Data.tsv nBAM REF BQ tBAM REF BQ SiteHo-

mopolymer Length ...

Everything after the input ﬁle are features to be ignored during training.

Also added r scripts/ada cross validation.R.

6.19 Version 2.5.1

•Additional passable parameters options to pass extra parameters to somatic mutation callers. Fixed a

bug where the “two-pass” parameter is not passed onto Scalpel in multiThreads scripts.

•Ignore Strelka QSS and Strelka TQSS for indel training in the SomaticSeq.Wrapper.sh script.

6.20 Version 2.5.2

•Ported some pipeline scripts to singularities at utilities/singularities.

6.21 Version 2.6.0

•VarScan2 Score is no longer extracted from VarScan’s output. Rather, it’s now calculated directly using

Fisher’s Exact Test, which reproduces VarScan’s output, but will have a real value when VarScan2

does not output a particular variant.

•Incorporate TNscope’s output VCF into SomaticSeq, but did not incorporate TNscope caller into the

dockerized workﬂow because we don’t have distribution license.

6.22 Version 2.6.1

•Optimized memory for singularity scripts.

•Updated utilities/bamQC.py and added utilities/trimSoftClippedReads.py (removed soft-clipped bases

on soft-clipped reads)

•Added some docker scripts at utilities/dockered pipelines/QC

6.23 Version 2.7.0

•Added another feature: consistent/inconsistent calls for paired reads if the position is covered by both

forward and reverse reads. However, they’re excluded as training features in SomaticSeq.Wrapper.sh

script for the time being.

•Change non-GCTA characters to N in VarDict.vcf ﬁle to make it conform to VCF ﬁle speciﬁcations.

6.24 Version 2.7.1

•Without –gatk $PATH/TO/GenomeAnalysisTK.jar in the SomaticSeq.Wrapper.sh script, it will use

utilities/getUniqueVcfPositions.py and utilities/vcfsorter.pl to (in lieu of GATK3 CombineVariants) to

combine all the VCF ﬁles.

•Fixed bugs in the docker/singularities scripts where extra arguments for the callers are not correctly

passed onto the callers.

6.25 Version 2.7.2

•Make compatible with .cram format

•Fixed a bug where Strelka-only calls are not considered by SomaticSeq.

7 Contact Us

For suggestions, bug reports, or technical support, please post in

https://github.com/bioinform/somaticseq/issues. The developers are alerted when issues are created

there. Alternatively, you may also email li tai.fang@roche.com.

Manual

Navigation menu

Versions of this User Manual:

Views

Navigation