Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 22

DownloadManual
Open PDF In BrowserView PDF
SomaticSeq Documentation
Li Tai Fang / li tai.fang@roche.com
October 4, 2018

Contents
1 Introduction
1.1 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Docker images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2
2
2

2 How to run SomaticSeq
2.1 SomaticSeq Training Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 SomaticSeq Prediction Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Consensus Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
3
4
4

3 Use SomaticSeq as a Python library
3.1 somaticseq.somaticseq modules . . . . . . . . . . . . . . .
3.1.1 Module: run somaticseq . . . . . . . . . . . . . . .
3.1.2 Module: somatic vcf2tsv and single sample vcf2tsv
3.1.3 Module: SSeq tsv2vcf . . . . . . . . . . . . . . . .
3.2 somaticseq.utilities modules . . . . . . . . . . . . . . . . .
3.2.1 Module: split Bed into equal regions . . . . . . . .
3.2.2 Module: lociCounterWithLabels . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

5
5
5
6
7
7
7
7

4 The
4.1
4.2
4.3
4.4

step-by-step SomaticSeq Workflow
Apply inclusion and exclusion regions .
Combine the call sets . . . . . . . . . . .
Convert the VCF file into TSV file . . .
Model Training or Mutation Prediction

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

7
7
8
8
9

5 To run the dockerized somatic mutation callers
5.1 Location . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Requirements . . . . . . . . . . . . . . . . . . . . .
5.3 Example commands . . . . . . . . . . . . . . . . .
5.3.1 Single-threaded Jobs . . . . . . . . . . . . .
5.3.2 Multi-threaded Jobs . . . . . . . . . . . . .
5.3.3 SomaticSeq Training . . . . . . . . . . . . .
5.3.4 SomaticSeq Prediction . . . . . . . . . . . .
5.3.5 Parameters . . . . . . . . . . . . . . . . . .
5.3.6 What does the single-threaded command do
5.3.7 What does the multi-threaded command do

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

10
10
10
10
10
10
11
11
11
12
13

6 Use BAMSurgeon to create training data
6.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Three scenario to simulate somatic mutations . . . . . . . . . . . . .
6.2.1 When you have sequencing replicates of normal samples . . .
6.2.2 This example mimicks DREAM Challenge . . . . . . . . . . .
6.2.3 Merge and then split the input tumor and normal BAM files

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

14
14
14
14
15
15

.
.
.
.

.
.
.
.

.
.
.
.

1

.
.
.
.

.
.
.
.

6.3
6.4

Parameters and Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 –merge-bam / –split-bam / –indel-realign . . . . . . . . . . . . . . . . . . . . . . . . .
To create SomaticSeq classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16
17
17

7 Release Notes

18

8 Contact Us

22

1

Introduction

SomaticSeq is a flexible post-somatic-mutation-calling algorithm for improved accuracy. It is compatible
with 10+ somatic mutation caller(s). Any combination of them can be used to obtain a combined call set
with sequencing features extracted into TSV and VCF files. In addition, SomaticSeq uses machine learning (adaptive boosting) to distinguish true mutations from false positives from that call set. The mutation
callers we have incorporated are MuTect/Indelocator/MuTect2, VarScan2, JointSNVMix, SomaticSniper,
VarDict, MuSE, LoFreq, Scalpel, Strelka, and TNscope. You may incorporate some or all of those callers
into your own pipeline with SomaticSeq.
The manuscript, An ensemble approach to accurately detect somatic mutations using SomaticSeq, was published in Genome Biology 2015, 16:197. The SomaticSeq project is located at
https://github.com/bioinform/somaticseq. There have been some major improvements in SomaticSeq since
that Genome Biology publication in 2015.
The wrapper script somaticseq/run somaticseq.py and its parallelized cousin somaticseq parallel.py can 1)
train the call set into a classifier, 2) predict high-confidence somatic mutations from the call set based on
a pre-defined classifier, or 3) default to consensus mode, i.e., extract sequencing features and output the
TSV and VCF files, and then label the calls (i.e., PASS, LowQual, or REJECT) based on majority vote of
the tools.

1.1

Dependencies

• Python 3, plus pysam (v0.14.1), numpy (v1.14.3), and scipy (v1.1.0). The versions in parentheses are
in our docker images and validated to work, though other versions should work, too.
• R, plus the ada package in R.
• BEDTools (if inclusion and/or an exclusion region files are supplied, and/or running somaticseq parallel.py instead of somaticseq/run somaticseq.py)
• Optional: dbSNP in VCF format (if you want to use dbSNP membership as a part of the training).
• At least one of MuTect/Indelocator/MuTect2, VarScan2, JointSNVMix2, SomaticSniper, VarDict,
MuSE, LoFreq, Scalpel, Strelka2, TNscope, and/or Platypus. Those are the tools we have incorporated in SomaticSeq. If there are other somatic tools that may be good addition to our list, please
make the suggestion to us.

1.2

Docker images

SomaticSeq and the somatic mutation callers that we routinely use were dockerized.
• SomaticSeq: https://hub.docker.com/r/lethalfang/somaticseq
• MuTect2: https://hub.docker.com/r/broadinstitute/gatk
• VarScan2: https://hub.docker.com/r/djordjeklisic/sbg-varscan2
• JointSNVMix2: https://hub.docker.com/r/lethalfang/jointsnvmix2
• SomaticSniper: https://hub.docker.com/r/lethalfang/somaticsniper
2

• VarDict: https://hub.docker.com/r/lethalfang/vardictjava
• MuSE: https://hub.docker.com/r/marghoob/muse
• LoFreq: https://hub.docker.com/r/marghoob/lofreq
• Scalpel: https://hub.docker.com/r/lethalfang/scalpel
• Strelka2: https://hub.docker.com/r/lethalfang/strelka

2

How to run SomaticSeq

The somaticseq/run somaticseq.py calls a series of programs and procedures after you have run your individual somatic mutation callers, and somaticseq parallel.py is a wrapper script that allows parallel processing. Section 5 will teach you how to run those mutation callers that we have been dockerized. It also
includes ways to create semi-simulated training data that can be used to create SomaticSeq classifiers. In
the next section, we will describe the workflow in this wrapper script in detail.
Both paired and single modes are supported, although single mode is not as well validated scientifically as
the paired mode. To see the required and optional input files and parameters to somaticseq parallel.py:
1

# See t h e g l o b a l i n p u t p a r a m e t e r s
s o m a t i c s e q p a r a l l e l . py −−h e l p

3

5

7

# I n p u t p a r a m e t e r s f o r p a i r e d −sample mode ( i . e . , tumor−normal )
s o m a t i c s e q p a r a l l e l . py p a i r e d −−h e l p
# I n p u t p a r a m e t e r s f o r s i n g l e −sample mode
s o m a t i c s e q p a r a l l e l . py s i n g l e −−h e l p

2.1

SomaticSeq Training Mode

To create SomaticSeq classifiers, you need a VCF file containing true SNVs and a VCF file containing true
INDELs. There is also an option to include a list of regions to include and/or exclude from this exercise.
The exclusion or inclusion regions can be VCF or BED files. An inclusion region may be subset of the call
sets where you have validated their true/false mutation status, so that only those regions will be used for
training. An exclusion region can be regions where the “truth” is ambigious. All the variants in the truth
VCF files are assumed to be true positives. Every mutation call not in the truth VCF files is assumed to
be false positives (as long as the genomic coordiante is in inclusion region and not in exclusion region if
those regions are provided).
All the output VCF files from individual callers are optional. Those VCF files can be bgzipped if they
have .vcf.gz extensions. It is imperative that you will use the same parameter for prediction as you do for
training.
2

4

6

8

10

12

14

16

# An example command f o r SomaticSeq T r a i n i n g .
s o m a t i c s e q p a r a l l e l . py \
−−s o m a t i c s e q −t r a i n \
−−output−d i r e c t o r y $OUTPUT DIR \
−−genome−r e f e r e n c e GRCh38 . f a \
−−t r u t h −snv
t r u e P o s i t i v e s . snv . v c f \
−−t r u t h −i n d e l
truePositives . indel . vcf \
−−i n c l u s i o n −r e g i o n genome . bed \
−−e x c l u s i o n −r e g i o n
b l a c k l i s t . bed \
−−t h r e a d s
12 \
paired \
−−tumor−bam− f i l e
tumor . bam \
−−normal−bam− f i l e
matched normal . bam \
−−mutect2−v c f
MuTect2/ v a r i a n t s . v c f \
−−v a r s c a n −snv
VarScan2 / v a r i a n t s . snp . v c f \
−−v a r s c a n −i n d e l
VarScan2 / v a r i a n t s . i n d e l . v c f \

3

18

20

22

24

−−jsm−v c f
−−s o m a t i c s n i p e r −v c f
−−v a r d i c t −v c f
−−muse−v c f
−−l o f r e q −snv
−−l o f r e q −i n d e l
−−s c a l p e l −v c f
−−s t r e l k a −snv
−−s t r e l k a −i n d e l

JointSNVMix2 / v a r i a n t s . snp . v c f \
S o m a t i c S n i p e r / v a r i a n t s . snp . v c f \
VarDict / v a r i a n t s . v c f \
MuSE/ v a r i a n t s . snp . v c f \
LoFreq / v a r i a n t s . snp . v c f \
LoFreq / v a r i a n t s . i n d e l . v c f \
Scalpel / variants . indel . vcf \
S t r e l k a / v a r i a n t s . snv . v c f \
Strelka / variants . indel . vcf

For the command’s argument placement, caller output and bam files are input “after” paired or single option. Everything else goes before, e.g., reference, ground truths, resources such as dbSNP and COSMIC,
etc.
Parallel processing is achieved by splitting the inclusion BED file into a number of sub-BED files of equal
region sizes, named 1.th.input.bed, 2.th.input.bed, ..., n.th.input.bed. Then each process will be run using
each sub-BED file as the inclusion BED file. If there is no inclusion BED file in the command argument, it
will split the reference.fa.fai file instead.
SomaticSeq supports any combination of the somatic mutation callers we have incorporated into the
workflow. SomaticSeq will run based on the output VCFs you have provided. It will train for SNV
and/or INDEL if you provide the truePositives.snv.vcf and/or truePositives.indel.vcf file(s) and invoke the
--somaticseq-train option. Otherwise, it will fall back to the simple caller consensus mode.

2.2

SomaticSeq Prediction Mode

Make sure the classifiers (.RData files) are supplied, Without either of them, or it will fall back to the simple caller consensus mode.
1

3

5

7

9

11

13

15

17

19

21

23

# The ∗ . RData f i l e s
somaticseq parallel
−− c l a s s i f i e r −snv
−− c l a s s i f i e r −i n d e l
−−output−d i r e c t o r y
−−genome−r e f e r e n c e
−−i n c l u s i o n −r e g i o n
−−e x c l u s i o n −r e g i o n
−−t h r e a d s
paired \
−−tumor−bam− f i l e
−−normal−bam− f i l e
−−mutect2−v c f
−−v a r s c a n −snv
−−v a r s c a n −i n d e l
−−jsm−v c f
−−s o m a t i c s n i p e r −v c f
−−v a r d i c t −v c f
−−muse−v c f
−−l o f r e q −snv
−−l o f r e q −i n d e l
−−s c a l p e l −v c f
−−s t r e l k a −snv
−−s t r e l k a −i n d e l

2.3

a r e t r a i n e d c l a s s i f i e r from t h e t r a i n i n g mode .
. py \
Ensemble . sSNV . t s v . ntChange . C l a s s i f i e r . RData \
Ensemble . sINDEL . t s v . ntChange . C l a s s i f i e r . RData \
$OUTPUT DIR \
GRCh38 . f a \
genome . bed \
b l a c k l i s t . bed \
12 \
tumor . bam \
matched normal . bam \
MuTect2/ v a r i a n t s . v c f \
VarScan2 / v a r i a n t s . snp . v c f \
VarScan2 / v a r i a n t s . i n d e l . v c f \
JointSNVMix2 / v a r i a n t s . snp . v c f \
S o m a t i c S n i p e r / v a r i a n t s . snp . v c f \
VarDict / v a r i a n t s . v c f \
MuSE/ v a r i a n t s . snp . v c f \
LoFreq / v a r i a n t s . snp . v c f \
LoFreq / v a r i a n t s . i n d e l . v c f \
Scalpel / variants . indel . vcf \
S t r e l k a / v a r i a n t s . snv . v c f \
Strelka / variants . indel . vcf

Consensus Mode

Same as the commands previously, but without including classifiers or invoking –somaticseq-train. Without those information, SomaticSeq will forgo machine learning, and fall back into a simple majority vote.

4

3

Use SomaticSeq as a Python library

Section 2 described how to use SomaticSeq as a standalone software, but SomaticSeq can also be treated
as a python library for your own software.

3.1

somaticseq.somaticseq modules

The directory somaticseq/somaticseq contain some of the critical modules of SomaticSeq, and many can be
used as a function of a library.
3.1.1

Module: run somaticseq

The script somaticseq/somaticseq/run somaticseq.py contains the module to convert individual VCF files
(each from a popular somatic mutation caller) to SomaticSeq TSV and VCF files.
The code to produce the .TSV and .VCF files described in Section 2, for example, would be something like
this:
2

# Module i s l o c a t e d s o m a t i c s e q / s o m a t i c s e q / r u n s o m a t i c s e q . py
im po rt s o m a t i c s e q . s o m a t i c s e q . r u n s o m a t i c s e q a s r u n s o m a t i c s e q

4

r u n s o m a t i c s e q . r u n P a i r e d ( o u t d i r = ’/PATH/TO/ SomaticSeq ’ , r e f = ’/PATH/TO/GRCh38 . f a ’ , tbam= ’/
PATH/TO/ tumor . bwa . bam ’ , nbam= ’/PATH/TO/ normal . bwa . bam ’ , tumor name =’TUMOR’ , normal name
=’NORMAL’ , t r u t h s n v=None , t r u t h i n d e l=None , c l a s s i f i e r s n v =None , c l a s s i f i e r i n d e l =None ,
p a s s t h r e s h o l d =0.5 , l o w q u a l t h r e s h o l d =0.1 , hom threshold =0.85 , h e t t h r e s h o l d =0.01 ,
dbsnp = ’/PATH/TO/ dbSNP 138 . hg38 . v c f . v c f ’ , c o s m i c = ’/PATH/TO/COSMIC . v85 . v c f ’ , i n c l u s i o n = ’/
PATH/TO/ Exon Capture . bed ’ , e x c l u s i o n = ’/PATH/TO/ i g n o r e . bed ’ , mutect=None , i n d e l o c a t o r=
None , mutect2 = ’/PATH/TO/MuTect2 . v c f ’ , v a r s c a n s n v=None , v a r s c a n i n d e l=None , jsm=None ,
s n i p e r=None , v a r d i c t = ’/PATH/TO/ VarDict . v c f ’ , muse= ’/PATH/TO/MuSE . v c f ’ , l o f r e q s n v = ’/PATH
/TO/ LoFreq . snv . v c f . gz ’ , l o f r e q i n d e l = ’/PATH/TO/ LoFreq . i n d e l . v c f . gz ’ , s c a l p e l=None ,
s t r e l k a s n v = ’/PATH/TO/ S t r e l k a / r e s u l t s / v a r i a n t s / s o m a t i c s s n v . v c f . gz ’ , s t r e l k a i n d e l = ’/
PATH/TO/ S t r e l k a / r e s u l t s / v a r i a n t s / s o m a t i c s i n d e l . v c f . gz ’ , t n s c o p e=None , p l a t y p u s=None ,
min mq=1 , min bq =5 , m i n c a l l e r = 0 . 5 , s o m a t i c s e q t r a i n=F a l s e , e n s e m b l e O u t P r e f i x =’ Ensemble
. ’ , c o n s e n s u s O u t P r e f i x =’ Consensus . ’ , c l a s s i f i e d O u t P r e f i x =’ SSeq . C l a s s i f i e d . ’ ,
k e e p i n t e r m e d i a t e s=F a l s e )

The parameters of ensembleOutPrefix, consensusOutPrefix, and classifiedOutPrefix will dictate the output
file names under outdir.
Likewise, the single sample mode to convert various individual VCF outputs would be something like this:
2

im po rt s o m a t i c s e q . s o m a t i c s e q . r u n s o m a t i c s e q a s r u n s o m a t i c s e q

4

r u n s o m a t i c s e q . r u n S i n g l e ( o u t d i r = ’/PATH/TO/ SomaticSeq ’ , r e f = ’/PATH/TO/GRCh38 . f a ’ , bam= ’/
PATH/TO/ tumor . bwa . bam ’ , tumor name =’TUMOR’ , t r u t h s n v=None , t r u t h i n d e l=None ,
c l a s s i f i e r s n v =None , c l a s s i f i e r i n d e l =None , p a s s t h r e s h o l d = 0 . 5 , l o w q u a l t h r e s h o l d = 0 . 1 ,
h o m t h r e s h o l d = 0 . 8 5 , h e t t h r e s h o l d = 0 . 0 1 , dbsnp = ’/PATH/TO/ dbSNP 138 . hg38 . v c f . v c f ’ , c o s m i c
= ’/PATH/TO/COSMIC . v85 . v c f ’ , i n c l u s i o n = ’/PATH/TO/ Exon Capture . bed ’ , e x c l u s i o n = ’/PATH/TO/
i g n o r e . bed ’ , mutect=None , mutect2 = ’/PATH/TO/MuTect2 . v c f ’ , v a r s c a n=None , v a r d i c t = ’/PATH/
TO/ VarDict . v c f ’ , l o f r e q = ’/PATH/TO/ LoFreq . v c f ’ , s c a l p e l=None , s t r e l k a = ’/PATH/TO/ S t r e l k a .
v c f ’ , min mq=1 , min bq =5 , m i n c a l l e r = 0 . 5 , s o m a t i c s e q t r a i n=F a l s e , e n s e m b l e O u t P r e f i x =’
Ensemble . ’ , c o n s e n s u s O u t P r e f i x =’ Consensus . ’ , c l a s s i f i e d O u t P r e f i x =’ SSeq . C l a s s i f i e d . ’ ,
k e e p i n t e r m e d i a t e s=F a l s e )

Parameters:
• truth snv/truth indel: if present, then the variants in these VCF files will be considered true positives, and everything else will be considered false positive. If None, then nothing with regard to true
positive or false positive will be annotated.

5

• classifier snv/classifier indel: if present, then SomaticSeq prediction will be invoked to create machine learning classified VCF files. if None, only majority-vote consensus VCF files will be created.
• inclusion: bed file so only variants in it will be considered (requires BEDTools on execution path)
• exclusion: bed file so variants in it will be tossed out (requires BEDTools on the execution path)
• mutect/mutect2/varscan/jsm/vardict/muse/lofreq/strelka/scalpel/tnscope: output VCF files from
the callers. If None, then it assumes that tool was not used.
• min caller: only output variants if at least N number of callers have called it. Since some LowQual
calls are considered 0.5, an input of 0.5 tells the function to also return variants even if it’s only
been called as a “LowQual” by a tool. However, it will still filter out variants that’s only been “REJECTED” by a caller.
somaticseq train: if True, and also if truth snv or truth indel are present, then it will create SomaticSeq classifiers. If False, then will not invoke training mode.
3.1.2

Module: somatic vcf2tsv and single sample vcf2tsv

Another useful module is the command to extract SomaticSeq features for variants in any VCF file, and
output the results to a TSV file. The following function requires both tumor and normal BAM files, and
the reference genome. COSMIC, dbSNP, etc. are optional. None for any null inputs. min mq = 0 for
this purpose. This is a filter to only output variants that has been called by a minimum number of tools
(which you may specify as VCF inputs such as mutect, varscan, etc.)
1

im po rt s o m a t i c s e q . s o m a t i c s e q . s o m a t i c v c f 2 t s v a s s o m a t i c v c f 2 t s v
3

s o m a t i c v c f 2 t s v . v c f 2 t s v ( i s v c f = ’/PATH/TO/ v a r i a n t s . v c f ’ , i s b e d=None , i s p o s=None , nbam fn
= ’/PATH/TO/ normal . bam ’ , tbam fn = ’/PATH/TO/ tumor . bam ’ , t r u t h=None , c o s m i c = ’/PATH/TO/
COSMIC . v85 . v c f ’ , dbsnp = ’/PATH/TO/ dbSNP 138 . hg38 . v c f . v c f ’ , mutect=None , v a r s c a n=None , jsm
=None , s n i p e r=None , v a r d i c t=None , muse=None , l o f r e q=None , s c a l p e l=None , s t r e l k a=None ,
t n s c o p e=None , p l a t y p u s=None , dedup=True , min mq=1 , min bq =5 , m i n c a l l e r =0 , r e f f a = ’/PATH
/TO/GRCh38 . f a ’ , p s c a l e=None , o u t f i l e = ’/PATH/TO/ SomaticSeq . F e a t u r e s E x t r a c t e d . t s v ’ )

You may also extract sequencing info for any VCF file if you just have one bam file
1

im po rt s o m a t i c s e q . s o m a t i c s e q . s i n g l e s a m p l e v c f 2 t s v a s s i n g l e s a m p l e v c f 2 t s v
3

s i n g l e s a m p l e v c f 2 t s v . v c f 2 t s v ( i s v c f = ’/PATH/TO/ v a r i a n t s . v c f ’ , i s b e d=None , i s p o s=None ,
bam fn = ’/PATH/TO/ tumor . bam ’ , t r u t h=None , c o s m i c = ’/PATH/TO/COSMIC . v85 . v c f ’ , dbsnp = ’/PATH/
TO/ dbSNP 138 . hg38 . v c f . v c f ’ , mutect=None , v a r s c a n=None , v a r d i c t=None , muse=None , l o f r e q=
None , s c a l p e l=None , s t r e l k a=None , dedup=True , min mq=1 , min bq =5 , m i n c a l l e r =0 , r e f f a
= ’/PATH/TO/GRCh38 . f a ’ , p s c a l e=None , o u t f i l e = ’/PATH/TO/ SomaticSeq . F e a t u r e s E x t r a c t e d . t s v
’)

Both somaticseq/somaticseq/somatic vcf2tsv.py and somaticseq/somaticseq/single sample vcf2tsv.py may
also be run as standalone scripts. Invoke the script with -h to learn their usages.
Parameters:
• is vcf: the VCF file serves as the input file, from which every variant will have its sequencing feature
extracted from the BAM file(s).
• mutect/varscan/jsm/sniper/vardict/muse/lofreq/scalpel/strelka/tnscope: VCF files from these tools.
If present, the function will extract information from these files such as if a variant is called by the
tool. If None, everything associated with that tool will be “nan” in the TSV file.

6

3.1.3

Module: SSeq tsv2vcf

This module converts SomaticSeq’s TSV file (described in Sec. 3.1.2) to SomaticSeq VCF files.
1

im po rt s o m a t i c s e q . s o m a t i c s e q . S S e q t s v 2 v c f a s S S e q t s v 2 v c f
3

S S e q t s v 2 v c f . t s v 2 v c f ( t s v f n = ’/PATH/TO/ SomaticSeq . t s v ’ , v c f f n = ’/PATH/TO/ SomaticSeq . v c f ’ ,
t o o l s =[ ’ MuTect2 ’ , ’ S o m a t i c S n i p e r ’ , ’ S t r e l k a ’ ] , p a s s s c o r e = 0 . 5 , l o w q u a l s c o r e = 0 . 1 ,
h o m t h r e s h o l d = 0 . 8 5 , h e t t h r e s h o l d = 0 . 0 1 , s i n g l e m o d e=F a l s e , p a i r e d m o d e=True ,
normal sample name =’NORMAL’ , tumor sample name =’TUMOR’ , p r i n t r e j e c t=True , p h r e d s c a l e d=
True )

Parameters:
• tools: A list of tools that were run, can only be selected from MuTect2, MuTect, VarScan2,
JointSNVMix2, SomaticSniper, VarDict, MuSE, LoFreq, Scalpel, Strelka, TNscope, and/or Platypus.
• print reject: if False, will only print PASS and LowQual variants into VCF. If True, will print everything from TSV to VCF.
• phred scaled: if True, will print Phred-scaled score in QUAL column (if the TSV was produced with
SomaticSeq prediction). If False, will print the 0-1 scale. If no SomaticSeq prediction was done, will
print 0.

3.2
3.2.1

somaticseq.utilities modules
Module: split Bed into equal regions

Given a .bed or a .fa.fai file, it will split the input region into N number of bed files, such that each bed
file has equal-sized regions in them.
3.2.2

Module: lociCounterWithLabels

Given a list of .bed files and a .fa.fai file, it will return a .bed file detailing which regions were contained
from which .bed inputs.
1

s o m a t i c s e q / u t i l i t i e s / l o c i C o u n t e r W i t h L a b e l s . py − f a i GRCh38 . f a . f a i −beds 1 . bed 2 . bed 3 . bed −
l a b e l s 01 02 03 −out o v e r l a p p i n g . bed

Parameters:
• labels: A list of labels to be written in 4th column of the output bed file. If absent, the 4th column
will be populated by the input bed file names.

4

The step-by-step SomaticSeq Workflow

We’ll describe the workflow here.

4.1

Apply inclusion and exclusion regions

This step may be needed for model training. BEDTools is invoked by SomaticSeq. An inclusion region
means we will only use calls inside these regions. An exclusion region means we do not care about calls
inside this region. DREAM Challenge had exclusion regions, e.g., blacklisted regions, etc. It is also a routine used to parallelize the process by splitting large regions into equal-sized (in terms of number of pairs)
regions, so that they can be processed in parallel.
7

4.2

Combine the call sets

We use vcfModifier/getUniqueVcfPositions.py and bedtools sort to combine the VCF files from different
callers. For each caller output, intermediate VCF file(s) may be created to separate the SNVs and INDELs
calls, and also remove some REJECT calls to reduce file sizes.
The following scripts are used to modify original VCF outputs.
1

3

5

7

9

v c f M o d i f i e r / modify JointSNVMix2 . py
v c f M o d i f i e r / modify MuTect2 . py
v c f M o d i f i e r / modify MuTect . py
v c f M o d i f i e r / m o d i f y S o m a t i c S n i p e r . py
v c f M o d i f i e r / modify ssMuTect2 . py
v c f M o d i f i e r / m o d i f y s s S t r e l k a . py
v c f M o d i f i e r / m o d i f y S t r e l k a . py
v c f M o d i f i e r / m o d i f y V a r D i c t . py
v c f M o d i f i e r / modify VarScan2 . py

modify ssTOOL.py denotes it’s for single-sample mode.
JointSNVMix2 does not output VCF files. In our own workflow, we convert its output into a basic VCF file with an 2 awk one-liners, which you may see at utilities/dockered pipelines/mutation callers/submit JointSNVMix2.sh.
1

# To a v o i d t e x t f i l e s on t h e o r d e r o f t e r a b y t e s , t h i s awk one−l i n e r k e e p s e n t r i e s where t h e
r e f e r e n c e i s not ”N” , and t h e s o m a t i c p r o b a b i l i t i e s a r e a t l e a s t 0 . 9 5 .
awk −F ” \ t ” ’NR!=1 && $4 != ”N” && $10+$11 >=0.95 ’

3

5

7

9

11

13

15

17

# This awk one−l i n e r c o n v e r t s t h e t e x t f i l e i n t o a b a s i c VCF f i l e
awk −F ” \ t ” ’ { p r i n t $1 ” \ t ” $2 ” \ t . \ t ” $3 ” \ t ” $4 ” \ t . \ t . \tAAAB=” $10 ” ;AABB=” $11 ” \tRD :AD\
t ” $5 ” : ” $6 ” \ t ” $7 ” : ” $8 } ’
## The a c t u a l commands we ’ ve used i n our w o r k f l o w :
echo −e ’##f i l e f o r m a t=VCFv4 . 1 ’ > u n s o r t e d . v c f
echo −e ’##INFO=’ >> u n s o r t e d . v c f
echo −e ’##INFO=’ >> u n s o r t e d . v c f
echo −e ’##FORMAT=’ >> u n s o r t e d . v c f
echo −e ’##FORMAT=’ >> u n s o r t e d . v c f
echo −e ’#CHROM\tPOS\ tID \tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNORMAL\tTUMOR’ >> u n s o r t e d .
vcf
python $PATH/TO/ jsm . py c l a s s i f y j o i n t s n v m i x t w o genome . GRCh37 . f a normal . bam tumor . bam
t r a i n e d . p a r a m e t e r . c f g / dev / s t d o u t | \
awk −F ” \ t ” ’NR!=1 && $4 != ”N” && $10+$11 >=0.95 ’ | \
awk −F ” \ t ” ’ { p r i n t $1 ” \ t ” $2 ” \ t . \ t ” $3 ” \ t ” $4 ” \ t . \ t . \tAAAB=” $10 ” ;AABB=” $11 ” \tRD :AD\
t ” $5 ” : ” $6 ” \ t ” $7 ” : ” $8 } ’ >> u n s o r t e d . v c f

4.3

Convert the VCF file into TSV file

This script somaticseq/somatic vcf2tsv.py works for all VCF files (requires input for two BAM files). It extracts information from the BAM files, as well as some individual callers’ output VCF files. If the ground
truth VCF file is included, a called variant will be annotated as a true positive, and everything will be annotated as a false positive. somaticseq/single sample vcf2tsv.py is used for single-sample mode.
At the end of this, Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv are created.
All the options for somaticseq/somatic vcf2tsv.py or somaticseq/single sample vcf2tsv.py can be found by
running.

8

1

s o m a t i c s e q / s o m a t i c v c f 2 t s v . py −h
s o m a t i c s e q / s i n g l e s a m p l e v c f 2 t s v . py −h

Note: Do not worry if Python throws a warning like this.
2

RuntimeWarning : i n v a l i d v a l u e e n c o u n t e r e d i n d o u b l e s c a l a r s
z = ( s − e x p e c t e d ) / np . s q r t ( n1 ∗ n2 ∗ ( n1+n2+1) / 1 2 . 0 )

This is to tell you that scipy was attempting some statistical test with empty data. That’s usually due to
the fact that normal BAM file has no variant reads at that given position. That is why lots of values are
NaN for the normal.

4.4

Model Training or Mutation Prediction

You can use Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv files either for model training (provided that
their mutation status is annotated with 0 or 1) or mutation prediction. This is done with stochastic boosting algorithm we have implemented in R.
Model training:
2

# Training :
r s c r i p t s / a d a m o d e l b u i l d e r n t C h a n g e .R Ensemble . sSNV . t s v
Consistent Mates
Inconsistent Mates
r s c r i p t s / a d a m o d e l b u i l d e r n t C h a n g e .R Ensemble . sINDEL . t s v S t r e l k a Q S S Strelka TQSS
Consistent Mates Inconsistent Mates

Ensemble.sSNV.tsv.ntChange.Classifier.RData and Ensemble.sINDEL.tsv.ntChange.Classifier.RData will
be created from model training. The arguments after Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv tells
the builder script to ignore those features in training. These features do not improve accuracy in our data
sets (mostly WGS data, but they may help other data sets)
Mutation prediction:
1

3

# Mutation p r e d i c t i o n :
r s c r i p t / a d a m o d e l p r e d i c t o r .R Ensemble . sSNV . t s v . C l a s s i f i e r . RData
Ensemble . sSNV . t s v
T r a i n e d . sSNV . t s v
r s c r i p t / a d a m o d e l p r e d i c t o r .R Ensemble . sINDEL . t s v . C l a s s i f i e r . RData Ensemble . sINDEL . t s v
T r a i n e d . sINDEL . t s v

After mutation prediction, if you feel like it, you may convert Trained.sSNV.tsv and Trained.sINDEL.tsv
into VCF files. Use -tools to list ONLY the individual tools used to have appropriately annotated VCF
files. Accepted tools are MuTect2/MuTect/Indelocator, VarScan2, JointSNVMix2, SomaticSniper, VarDict, MuSE, LoFreq, Scalpel, Strelka, and/or TNscope. To list a tool without having run it, the VCF will
be annotated as if the tool was run but did not identify that position as a somatic variant, which is probably undesireable.
1

3

5

7

# P r o b a b i l i t y above 0 . 7 l a b e l e d PASS (− p a s s 0 . 7 ) , and between 0 . 1 and 0 . 7 l a b e l e d LowQual (−
low 0 . 1 ) :
# Use − a l l t o i n c l u d e REJECT c a l l s i n t h e VCF f i l e
# Use −phred t o c o n v e r t p r o b a b i l i t y v a l u e s ( between 0 t o 1 ) i n t o Phred s c a l e i n t h e QUAL
column i n t h e VCF f i l e
s o m a t i c s e q / S S e q t s v 2 v c f . py −t s v T r a i n e d . sSNV . t s v
−v c f T r a i n e d . sSNV . v c f
−p a s s 0 . 7 −low
0 . 1 − t o o l s MuTect2 VarScan2 JointSNVMix2 S o m a t i c S n i p e r VarDict MuSE LoFreq S t r e l k a − a l l
−phred
s o m a t i c s e q / S S e q t s v 2 v c f . py −t s v T r a i n e d . sINDEL . t s v −v c f T r a i n e d . sINDEL . v c f −p a s s 0 . 7 −low
0 . 1 − t o o l s MuTect2 VarScan2 VarDict LoFreq S c a l p e l S t r e l k a − a l l −phred

9

5

To run the dockerized somatic mutation callers

For your convenience, we have created a couple of scripts that can generate run script for the dockerized
somatic mutation callers.

5.1

Location

• somaticseq/utilities/dockered pipelines/

5.2

Requirements

• Have internet connection, and able to pull and run docker images from docker.io
• Have cluster management system such as Sun Grid Engine, so that the ”qsub” command is valid

5.3
5.3.1

Example commands
Single-threaded Jobs

This is best suited for whole exome sequencing or less.
1

3

5

7

9

# Example command t o submit t h e run s c r i p t s f o r each o f t h e f o l l o w i n g s o m a t i c mutation
callers
$PATH/TO/ s o m a t i c s e q / u t i l i t i e s / d o c k e r e d p i p e l i n e s / s u b m i t c a l l e r s s i n g l e T h r e a d . sh \
−−normal−bam
/ABSOLUTE/PATH/TO/ n o r m a l s a m p l e . bam \
−−tumor−bam
/ABSOLUTE/PATH/TO/ tumor sample . bam \
−−human−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a \
−−output−d i r
/ABSOLUTE/PATH/TO/RESULTS \
−−dbsnp
/ABSOLUTE/PATH/TO/dbSNP . GRCh38 . v c f \
−−s o m a t i c s e q −d i r /ABSOLUTE/PATH/TO/ SomaticSeq \
−−a c t i o n
echo \
−−mutect2 −−s o m a t i c s n i p e r −−v a r d i c t −−muse −−l o f r e q −−s c a l p e l −−s t r e l k a −−s o m a t i c s e q

The command shown above will create scripts for MuTect2, SomaticSniper, VarDict, MuSE, LoFreq,
Scalpel, and Strelka. Then, it will create the SomaticSeq script that merges those 7 callers. This command
defaults to majority-vote consensus.
Since it’s --aciton echo, it will echo the mutation caller scripts locations, but these scripts will not be
run. If you do --action qsub instead, then those mutation caller scripts will be qsub’ed. You’ll still need
to mantually run/submit the SomaticSeq script after all the caller jobs are done.
5.3.2

Multi-threaded Jobs

This is best suited for whole genome sequencing. This is same as above, except it will create 36 equal-size
regions in 36 bed files, and parallelize the jobs into 36 regions.
2

4

6

8

10

# S u b m i t t i n g mutation c a l l e r j o b s by s p l i t t i n g each j o b i n t o 36 even r e g i o n s .
$PATH/TO/ s o m a t i c s e q / u t i l i t i e s / d o c k e r e d p i p e l i n e s / s u b m i t c a l l e r s m u l t i T h r e a d s . sh \
−−normal−bam
/ABSOLUTE/PATH/TO/ n o r m a l s a m p l e . bam \
−−tumor−bam
/ABSOLUTE/PATH/TO/ tumor sample . bam \
−−human−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a \
−−output−d i r
/ABSOLUTE/PATH/TO/RESULTS \
−−dbsnp
/ABSOLUTE/PATH/TO/dbSNP . GRCh38 . v c f \
−−t h r e a d s
36 \
−−a c t i o n
echo \
−−mutect2 −−s o m a t i c s n i p e r −−v a r d i c t −−muse −−l o f r e q −−s c a l p e l −−s t r e l k a −−s o m a t i c s e q

10

5.3.3

SomaticSeq Training

Two classifiers will be created (*.RData files), one for SNV and one for INDEL.
2

4

6

8

10

12

# S u b m i t t i n g mutation c a l l e r j o b s by s p l i t t i n g each j o b i n t o 36 even r e g i o n s .
$PATH/TO/ s o m a t i c s e q / u t i l i t i e s / d o c k e r e d p i p e l i n e s / s u b m i t c a l l e r s s i n g l e T h r e a d . sh \
−−normal−bam
/ABSOLUTE/PATH/TO/ n o r m a l s a m p l e . bam \
−−tumor−bam
/ABSOLUTE/PATH/TO/ tumor sample . bam \
−−t r u t h −snv
/ABSOLUTE/PATH/TO/ snvTruth . v c f \
−−t r u t h −i n d e l
/ABSOLUTE/PATH/TO/ i n d e l T r u t h . v c f \
−−human−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a \
−−output−d i r
/ABSOLUTE/PATH/TO/RESULTS \
−−dbsnp
/ABSOLUTE/PATH/TO/dbSNP . GRCh38 . v c f \
−−s o m a t i c s e q −d i r /ABSOLUTE/PATH/TO/ SomaticSeq \
−−a c t i o n
echo \
−−mutect2 −−s o m a t i c s n i p e r −−v a r d i c t −−muse −−l o f r e q −−s c a l p e l −−s t r e l k a −−s o m a t i c s e q −−
s o m a t i c s e q −t r a i n

Notice the command includes –truth-snv and –truth-indel, and invokes somaticseq-train.
For multi-threaded job, you should not invoke somaticseq-train. Instead, you should combine all the Ensemble.sSNV.tsv and Ensemble.sINDEL.tsv files (separately), and then train on the combined files.
5.3.4

2

4

6

8

10

12

SomaticSeq Prediction

# S u b m i t t i n g mutation c a l l e r j o b s by s p l i t t i n g each j o b i n t o 36 even r e g i o n s .
$PATH/TO/ s o m a t i c s e q / u t i l i t i e s / d o c k e r e d p i p e l i n e s / s u b m i t c a l l e r s s i n g l e T h r e a d . sh \
−−normal−bam
/ABSOLUTE/PATH/TO/ n o r m a l s a m p l e . bam \
−−tumor−bam
/ABSOLUTE/PATH/TO/ tumor sample . bam \
−− c l a s s i f i e r −snv
/ABSOLUTE/PATH/TO/ Ensemble . sSNV . t s v . ntChange . C l a s s i f i e r . RData \
−− c l a s s i f i e r −i n d e l /ABSOLUTE/PATH/TO/ Ensemble . sINDEL . t s v . ntChange . C l a s s i f i e r . RData \
−−human−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a \
−−output−d i r
/ABSOLUTE/PATH/TO/RESULTS \
−−dbsnp
/ABSOLUTE/PATH/TO/dbSNP . GRCh38 . v c f \
−−s o m a t i c s e q −d i r
/ABSOLUTE/PATH/TO/ SomaticSeq \
−−a c t i o n
echo \
−−mutect2 −−s o m a t i c s n i p e r −−v a r d i c t −−muse −−l o f r e q −−s c a l p e l −−s t r e l k a −−s o m a t i c s e q

Notice the command includes –classifier-snv and –classifier-indel.
5.3.5

2

4

6

8

10

Parameters

−−normal−bam
/ABSOLUTE/PATH/TO/ n o r m a l s a m p l e . bam ( R e q u i r e d )
−−tumor−bam
/ABSOLUTE/PATH/TO/ tumor sample . bam ( R e q u i r e d )
−−human−r e f e r e n c e
/ABSOLUTE/PATH/TO/ h u m a n r e f e r e n c e . f a ( R e q u i r e d )
−−dbsnp
/ABSOLUTE/PATH/TO/ dbsnp . v c f ( R e q u i r e d f o r MuSE and LoFreq )
−−c o s m i c
/ABSOLUTE/PATH/TO/ c o s m i c . v c f ( O p t i o n a l )
−− s e l e c t o r
/ABSOLUTE/PATH/TO/ C a p t u r e r e g i o n . bed ( O p t i o n a l . W i l l assume
whole genome from t h e . f a i f i l e w i t h o u t i t . )
−−e x c l u d e
/ABSOLUTE/PATH/TO/ B l a c k l i s t r e g i o n . bed ( O p t i o n a l )
−−min−a f
( O p t i o n a l . The minimum VAF c u t o f f f o r VarDict and VarScan2 .
D e f a u l t s a r e 0 . 1 0 f o r VarScan2 and 0 . 0 5 f o r VarDict ) .
−−a c t i o n
qsub ( O p t i o n a l : t h e command p r e c e d i n g t h e . cmd s c r i p t s .
D e f a u l t i s echo )
−−t h r e a d s
36 ( O p t i o n a l f o r m u l t i T h r e a d s and i n v a l i d f o r s i n g l e T h r e a d :
e v e n l y s p l i t t h e genome i n t o 36 BED f i l e s . D e f a u l t = 1 2 ) .
−−mutect2
( O p t i o n a l f l a g t o i n v o k e MuTect2 )

11

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

44

−−v a r s c a n 2
( O p t i o n a l f l a g t o i n v o k e VarScan2 )
−−j o i n t s n v m i x 2
( O p t i o n a l f l a g t o i n v o k e JointSNVMix2 )
−−s o m a t i c s n i p e r
( Optional f l a g to invoke SomaticSniper )
−−v a r d i c t
( O p t i o n a l f l a g t o i n v o k e VarDict )
−−muse
( O p t i o n a l f l a g t o i n v o k e MuSE)
−−l o f r e q
( O p t i o n a l f l a g t o i n v o k e LoFreq )
−−s c a l p e l
( Optional f l a g to invoke S c a l p e l )
−−s t r e l k a
( Optional f l a g to invoke S t r e l k a )
−−s o m a t i c s e q
( O p t i o n a l f l a g t o i n v o k e SomaticSeq . This s c r i p t a l w a y s be
echo ’ ed , a s i t s h o u l d not be s u b m i t t e d u n t i l a l l t h e c a l l e r s above c o m p l e t e ) .
−−output−d i r
/ABSOLUTE/PATH/TO/OUTPUT DIRECTORY ( R e q u i r e d )
−−s o m a t i c s e q −d i r
S o m a t i c S e q O u t p u t D i r e c t o r y ( O p t i o n a l . The d i r e c t o r y name o f
t h e SomaticSeq o ut pu t . D e f a u l t = SomaticSeq ) .
−−s o m a t i c s e q −t r a i n
( O p t i o n a l f l a g t o i n v o k e SomaticSeq t o p r o d u c e c l a s s i f i e r s i f
ground t r u t h VCF f i l e s a r e p r o v i d e d . Only recommended i n s i n g l e T h r e a d mode , b e c a u s e
o t h e r w i s e i t ’ s b e t t e r t o combine t h e ou tp ut TSV f i l e s f i r s t , and then t r a i n c l a s s i f i e r s
.)
−−s o m a t i c s e q −a c t i o n
( O p t i o n a l . What t o do with t h e s o m a t i c s e q . cmd . D e f a u l t i s echo
. Only do ” qsub ” i f you have a l r e a d y c o m p l e t e d a l l t h e mutation c a l l e r s , but want t o run
SomaticSeq a t a d i f f e r e n t s e t t i n g . )
−− c l a s s i f i e r −snv
T r a i n e d s S N V C l a s s i f i e r . RData ( O p t i o n a l i f t h e r e i s a
c l a s s i f e r you want t o u s e )
−− c l a s s i f i e r −i n d e l
T r a i n e d s I N D E L C l a s s i f i e r . RData ( O p t i o n a l i f t h e r e i s a
c l a s s i f e r you want t o u s e )
−−t r u t h −snv
s S N V g r o u n d t r u t h . v c f ( O p t i o n a l i f t h e r e i s a ground t r u t h ,
and e v e r y t h i n g e l s e w i l l be l a b e l e d f a l s e p o s i t i v e )
−−t r u t h −i n d e l
sINDEL ground truth . v c f ( O p t i o n a l i f t h e r e i s a ground t r u t h ,
and e v e r y t h i n g e l s e w i l l be l a b e l e d f a l s e p o s i t i v e )
−−exome
( Optional f l a g f o r S trel ka )
−−s c a l p e l −two−p a s s
( Optional parameter f o r S c a l p e l . Default = f a l s e . )
−−mutect2−arguments
( Extra p a r a m e t e r s t o p a s s onto Mutect2 , e . g . , −−mutect2−
arguments ’−− i n i t i a l t u m o r l o d 3 . 0 −−l o g s o m a t i c p r i o r −5.0 −−m i n b a s e q u a l i t y s c o r e
20 ’)
−−mutect2− f i l t e r −arguments
( Extra p a r a m e t e r s t o p a s s onto F i l t e r M u t e c t C a l l s )
−−v a r s c a n −arguments
( Extra p a r a m e t e r s t o p a s s onto VarScan2 )
−−v a r s c a n −p i l e u p −arguments
( Extra p a r a m e t e r s t o p a s s onto s a m t o o l s mpileup t h a t c r e a t e s
p i l e u p f i l e s f o r VarScan )
−−jsm−t r a i n −arguments
( Extra p a r a m e t e r s t o p a s s onto JointSNVMix2 ’ s t r a i n command )
−−jsm−c l a s s i f y −arguments
( Extra p a r a m e t e r s t o p a s s onto JointSNVMix2 ’ s c l a s s i f y command
)
−−s o m a t i c s n i p e r −arguments
( Extra p a r a m e t e r s t o p a s s onto S o m a t i c S n i p e r )
−−v a r d i c t −arguments
( Extra p a r a m e t e r s t o p a s s onto VarDict )
−−muse−arguments
( Extra p a r a m e t e r s t o p a s s onto MuSE)
−−l o f r e q −arguments
( Extra p a r a m e t e r s t o p a s s onto LoFreq )
−−s c a l p e l −d i s c o v e r y −arguments ( Extra p a r a m e t e r s t o p a s s onto S c a l p e l ’ s d i s c o v e r y command )
−−s c a l p e l −e x p o r t −arguments
( Extra p a r a m e t e r s t o p a s s onto S c a l p e l ’ s e x p o r t command )
−−s t r e l k a −c o n f i g −arguments
( Extra p a r a m e t e r s t o p a s s onto S t r e l k a ’ s c o n f i g command )
−−s t r e l k a −run−arguments
( Extra p a r a m e t e r s t o p a s s onto S t r e k l a ’ s run command )
−−s o m a t i c s e q −arguments
( Extra p a r a m e t e r s t o p a s s onto SomaticSeq . Wrapper . sh )

5.3.6

What does the single-threaded command do

• For each flag such as --mutect2, --jointsnvmix2, ...., --strelka, a run script ending with .cmd will
be created in /ABSOLUTE/PATH/TO/RESULTS/logs. By default, these .cmd scripts will only be
created, and their file path will be printed on screen. However, if you do “--action qsub”, then these
scripts will be submitted via the qsub command. The default action is “echo.”
– Each of these .cmd script correspond to a mutation caller you specified. They all use docker
images.
– We may improve their functionalities in the future to allow more tunable parameters. For the
initial releases, POC and reproducibility take precedence.

12

• If you do “--somaticseq,” the somaticseq script will be created in /ABSOLUTE/PATH/TO/RESULTS/SomaticSeq/logs. However, it will not be submitted until you manually do so after each of
these mutation callers is finished running.
– In the future, we may create more sophisticated solution that will automatically solves these
dependencies. For the initial release, we’ll focus on stability and reproducibility.
• Due to the way those run scripts are written, the Sun Grid Engine’s standard error log will record
the time the task completes (i.e., Done at 2017/10/30 29:03:02), and it will only do so when the task
is completed with an exit code of 0. It can be a quick way to check if a task is done, by looking at
the final line of the standard error log file.
5.3.7

What does the multi-threaded command do

It’s very similar to the single-threaded WES solution, except the job will be split evenly based on genomic
lengths.
• If you specified “--threads 36,” then 36 BED files will be created. Each BED file represents 1/36 of
the total base pairs in the human genome (obtained from the .fa.fai file, but only including 1, 2, 3,
..., MT, or chr1, chr2, ..., chrM contigs). They are named 1.bed, 2.bed, ..., 36.bed, and will be created into /ABSOLUTE/PATH/TO/RESULTS/1, /ABSOLUTE/PATH/TO/RESULTS/2, ..., /ABSOLUTE/PATH/TO/RESULTS/36. You may, of course, specify any number. The default is 12.
• For each mutation callers you specify (with the exception of SomaticSniper), a script will be created
into /ABSOLUTE/PATH/TO/RESULTS/1/logs, /ABSOLUTE/PATH/TO/RESULTS/2/logs, etc.,
with partial BAM input. Again, they will be automatically submitted if you do “--action qsub.”
• Because SomaticSniper does not support partial BAM input (one would have to manually split the
BAMs in order to parallelize SomaticSniper this way), the above mentioned procedure is not applied
to SomaticSniper. Instead, a single-threaded script will be created (and potentially qsub’ed) into
/ABSOLUTE/PATH/TO/RESULTS/logs.
– However, because SomaticSniper is by far the fastest tool there, single-thread is doable even for
WGS. Even single-threaded SomaticSniper will likely finish before parallelized Scalpel. When
I benchmarked the DREAM Challenge Stage 3 by splitting it into 120 regions, Scalpel took 10
hours and 10 minutes to complete 1/120 of the data. SomaticSniper took a little under 5 hours
for the whole thing.
– After SomaticSniper finishes, the result VCF files will be split into each of the /ABSOLUTE/PATH/TO/RESULTS/1, /ABSOLUTE/PATH/TO/RESULTS/2, etc.
• JointSNVMix2 also does not support partial BAM input. Unlike SomaticSniper, it’s slow and takes
massive amount of memory. It’s not a good idea to run JointSNVMix2 on a WGS data. The only
way to do so is to manually split the BAM files and run each separately. We may do so in the future, but JointSNVMix2 is a 5-year old that’s no longer being supported, so we probably won’t
bother.
• Like the single-threaded case, a SomaticSeq run script will also be created for each partition like
/ABSOLUTE/PATH/TO/RESULTS/1/SomaticSeq/logs, but will not be submitted until you do
so manually.
– For simplicity, you may wait until all the mutation calling is done, then run a command like
1

f i n d /ABSOLUTE/PATH/TO/RESULTS −name ’ s o m a t i c s e q ∗ . cmd ’ −e x e c qsub {} \ ;

13

6

Use BAMSurgeon to create training data

For your convenience, we have created a couple of wrapper scripts that can generate the run script to create training data using BAMSurgon at somaticseq/utilities/dockered pipelines/bamSimulator. Descriptions and example commands can be found in the README there.
This pipeline is used to spike in in silico somatic mutations into existing BAM files in order to create a
training set for somatic mutations.
After the in silico data are generated, you can use the somatic mutation pipeline on the training data to
generate the SomaticSeq classifiers.
Classifiers built on training data work if the training data is similar to the data you want to predict. Ideally, the training data are sequenced on the same platform, same sample prep, and similar depth of coverage as the data of interest.
This method is based on BAMSurgeon, slightly modified into our own fork for some speedups.
The proper citation for BAMSurgeon is Ewing AD, Houlahan KE, Hu Y, et al. Combining tumor genome
simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat Methods.
2015;12(7):623-30.

6.1

Requirements

• Have internet connection, and able to pull and run docker images from docker.io
• Have cluster management system such as Sun Grid Engine, so that the ”qsub” command is valid

6.2

Three scenario to simulate somatic mutations

Which scenario to use depend on the data sets available to you.
6.2.1

When you have sequencing replicates of normal samples

This is our approach to define high-confidence somatic mutations in SEQC2 consortium’s cancer reference
samples, presented here.
In this case, in silico mutations will be spiked into Replicate 002.bam. Since Replicate 002.bam and Replicate 001.bam are otherwise the same sample, any mutations detected that you did not spike in are false
positives. The following command is a single-thread example.
1

3

5

7

9

11

13

15

$PATH/TO/ s o m a t i c s e q / u t i l i t i e s / d o c k e r e d p i p e l i n e s / bamSimulator / B a m S i m u l a t o r s i n g l e T h r e a d . sh \
−−genome−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a \
−−tumor−bam−i n
/ABSOLUTE/PATH/TO/ R e p l i c a t e 0 0 1 . bam \
−−normal−bam−i n
/ABSOLUTE/PATH/TO/ R e p l i c a t e 0 0 2 . bam \
−−tumor−bam−out
s y n t h e t i c T u m o r . bam \
−−normal−bam−out
s y n t h e t i c N o r m a l . bam \
−−s p l i t −p r o p o r t i o n
0.5 \
−−num−s n v s
20000 \
−−num−i n d e l s
8000 \
−−min−v a f
0.0 \
−−max−v a f
1.0 \
−−l e f t −b e t a
2 \
−−r i g h t −b e t a
5 \
−−min−v a r i a n t −r e a d s 2 \
−−output−d i r
/ABSOLUTE/PATH/TO/ t r a i n i n g S e t \
−−a c t i o n
qsub

BamSimulator *.sh creates semi-simulated tumor-normal pairs out of your input tumor-normal pairs. The
”ground truth” of the somatic mutations will be synthetic snvs.vcf, synthetic indels.vcf in the output directory.
For multi-thread job (WGS), use BamSimulator multiThreads.sh instead. See below for additional options
and parameters.
A schematic of the BAMSurgeon simulation procedure
14

6.2.2

This example mimicks DREAM Challenge

DREAM Somatic Mutation Calling Challenge was an international competition to find algorithms that
gave the most accurate performances.
In that case, a high-coverage BAM file is randomly split into two. One of which is designated normal, and
the other one is designated tumor where mutations will be spiked in. Like the previous example, any mutations found between the designated tumor and designated normal are false positive, since not only are
they from the same sample, but also from the same sequencing run. This example will not capture false
positives as a result of run-to-run biases if they exist in your sequencing data. It will, however, still capture artefacts related to sequencing errors, sampling errors, mapping errors, etc.
2

$PATH/TO/ s o m a t i c s e q / u t i l i t i e s / d o c k e r e d p i p e l i n e s / bamSimulator / Ba mSimulato r mult iThreads . sh \
−−genome−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a −−tumor−bam−i n /ABSOLUTE/PATH/TO/
highCoverageGenome . bam −−tumor−bam−out s y n t h e t i c T u m o r . bam −−normal−bam−out
s y n t h e t i c N o r m a l . bam −−s p l i t −p r o p o r t i o n
0 . 5 −−num−s n v s 10000 −−num−i n d e l s 8000 −−num−s v s
1500 −−min−v a f 0 . 0 −−max−v a f 1 . 0 −−l e f t −b e t a 2 −−r i g h t −b e t a 5 −−min−v a r i a n t −r e a d s 2 −−
output−d i r /ABSOLUTE/PATH/TO/ t r a i n i n g S e t −−t h r e a d s 24 −−a c t i o n qsub −−s p l i t −bam −−i n d e l −
r e a l i g n −−merge−output−bams

The –split-bem will randomly split the high coverage BAM file into two BAM files, one of which is designated normal and the other one designated tumor for mutation spike in. The –indel-realign is an option
that will perform GATK Joint Indel Realignment on the two BAM files. You may or may not invoke it depending on your real data sets. The –merge-output-bams creates another script that will merge the BAM
and VCF files region-by-region. It will need to be run manually after all the spike in is done.
A schematic of the DREAM Challenge simulation procedure

6.2.3

2

Merge and then split the input tumor and normal BAM files

$PATH/TO/ s o m a t i c s e q / u t i l i t i e s / d o c k e r e d p i p e l i n e s / bamSimulator / Ba mSimulato r mult iThreads . sh \
−−genome−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a −−tumor−bam−i n /ABSOLUTE/PATH/TO/ Tumor Sample
. bam −−normal−bam−i n /ABSOLUTE/PATH/TO/ Normal Sample . bam −−tumor−bam−out s y n t h e t i c T u m o r .
bam −−normal−bam−out
s y n t h e t i c N o r m a l . bam −−s p l i t −p r o p o r t i o n
0 . 5 −−num−s n v s 30000 −−

15

num−i n d e l s 10000 −−num−s v s 1500 −−min−v a f 0 . 0 −−max−v a f 1 . 0 −−l e f t −b e t a 2 −−r i g h t −b e t a 5
−−min−v a r i a n t −r e a d s 2 −−output−d i r /ABSOLUTE/PATH/TO/ t r a i n i n g S e t −−t h r e a d s 24 −−a c t i o n
qsub −−merge−bam −−s p l i t −bam −−i n d e l −r e a l i g n −−merge−output−bams

The --merge-bam will merge the normal and tumor BAM files into a single BAM file. Then, --split-bem
will randomly split the merged BAM file into two BAM files. One of which is designated normal, and one
of which is designated tumor. Synthetic mutations will then be spiked into the designated tumor to create
”real” mutations. This is the approach described in our 2017 AACR Abstract.
A schematic of the simulation procedure

6.3
2

4

6

8

10

12

14

16

18

20

Parameters and Options

−−genome−r e f e r e n c e /ABSOLUTE/PATH/TO/ h u m a n r e f e r e n c e . f a ( R e q u i r e d )
−− s e l e c t o r
/ABSOLUTE/PATH/TO/ c a p t u r e r e g i o n . bed (BED f i l e t o l i m i t where mutation
s p i k e i n w i l l be attempted )
−−tumor−bam−i n
I n p u t BAM f i l e ( R e q u i r e d )
−−normal−bam−i n
I n p u t BAM f i l e ( O p t i o n a l , but r e q u i r e d i f you want t o merge i t with t h e
tumor i n p u t )
−−tumor−bam−out
Output BAM f i l e f o r t h e d e s i g n a t e d tumor a f t e r BAMSurgeon mutation s p i k e
in
−−normal−bam−out
Output BAM f i l e f o r t h e d e s i g n a t e d normal i f −−s p l i t −bam i s c h o s e n
−−s p l i t −p r o p o r t i o n The f a c t i o n o f t o t a l r e a d s d e s g i n a t e d t o t h e normal . ( Defaut = 0 . 5 )
−−num−s n v s
Number o f SNVs t o s p i k e i n t o t h e d e s i g n a t e d tumor
−−num−i n d e l s
Number o f INDELs t o s p i k e i n t o t h e d e s i g n a t e d tumor
−−num−s v s
Number o f SVs t o s p i k e i n t o t h e d e s i g n a t e d tumor ( D e f a u l t = 0 )
−−min−depth
Minimum depth where s p i k e i n can t a k e p l a c e
−−max−depth
Maximum depth where s p i k e i n can t a k e p l a c e
−−min−v a f
Minimum VAF t o s i m u l a t e
−−max−v a f
Maximum VAF t o s i m u l a t e
−−l e f t −b e t a
L e f t b e t a o f b e t a d i s t r i b u t i o n f o r VAF
−−r i g h t −b e t a
Right b e t a o f b e t a d i s t r i b u t i o n f o r VAF
−−min−v a r i a n t −r e a d s Minimum number o f v a r i a n t −s u p p o r t i n g r e a d s f o r a s u c c e s s f u l s p i k e i n
−−output−d i r
Output d i r e c t o r y
−−merge−bam
F la g t o merge t h e tumor and normal bam f i l e i n p u t
−−s p l i t −bam
F la g t o s p l i t BAM f i l e f o r tumor and normal
−−c l e a n −bam
F la g t o go t h r o u g h t h e BAM f i l e and remove r e a d s where more than 2
i d e n t i c a l r e a d names a r e p r e s e n t , o r r e a d s where i t s r e a d l e n g t h and CIGAR s t r i n g do not
match . This was n e c e s s a r y f o r some BAM f i l e s downloaded from TCGA. However , a p r o p e r
p a i r −end BAM f i l e s h o u l d not have t h e same r e a d name a p p e a r i n g more than t w i c e . Use t h i s
o n l y when n e c e s s a r y a s i t f i r s t s o r t s BAM f i l e by qname , g o e s t h r o u g h t h e c l e a n i n g
p r o c e d u r e , then re −s o r t by c o o r d i n a t e s .

16

22

24

−−i n d e l −r e a l i g n
Conduct GATK J o i n t I n d e l Realignment on t h e two o ut pu t BAM f i l e s .
I n s t e a d o f s y n t h e t i c N o r m a l . bam and s y n t h e t i c T u m o r . bam , t h e f i n a l BAM f i l e s w i l l be
s y n t h e t i c N o r m a l . J o i n t R e a l i g n e d . bam and s y n t h e t i c T u m o r . J o i n t R e a l i g n e d . bam .
−−s e e d
Random s e e d . P ic k any i n t e g e r f o r r e p r o d u c i b i l i t y p u r p o s e s .
−−t h r e a d s
S p l i t t h e BAM f i l e s e v e n l y i n N r e g i o n s , then p r o c e s s each ( p a i r ) o f sub
−BAM f i l e s i n p a r a l l e l .
−−a c t i o n
The command p r e c e d i n g t h e run s c r i p t c r e a t e d i n t o /ABSOLUTE/PATH/TO/
BamSurgeoned SAMPLES/ l o g s . ” qsub ” i s t o submit t h e s c r i p t i n SGE system . D e f a u l t = echo

6.3.1

–merge-bam / –split-bam / –indel-realign

If you have sequenced replicate normal, that’s the best data set for training. You can use one of the normal as normal, and designate the other normal (of the same sample) as tumor. Use --indel-realign to invoke GATK IndelRealign.
When you have a normal that’s roughly 2X the coverage as your data of choice, you can split that into
two halves. One designated as normal, and the other one designated as tumor. That DREAM Challenge’s
approach. Use --split-bam --indel-realign options.
Another approach is to merge the tumor and normal data, and then randomly split them as described
above. When you merge the tumor and normal, the real tumor mutations are relegated as germline or
noise, so they are considered false positives, because they are supposed to be evenly split into the designated normal. To take this approach, use --merge-bam --split-bam --indel-realign options.
Don’t use --indel-realign if you do not use indel realignment in your alignment pipeline.
In some BAM files, there are reads where read lengths and CIGAR strings don’t match. Spike in will fail
in these cases, and you’ll need to invoke --clean-bam to get rid of these problematic reads.
You can control and visualize the shape of target VAF distribution with python command:
im po rt s c i p y . s t a t s a s s t a t s
im po rt numpy a s np
im po rt m a t p l o t l i b . p y p l o t a s p l t

1

3

leftBeta , rigthBeta = 2 ,5
minAF , maxAF = 0 , 1
x = np . l i n s p a c e ( 0 , 1 , 1 0 1 )
y = s t a t s . b e t a . p d f ( x , l e f t B e t a , r i g t h B e t a , l o c = minAF , s c a l e = minAF + maxAF)
= plt . plot (x , y)

5

7

9

6.4

To create SomaticSeq classifiers

After the mutation simulation jobs are completed, you may create classifiers with the training data with
the following command:
See our somatic mutation pipeline for more details.
1

3

5

7

9

11

$PATH/TO/ s o m a t i c s e q / u t i l i t i e s / d o c k e r e d p i p e l i n e s / s u b m i t c a l l e r s m u l t i T h r e a d s . sh \
−−output−d i r
/ABSOLUTE/PATH/TO/ t r a i n i n g S e t / s o m a t i c M u t a t i o n P i p e l i n e \
−−normal−bam
/ABSOLUTE/PATH/TO/ t r a i n i n g S e t / s y n t h e t i c N o r m a l . bam \
−−tumor−bam
/ABSOLUTE/PATH/TO/ t r a i n i n g S e t / s y n t h e t i c T u m o r . bam \
−−human−r e f e r e n c e /ABSOLUTE/PATH/TO/GRCh38 . f a \
−−dbsnp
/ABSOLUTE/PATH/TO/dbSNP . GRCh38 . v c f \
−−t h r e a d
24 \
−−t r u t h −snv
/ABSOLUTE/PATH/TO/ t r a i n i n g S e t / s y n t h e t i c s n v s . v c f \
−−t r u t h −i n d e l
/ABSOLUTE/PATH/TO/ t r a i n i n g S e t / s y n t h e t i c i n d e l s . l e f t A l i g n . v c f \
−−a c t i o n
echo \
−−mutect2 −−s o m a t i c s n i p e r −−v a r d i c t −−muse −−l o f r e q −−s t r e l k a −−s o m a t i c s e q

17

7

Release Notes

Make sure training and prediction use the same SomaticSeq version, or at least make sure the different
minor version changes do not change the results significantly.
1. Version 1.0
Version used to generate data in the manuscript and Stage 5 of the ICGC-TCGA DREAM Somatic
Mutation Challenge, where SomaticSeq’s results were #1 for INDEL and #2 for SNV.
In the original manuscript, VarDict’s var2vcf somatic.pl script was used to generate VarDict VCFs,
and subsequently “-filter somatic” was used for SSeq merged.vcf2tsv.py. Since then (including
DREAM Challenge Stage 5), VarDict recommends var2vcf paired.pl over var2vcf somatic.pl, and
subsequently “-filter paired” was used for SSeq merged.vcf2tsv.py. The difference in SomaticSeq results, however, is pretty much negligible.
2. Version 1.1
Automated the SomaticSeq.Wrapper.sh script for both training and prediction mode. No change to
any algorithm.
3. Version 1.2
Have implemented the following improvement, mostly for indels:
• SSeq merged.vcf2tsv.py can now accept pileup files to extract read depth and DP4 (reference
forward, reference reverse, alternate forward, and alternate reverse) information (mainly for indels). Previously, that information can only be extracted from SAMtools VCF. Since the SAMtools or HaplotypeCaller generated VCFs hardly contain any indel information, this option improves the indel model. The SomaticSeq.Wrapper.sh script is modified accordingly.
• Extract mapping quality (MQ) from VarDict output if this information cannot be found in
SAMtools VCF (also mostly benefits the indel model).
• Indel length now positive for insertions and negative for deletions, instead of using the absolute
value previously.
4. Version 2.0
• Removed dependencies for SAMtools and HaplotypeCaller during feature extraction.
SSeq merged.vcf2tsv.py extracts those information (plus more) directly from BAM files.
• Allow not only VCF file, but also BED file or a list of chromosome coordinate as input format
for SSeq merged.vcf2tsv.py, i.e., use -mybed or -mypos instead of -myvcf.
• Instead of a separate step to annotate ground truth, that can be done directly by
SSeq merged.vcf2tsv.py by supplying the ground truth VCF via -truth.
• SSeq merged.vcf2tsv.py can annotate dbSNP and COSMIC information directly if BED file or a
list of chromosome coordinates are used as input in lieu of an annotated VCF file.
• Consolidated feature sets, e.g., removed some redunda Fixed a bug: if JointSNVMix2 is not
included, the values should be “NaN” instead of 0’s. This is to keep consistency with how we
handle all other callersnt feature sets coming from different resources.
5. Version 2.0.2
• Incorporated LoFreq.
• Used getopt to replace getopts in the SomaticSeq.Wrapper.sh script to allow long options.
6. Version 2.1.2

18

• Properly handle cases when multiple ALT’s are calls in the same position. The VCF files can
either contain multiple calls in the ALT column (i.e., A,G), or have multiple lines corresponding to the same position (one line for each variant call). Some functions were significantly rewritten to allow this.
• Incorporated Scalpel.
• Deprecated HaplotypeCaller and SAMTools dependencies completely as far as feature generation is concerned.
• The Wrapper script removed SnpSift/SnpEff dependencies. Those information can be directly
obtained during the SSeq merged.vcf2tsv.py step. Also removed some additional legacy steps
that has become useless since v2 (i.e., score Somatic.Variants.py). Added a step to check the
correctness of the input. The v2.1 and 2.1.1 had some typos in the wrapper script, so only describing v2.1.2 here.
7. Version 2.2
• Added MuTect2 support.
8. Version 2.2.1
• InDel 3bp now stands for indel counts within 3 bps of the variant site, instead of exactly 3 bps
from the variant site as it was previously (likewise for InDel 2bp).
• Collapse MQ0 (mapping quality of 0) reads supporting reference/variant reads into a single
metric of MQ0 reads (i.e., tBAM MQ0 and nBAM MQ0). From experience, the number of
MQ0 reads is at least equally predictive of false positive calls, rather than distinguishing if
those MQ0 reads support reference or variant.
• Obtain SOR (Somatic Odds Ratio) from BAM files instead of VarDict’s VCF file.
• Fixed a typo in the SomaticSeq.Wrapper.sh script that did not handle inclusion region correctly.
9. Version 2.2.2
• Got around an occasional unexplained issue in then ada package were the SOR is sometimes
categorized as type, by forcing it to be numeric.
• Defaults PASS score from 0.7 to 0.5, and make them tunable in the SomaticSeq.Wrapper.sh
script (--pass-threshold and --lowqual-threshold).
10. Version 2.2.3
• Incorporated Strelka2 since it’s now GPLv3.
• Added another R script (ada model builder ntChange.R) that uses nucleotide substitution pattern as a feature. Limited experiences have shown us that it improves the accuracy, but it’s not
heavily tested yet.
• If a COSMIC site is labeled SNP in the COSMIC VCF file, if cosmic and CNT will be labeled
as 0. The COSMIC ID will still appear in the ID column. This will not change any results because both of those features are turned off in the training R script.
• Fixed a bug: if JointSNVMix2 is not included, the values should be “NaN” instead of 0’s. This
is to keep consistency with how we handle all other callers.
11. Version 2.2.4
• Resolved a bug in v2.2.3 where the VCF files of Strelka INDEL and Scalpel clash on GATK
CombineVariants, by outputting a temporary VCF file for Strelka INDEL without the sample
columns.
19

• Caller classification: consider if Scalpel = 1 only if there is a SOMATIC flag in its INFO.
12. Version 2.2.5
• Added a dockerfile. Docker repo at https://hub.docker.com/r/lethalfang/somaticseq/.
• Ability to use vcfsort.pl instead of GATK CombineVariants to merge VCF files.
13. Version 2.3.0
• Moved some scripts to the utilities directory to clean up the clutter.
• Added the split Bed into equal regions.py to utilities, which will split a input BED file into
multiple BED files of equal size. This is to be used to parallelize large WGS jobs.
• Made compatible with MuTect2 from GATK4.
• Removed long options for the SomaticSeq.Wrapper.sh script because it’s more readable this
way.
• Added a script to add “GT” field to Strelka’s VCF output before merging it with other VCF
files. That was what caused GATK CombineVariants errors mentioned in v2.2.4’s release notes.
• Added a bunch of scripts at utilities/dockered pipelines that can be used to submit (requiring
Sun Grid Engine or equivalent) dockerized pipeline to a computing cluster.
14. Version 2.3.1
• Improve the automated run script generator at utilities/dockered pipelines.
• No change to SomaticSeq algorithm
15. Version 2.3.2
• Added run script generators for dockerized BAMSurgeon pipelines at utilities/dockered pipelines/bamSurgeon
• Added an error message to r scripts/ada model builder ntChange.R when TrueVariants or False
don’t have both 0’s and 1’s. Other than this warning message change, no other change to SomaticSeq algorithm.
16. Version 2.4.0
• Restructured the utilities scripts.
• Added the utilities/filter SomaticSeq VCF.py script that “demotes” PASS calls to LowQual
based on a set of tunable hard filters.
• BamSurgeon scripts invokes modified BamSurgeon script that splits a BAM file without the
need to sort by read name. This works if the BAM files have proper read names, i.e., 2 and
only 2 identical read names for each paired-end reads.
• No change to SomaticSeq algorithm
17. Version 2.4.1
• Updated some docker job scripts.
• Added a script that converts some items in the VCF’s INFO field into the sample field, to precipitate the need to merge multiple VCF files into a single multi-sample VCF, i.e., utilities/reformat VCF2SEQC2.py.
• No change to SomaticSeq algorithm
18. Version 2.5.0
20

• In modify VJSD.py, get rid of VarDict’s END tag (in single sample mode) because it causes
problem with GATK CombineVariants.
• Added limited single-sample support, i.e., ssSomaticSeq.Wrapper.sh is the wrapper script. singleSample callers singleThread.sh is the wrapper script to submit single-sample mutation caller
scripts.
• Added run scripts for read alignments and post-alignment processing, i.e,. FASTQ → BAM, at
utilities/dockered pipelines/alignments.
• Fixed a bug where the last two CD4 numbers were both alternate concordant reads in the output VCF file, when the last number should’ve been alternate discordant reads.
• Changed the output file names from Trained.s(SNV|INDEL).vcf and
Untrained.s(SNV|INDEL).vcf to SSeq.Classified.s(SNV|INDE).vcf and
Consensus.s(SNV|INDEL).vcf. No change to the actual tumor-normal SomaticSeq algorithm.
• Added utilities/modify VarDict.py to VarDict’s “complex” variant calls (e.g., GCA¿TAC) into
SNVs when possible.
• Modified r scripts/ada model builder ntChange.R to allow you to ignore certain features, e.g.,
r scripts/ada model builder ntChange.R Training Data.tsv nBAM REF BQ tBAM REF BQ
SiteHomopolymer Length ...
Everything after the input file are features to be ignored during training.
Also added r scripts/ada cross validation.R.
19. Version 2.5.1
• Additional passable parameters options to pass extra parameters to somatic mutation callers.
Fixed a bug where the “two-pass” parameter is not passed onto Scalpel in multiThreads scripts.
• Ignore Strelka QSS and Strelka TQSS for indel training in the SomaticSeq.Wrapper.sh script.
20. Version 2.5.2
• Ported some pipeline scripts to singularities at utilities/singularities.
21. Version 2.6.0
• VarScan2 Score is no longer extracted from VarScan’s output. Rather, it’s now calculated directly using Fisher’s Exact Test, which reproduces VarScan’s output, but will have a real value
when VarScan2 does not output a particular variant.
• Incorporate TNscope’s output VCF into SomaticSeq, but did not incorporate TNscope caller
into the dockerized workflow because we don’t have distribution license.
22. Version 2.6.1
• Optimized memory for singularity scripts.
• Updated utilities/bamQC.py and added utilities/trimSoftClippedReads.py (removed softclipped bases on soft-clipped reads)
• Added some docker scripts at utilities/dockered pipelines/QC
23. Version 2.7.0
• Added another feature: consistent/inconsistent calls for paired reads if the position is covered
by both forward and reverse reads. However, they’re excluded as training features in SomaticSeq.Wrapper.sh script for the time being.
• Change non-GCTA characters to N in VarDict.vcf file to make it conform to VCF file specifications.
21

24. Version 2.7.1
• Without –gatk $PATH/TO/GenomeAnalysisTK.jar in the SomaticSeq.Wrapper.sh script, it
will use utilities/getUniqueVcfPositions.py and utilities/vcfsorter.pl to (in lieu of GATK3 CombineVariants) to combine all the VCF files.
• Fixed bugs in the docker/singularities scripts where extra arguments for the callers are not correctly passed onto the callers.
25. Version 2.7.2
• Make compatible with .cram format
• Fixed a bug where Strelka-only calls are not considered by SomaticSeq.
26. Version 2.8.0
• The program is now designed to crash if the VCF file(s) are not sorted according to the .fasta
reference file.
27. Version 2.8.1
• Fixed a bug in the ssSomaticSeq.Wrapper.sh script (single-sample mode), where the SNV algorithm weren’t looking for SNV VCF files during merging when using utilities/getUniqueVcfPositions.py, causing empty SNV files. For previous commands (invoking –gatk for CombineVariants), the results have never changed.
28. Version 3.0.0
Refactored the codes.
• The wrapper scripts written in bash script (i.e., SomaticSeq.Wrapper.sh and ssSomaticSeq.Wrapper.sh) are replaced by somaticseq/run somaticseq.py, though they’re still kept for
backward-compatibility.
• Allow parallel processing using somaticseq parallel.py
29. Version 3.0.1
• Fixed a bug that didn’t handle Strelka/LoFreq indel calls correctly in somaticseq/combine callers.py module.
30. Version 3.1.0
• When splitting MuTect2 files into SNV and INDEL, make sure either the ref base or the alt
base (but not both) consists of a single base, i.e., discarding stuff like GCAA¿GCT.
• Fixed a bug introduced in v3.0.1 that caused the program to handle .vcf.gz files incorrectly.
• Incorporated Platypus into paired mode.

8

Contact Us

For suggestions, bug reports, or technical support, please post in
https://github.com/bioinform/somaticseq/issues. The developers are alerted when issues are created
there. Alternatively, you may also email li tai.fang@roche.com.

22



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 22
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.16
Create Date                     : 2018:10:04 23:09:32-07:00
Modify Date                     : 2018:10:04 23:09:32-07:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015) kpathsea version 6.2.1
EXIF Metadata provided by EXIF.tools

Navigation menu