Manual Sigma

User Manual:

Open the PDF directly: View PDF .
Page Count: 10

assignment
calc_llh
cosine
decompose
lite_df
make_matrix
match_to_catalog
plot_detailed
plot_summary
plot_tribase_dist
predict_mva
run
Index

Package ‘SigMA’

December 5, 2018

Title Signature Multivariate Analysis

Version 1.0.0.0

Description SigMA is a signature analysis tool optimized to detect the mutational signature associ-

ated to HR defect, Signature 3, from hybrid capture panels, exomes and whole genome sequenc-

ing. For panels with low SNV counts, conventional signature analysis tools do not per-

form well while the novel approach of SigMA allows it to detect Signature 3-positive tu-

mors with 74% sensitivity at 10% false positive rate. One novelty of SigMA is a likeli-

hood based matching: We associate a new patient's mutational spectrum to subtypes of tu-

mors according to their signature composition. The subtypes of tumors are deﬁned us-

ing the WGS data from ICGC and TCGA consortia, by a clustering of signature frac-

tions with hierarchical clustering. The likelihood of the sample to belong to each tumor sub-

type is calculated, and the likelihood of Signature 3 is the sum of the likelihoods of all Signa-

ture 3-positive tumor subtypes. The second novel step is the multivariate analysis with gradi-

ent boosting machines, which allows us to obtain a ﬁnal score for presence of Signature-3 com-

bining likelihood with cosine similarity and exposure of Signature 3 obtained with non-

negativel-least-squares (NNLS) algorithm. The multivariate analysis allows us to automati-

cally handle different sequencing platforms. For different platforms different methods for signa-

ture analysis become more efﬁcient, e.g. for WGS data it is not necessary to associate the tu-

mor to a subtype of tumors, because it is possible to determine Signature 3 with NNLS acu-

rately. We have a new feature also for these cases and we calculate the likelihood of NNLS de-

composition to be unique. This likelihood value was found to be the most inﬂuential fea-

ture in the multivariate analysis.

Depends R (>= 3.4.0)

License What license is it under?

Encoding UTF-8

LazyData true

RoxygenNote 6.1.0

Imports BSgenome,

BSgenome.Hsapiens.UCSC.hg19,

devtools,

DT,

GenomicRanges,

ggplot2,

gbm,

grid,

gridExtra,

IRanges,

nnls,

2assignment

reshape2,

Rmisc,

shinycssloaders,

VariantAnnotation

Rtopics documented:

assignment ......................................... 2

calc_llh ........................................... 3

cosine ............................................ 3

decompose ......................................... 4

lite_df............................................ 4

make_matrix ........................................ 5

match_to_catalog...................................... 6

plot_detailed ........................................ 6

plot_summary........................................ 7

plot_tribase_dist....................................... 7

predict_mva......................................... 8

run.............................................. 8

Index 10

assignment Assigns a boolean based on a threshold on the likelihood or mva score

for whether the signature is identiﬁed

Description

Assigns a boolean based on a threshold on the likelihood or mva score for whether the signature is

identiﬁed

Usage

assignment(df_in, method = "mva", signame = "Signature_3",

data = NULL, tumor_type = "breast", do_strict = T, weight_cf)

Arguments

df_in input data.frame

method ’median_catalog’ for likelihood based selection or ’mva’ for multivariate analy-

sis score based selection

signame name of the signature that user wants to identify, ’Signature_3’ or ’Signature_msi’

data ’msk’, ’seqcap’ or ’wgs’

tumor_type tumor type as listed in https://github.com/parklab/SigMA/ because the thresh-

olds are tumor_type speciﬁc

do_strict sets whether a strict threshold should be applied or a loose one

Value

a data.frame with a single column which contains the boolean indicating the presence of the signa-

ture

calc_llh 3

calc_llh Calculates likelihood of the genome with respect to the available sig-

nature probability distributions

Description

Calculates likelihood of the genome with respect to the available signature probability distributions

Usage

calc_llh(spectrum, signatures, counts = NULL, normalize = T)

Arguments

spectrum is the mutational spectrum

signatures is the reference signature catalog with the probability distributions

counts is the number of cases in each cluster that is represented in the catalog. They are

used as weights for each signature in the catalog

normalize is true by default only for when it is used together with NNLS in the match_to_catalog

function it is not normalized here but outside of the function

cosine calculates cosine similarity between the spectrum and a set of signa-

tures

Description

calculates cosine similarity between the spectrum and a set of signatures

Usage

cosine(x, y)

Arguments

spectrum is the mutational spectrum

signatures is the reference signature catalog with the probability distribution

4lite_df

decompose Decomposes the mutational spectrum of a genome in terms of tumor

type speciﬁc signatures that were calculated through analysis of pub-

lic WGS samples from ICGC and TCGA consortia, and contained as a

list in the package. Non-negative-least squares algorithm is used and

the number of signatures to be considered in the decomposition is in-

creased gradually, ﬁrst all pairs from among the available signatures

are considered and minimal error pair is kept. Then all 3-signature

combinations, 4-signature combinations and so on are considered.

The result is updated if the error is smaller with larger number of

signatures

Description

Decomposes the mutational spectrum of a genome in terms of tumor type speciﬁc signatures that

were calculated through analysis of public WGS samples from ICGC and TCGA consortia, and

contained as a list in the package. Non-negative-least squares algorithm is used and the number of

signatures to be considered in the decomposition is increased gradually, ﬁrst all pairs from among

the available signatures are considered and minimal error pair is kept. Then all 3-signature com-

binations, 4-signature combinations and so on are considered. The result is updated if the error is

smaller with larger number of signatures

Usage

decompose(spect, signatures, data)

Arguments

spect composite spectrum that is being decomposed

signatures a data.frame that contains the signatures in its columns

data sequencing platform that as in run(), used for setting the maximum number of

signatures that is allowed in the decomposition

lite_df produces a data.frame with fewer columns for easier use

Description

produces a data.frame with fewer columns for easier use

Usage

lite_df(merged_output)

Arguments

merged_output is the input data.frame

make_matrix 5

make_matrix Converts somatic mutation call ﬁles in a directory either in the form of

vcf or maf into a 96-dimensional matrix, it works for general number

of context and for 1 or 2 strands

Description

Converts somatic mutation call ﬁles in a directory either in the form of vcf or maf into a 96-

dimensional matrix, it works for general number of context and for 1 or 2 strands

Usage

make_matrix(directory, file_type = "vcf",

ref_genome = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,

ncontext = 3, nstrand = 1, chrom_colname = NULL,

pos_colname = NULL, ref_colname = NULL, alt_colname = NULL)

Arguments

directory pointer to the directory where input vcf maf ﬁles reside

file_type ’maf’, ’vcf’ or ’custom’

ref_genome name of the BSgenome currently set by default to BSgenome.Hsapiens.UCSC.hg19

ncontext number of bases in the nucleotide sequence which makes up the spectrum, de-

fault 3

nstrand number of strands to be considered, 1 contracts to a single strand which for

ncontext = 3 gives the commonly used 96 dimensions

chrom_colname used only for custom ﬁles a character string deﬁning the colname which holds

the chromosome number

pos_colname used only for custom ﬁles a character string deﬁning the colname which holds

the position information

ref_colname used only for custom ﬁles a character string deﬁning the colname which holds

the ref allele

alt_colname used only for custom ﬁles a character string deﬁning the colname which holds

the alt allele

Examples

by default runs on vcf input and produces 96 dimensional spectra

make_matrix(directory = 'input')

make_matrix(directory = 'input',

file_type = 'vcf',

ref_genome = BSgenome.Hsapiens.UCSC.hg19,

ncontext = 5,

nstrand = 2)

6plot_detailed

match_to_catalog Calculates the compatibility of a list of genomes to an input catalog

based on likelihood and cosine similarity

Description

Calculates the compatibility of a list of genomes to an input catalog based on likelihood and cosine

similarity

Usage

match_to_catalog(genomes, signatures, data, cluster_fractions = NULL,

method = "median_catalog")

Arguments

genomes a data table or matrix with snv spectra in the ﬁrst ntype columns and genomes

in each row

signatures the input catalog, a data table with signature spectra in each column

data sets the type of sequencing platform used, options are ’msk’, ’seqcap’, ’wgs’

method can be ’median_catalog’, ’weighted_catalog’ ’cosine_simil’ or ’decompose. ’me-

dian_atalog’ uses the signature catalog formed by clustering genome SNV spec-

tra and using it as a probability distribution. The ’median_catalog’ method can

be used with any custom signatures data frame if the user intends to provide

their own signature table.

Value

A data frame that contains the input genomes and in addition columns associated to each signature

in in the catalog with likelihood and cosine simil values

plot_detailed Generates a detailed plot per sample

Description

Generates a detailed plot per sample

Usage

plot_detailed(file = NULL, sample = NULL)

Arguments

file the csv ﬁle produced by SigMA

sample name to be plotted

plot_summary 7

plot_summary Generates summary plot

Description

Generates summary plot

Usage

plot_summary(file = NULL)

Arguments

file the csv ﬁle produced by SigMA

plot_tribase_dist plots the 96 dimensional mutational spectrum

Description

plots the 96 dimensional mutational spectrum

Usage

plot_tribase_dist(df_snvs, file_name = "test.png", labely = "N SNVs",

legend = T, text_size = 10, signame = "")

Arguments

df_snvs a data frame with 96-dimensional spectra on its columns

file_name the name of the plot to be generated with the proper extension e.g. "test.pdf",

"test.png", etc

labely string for the label of the y axis

legend boolean determining whether legend should be printed

text_size size of the text of the x and y axis text and titles

signame a text to be printed on the ﬁgure

8run

predict_mva This function uses the trained MVA, in particular gradient boosting

models, inside the package to assign a probability for the existence of

the signature of interest.

Description

This function uses the trained MVA, in particular gradient boosting models, inside the package to

assign a probability for the existence of the signature of interest.

Usage

predict_mva(input, signame, data, tumor_type = "breast", weight_cf)

Arguments

input is a data frame that has likelihood cosine similarity and total snv values in it’s

columns

signame name of the signature which is being identiﬁed

data determines the sequencing platform see run()

tumor_type tumor type tag see ?run

Value

a data.frame with a single column with the score of MVA

run Runs SigMA: (1) calculates likelihood, cosine similarity, NNLS expo-

sures, and likelihood of the decomposition. (2) These features are later

used in multivariate analysis. (3) Based on scores a ﬁnal decision on

existence of the signature.

Description

Runs SigMA: (1) calculates likelihood, cosine similarity, NNLS exposures, and likelihood of the

decomposition. (2) These features are later used in multivariate analysis. (3) Based on scores a ﬁnal

decision on existence of the signature.

Usage

run(genome_file, output_file = NULL, do_assign = T, data = "msk",

tumor_type = "breast", do_mva = T, check_msi = F, weight_cf = F,

lite_format = F, add_sig3 = F)

run 9

Arguments

genome_file a csv ﬁle with snv spectra info can be created from vcf ﬁle using @make_genome_matrix()

function see ?make_genome_matrix

output_file the output ﬁle name, can be NULL in which case input ﬁle name is used and

appended with "_output"

do_assign boolean for whether a cutoff should be applied to determine the ﬁnal decision or

just the features should be returned

data the options are "msk" (for a panel that is similar size to MSK-Impact panel with

410 genes), "seqcap" (for whole exome sequencing), "seqcap_probe" (64 Mb

SeqCap EZ Probe v3), or "wgs" (for whole genome sequencing)

tumor_type the options are "bladder", "bone_other" (Ewing’s sarcoma or Chordoma), "breast",

"crc", "eso", "gbm", "lung", "lymph", "medullo", "osteo", "ovary", "panc_ad",

"panc_en", "prost", "stomach", "thy", or "uterus". The exact correspondance of

these names can be found in https://github.com/parklab/SigMA

do_mva a boolean for whether multivariate analysis should be run

check_msi is a boolean which determines whether the user wants to identify micro-sattelite

instable tumors

weight_cf determines whether the likelihood calculation will take into account the number

of tumors in each cluster when it is F the clusters get equal weights and when

it’s T they are weighted according to the fraction of tumors in each cluster

lite_format saves the output in a lite format when set to true

add_sig3 should be set to T when the likelihood of Signature 3 is calculated for tumor

types for which Signature 3 was not discovered by NMF in their WGS data

Examples

run(genome_file = "input_genomes.csv",

data = "msk",

tumor_type = "ovary")

run(genome_file = "input_genomes.csv",

data = "seqcap",

tumor_type = "bone_other")

Index

assignment,2

calc_llh,3

cosine,3

decompose,4

lite_df,4

make_matrix,5

match_to_catalog,6

plot_detailed,6

plot_summary,7

plot_tribase_dist,7

predict_mva,8

run,8

Manual Sigma

Navigation menu

Versions of this User Manual:

Views

Navigation