Reference Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 51

Package ‘CRISPRcleanR’

June 8, 2018

Type Package

Title Unsupervised correction of gene independent cell responses to CRISPR-cas9 targeting

Version 0.3

Date 2017-09-11

Author Francesco Iorio

Maintainer Francesco Iorio <iorio@ebi.ac.uk>

License GPL-2

Description CRISPRcleanR is an R package for identifying and correcting gene independent re-

sponses to CRISPRcas9 targeting, in genome-wide pooled sgRNA drop-

out screens. CRISPRcleanR uses an unsupervised approach based on the segmentation of single-

guide RNA (sgRNA) fold change values across the genome, without making any assump-

tion on the copy number status of the targeted genes. CRISPRcleanR re-

ports sgRNA fold changes and normalised sgRNA read counts, and is therefore compati-

ble with downstream analysis tools, and works with multiple sgRNA libraries.

Depends stringr, DNAcopy, pROC, stats, utils, grDevices, graphics, pracma, PRROC

RoxygenNote 6.0.1

Rtopics documented:

BAGEL_essential...................................... 2

BAGEL_nonEssential.................................... 3

CCLE.gisticCNA ...................................... 3

ccr.cleanChrm........................................ 4

ccr.correctCounts ...................................... 7

ccr.ExecuteMageck ..................................... 9

ccr.geneMeanFCs...................................... 10

ccr.genes2sgRNAs ..................................... 11

ccr.get.CCLEgisticSets ................................... 12

ccr.get.gdsc1000.AMPgenes ................................ 13

ccr.get.nonExpGenes .................................... 15

ccr.GWclean......................................... 16

ccr.impactOnPhenotype................................... 19

ccr.logFCs2chromPos.................................... 22

ccr.multDensPlot ...................................... 23

ccr.NormfoldChanges.................................... 24

ccr.perf_distributions .................................... 26

ccr.perf_statTests ...................................... 28

2BAGEL_essential

ccr.PlainTsvFile....................................... 31

ccr.PrRc_Curve ....................................... 33

ccr.RecallCurves ...................................... 34

ccr.ROC_Curve....................................... 36

ccr.VisDepAndSig ..................................... 38

CL.subset .......................................... 40

EPLC.272HcorrectedFCs.................................. 41

EssGenes.DNA_REPLICATION_cons . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

EssGenes.KEGG_rna_polymerase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

EssGenes.PROTEASOME_cons .............................. 44

EssGenes.ribosomalProteins ................................ 44

EssGenes.SPLICEOSOME_cons.............................. 45

GDSC.CL_annotation.................................... 46

GDSC.geneLevCNA .................................... 46

HT.29correctedFCs ..................................... 47

KY_Library_v1.0...................................... 49

RNAseq.fpkms ....................................... 50

Index 51

BAGEL_essential Reference Core ﬁtness essential genes

Description

A list of reference core ﬁtness essential genes assembled from multiple RNAi studies used as clas-

siﬁcation template by the BAGEL algorithm to call gene depletion signiﬁcance [1].

Usage

data(BAGEL_essential)

Format

A vector of strings containing HGNC symbols of reference core ﬁtness essential genes.

References

[1] BAGEL: a computational framework for identifying essential genes from pooled library screens.

Traver Hart and Jason Moffat. BMC Bioinformatics, 2016 vol. 17 p. 164.

See Also

BAGEL_nonEssential

Examples

data(BAGEL_essential)

head(BAGEL_essential)

BAGEL_nonEssential 3

BAGEL_nonEssential Reference set of non essential genes

Description

A list of reference non essential genes assembled from multiple RNAi studies used as classiﬁcation

template by the BAGEL algorithm to call gene depletion signiﬁcance [1].

Usage

data(BAGEL_nonEssential)

Format

A vector of strings containing HGNC symbols of reference non essential genes.

References

[1] BAGEL: a computational framework for identifying essential genes from pooled library screens.

Traver Hart and Jason Moffat. BMC Bioinformatics, 2016 vol. 17 p. 164.

See Also

BAGEL_essential

Examples

data(BAGEL_nonEssential)

head(BAGEL_nonEssential)

CCLE.gisticCNA Genome-wide copy number data for 13 human cancer cell lines.

Description

Genome-wide Gistic [1] scores quantifying copy number status across a subset of the cell lines in

CL.subset that are used to assess CRISPRcleaneR results in [2].

Usage

data(CCLE.gisticCNA)

Format

A data frame with one observations per gene across 13 variables (one per cell line). Row names

indicate HGNC gene symbols and column names indicate cell line COSMIC identiﬁers [3].

4ccr.cleanChrm

Source

This data frame has been derived from the tsv ﬁle downloadable at

http://www.cbioportal.org/study?id=cellline_ccle_broad#summary.

This has been obtained by processing Affymetrix SNP array data in the Cancer Cell Line Ency-

clopaedia [4] repository

(https://portals.broadinstitute.org/ccle_legacy/data/)

References

[1] Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and conﬁdent lo-

calization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol.

2011;12(4):R41. doi: 10.1186/gb-2011-12-4-r41.

[2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsuper-

vised correction of gene-independent cell responses to CRISPR-Cas9 targeting.

http://doi.org/10.1101/228189

[2] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution

Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783,

[3] Barretina J, Caponigro G, Stransky N, et al. The Cancer Cell Line Encyclopedia enables

predictive modelling of anticancer drug sensitivity. Nature. 2012 Mar 28;483(7391):603-7. doi:

10.1038/nature11003. Erratum in: Nature. 2012 Dec 13;492(7428):290.

Examples

data(CCLE.gisticCNA)

head(CCLE.gisticCNA)

ccr.cleanChrm Identiﬁcation and correction of genomic regions of equal log fold

changes involving sgRNAs targeting a minimal number of genes within

a given chromosome.

Description

This function applies a circular binary segmentation algorithm [1, 2] to genomic-sorted log fold

changes of all the sgRNAs targeting genes on the same chromosome. This procedure yields a sets

of genomic regions of estimated equal sgRNAs’ log fold changes, signiﬁcantly differing on average

from adjacent regions. If some of these regions fulﬁll certain criteria (detailed below) then they are

deemed as responding to CRISPR-Cas9 targeting in a gene independent manner, i.e. they might be

biased by local feature of the DNA) and their pattern of log fold changes is mean centered [3].

Usage

ccr.cleanChrm(gwSortedFCs,CHR,display=TRUE,label='',

saveTO=NULL,min.ngenes=3,ignoredGenes=NULL)

ccr.cleanChrm 5

Arguments

gwSortedFCs A data frame containing genome-wide genomic-sorted sgRNAs’ log fold changes.

This data frame must include one named row per each sgRNAs and the follow-

ing columns/headers:

• CHR: the chromosome of the gene targeted by the sgRNA under consider-

ation;

• startp: the genomic coordinate of the starting position of the region targeted

by the sgRNA under consideration;

• endp: the genomic coordinate of the ending position of the region targeted

by the sgRNA under consideration;

• genes: the HGNC symbol of the gene targeted by the sgRNA under consid-

eration;

• avgFC: the log fold change of the sgRNA under consideration averaged

across replicates;

• BP: the genomic coordinate of the sgRNA deﬁned as STARTpos+(ENDpos-

STARTpos)/2.

This can be generated using the ccr.logFCs2chromPos function, starting from

a data frame containing sgRNAs’ log fold changes generated by the

ccr.NormfoldChanges function from raw sgRNAs’ counts.

CHR Numerical value indicating the chromosome to analyse and correct. X and Y

chromosome must be indicated with 23 and 24, respectively.

display A logical value indicating whether genomic plots showing the results of the bi-

ased regions’ identiﬁcation and their log fold change correction should be gen-

erated or not.

label A string indicating the experiment name, used in the main title of the plots and

for the name of the folder where results are saved.

saveTO If different from NULL then it will contain the path where pdf ﬁles with then

genomic plots showing the results of the biased regions’ identiﬁcation (and their

log fold change correction) will be saved (within a folder named as deﬁned in

the label parameter).

min.ngenes A numerical value (>0) specifying the minimal number of different genes that

the set of sgRNAs within a region of estimated equal log fold changes should

target in order for that region to be corrected, i.e. mean centered.

ignoredGenes A vector of strings containing HGNC symbols of genes that should not be con-

sidered when computing the minimal number of different genes targeted by the

sgRNAs in the same identiﬁed region of estimated equal log fold changes. This

vector could contain, for example, a priori known essential genes. This parame-

ter should be set to NULL for a completely unsupervised correction.

Value

A list containing two data frames. The ﬁrst one (correctedFCs) contains a named row per each

sgRNA and the following columns/header:

•CHR: the chromosome of the gene targeted by the sgRNA under consideration;

•startp: the genomic coordinate of the starting position of the region targeted by the sgRNA

under consideration;

•endp: the genomic coordinate of the ending position of the region targeted by the sgRNA

under consideration;

6ccr.cleanChrm

•genes: the HGNC symbol of the gene targeted by the sgRNA under consideration;

•avgFC: the log fold change of the sgRNA averaged across replicates;

•correction: the type of correction: 1 = increased, -1 = decreased;

•correctedFC: the corrected log fold change of the sgRNA

The second one (regions) contains the identiﬁed region of estimated equal log fold changes (one

region per row) and the following columns/headers:

•CHR: the chromosome of the region under consideration;

•startp: the genomic coordinate of the starting position of the region under consideration;

•endp: the genomic coordinate of the ending position of the region under consideration;

•n.sgRNAs: the number of sgRNAs targeting sequences in the region under consideration;

•avg.logFC: the average log fold change of the sgRNAs targeting the region;

•guideIdx: the indexes range of the sgRNAs targeting the region under consideration as they

appear in the gwSortedF Cs provided in input.

Author(s)

Francesco Iorio (iorio@ebi.ac.uk)

References

[1] Olshen, A. B., Venkatraman, E. S., Lucito, R., Wigler, M. (2004). Circular binary segmentation

for the analysis of array-based DNA copy number data. Biostatistics 5: 557-572.

[2] Venkatraman, E. S., Olshen, A. B. (2007). A faster circular binary segmentation algorithm for

the analysis of array CGH data. Bioinformatics 23: 657-63.

[3] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsuper-

vised correction of gene-independent cell responses to CRISPR-Cas9 targeting.

http://doi.org/10.1101/228189

See Also

ccr.logFCs2chromPos,ccr.NormfoldChanges

Examples

data(KY_Library_v1.0)

fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),'/HT-29_counts.tsv',sep='')

normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30,EXPname='Example',

libraryAnnotation=KY_Library_v1.0)

gwSortedFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0)

chr8cleaned<-ccr.cleanChrm(gwSortedFCs,8,display=TRUE,label='HT-29',

min.ngenes=3)

ccr.correctCounts 7

ccr.correctCounts Correction of sgRNA treatment counts for gene independent responses

to CRISPR-Cas9 targeting

Description

This function applies an inverse transformation (described in ...) to CRISPRcleanR corrected sgR-

NAs’ log fold changes and produces in output normalised corrected sgRNA counts (across treat-

ments and control replicates), suitable for gene depletion/enrichment statistical testing via mean-

variance modeling (for example through MAGeCK [1]*). *MAGeCK should be executed exclud-

ing initial normalisation, as the corrected sgRNA counts outputted by this function are already

normalised.

Usage

ccr.correctCounts(CL,normalised_counts,

correctedFCs_and_segments,

libraryAnnotation,

minTargetedGenes=3,

OutDir='./',

ncontrols=1)

Arguments

CL A string specifying the name of the experiment. This will be used to compose

names of ﬁles and folde where results will be saved.

normalised_counts

A data frame containing normalised sgRNAs’ read counts, which can be com-

puted using the ccr.NormfoldChanges function from raw sgRNAs’ counts.

correctedFCs_and_segments

sgRNAs log fold changes corrected for gene independent responses, generated

with the function ccr.GWclean.

libraryAnnotation

A data frame containing the sgRNAs’ genome-wide annotations with at least

a named row for each of the sgRNAs included in the foldchanges data frame

provided in input. The following columns/headers should be present in this data

frame (additional columns will be ignored):

• GENES: string vector containing the HGNC symbols of the genes targeted

by the sgRNA under consideration;

• EXONE: string vector containing the gene exon targeted by the sgRNA

under consideration (these should include the preﬁx "ex" followed by the

exone number);

• CHRM: string vector the chromosome of the gene targeted by the sgRNA

under consideration (X and Y chromosome should be speciﬁed as "X" and

"Y");

• STRAND: string vector containing the strand targeted by the sgRNA under

consideration ("+" or "-");

• STARTpop: numeric vector containing the genomic coordinate of the start-

ing position of the region targeted by the sgRNA under consideration;

8ccr.correctCounts

• ENDpos: numeric vector containing the genomic coordinate of the ending

position of the region targeted by the sgRNA under consideration;

minTargetedGenes

Minimanl number of different genes targeted by sgRNAs in a biased segment in

order for the corresponding counts to be corrected (default = 3).

OutDir Path of the folder where results and plots will be saved.

ncontrols A numerical value indicating the number of control replicates (therefore columns

to be considered as controls in the normalised counts).

Value

A data frame with one entry per sgRNA and individual columns for the control/treatment samples

included in the normalised count data object speciﬁed by the normalised_counts parameter, and

containing sgRNA counts corrected for gene independent responses to CRISPR-Cas9 targeting and

median-ratio normalised.

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

References

[1] Li, W., Xu, H., Xiao, T., Cong, L., Love, M. I., Zhang, F., et al. (2014). MAGeCK enables ro-

bust identiﬁcation of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome

Biology, 15(12), 554. [2] Hart, T., & Moffat, J. (2016). BAGEL: a computational framework for

identifying essential genes from pooled library screens. BMC Bioinformatics, 17(1), 164.

See Also

ccr.NormfoldChanges,ccr.GWclean

Examples

## Loading sgRNA library annotation file

data(KY_Library_v1.0)

## Deriving the path of the file with the example dataset,

## from the mutagenesis of the EPLC-272H colorectal cancer cell line

fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),

'/EPLC-272H_counts.tsv',sep='')

## Loading, median-normalizing and computing fold-changes for the example dataset

normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30,

EXPname='EPLC-272H',

libraryAnnotation = KY_Library_v1.0)

## Genome-sorting of the fold changes

gwSortedFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0)

## Identifying and correcting biased sgRNAs'fold changes

correctedFCs<-ccr.GWclean(gwSortedFCs,display=FALSE,label='EPLC-272H')

## correcting individual sgRNA treatment counts

correctedCounts<-ccr.correctCounts('EPLC-272H',normANDfcs$norm_counts,

ccr.ExecuteMageck 9

correctedFCs,

KY_Library_v1.0,

minTargetedGenes=3,

OutDir='./')

head(correctedCounts)

ccr.ExecuteMageck Executing MAGeCK from R command line

Description

This function executes MAGeCK [1] from the command line, taking in input the path of the ﬁle

containing the sgRNA counts’ ﬁle to be processed and saving the results in a user deﬁned location.

By default this function do not pre-normalise the counts. However this preliminary step can be in-

cluded as speciﬁed by the corresponding argument. Additionally this function assumes that there is

only one control sample, whose count values should be contained in the ﬁrst column of the sgRNA

counts’ ﬁle. This function requires python and the MAGeCK python package (v0.5.3, available

at: https://sourceforge.net/projects/mageck/files/0.5/mageck-0.5.3.zip/download)

to be installed.

Usage

ccr.ExecuteMageck(mgckInputFile,

expName = "expName",

normMethod = "none",

outputPath = "./")

Arguments

mgckInputFile A string specifying the path of the (plain text) ﬁle containing the sgRNA counts’

ﬁle to be processed

expName A string specifying the experiment name. This is used as name preﬁx for all the

ﬁles generated by MAGeCK.

normMethod A string specifying the normalisation method to be used (’none’ by default).

outputPath A string specifying the folder where all the ﬁles outputted by MAGeCK will be

saved.

Value

A string specifying the path to the gene summary ﬁle outputted by MAGeCK.

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

References

[1] Li, W., Xu, H., Xiao, T., Cong, L., Love, M. I., Zhang, F., et al. (2014). MAGeCK enables ro-

bust identiﬁcation of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome

Biology, 15(12), 554. [2] Hart, T., & Moffat, J. (2016). BAGEL: a computational framework for

identifying essential genes from pooled library screens. BMC Bioinformatics, 17(1), 164.

10 ccr.geneMeanFCs

Examples

## Loading sgRNA library annotation file

data(KY_Library_v1.0)

## Deriving the path of the file with the example dataset,

## from the mutagenesis of the EPLC-272H colorectal cancer cell line

fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),

'/EPLC-272H_counts.tsv',sep='')

## Loading, median-normalizing and computing fold-changes for the example dataset

normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30,

EXPname='EPLC-272H',

libraryAnnotation = KY_Library_v1.0)

uncorrected_fn<-ccr.PlainTsvFile(sgRNA_count_object = normANDfcs$norm_counts,

fprefix = 'EPLC-272H')

## execute MAGeCK saving files in the working directory

uncorrected_gs_fn<-ccr.ExecuteMageck(mgckInputFile = uncorrected_fn,

expName = 'EPLC-272H',

normMethod = 'none')

uncorrected_gs_fn

ccr.geneMeanFCs Gene level log fold changes

Description

This functions computes gene level log fold changes based on average log fold changes of targeting

sgRNAs

Usage

ccr.geneMeanFCs(sgRNA_FCprofile, libraryAnnotation)

Arguments

sgRNA_FCprofile

A named numerical vector containing the sgRNAs’ log fold-changes, with names

corresponding to sgRNAs identiﬁers.

libraryAnnotation

A data frame containing the sgRNA library annotation (with same format of

KY_Library_v1.0).

Value

A numerical vector containing gene average log fold-changes, with corresponding HGNC symbols

as names.

ccr.genes2sgRNAs 11

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

See Also

KY_Library_v1.0

Examples

## loading corrected sgRNAs log fold-changes and segment annotations for

## an example cell line (EPLC-272H)

data(EPLC.272HcorrectedFCs)

## loading sgRNA library annotation

data(KY_Library_v1.0)

## storing sgRNA log fold-changes in a named vector

FCs<-EPLC.272HcorrectedFCs$corrected_logFCs$avgFC

names(FCs)<-rownames(EPLC.272HcorrectedFCs$corrected_logFCs)

## computing gene level log fold-changes

geneFCs<-ccr.geneMeanFCs(FCs,KY_Library_v1.0)

head(geneFCs)

ccr.genes2sgRNAs Targeting sgRNAs

Description

This function returns the set of sgRNAs targeting the set of genes provided in input, in a given

pooled library.

Usage

ccr.genes2sgRNAs(libraryAnnotation,genes)

Arguments

libraryAnnotation

A data frame with a named row for each sgRNA with the same format of

KY_Library_v1.0

genes A list of strings containing HGNC symbols

Value

A list of strings containing the identiﬁers of the sgRNAs targeting the inputted set of genes

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

12 ccr.get.CCLEgisticSets

See Also

KY_Library_v1.0

Examples

## Loading an sgRNA pooled library annotation

data(KY_Library_v1.0)

## Loading an example set of genes

data(BAGEL_essential)

ccr.genes2sgRNAs(KY_Library_v1.0,BAGEL_essential)

ccr.get.CCLEgisticSets

CCLE gistic score gene sets

Description

This function splits all the genes into 5 classes (-2, -1, 0, +1 and +2) based on the CNA Gistic [1]

score observed in a given cell line.

Usage

ccr.get.CCLEgisticSets(cellLine,CCLE.gisticCNA=NULL)

Arguments

cellLine A string specifying the name of a cell line (or a COSMIC identiﬁer [2]);

CCLE.gisticCNA Genome-wide Gistic [1] scores quantifying copy number status across cell lines

with the same format of CCLE.gisticCNA. If NULL then this function uses the

CCLE.gisticCNA builtin data frame, containing data for 13 cell lines of the 15

used in [3] to assess the performances of CRISPRcleanR.

Value

A named list of vectors with the following ﬁelds:

gm2 A vector of strings containing identiﬁers of sgRNAs targeting genes whit a Gis-

tic score = -2 in the cell line under consideration;

gm1 A vector of strings containing identiﬁers of sgRNAs targeting genes whit a Gis-

tic score = -1 in the cell line under consideration;

gz A vector of strings containing identiﬁers of sgRNAs targeting genes whit a Gis-

tic score = 0 in the cell line under consideration;

gp1 A vector of strings containing identiﬁers of sgRNAs targeting genes whit a Gis-

tic score = +1 in the cell line under consideration;

gp2 A vector of strings containing identiﬁers of sgRNAs targeting genes whit a Gis-

tic score = +2 in the cell line under consideration;

Author(s)

Francesco Iorio (iorio@ebi.ac.uk)

ccr.get.gdsc1000.AMPgenes 13

References

[1] Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and conﬁdent lo-

calization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol.

2011;12(4):R41. doi: 10.1186/gb-2011-12-4-r41.

[2] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution

Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783,

[3] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsuper-

vised correction of gene-independent cell responses to CRISPR-Cas9 targeting.

http://doi.org/10.1101/228189

See Also

ccr.get.gdsc1000.AMPgenes

Examples

GS<-ccr.get.CCLEgisticSets('HT-29')

head(GS$gm2)

head(GS$gm1)

head(GS$gz)

head(GS$gp1)

head(GS$gp2)

ccr.get.gdsc1000.AMPgenes

Copy number ampliﬁed genes in a given cell line from the GDSC1000

Description

This function takes in input the name (or the COSMIC identiﬁer [1]) of a cell line included in the

GDSC1000 project [2] and it identiﬁes the genes that are copy number ampliﬁed (according to a

user deﬁned minimal copy number value) in that cell line, using gene level copy number data from

the Genomics of Drug Sensitivity in 1,000 Cancer Cell lines (GDSC1000) [2].

Usage

ccr.get.gdsc1000.AMPgenes(cellLine, minCN = 8, exact = FALSE,

GDSC.geneLevCNA=NULL)

Arguments

cellLine A string specifying the name of a cell line (or a COSMIC identiﬁer [1]);

minCN Lower threshold for the minimum copy number of any genomic segment con-

taining coding sequence of a gene in order for it to be considered as copy number

ampliﬁed.

14 ccr.get.gdsc1000.AMPgenes

exact If TRUE, then those genes for which any genomic segment containing coding

sequence has a minimum copy number equal to minCN are considered as copy

number ampliﬁed.

GDSC.geneLevCNA

Genome-wide copy number data with the same format of GDSC.geneLevCNA.

This can be assembled from the xls sheet speciﬁed in the source section [a] (con-

taining data for the GDSC1000 cell lines). If NULL, then this function uses the

data in the built in GDSC.geneLevCNA data frame, containing data derived from

[a] for 15 cell lines used in [3] to assess the performances of CRISPRcleanR.

Value

A data frame, containing one row for each copy number ampliﬁed gene with the following columns:

Gene HGNC symbol of the gene;

minCN Minimum copy number of any genomic segment containing coding sequence of

the gene in the cell line under consideration.

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

Source

[a] ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-6.0/Gene_level_

CN.xlsx.

References

[1] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution

Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783,

[2] Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, et al. A landscape of pharmacoge-

nomic interactions in cancer Cell 2016 Jul 28;166(3):740-54

[3] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsuper-

vised correction of gene-independent cell responses to CRISPR-Cas9 targeting.

http://doi.org/10.1101/228189

See Also

ccr.get.CCLEgisticSets

Examples

CNAgenes<-

ccr.get.gdsc1000.AMPgenes('HT-29')

head(CNAgenes)

ccr.get.nonExpGenes 15

ccr.get.nonExpGenes Non expressed genes in a given cell line

Description

This function takes in input the name (or the COSMIC identiﬁer [1]) of a cell line and it identiﬁes

genes that are not expressed (according to a user deﬁned FPKM threshold) using a collection of

RNAseq proﬁle from [2].

Usage

ccr.get.nonExpGenes(cellLine, th = 0.05,

amplified = FALSE, minCN = 8,

RNAseq.fpkms=NULL)

Arguments

cellLine A string specifying the name of a cell line (or a COSMIC identiﬁer [1]);

th Minimum FPKM value for a gene to be considered as expressed;

amplified A logic value specifying whether the selected not expressed genes should be

also copy number ampliﬁed function;

minCN If amplified = TRUE, this parameter deﬁnes a lower threshold for the minimum

copy number of any genomic segment containing coding sequence of a gene in

order for it to be considered as copy number ampliﬁed.

RNAseq.fpkms Genome-wide substitute reads with fragments per kilobase of exon per million

reads mapped (FPKM) across cell lines. These can be derived from a compre-

hensive collection of RNAseq proﬁles described in [2]. The format must be the

same of the RNAseq.fpkms builtin data frame. If NULL then this function uses

the RNAseq.fpkms builtin data fram containing data for 15 cell lines used in [3]

to assess CRISPRcleaneR results.

Value

A vector of string containing the HGNC symbols of non expressed (optionally copy number ampli-

ﬁed) genes in the cell line under consideration.

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

References

[1] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution

Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783.

[2] Garcia-Alonso L, Iorio F, Matchan A, et al. Transcription factor activities enhance markers of

drug response in cancer doi: https://doi.org/10.1101/129478

[3] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al.

Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting.

http://doi.org/10.1101/228189

16 ccr.GWclean

See Also

ccr.get.gdsc1000.AMPgenes

Examples

ccr.get.nonExpGenes('HT-29',amplified = TRUE)

ccr.GWclean Unsupervised identiﬁcation and correction of gene independent cell

responses to CRISPR-Cas9 targeting.

Description

This function takes in input a genome-wide essentiality proﬁle derived from a CRISPR-Cas9 exper-

iment employing a pooled library of single guide RNAs (sgRNAs) targeting protein coding genes,

which are transfected in an in vitro model stably expressing Cas9. The essentiality proﬁle quantiﬁes

the loss/gain-of-ﬁtness caused by each sgRNA-targeting, and it is expressed as log fold changes

(logFCs) between the aboundance of the sgRNAs at an end point after cell puriﬁcation and their

aboundance in the plasmid pool used for viral production, or at an initial time point, or in any other

control condition. A circular binary segmentation algorithm [1, 2] is applied by this function to the

genome-wide pattern of logFCs provided in input, in order to identify genomic regions including

sgRNAs with sufﬁciently equal logFC (and mean logFC sufﬁciently different from background)

and targeting a minimal number of different genes. Assuming that it is very unlikely to observe the

same loss/gain-of-ﬁtness effect when targeting a large number of contiguous genes, if certain user-

deﬁned condition (detailed below) are met then the logFCs of such regions are deemed as biased

by some local feature of the involved genomic segment (which could be, for example, copy number

ampliﬁed [3]), and they are corrected, i.e. mean centered [4].

Usage

ccr.GWclean(gwSortedFCs,label='',display=TRUE,

saveTO=NULL,ignoredGenes=NULL,min.ngenes=3)

Arguments

gwSortedFCs A data frame containing genome-wide genomic-sorted sgRNAs’ log fold changes.

This data frame must include one named row per each sgRNA and the following

columns/headers:

• CHR: the chromosome of the gene targeted by the sgRNA under consider-

ation;

• startp: the genomic coordinate of the starting position of the region targeted

by the sgRNA under consideration;

• endp: the genomic coordinate of the ending position of the region targeted

by the sgRNA under consideration;

• genes: the HGNC symbol of the gene targeted by the sgRNA under consid-

eration;

• avgFC: the log fold change of the sgRNA under consideration averaged

across replicates;

ccr.GWclean 17

• BP: the genomic coordinate of the sgRNA deﬁned as STARTpos+(ENDpos-

STARTpos)/2.

This can be generated using the ccr.logFCs2chromPos function, starting from

a data frame containing sgRNAs’ log fold changes generated by the

ccr.NormfoldChanges function (from raw sgRNAs’ counts), from raw sgR-

NAs’ counts.

label A string indicating the experiment name. This is used to compose the main title

of the plots generated by this function and the name of the folder where the

results are saved.

display A logical value indicating whether genomic plots showing the results of the bi-

ased regions’ identiﬁcation and their log fold change correction should be gen-

erated or not.

saveTO If different from NULL then this parameter will contain the path where pdf ﬁles

with then genomic plots showing the results of the biased regions’ identiﬁcation

(and their log fold change correction) will be saved (within a folder named as

deﬁned in the label parameter).

ignoredGenes A vector of strings containing HGNC symbols of genes that should not be con-

sidered when computing the minimal number of different genes targeted by sgR-

NAs in the same identiﬁed region of estimated equal log fold changes. This

could contain, for example, a-priori known essential genes.

min.ngenes A numerical value (>0) specifying the minimal number of different genes that

the set of sgRNAs within a region of estimated equal logFCs should target in

order for theri logFCs to be corrected, i.e. mean centered.

Value

A list containing two data frames and a vector of strings. The ﬁrst data frame (corrected_logFCs)

contains a named row per each sgRNA and the following columns/header:

•CHR: the chromosome of the gene targeted by the sgRNA under consideration;

•startp: the genomic coordinate of the starting position of the region targeted by the sgRNA

under consideration;

•endp: the genomic coordinate of the ending position of the region targeted by the sgRNA

under consideration;

•genes: the HGNC symbol of the gene targeted by the sgRNA under consideration;

•avgFC: the log fold change of the sgRNA averaged across replicates;

•correction: the type of correction: 1 = increased log fold change, -1 = decreased log fold

change. 0 indicates no correction;

•correctedFC: the corrected log fold change of the sgRNA

The second data frame (segments) contains the identiﬁed region of estimated equal log fold changes

(one region per row) and the following columns/headers:

•CHR: the chromosome of the region under consideration;

•startp: the genomic coordinate of the starting position of the region under consideration;

•endp: the genomic coordinate of the ending position of the region under consideration;

•n.sgRNAs: the number of sgRNAs targeting sequences in the region under consideration;

•avg.logFC: the average log fold change of the sgRNAs in the region;

18 ccr.GWclean

•guideIdx: the indexes range of the sgRNAs targeting the region under consideration as they

appear in the gwSortedF Cs provided in input.

The string of vectors (SORTED_sgRNAs) contains the sgRNAs’ identiﬁers in the same order as

they are reported in the gwSortedFCs input data frame, i.e. genome sorted.

Author(s)

Francesco Iorio (iorio@ebi.ac.uk)

References

[1] Olshen, A. B., Venkatraman, E. S., Lucito, R., Wigler, M. (2004). Circular binary segmentation

for the analysis of array-based DNA copy number data. Biostatistics 5: 557-572. \

[2] Venkatraman, E. S., Olshen, A. B. (2007). A faster circular binary segmentation algorithm for

the analysis of array CGH data. Bioinformatics 23: 657-63. \

[3] Andrew J. Aguirre, Robin M. Meyers, Barbara A. Weir, Francisca Vazquez, Cheng-Zhong

Zhang, Uri Ben-David, April Cook, Gavin Ha, William F. Harrington, Mihir B. Doshi, Maria

Kost-Alimova, Stanley Gill, Han Xu, Levi D. Ali, Guozhi Jiang, Sasha Pantel, Yenarae Lee, Amy

Goodale, Andrew D. Cherniack, Coyin Oh, Gregory Kryukov, Glenn S. Cowley, Levi A. Garraway,

Kimberly Stegmaier, Charles W. Roberts, Todd R. Golub, Matthew Meyerson, David E. Root, Aviad

Tsherniak and William C. Hahn. Genomic copy number dictates a gene-independent cell response

to CRISPR-Cas9 targeting. Cancer Discov June 3 2016 DOI: 10.1158/2159-8290.CD-16-0154

[4] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsuper-

vised correction of gene-independent cell responses to CRISPR-Cas9 targeting.

http://doi.org/10.1101/228189

See Also

ccr.cleanChrm

Examples

## Loading sgRNA library annotation file

data(KY_Library_v1.0)

## Deriving the path of the file with the example dataset,

## from the mutagenesis of the HT-29 colorectal cancer cell line

fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),'/HT-29_counts.tsv',sep='')

## Loading, median-normalizing and computing fold-changes for the example dataset

normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30,EXPname='HT-29',

libraryAnnotation = KY_Library_v1.0)

## Genome-sorting of the fold changes

gwSortedFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0)

## Identifying and correcting biased sgRNAs'fold changes

correctedFCs<-ccr.GWclean(gwSortedFCs,display=TRUE,label='HT-29')

## Visualising first five entries of the corrected fold changes

head(correctedFCs$corrected_logFCs)

ccr.impactOnPhenotype 19

ccr.impactOnPhenotype Assessing the impact and potential distortion introduced by the

CRISPRcleanR correction on the genes showing loss/gain-of-ﬁtness

effect.

Description

This function compares two MAGeCK [1] gene summaries (obtained from sgRNA count ﬁles

pre/post CRISPRcleanR correction) and it computes the percentages of genes whose loss/gain-of-

ﬁtness effect is attenuated post CRISPRcleanR correction or potentially distorted (i.e. loss-of-ﬁtness

genes are detected post CRISPRcleanR correction as gain-of-ﬁtness genes, and viceversa). Results

are returned in output and optionally plotted as bar/pie charts.

Usage

ccr.impactOnPhenotype(MO_uncorrectedFile,

MO_correctedFile,

sigFDR = 0.05,

expName = "expName",

display = TRUE)

Arguments

MO_uncorrectedFile

String specifying the path to a MAGeCK gene summary ﬁle produced by MAGeCK

from non corrected sgRNA counts.

MO_correctedFile

String specifying the path to a MAGeCK gene summary ﬁle produced by MAGeCK

from CRISPRcleanR corrected sgRNA counts.

sigFDR A numerical value in [0,1] False discovery rate threshold at which genes are

called as signiﬁcantly exerting a loss/gain-of-ﬁtness effect.

expName A string specifying the experiment name, used as main title in the ﬁgures (ig-

nored if the display argument is set to FALSE).

display Boolean value specifying whether ﬁgures sumarising the comparison results

should be plotted.

Details

For each of the considered MAGeCK gene summaries, this function calls loss/gain-of-ﬁtness based

on the MAGeCK negative/positive false discovery rate and the user deﬁned threshold (as speciﬁed

by the sigFDR argument). Particularly, are called as signiﬁcant loss-of-ﬁtness genes those with a

negative fdr < sigFDR and a positive fdr >= sigFDR, and as signiﬁcant gain-of-ﬁtness genes those

those with a positive fdr < sigFDR and a negative fdr >= sigFDR. All the other genes are deemed as

not exerting any effect on cellular ﬁtness.

Value

A list containing the following four numerical values and two data frames:

20 ccr.impactOnPhenotype

•GW_impact %: Percentage of genes impacted by the CRISPRcleanR correction, i.e. showing

a gain/loss-of-ﬁtness genes effect in the MAGeCK gene summary obtained from uncorrected

sgRNA counts, over the total number of screened genes;

•Phenotype_G_impact %: Percentage of genes impacted by the CRISPRcleanR correction,

i.e. showing a gain/loss-of-ﬁtness genes effect in the MAGeCK gene summary obtained from

uncorrected sgRNA counts, over the total number of genes showing a gain/loss of ﬁtness effect

in the MAGeCK gene summary obtained from uncorrected sgRNA counts;

•GW_distortion %: Percentage of genes distorted by the CRISPRcleanR correction, i.e.

showing a gain/loss-of-ﬁtness effect in the MAGeCK gene summary obtained from corrected

sgRNA counts that is opposite to the effect in that obtained from uncorrected sgRNA counts,

over the total number of screened genes;

•Phenotype_G_distortion %: Percentage of genes distorted by the CRISPRcleanR correc-

tion, i.e. showing a gain/loss-of-ﬁtness effect in the MAGeCK gene summary obtained from

corrected sgRNA counts that is opposite to the effect in that obtained from uncorrected sgRNA

counts, over the total number of screened genes,over the total number of genes showing a

gain/loss of ﬁtness effect in the MAGeCK gene summary obtained from uncorrected sgRNA

countsl;

•geneCounts: A contingency table with gene counts as entries, with data referring to the orig-

inal (uncorrected) sgRNA counts on the columns, and to the corrected sgRNA counts on the

rows. There are three vectors for each dimensions, respectively for number of genes showing

a signiﬁcant loss of ﬁtness effect (dep.), number of genes not showing any ﬁtness effect (or

with a not clear effect, i.e. showing both gain and loss of ﬁtness effect, null), and number of

genes showing a signiﬁcant gain of ﬁtness effect (enr.);

•distortion: a data frame showing genes whose ﬁtness effect has been distorted by the

CRISPRcleanR correction: one row per gene (as speciﬁed by the row names), with two col-

umn per condition (i.e. prior/post correction), indicating the loss of ﬁtness effect fdr (neg.fdr

and ccr.neg.fdr) and the gain of ﬁtness effect fdr (pos.fdr and ccr.pos.fdr) as outputted by

MAGeCK;

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

References

[1] Li, W., Xu, H., Xiao, T., Cong, L., Love, M. I., Zhang, F., et al. (2014). MAGeCK enables ro-

bust identiﬁcation of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome

Biology, 15(12), 554. [2] Hart, T., & Moffat, J. (2016). BAGEL: a computational framework for

identifying essential genes from pooled library screens. BMC Bioinformatics, 17(1), 164.

See Also

ccr.ExecuteMageck

Examples

## Loading sgRNA library annotation file

data(KY_Library_v1.0)

## Deriving the path of the file with the example dataset,

## from the mutagenesis of the EPLC-272H colorectal cancer cell line

fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),

ccr.impactOnPhenotype 21

'/EPLC-272H_counts.tsv',sep='')

## Loading, median-normalizing and computing fold-changes for the example dataset

normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30,

EXPname='EPLC-272H',

libraryAnnotation = KY_Library_v1.0)

uncorrected_fn<-ccr.PlainTsvFile(sgRNA_count_object = normANDfcs$norm_counts,

fprefix = 'EPLC-272H')

## execute MAGeCK on uncorrected normalised counts

uncorrected_gs_fn<-ccr.ExecuteMageck(mgckInputFile = uncorrected_fn,

expName = 'EPLC-272H',

normMethod = 'none')

## Genome-sorting of the fold changes

gwSortedFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0)

## Identifying and correcting biased sgRNAs'fold changes

correctedFCs<-ccr.GWclean(gwSortedFCs,display=FALSE,label='EPLC-272H')

## correcting individual sgRNA treatment counts

correctedCounts<-ccr.correctCounts('EPLC-272H',normANDfcs$norm_counts,

correctedFCs,

KY_Library_v1.0,

minTargetedGenes=3,

OutDir='./')

## saving corrected/uncorrected sgRNA count files as plain tsv files

corrected_fn<-ccr.PlainTsvFile(sgRNA_count_object = correctedCounts,

fprefix = 'EPLC-272H_ccleaned')

## execute MAGeCK on corrected normalised counts

corrected_gs_fn<-ccr.ExecuteMageck(mgckInputFile = corrected_fn,

expName = 'EPLC-272H_ccleaned')

## Assessing the impact of CRISPcleanR correction on gain/loss-of-fitness genes

RES<-ccr.impactOnPhenotype(MO_uncorrectedFile = uncorrected_gs_fn,

MO_correctedFile = corrected_gs_fn,

expName = 'EPLC-272H')

## Percentage of genes whose gain/loss-of fitness effect is impacted by CRISPRcleanR

## over the total number of screened genes

RES[1]

## Percentage of genes whose gain/loss-of fitness effect is impacted by CRISPRcleanR

## over the total number of genes with a significant gain/loss-of fitness effect when

## using uncorrected sgRNA counts

RES[2]

## Percentage of genes whose gain/loss-of fitness effect is distorted by CRISPRcleanR

## over the total number of screened genes

RES[3]

## Percentage of genes whose gain/loss-of fitness effect is distorted by CRISPRcleanR

## over the total number of genes with a significant gain/loss-of fitness effect when

## using uncorrected sgRNA counts

22 ccr.logFCs2chromPos

RES[4]

## Contingency table showing the impact of the CRISPRcleanR correction on the phenotype

RES$geneCounts

## Genes whose gain/loss-of-fitness effect has been distorted by the CRISPRcleanR correction

RES$distortion

ccr.logFCs2chromPos Genomic sorting of sgRNAs’ log fold changes.

Description

This function maps genome-wide sgRNAs’ log fold changes (averaged across replicates) on the

genome and returns them sorted according to the position of their targeted region on the chromo-

somes.

Usage

ccr.logFCs2chromPos(foldchanges, libraryAnnotation)

Arguments

foldchanges A data frame containing genome-wide sgRNAs’ log fold changes, one column

per library transfection replicate, with ﬁrst and second column containing the

sgRNAs’ identiﬁers and the HGNC symbols of the targeted genes, respectively.

This can be generated from raw count ﬁles using the ccr.NormfoldChanges

function.

libraryAnnotation

A data frame containing the sgRNAs’ genome-wide annotations with at least

a named row for each of the sgRNAs included in the foldchanges data frame

provided in input. The following columns/headers should be present in this data

frame (additional columns will be ignored):

• GENES: string vector containing the HGNC symbols of the genes targeted

by the sgRNA under consideration;

• EXONE: string vector containing the gene exon targeted by the sgRNA

under consideration (these should include the preﬁx "ex" followed by the

exone number);

• CHRM: string vector the chromosome of the gene targeted by the sgRNA

under consideration (X and Y chromosome should be speciﬁed as "X" and

"Y");

• STRAND: string vector containing the strand targeted by the sgRNA under

consideration ("+" or "-");

• STARTpop: numeric vector containing the genomic coordinate of the start-

ing position of the region targeted by the sgRNA under consideration;

• ENDpos: numeric vector containing the genomic coordinate of the ending

position of the region targeted by the sgRNA under consideration;

Additiol columns can be optionally included and will be ignored by this func-

tion. The annation for the genome-wide sgRNA library presented in [1] is in-

cluded in the KY_Library_v1.0 data object, formatted as described above.

ccr.multDensPlot 23

Value

A data frame with a named row per each sgRNA and the following columns/headers:

•CHR: the chromosome where the gene targeted by the sgRNA under consideration resides;

•startp: the genomic coordinate of the starting position of the region targeted by the sgRNA

under consideration;

•endp: the genomic coordinate of the ending position of the region targeted by the sgRNA

under consideration;

•avgFC: the log fold change of the sgRNA averaged across replicates;

•BP: the genomic coordinate of the sgRNA deﬁned as STARTpos+(ENDpos-STARTpos)/2.

Author(s)

Francesco Iorio (iorio@ebi.ac.uk)

References

[1] Tzelepis K, Koike-Yusa H, De Braekeleer E, Li Y, Metzakopian E, Dovey OM, Mupo A,

Grinkevich V, Li M, Mazan M, Gozdecka M, Onishi S, Cooper J, Patel M, McKerrell T, Chen

B, Domingues AF, Gallipoli P, Teichmann S, Ponstingl H, McDermott U, Saez-Rodriguez J, Huntly

BJP, Iorio F, Pina C, Vassiliou GS, Yusa K. A CRISPR dropout screen identiﬁes genetic vulnerabili-

ties and therapeutic targets in acute myeloid leukaemia. Cell Reports 2016 Oct 18;17(4):1193-1205

See Also

ccr.NormfoldChanges,KY_Library_v1.0

Examples

data(KY_Library_v1.0)

fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),'/A2058_counts.tsv',sep='')

normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30,

EXPname='Example',

libraryAnnotation=KY_Library_v1.0)

mappedLogFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0)

head(mappedLogFCs)

ccr.multDensPlot Mutiple shaded density plot

Description

This functions plots multiple distribution densities with solid colors for the curves and shaded colors

for underlying areas.

Usage

ccr.multDensPlot(TOPLOT, COLS,

XLIMS, TITLE, LEGentries, XLAB)

24 ccr.NormfoldChanges

Arguments

TOPLOT A list of density object computed using the density function of the stats pack-

age.

COLS A vector of colors of the same length of TOPLOT that are used to plot the density

curves. Alpha-reduced versions of these colors are used to ﬁll the underlying

areas.

XLIMS A vector of two numerical values optionally specifying x-axis limits (NULL by

default).

TITLE A string containing the plot title.

LEGentries A vector of strings (one per each density in TOPLOT) specifying corresponding

legend entries.

XLAB A string containing the x-axis label.

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

Examples

## generating random data

x <- rnorm(1000, 0, 0.5)

y <- rnorm(1000, 2, 0.4)

z <- rnorm(1000, -1, 1.5)

## assembling kernel estimated distributions into a list

ToPlot<-list(x=density(x),y=density(y),z=density(z))

## density visualisation

ccr.multDensPlot(ToPlot,COLS = c('red','blue','gray'),

TITLE = 'example',LEGentries = c('x','y','z'),

XLIMS = c(-5,3))

ccr.NormfoldChanges Median-ratio normalisation of sgRNA counts and fold change compu-

tation

Description

This function median-ratio normalises [1,2] sgRNAs’ counts stored in a tsv ﬁle whose path is pro-

vided in input, to adjust for the effect of library size and read count distributions. It computes log

fold changes of transfected library replicates versus controls (tipically the sgRNA counts in the

plasmid). The output of this function is returned as a list, and it is also saved into two tsv ﬁles.

Usage

ccr.NormfoldChanges(filename, display=TRUE, saveToFig=FALSE,

outdir='./', min_reads=30, EXPname='',

libraryAnnotation, ncontrols=1)

ccr.NormfoldChanges 25

Arguments

filename A string specifying the path of a tsv ﬁle containing the raw sgRNA counts.

This must be a tab delimited ﬁle with one row per sgRNA and the following

columns/headers:

• sgRNA: containing alphanumerical identiﬁers of the sgRNA under consid-

eration;

• gene: containing HGNC symbols of the genes targeted by the sgRNA under

consideration;

followed by the columns containing the sgRNAs’ counts for the controls and

columns for library trasfected samples.

display A logic value specifying whether ﬁgures containing boxplots with the count

values pre/post normalisation and log fold-changes should be visualised (TRUE,

by default).

saveToFig A logic value specifying whether ﬁgures containing boxplots with the count

values pre/post normalisation and log fold-changes should be saved as pdf ﬁles

(FALSE, by default). Setting this parameter to TRUE overrides the value of the

display parameter.

outdir Path of the directory where the normalised sgRNAs’ counts and the log fold

changes, as well as the pdf ﬁles (if the parameter saveToFig is set to TRUE),

must be saved.

min_reads This parameter deﬁnes a ﬁlter threshold value for sgRNAs, based on their aver-

age counts in the control sample. Speciﬁcally, it indicates the minimal number

of counts that each individual sgRNA needs to have in the controls (on average)

in order to be included in the output.

EXPname A string specifying the name of the experiment. This will be used to compose

main title of the generated ﬁgures and ﬁle names.

libraryAnnotation

A data frame containing the sgRNA annotations, with a named row for each

sgRNA, and columns for targeted genes, genomic coordinates and possibly other

informations. This should be formatted as the KY_Library_v1.0 data object

containing the annotation of the sgRNA library presented in [3].

ncontrols A numerical value indicating the number of control replicates (therefore columns

to be considered as control counts after the ﬁrst two, in the inputted tsv ﬁle).

Value

A list containing two data frames: for the normalised sgRNAs’ counts (norm_counts) and the sgR-

NAs’ log fold changes (logFCs) respectively. First two columns in these data frames contain sgR-

NAs’ identiﬁers and HGNC symbols of targete gene, respectively.

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

References

[1] Wang T, Wei JJ, Sabatini DM, Lander ES. Genetic screens in human cells using the CRISPR-

Cas9 system. Science. 2014, 343: 80-84.

[2] Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol.

2010, 11: R106

26 ccr.perf_distributions

[3] Tzelepis K, Koike-Yusa H, De Braekeleer E, et al A CRISPR dropout screen identiﬁes ge-

netic vulnerabilities and therapeutic targets in acute myeloid leukaemia. Cell Reports 2016 Oct

18;17(4):1193-1205

See Also

KY_Library_v1.0

Examples

## loading sgRNA library annotation

data(KY_Library_v1.0)

## derive path for an example dataset

fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),'/HT-29_counts.tsv',sep='')

## sgRNAs'normalisation and computation of log fold-changes

normANDfcs<-ccr.NormfoldChanges(fn,

min_reads=30,

EXPname='Example',

libraryAnnotation=KY_Library_v1.0)

## inspecting first 5 entries of the data frames containing the

## normalised counts and the log fold-changes

head(normANDfcs$norm_counts)

head(normANDfcs$logFCs)

ccr.perf_distributions

CRISPRcleanR correction assessment: inspection of sgRNA log fold

changes distributions

Description

This function creates distributions density plots of sgRNA log fold changes for deﬁned sets of

targeted genes prior/post CRISPRcleanR correction.

Usage

ccr.perf_distributions(cellLine, correctedFCs,

GDSC.geneLevCNA = NULL,

CCLE.gisticCNA = NULL,

RNAseq.fpkms = NULL,

minCNs = c(8, 10),

libraryAnnotation)

Arguments

cellLine A string specifying the name of a cell line (or a COSMIC identiﬁer [1]);

correctedFCs sgRNAs log fold changes corrected for gene independent responses to CRISPR-

Cas9 targeting, generated with the function ccr.GWclean (ﬁrst data frame in-

cluded in the list outputted by ccr.GWclean, i.e. corrected_logFCs).

ccr.perf_distributions 27

GDSC.geneLevCNA

Genome-wide copy number data with the same format of GDSC.geneLevCNA.

This can be assembled from the xls sheet speciﬁed in the source section [a]

(containing data for the GDSC1000 cell lines). If NULL, then this function uses

the built in GDSC.geneLevCNA data frame, containing data derived from [a] for

15 cell lines used in [2] to assess the performances of CRISPRcleanR.

CCLE.gisticCNA Genome-wide Gistic [3] scores quantifying copy number status across cell lines

with the same format of CCLE.gisticCNA. If NULL then this function uses the

CCLE.gisticCNA builtin data frame, containing data for 13 cell lines of the 15

used in [2] to assess the performances of CRISPRcleanR.

RNAseq.fpkms Genome-wide substitute reads with fragments per kilobase of exon per million

reads mapped (FPKM) across cell lines. These can be derived from a compre-

hensive collection of RNAseq proﬁles described in [4]. The format must be the

same of the RNAseq.fpkms builtin data frame. If NULL then this function uses

the RNAseq.fpkms builtin data fram containing data for 15 cell lines used in [2]

to assess CRISPRcleaneR results.

minCNs A numerical vector with two entries specifying the minimal copy number for a

gene in order to be considered ampliﬁed based on the data in GDSC.geneLevCNA.

These two values can be 2, 4, 8 or 10.

libraryAnnotation

The sgRNA library annotations formatted as speciﬁed in the reference manual

entry of the KY_Library_v1.0 built in library.

Details

This function generates 4 sets of plots. They contains log fold change distributions density plots

prior/post CRISPRcleanR correction respectively for

• (i) Copy number ampliﬁed genes according to the data in GDSC.geneLevCNA based on the two

threshold values speciﬁed in minCNs;

• (ii) Copy number ampliﬁed genes according to the data in CCLE.gisticCNA (gistic score =

+2);

• (iii) Copy number ampliﬁed non expressed genes according to the data in GDSC.geneLevCNA

based on the two threshold values speciﬁed in minCNs, and the data in RNAseq.fpkms (FPKM

< 0.05);

• (iv) reference sets of core ﬁtness essential genes from MSigDB [5] (included in the builtin

vectors EssGenes.DNA_REPLICATION_cons,EssGenes.KEGG_rna_polymerase,

EssGenes.PROTEASOME_cons,EssGenes.ribosomalProteins,

EssGenes.SPLICEOSOME_cons, and reference core-ﬁtness-essential and non-essential genes

assembled from multiple RNAi studies used as classiﬁcation template by the BAGEL algo-

rithm to call gene depletion signiﬁcance [6]

(BAGEL_essential,BAGEL_nonEssential).

Author(s)

Francesco Iorio (ﬁ9323@gmail.com)

Source

[a] ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-6.0/Gene_level_

CN.xlsx.

28 ccr.perf_statTests

References

[1] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution

Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783.

[2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsuper-

vised correction of gene-independent cell responses to CRISPR-Cas9 targeting.

http://doi.org/10.1101/228189

[3] Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and conﬁdent lo-

calization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol.

2011;12(4):R41. doi: 10.1186/gb-2011-12-4-r41.

[4] Garcia-Alonso L, Iorio F, Matchan A, et al. Transcription factor activities enhance markers of

drug response in cancer doi: https://doi.org/10.1101/129478

[5] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et

al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-

wide expression proﬁles. Proceedings of the National Academy of Sciences of the United States of

America, 102(43), 15545-15550. http://doi.org/10.1073/pnas.0506580102

[6] BAGEL: a computational framework for identifying essential genes from pooled library screens.

Traver Hart and Jason Moffat. BMC Bioinformatics, 2016 vol. 17 p. 164.

See Also

KY_Library_v1.0,ccr.GWclean,

GDSC.geneLevCNA,CCLE.gisticCNA,RNAseq.fpkms,

EssGenes.DNA_REPLICATION_cons,EssGenes.KEGG_rna_polymerase,EssGenes.PROTEASOME_cons,

EssGenes.ribosomalProteins,EssGenes.SPLICEOSOME_cons

BAGEL_essential,BAGEL_nonEssential

Examples

## loading corrected sgRNAs log fold-changes and segment annotations for an example

## cell line (HT-29)

data(HT.29correctedFCs)

## loading library annotation

data(KY_Library_v1.0)

## inpecting sgRNA log fold change distributions prior/post CRISPRcleanR correction

ccr.perf_distributions('HT-29',HT.29correctedFCs$corrected_logFCs,

libraryAnnotation = KY_Library_v1.0)

ccr.perf_statTests CRISPRcleanR correction assessment: Statistical tests

Description

This function tests the log fold changes of sgRNAs targeting different sets of genes for statistically

signiﬁcant differences with respect to background pre and post CRISPRcleanR correction, creating

two sets of boxplots with outcomes and outputting statistical indicators.

ccr.perf_statTests 29

Usage

ccr.perf_statTests(cellLine, libraryAnnotation, correctedFCs,

outDir = "./",

GDSC.geneLevCNA = NULL,

CCLE.gisticCNA = NULL,

RNAseq.fpkms = NULL)

Arguments

cellLine A string specifying the name of a cell line (or a COSMIC identiﬁer [1]);

libraryAnnotation

The sgRNA library annotations formatted as speciﬁed in the reference manual

entry of the KY_Library_v1.0 built in library.

correctedFCs sgRNAs log fold changes corrected for gene independent responses to CRISPR-

Cas9 targeting, generated with the function ccr.GWclean (ﬁrst data frame in-

cluded in the list outputted by ccr.GWclean, i.e. corrected_logFCs).

outDir The path of the folder where the boxplot will be saved.