Reference Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 51
Download | |
Open PDF In Browser | View PDF |
Package ‘CRISPRcleanR’ June 8, 2018 Type Package Title Unsupervised correction of gene independent cell responses to CRISPR-cas9 targeting Version 0.3 Date 2017-09-11 Author Francesco Iorio Maintainer Francesco IorioLicense GPL-2 Description CRISPRcleanR is an R package for identifying and correcting gene independent responses to CRISPRcas9 targeting, in genome-wide pooled sgRNA dropout screens. CRISPRcleanR uses an unsupervised approach based on the segmentation of singleguide RNA (sgRNA) fold change values across the genome, without making any assumption on the copy number status of the targeted genes. CRISPRcleanR reports sgRNA fold changes and normalised sgRNA read counts, and is therefore compatible with downstream analysis tools, and works with multiple sgRNA libraries. Depends stringr, DNAcopy, pROC, stats, utils, grDevices, graphics, pracma, PRROC RoxygenNote 6.0.1 R topics documented: BAGEL_essential . . . . . . BAGEL_nonEssential . . . . CCLE.gisticCNA . . . . . . ccr.cleanChrm . . . . . . . . ccr.correctCounts . . . . . . ccr.ExecuteMageck . . . . . ccr.geneMeanFCs . . . . . . ccr.genes2sgRNAs . . . . . ccr.get.CCLEgisticSets . . . ccr.get.gdsc1000.AMPgenes ccr.get.nonExpGenes . . . . ccr.GWclean . . . . . . . . . ccr.impactOnPhenotype . . . ccr.logFCs2chromPos . . . . ccr.multDensPlot . . . . . . ccr.NormfoldChanges . . . . ccr.perf_distributions . . . . ccr.perf_statTests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 3 4 7 9 10 11 12 13 15 16 19 22 23 24 26 28 2 BAGEL_essential ccr.PlainTsvFile . . . . . . . . . . . . ccr.PrRc_Curve . . . . . . . . . . . . ccr.RecallCurves . . . . . . . . . . . ccr.ROC_Curve . . . . . . . . . . . . ccr.VisDepAndSig . . . . . . . . . . CL.subset . . . . . . . . . . . . . . . EPLC.272HcorrectedFCs . . . . . . . EssGenes.DNA_REPLICATION_cons EssGenes.KEGG_rna_polymerase . . EssGenes.PROTEASOME_cons . . . EssGenes.ribosomalProteins . . . . . EssGenes.SPLICEOSOME_cons . . . GDSC.CL_annotation . . . . . . . . . GDSC.geneLevCNA . . . . . . . . . HT.29correctedFCs . . . . . . . . . . KY_Library_v1.0 . . . . . . . . . . . RNAseq.fpkms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index BAGEL_essential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 33 34 36 38 40 41 42 43 44 44 45 46 46 47 49 50 51 Reference Core fitness essential genes Description A list of reference core fitness essential genes assembled from multiple RNAi studies used as classification template by the BAGEL algorithm to call gene depletion significance [1]. Usage data(BAGEL_essential) Format A vector of strings containing HGNC symbols of reference core fitness essential genes. References [1] BAGEL: a computational framework for identifying essential genes from pooled library screens. Traver Hart and Jason Moffat. BMC Bioinformatics, 2016 vol. 17 p. 164. See Also BAGEL_nonEssential Examples data(BAGEL_essential) head(BAGEL_essential) BAGEL_nonEssential BAGEL_nonEssential 3 Reference set of non essential genes Description A list of reference non essential genes assembled from multiple RNAi studies used as classification template by the BAGEL algorithm to call gene depletion significance [1]. Usage data(BAGEL_nonEssential) Format A vector of strings containing HGNC symbols of reference non essential genes. References [1] BAGEL: a computational framework for identifying essential genes from pooled library screens. Traver Hart and Jason Moffat. BMC Bioinformatics, 2016 vol. 17 p. 164. See Also BAGEL_essential Examples data(BAGEL_nonEssential) head(BAGEL_nonEssential) CCLE.gisticCNA Genome-wide copy number data for 13 human cancer cell lines. Description Genome-wide Gistic [1] scores quantifying copy number status across a subset of the cell lines in CL.subset that are used to assess CRISPRcleaneR results in [2]. Usage data(CCLE.gisticCNA) Format A data frame with one observations per gene across 13 variables (one per cell line). Row names indicate HGNC gene symbols and column names indicate cell line COSMIC identifiers [3]. 4 ccr.cleanChrm Source This data frame has been derived from the tsv file downloadable at http://www.cbioportal.org/study?id=cellline_ccle_broad#summary. This has been obtained by processing Affymetrix SNP array data in the Cancer Cell Line Encyclopaedia [4] repository (https://portals.broadinstitute.org/ccle_legacy/data/) References [1] Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011;12(4):R41. doi: 10.1186/gb-2011-12-4-r41. [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 [2] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783, [3] Barretina J, Caponigro G, Stransky N, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012 Mar 28;483(7391):603-7. doi: 10.1038/nature11003. Erratum in: Nature. 2012 Dec 13;492(7428):290. Examples data(CCLE.gisticCNA) head(CCLE.gisticCNA) ccr.cleanChrm Identification and correction of genomic regions of equal log fold changes involving sgRNAs targeting a minimal number of genes within a given chromosome. Description This function applies a circular binary segmentation algorithm [1, 2] to genomic-sorted log fold changes of all the sgRNAs targeting genes on the same chromosome. This procedure yields a sets of genomic regions of estimated equal sgRNAs’ log fold changes, significantly differing on average from adjacent regions. If some of these regions fulfill certain criteria (detailed below) then they are deemed as responding to CRISPR-Cas9 targeting in a gene independent manner, i.e. they might be biased by local feature of the DNA) and their pattern of log fold changes is mean centered [3]. Usage ccr.cleanChrm(gwSortedFCs,CHR,display=TRUE,label='', saveTO=NULL,min.ngenes=3,ignoredGenes=NULL) ccr.cleanChrm 5 Arguments gwSortedFCs A data frame containing genome-wide genomic-sorted sgRNAs’ log fold changes. This data frame must include one named row per each sgRNAs and the following columns/headers: • CHR: the chromosome of the gene targeted by the sgRNA under consideration; • startp: the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; • endp: the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; • genes: the HGNC symbol of the gene targeted by the sgRNA under consideration; • avgFC: the log fold change of the sgRNA under consideration averaged across replicates; • BP: the genomic coordinate of the sgRNA defined as STARTpos+(ENDposSTARTpos)/2. This can be generated using the ccr.logFCs2chromPos function, starting from a data frame containing sgRNAs’ log fold changes generated by the ccr.NormfoldChanges function from raw sgRNAs’ counts. CHR Numerical value indicating the chromosome to analyse and correct. X and Y chromosome must be indicated with 23 and 24, respectively. display A logical value indicating whether genomic plots showing the results of the biased regions’ identification and their log fold change correction should be generated or not. label A string indicating the experiment name, used in the main title of the plots and for the name of the folder where results are saved. saveTO If different from NULL then it will contain the path where pdf files with then genomic plots showing the results of the biased regions’ identification (and their log fold change correction) will be saved (within a folder named as defined in the label parameter). min.ngenes A numerical value (>0) specifying the minimal number of different genes that the set of sgRNAs within a region of estimated equal log fold changes should target in order for that region to be corrected, i.e. mean centered. ignoredGenes A vector of strings containing HGNC symbols of genes that should not be considered when computing the minimal number of different genes targeted by the sgRNAs in the same identified region of estimated equal log fold changes. This vector could contain, for example, a priori known essential genes. This parameter should be set to NULL for a completely unsupervised correction. Value A list containing two data frames. The first one (correctedFCs) contains a named row per each sgRNA and the following columns/header: • CHR: the chromosome of the gene targeted by the sgRNA under consideration; • startp: the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; • endp: the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; 6 ccr.cleanChrm • genes: the HGNC symbol of the gene targeted by the sgRNA under consideration; • avgFC: the log fold change of the sgRNA averaged across replicates; • correction: the type of correction: 1 = increased, -1 = decreased; • correctedFC: the corrected log fold change of the sgRNA . The second one (regions) contains the identified region of estimated equal log fold changes (one region per row) and the following columns/headers: • CHR: the chromosome of the region under consideration; • startp: the genomic coordinate of the starting position of the region under consideration; • endp: the genomic coordinate of the ending position of the region under consideration; • n.sgRNAs: the number of sgRNAs targeting sequences in the region under consideration; • avg.logFC: the average log fold change of the sgRNAs targeting the region; • guideIdx: the indexes range of the sgRNAs targeting the region under consideration as they appear in the gwSortedF Cs provided in input. Author(s) Francesco Iorio (iorio@ebi.ac.uk) References [1] Olshen, A. B., Venkatraman, E. S., Lucito, R., Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5: 557-572. [2] Venkatraman, E. S., Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23: 657-63. [3] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 See Also ccr.logFCs2chromPos, ccr.NormfoldChanges Examples data(KY_Library_v1.0) fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),'/HT-29_counts.tsv',sep='') normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30,EXPname='Example', libraryAnnotation=KY_Library_v1.0) gwSortedFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0) chr8cleaned<-ccr.cleanChrm(gwSortedFCs,8,display=TRUE,label='HT-29', min.ngenes=3) ccr.correctCounts ccr.correctCounts 7 Correction of sgRNA treatment counts for gene independent responses to CRISPR-Cas9 targeting Description This function applies an inverse transformation (described in ...) to CRISPRcleanR corrected sgRNAs’ log fold changes and produces in output normalised corrected sgRNA counts (across treatments and control replicates), suitable for gene depletion/enrichment statistical testing via meanvariance modeling (for example through MAGeCK [1]*). *MAGeCK should be executed excluding initial normalisation, as the corrected sgRNA counts outputted by this function are already normalised. Usage ccr.correctCounts(CL,normalised_counts, correctedFCs_and_segments, libraryAnnotation, minTargetedGenes=3, OutDir='./', ncontrols=1) Arguments CL A string specifying the name of the experiment. This will be used to compose names of files and folde where results will be saved. normalised_counts A data frame containing normalised sgRNAs’ read counts, which can be computed using the ccr.NormfoldChanges function from raw sgRNAs’ counts. correctedFCs_and_segments sgRNAs log fold changes corrected for gene independent responses, generated with the function ccr.GWclean. libraryAnnotation A data frame containing the sgRNAs’ genome-wide annotations with at least a named row for each of the sgRNAs included in the foldchanges data frame provided in input. The following columns/headers should be present in this data frame (additional columns will be ignored): • GENES: string vector containing the HGNC symbols of the genes targeted by the sgRNA under consideration; • EXONE: string vector containing the gene exon targeted by the sgRNA under consideration (these should include the prefix "ex" followed by the exone number); • CHRM: string vector the chromosome of the gene targeted by the sgRNA under consideration (X and Y chromosome should be specified as "X" and "Y"); • STRAND: string vector containing the strand targeted by the sgRNA under consideration ("+" or "-"); • STARTpop: numeric vector containing the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; 8 ccr.correctCounts • ENDpos: numeric vector containing the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; minTargetedGenes Minimanl number of different genes targeted by sgRNAs in a biased segment in order for the corresponding counts to be corrected (default = 3). OutDir Path of the folder where results and plots will be saved. ncontrols A numerical value indicating the number of control replicates (therefore columns to be considered as controls in the normalised counts). Value A data frame with one entry per sgRNA and individual columns for the control/treatment samples included in the normalised count data object specified by the normalised_counts parameter, and containing sgRNA counts corrected for gene independent responses to CRISPR-Cas9 targeting and median-ratio normalised. Author(s) Francesco Iorio (fi9323@gmail.com) References [1] Li, W., Xu, H., Xiao, T., Cong, L., Love, M. I., Zhang, F., et al. (2014). MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology, 15(12), 554. [2] Hart, T., & Moffat, J. (2016). BAGEL: a computational framework for identifying essential genes from pooled library screens. BMC Bioinformatics, 17(1), 164. See Also ccr.NormfoldChanges, ccr.GWclean Examples ## Loading sgRNA library annotation file data(KY_Library_v1.0) ## Deriving the path of the file with the example dataset, ## from the mutagenesis of the EPLC-272H colorectal cancer cell line fn<-paste(system.file('extdata', package = 'CRISPRcleanR'), '/EPLC-272H_counts.tsv',sep='') ## Loading, median-normalizing and computing fold-changes for the example dataset normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30, EXPname='EPLC-272H', libraryAnnotation = KY_Library_v1.0) ## Genome-sorting of the fold changes gwSortedFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0) ## Identifying and correcting biased sgRNAs' fold changes correctedFCs<-ccr.GWclean(gwSortedFCs,display=FALSE,label='EPLC-272H') ## correcting individual sgRNA treatment counts correctedCounts<-ccr.correctCounts('EPLC-272H',normANDfcs$norm_counts, ccr.ExecuteMageck 9 correctedFCs, KY_Library_v1.0, minTargetedGenes=3, OutDir='./') head(correctedCounts) ccr.ExecuteMageck Executing MAGeCK from R command line Description This function executes MAGeCK [1] from the command line, taking in input the path of the file containing the sgRNA counts’ file to be processed and saving the results in a user defined location. By default this function do not pre-normalise the counts. However this preliminary step can be included as specified by the corresponding argument. Additionally this function assumes that there is only one control sample, whose count values should be contained in the first column of the sgRNA counts’ file. This function requires python and the MAGeCK python package (v0.5.3, available at: https://sourceforge.net/projects/mageck/files/0.5/mageck-0.5.3.zip/download) to be installed. Usage ccr.ExecuteMageck(mgckInputFile, expName = "expName", normMethod = "none", outputPath = "./") Arguments mgckInputFile A string specifying the path of the (plain text) file containing the sgRNA counts’ file to be processed expName A string specifying the experiment name. This is used as name prefix for all the files generated by MAGeCK. normMethod A string specifying the normalisation method to be used (’none’ by default). outputPath A string specifying the folder where all the files outputted by MAGeCK will be saved. Value A string specifying the path to the gene summary file outputted by MAGeCK. Author(s) Francesco Iorio (fi9323@gmail.com) References [1] Li, W., Xu, H., Xiao, T., Cong, L., Love, M. I., Zhang, F., et al. (2014). MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology, 15(12), 554. [2] Hart, T., & Moffat, J. (2016). BAGEL: a computational framework for identifying essential genes from pooled library screens. BMC Bioinformatics, 17(1), 164. 10 ccr.geneMeanFCs Examples ## Loading sgRNA library annotation file data(KY_Library_v1.0) ## Deriving the path of the file with the example dataset, ## from the mutagenesis of the EPLC-272H colorectal cancer cell line fn<-paste(system.file('extdata', package = 'CRISPRcleanR'), '/EPLC-272H_counts.tsv',sep='') ## Loading, median-normalizing and computing fold-changes for the example dataset normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30, EXPname='EPLC-272H', libraryAnnotation = KY_Library_v1.0) uncorrected_fn<-ccr.PlainTsvFile(sgRNA_count_object = normANDfcs$norm_counts, fprefix = 'EPLC-272H') ## execute MAGeCK saving files in the working directory uncorrected_gs_fn<-ccr.ExecuteMageck(mgckInputFile = uncorrected_fn, expName = 'EPLC-272H', normMethod = 'none') uncorrected_gs_fn ccr.geneMeanFCs Gene level log fold changes Description This functions computes gene level log fold changes based on average log fold changes of targeting sgRNAs Usage ccr.geneMeanFCs(sgRNA_FCprofile, libraryAnnotation) Arguments sgRNA_FCprofile A named numerical vector containing the sgRNAs’ log fold-changes, with names corresponding to sgRNAs identifiers. libraryAnnotation A data frame containing the sgRNA library annotation (with same format of KY_Library_v1.0). Value A numerical vector containing gene average log fold-changes, with corresponding HGNC symbols as names. ccr.genes2sgRNAs 11 Author(s) Francesco Iorio (fi9323@gmail.com) See Also KY_Library_v1.0 Examples ## loading corrected sgRNAs log fold-changes and segment annotations for ## an example cell line (EPLC-272H) data(EPLC.272HcorrectedFCs) ## loading sgRNA library annotation data(KY_Library_v1.0) ## storing sgRNA log fold-changes in a named vector FCs<-EPLC.272HcorrectedFCs$corrected_logFCs$avgFC names(FCs)<-rownames(EPLC.272HcorrectedFCs$corrected_logFCs) ## computing gene level log fold-changes geneFCs<-ccr.geneMeanFCs(FCs,KY_Library_v1.0) head(geneFCs) ccr.genes2sgRNAs Targeting sgRNAs Description This function returns the set of sgRNAs targeting the set of genes provided in input, in a given pooled library. Usage ccr.genes2sgRNAs(libraryAnnotation,genes) Arguments libraryAnnotation A data frame with a named row for each sgRNA with the same format of KY_Library_v1.0 genes A list of strings containing HGNC symbols Value A list of strings containing the identifiers of the sgRNAs targeting the inputted set of genes Author(s) Francesco Iorio (fi9323@gmail.com) 12 ccr.get.CCLEgisticSets See Also KY_Library_v1.0 Examples ## Loading an sgRNA pooled library annotation data(KY_Library_v1.0) ## Loading an example set of genes data(BAGEL_essential) ccr.genes2sgRNAs(KY_Library_v1.0,BAGEL_essential) ccr.get.CCLEgisticSets CCLE gistic score gene sets Description This function splits all the genes into 5 classes (-2, -1, 0, +1 and +2) based on the CNA Gistic [1] score observed in a given cell line. Usage ccr.get.CCLEgisticSets(cellLine,CCLE.gisticCNA=NULL) Arguments cellLine A string specifying the name of a cell line (or a COSMIC identifier [2]); CCLE.gisticCNA Genome-wide Gistic [1] scores quantifying copy number status across cell lines with the same format of CCLE.gisticCNA. If NULL then this function uses the CCLE.gisticCNA builtin data frame, containing data for 13 cell lines of the 15 used in [3] to assess the performances of CRISPRcleanR. Value A named list of vectors with the following fields: gm2 A vector of strings containing identifiers of sgRNAs targeting genes whit a Gistic score = -2 in the cell line under consideration; gm1 A vector of strings containing identifiers of sgRNAs targeting genes whit a Gistic score = -1 in the cell line under consideration; gz A vector of strings containing identifiers of sgRNAs targeting genes whit a Gistic score = 0 in the cell line under consideration; gp1 A vector of strings containing identifiers of sgRNAs targeting genes whit a Gistic score = +1 in the cell line under consideration; gp2 A vector of strings containing identifiers of sgRNAs targeting genes whit a Gistic score = +2 in the cell line under consideration; Author(s) Francesco Iorio (iorio@ebi.ac.uk) ccr.get.gdsc1000.AMPgenes 13 References [1] Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011;12(4):R41. doi: 10.1186/gb-2011-12-4-r41. [2] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783, [3] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 See Also ccr.get.gdsc1000.AMPgenes Examples GS<-ccr.get.CCLEgisticSets('HT-29') head(GS$gm2) head(GS$gm1) head(GS$gz) head(GS$gp1) head(GS$gp2) ccr.get.gdsc1000.AMPgenes Copy number amplified genes in a given cell line from the GDSC1000 Description This function takes in input the name (or the COSMIC identifier [1]) of a cell line included in the GDSC1000 project [2] and it identifies the genes that are copy number amplified (according to a user defined minimal copy number value) in that cell line, using gene level copy number data from the Genomics of Drug Sensitivity in 1,000 Cancer Cell lines (GDSC1000) [2]. Usage ccr.get.gdsc1000.AMPgenes(cellLine, minCN = 8, exact = FALSE, GDSC.geneLevCNA=NULL) Arguments cellLine A string specifying the name of a cell line (or a COSMIC identifier [1]); minCN Lower threshold for the minimum copy number of any genomic segment containing coding sequence of a gene in order for it to be considered as copy number amplified. 14 ccr.get.gdsc1000.AMPgenes exact GDSC.geneLevCNA If TRUE, then those genes for which any genomic segment containing coding sequence has a minimum copy number equal to minCN are considered as copy number amplified. Genome-wide copy number data with the same format of GDSC.geneLevCNA. This can be assembled from the xls sheet specified in the source section [a] (containing data for the GDSC1000 cell lines). If NULL, then this function uses the data in the built in GDSC.geneLevCNA data frame, containing data derived from [a] for 15 cell lines used in [3] to assess the performances of CRISPRcleanR. Value A data frame, containing one row for each copy number amplified gene with the following columns: Gene HGNC symbol of the gene; minCN Minimum copy number of any genomic segment containing coding sequence of the gene in the cell line under consideration. Author(s) Francesco Iorio (fi9323@gmail.com) Source [a] ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-6.0/Gene_level_ CN.xlsx. References [1] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783, [2] Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, et al. A landscape of pharmacogenomic interactions in cancer Cell 2016 Jul 28;166(3):740-54 [3] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 See Also ccr.get.CCLEgisticSets Examples CNAgenes 0) specifying the minimal number of different genes that the set of sgRNAs within a region of estimated equal logFCs should target in order for theri logFCs to be corrected, i.e. mean centered. Value A list containing two data frames and a vector of strings. The first data frame (corrected_logFCs) contains a named row per each sgRNA and the following columns/header: • CHR: the chromosome of the gene targeted by the sgRNA under consideration; • startp: the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; • endp: the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; • genes: the HGNC symbol of the gene targeted by the sgRNA under consideration; • avgFC: the log fold change of the sgRNA averaged across replicates; • correction: the type of correction: 1 = increased log fold change, -1 = decreased log fold change. 0 indicates no correction; • correctedFC: the corrected log fold change of the sgRNA . The second data frame (segments) contains the identified region of estimated equal log fold changes (one region per row) and the following columns/headers: • CHR: the chromosome of the region under consideration; • startp: the genomic coordinate of the starting position of the region under consideration; • endp: the genomic coordinate of the ending position of the region under consideration; • n.sgRNAs: the number of sgRNAs targeting sequences in the region under consideration; • avg.logFC: the average log fold change of the sgRNAs in the region; 18 ccr.GWclean • guideIdx: the indexes range of the sgRNAs targeting the region under consideration as they appear in the gwSortedF Cs provided in input. The string of vectors (SORTED_sgRNAs) contains the sgRNAs’ identifiers in the same order as they are reported in the gwSortedFCs input data frame, i.e. genome sorted. Author(s) Francesco Iorio (iorio@ebi.ac.uk) References [1] Olshen, A. B., Venkatraman, E. S., Lucito, R., Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5: 557-572. \ [2] Venkatraman, E. S., Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23: 657-63. \ [3] Andrew J. Aguirre, Robin M. Meyers, Barbara A. Weir, Francisca Vazquez, Cheng-Zhong Zhang, Uri Ben-David, April Cook, Gavin Ha, William F. Harrington, Mihir B. Doshi, Maria Kost-Alimova, Stanley Gill, Han Xu, Levi D. Ali, Guozhi Jiang, Sasha Pantel, Yenarae Lee, Amy Goodale, Andrew D. Cherniack, Coyin Oh, Gregory Kryukov, Glenn S. Cowley, Levi A. Garraway, Kimberly Stegmaier, Charles W. Roberts, Todd R. Golub, Matthew Meyerson, David E. Root, Aviad Tsherniak and William C. Hahn. Genomic copy number dictates a gene-independent cell response to CRISPR-Cas9 targeting. Cancer Discov June 3 2016 DOI: 10.1158/2159-8290.CD-16-0154 [4] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 See Also ccr.cleanChrm Examples ## Loading sgRNA library annotation file data(KY_Library_v1.0) ## Deriving the path of the file with the example dataset, ## from the mutagenesis of the HT-29 colorectal cancer cell line fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),'/HT-29_counts.tsv',sep='') ## Loading, median-normalizing and computing fold-changes for the example dataset normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30,EXPname='HT-29', libraryAnnotation = KY_Library_v1.0) ## Genome-sorting of the fold changes gwSortedFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0) ## Identifying and correcting biased sgRNAs' fold changes correctedFCs<-ccr.GWclean(gwSortedFCs,display=TRUE,label='HT-29') ## Visualising first five entries of the corrected fold changes head(correctedFCs$corrected_logFCs) ccr.impactOnPhenotype 19 ccr.impactOnPhenotype Assessing the impact and potential distortion introduced by the CRISPRcleanR correction on the genes showing loss/gain-of-fitness effect. Description This function compares two MAGeCK [1] gene summaries (obtained from sgRNA count files pre/post CRISPRcleanR correction) and it computes the percentages of genes whose loss/gain-offitness effect is attenuated post CRISPRcleanR correction or potentially distorted (i.e. loss-of-fitness genes are detected post CRISPRcleanR correction as gain-of-fitness genes, and viceversa). Results are returned in output and optionally plotted as bar/pie charts. Usage ccr.impactOnPhenotype(MO_uncorrectedFile, MO_correctedFile, sigFDR = 0.05, expName = "expName", display = TRUE) Arguments MO_uncorrectedFile String specifying the path to a MAGeCK gene summary file produced by MAGeCK from non corrected sgRNA counts. MO_correctedFile String specifying the path to a MAGeCK gene summary file produced by MAGeCK from CRISPRcleanR corrected sgRNA counts. sigFDR A numerical value in [0,1] False discovery rate threshold at which genes are called as significantly exerting a loss/gain-of-fitness effect. expName A string specifying the experiment name, used as main title in the figures (ignored if the display argument is set to FALSE). display Boolean value specifying whether figures sumarising the comparison results should be plotted. Details For each of the considered MAGeCK gene summaries, this function calls loss/gain-of-fitness based on the MAGeCK negative/positive false discovery rate and the user defined threshold (as specified by the sigFDR argument). Particularly, are called as significant loss-of-fitness genes those with a negative fdr < sigFDR and a positive fdr >= sigFDR, and as significant gain-of-fitness genes those those with a positive fdr < sigFDR and a negative fdr >= sigFDR. All the other genes are deemed as not exerting any effect on cellular fitness. Value A list containing the following four numerical values and two data frames: 20 ccr.impactOnPhenotype • GW_impact %: Percentage of genes impacted by the CRISPRcleanR correction, i.e. showing a gain/loss-of-fitness genes effect in the MAGeCK gene summary obtained from uncorrected sgRNA counts, over the total number of screened genes; • Phenotype_G_impact %: Percentage of genes impacted by the CRISPRcleanR correction, i.e. showing a gain/loss-of-fitness genes effect in the MAGeCK gene summary obtained from uncorrected sgRNA counts, over the total number of genes showing a gain/loss of fitness effect in the MAGeCK gene summary obtained from uncorrected sgRNA counts; • GW_distortion %: Percentage of genes distorted by the CRISPRcleanR correction, i.e. showing a gain/loss-of-fitness effect in the MAGeCK gene summary obtained from corrected sgRNA counts that is opposite to the effect in that obtained from uncorrected sgRNA counts, over the total number of screened genes; • Phenotype_G_distortion %: Percentage of genes distorted by the CRISPRcleanR correction, i.e. showing a gain/loss-of-fitness effect in the MAGeCK gene summary obtained from corrected sgRNA counts that is opposite to the effect in that obtained from uncorrected sgRNA counts, over the total number of screened genes,over the total number of genes showing a gain/loss of fitness effect in the MAGeCK gene summary obtained from uncorrected sgRNA countsl; • geneCounts: A contingency table with gene counts as entries, with data referring to the original (uncorrected) sgRNA counts on the columns, and to the corrected sgRNA counts on the rows. There are three vectors for each dimensions, respectively for number of genes showing a significant loss of fitness effect (dep.), number of genes not showing any fitness effect (or with a not clear effect, i.e. showing both gain and loss of fitness effect, null), and number of genes showing a significant gain of fitness effect (enr.); • distortion: a data frame showing genes whose fitness effect has been distorted by the CRISPRcleanR correction: one row per gene (as specified by the row names), with two column per condition (i.e. prior/post correction), indicating the loss of fitness effect fdr (neg.fdr and ccr.neg.fdr) and the gain of fitness effect fdr (pos.fdr and ccr.pos.fdr) as outputted by MAGeCK; Author(s) Francesco Iorio (fi9323@gmail.com) References [1] Li, W., Xu, H., Xiao, T., Cong, L., Love, M. I., Zhang, F., et al. (2014). MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology, 15(12), 554. [2] Hart, T., & Moffat, J. (2016). BAGEL: a computational framework for identifying essential genes from pooled library screens. BMC Bioinformatics, 17(1), 164. See Also ccr.ExecuteMageck Examples ## Loading sgRNA library annotation file data(KY_Library_v1.0) ## Deriving the path of the file with the example dataset, ## from the mutagenesis of the EPLC-272H colorectal cancer cell line fn<-paste(system.file('extdata', package = 'CRISPRcleanR'), ccr.impactOnPhenotype '/EPLC-272H_counts.tsv',sep='') ## Loading, median-normalizing and computing fold-changes for the example dataset normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30, EXPname='EPLC-272H', libraryAnnotation = KY_Library_v1.0) uncorrected_fn<-ccr.PlainTsvFile(sgRNA_count_object = normANDfcs$norm_counts, fprefix = 'EPLC-272H') ## execute MAGeCK on uncorrected normalised counts uncorrected_gs_fn<-ccr.ExecuteMageck(mgckInputFile = uncorrected_fn, expName = 'EPLC-272H', normMethod = 'none') ## Genome-sorting of the fold changes gwSortedFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0) ## Identifying and correcting biased sgRNAs' fold changes correctedFCs<-ccr.GWclean(gwSortedFCs,display=FALSE,label='EPLC-272H') ## correcting individual sgRNA treatment counts correctedCounts<-ccr.correctCounts('EPLC-272H',normANDfcs$norm_counts, correctedFCs, KY_Library_v1.0, minTargetedGenes=3, OutDir='./') ## saving corrected/uncorrected sgRNA count files as plain tsv files corrected_fn<-ccr.PlainTsvFile(sgRNA_count_object = correctedCounts, fprefix = 'EPLC-272H_ccleaned') ## execute MAGeCK on corrected normalised counts corrected_gs_fn<-ccr.ExecuteMageck(mgckInputFile = corrected_fn, expName = 'EPLC-272H_ccleaned') ## Assessing the impact of CRISPcleanR correction on gain/loss-of-fitness genes RES<-ccr.impactOnPhenotype(MO_uncorrectedFile = uncorrected_gs_fn, MO_correctedFile = corrected_gs_fn, expName = 'EPLC-272H') ## Percentage of genes whose gain/loss-of fitness effect is impacted by CRISPRcleanR ## over the total number of screened genes RES[1] ## Percentage of genes whose gain/loss-of fitness effect is impacted by CRISPRcleanR ## over the total number of genes with a significant gain/loss-of fitness effect when ## using uncorrected sgRNA counts RES[2] ## Percentage of genes whose gain/loss-of fitness effect is distorted by CRISPRcleanR ## over the total number of screened genes RES[3] ## Percentage of genes whose gain/loss-of fitness effect is distorted by CRISPRcleanR ## over the total number of genes with a significant gain/loss-of fitness effect when ## using uncorrected sgRNA counts 21 22 ccr.logFCs2chromPos RES[4] ## Contingency table showing the impact of the CRISPRcleanR correction on the phenotype RES$geneCounts ## Genes whose gain/loss-of-fitness effect has been distorted by the CRISPRcleanR correction RES$distortion ccr.logFCs2chromPos Genomic sorting of sgRNAs’ log fold changes. Description This function maps genome-wide sgRNAs’ log fold changes (averaged across replicates) on the genome and returns them sorted according to the position of their targeted region on the chromosomes. Usage ccr.logFCs2chromPos(foldchanges, libraryAnnotation) Arguments foldchanges A data frame containing genome-wide sgRNAs’ log fold changes, one column per library transfection replicate, with first and second column containing the sgRNAs’ identifiers and the HGNC symbols of the targeted genes, respectively. This can be generated from raw count files using the ccr.NormfoldChanges function. libraryAnnotation A data frame containing the sgRNAs’ genome-wide annotations with at least a named row for each of the sgRNAs included in the foldchanges data frame provided in input. The following columns/headers should be present in this data frame (additional columns will be ignored): • GENES: string vector containing the HGNC symbols of the genes targeted by the sgRNA under consideration; • EXONE: string vector containing the gene exon targeted by the sgRNA under consideration (these should include the prefix "ex" followed by the exone number); • CHRM: string vector the chromosome of the gene targeted by the sgRNA under consideration (X and Y chromosome should be specified as "X" and "Y"); • STRAND: string vector containing the strand targeted by the sgRNA under consideration ("+" or "-"); • STARTpop: numeric vector containing the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; • ENDpos: numeric vector containing the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; Additiol columns can be optionally included and will be ignored by this function. The annation for the genome-wide sgRNA library presented in [1] is included in the KY_Library_v1.0 data object, formatted as described above. ccr.multDensPlot 23 Value A data frame with a named row per each sgRNA and the following columns/headers: • CHR: the chromosome where the gene targeted by the sgRNA under consideration resides; • startp: the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; • endp: the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; • avgFC: the log fold change of the sgRNA averaged across replicates; • BP: the genomic coordinate of the sgRNA defined as STARTpos+(ENDpos-STARTpos)/2. Author(s) Francesco Iorio (iorio@ebi.ac.uk) References [1] Tzelepis K, Koike-Yusa H, De Braekeleer E, Li Y, Metzakopian E, Dovey OM, Mupo A, Grinkevich V, Li M, Mazan M, Gozdecka M, Onishi S, Cooper J, Patel M, McKerrell T, Chen B, Domingues AF, Gallipoli P, Teichmann S, Ponstingl H, McDermott U, Saez-Rodriguez J, Huntly BJP, Iorio F, Pina C, Vassiliou GS, Yusa K. A CRISPR dropout screen identifies genetic vulnerabilities and therapeutic targets in acute myeloid leukaemia. Cell Reports 2016 Oct 18;17(4):1193-1205 See Also ccr.NormfoldChanges, KY_Library_v1.0 Examples data(KY_Library_v1.0) fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),'/A2058_counts.tsv',sep='') normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30, EXPname='Example', libraryAnnotation=KY_Library_v1.0) mappedLogFCs<-ccr.logFCs2chromPos(normANDfcs$logFCs,KY_Library_v1.0) head(mappedLogFCs) ccr.multDensPlot Mutiple shaded density plot Description This functions plots multiple distribution densities with solid colors for the curves and shaded colors for underlying areas. Usage ccr.multDensPlot(TOPLOT, COLS, XLIMS, TITLE, LEGentries, XLAB) 24 ccr.NormfoldChanges Arguments TOPLOT A list of density object computed using the density function of the stats package. COLS A vector of colors of the same length of TOPLOT that are used to plot the density curves. Alpha-reduced versions of these colors are used to fill the underlying areas. XLIMS A vector of two numerical values optionally specifying x-axis limits (NULL by default). TITLE A string containing the plot title. LEGentries A vector of strings (one per each density in TOPLOT) specifying corresponding legend entries. XLAB A string containing the x-axis label. Author(s) Francesco Iorio (fi9323@gmail.com) Examples ## generating random data x <- rnorm(1000, 0, 0.5) y <- rnorm(1000, 2, 0.4) z <- rnorm(1000, -1, 1.5) ## assembling kernel estimated distributions into a list ToPlot<-list(x=density(x),y=density(y),z=density(z)) ## density visualisation ccr.multDensPlot(ToPlot,COLS = c('red','blue','gray'), TITLE = 'example',LEGentries = c('x','y','z'), XLIMS = c(-5,3)) ccr.NormfoldChanges Median-ratio normalisation of sgRNA counts and fold change computation Description This function median-ratio normalises [1,2] sgRNAs’ counts stored in a tsv file whose path is provided in input, to adjust for the effect of library size and read count distributions. It computes log fold changes of transfected library replicates versus controls (tipically the sgRNA counts in the plasmid). The output of this function is returned as a list, and it is also saved into two tsv files. Usage ccr.NormfoldChanges(filename, display=TRUE, saveToFig=FALSE, outdir='./', min_reads=30, EXPname='', libraryAnnotation, ncontrols=1) ccr.NormfoldChanges 25 Arguments filename A string specifying the path of a tsv file containing the raw sgRNA counts. This must be a tab delimited file with one row per sgRNA and the following columns/headers: • sgRNA: containing alphanumerical identifiers of the sgRNA under consideration; • gene: containing HGNC symbols of the genes targeted by the sgRNA under consideration; followed by the columns containing the sgRNAs’ counts for the controls and columns for library trasfected samples. display A logic value specifying whether figures containing boxplots with the count values pre/post normalisation and log fold-changes should be visualised (TRUE, by default). saveToFig A logic value specifying whether figures containing boxplots with the count values pre/post normalisation and log fold-changes should be saved as pdf files (FALSE, by default). Setting this parameter to TRUE overrides the value of the display parameter. outdir Path of the directory where the normalised sgRNAs’ counts and the log fold changes, as well as the pdf files (if the parameter saveToFig is set to TRUE), must be saved. min_reads This parameter defines a filter threshold value for sgRNAs, based on their average counts in the control sample. Specifically, it indicates the minimal number of counts that each individual sgRNA needs to have in the controls (on average) in order to be included in the output. EXPname A string specifying the name of the experiment. This will be used to compose main title of the generated figures and file names. libraryAnnotation A data frame containing the sgRNA annotations, with a named row for each sgRNA, and columns for targeted genes, genomic coordinates and possibly other informations. This should be formatted as the KY_Library_v1.0 data object containing the annotation of the sgRNA library presented in [3]. ncontrols A numerical value indicating the number of control replicates (therefore columns to be considered as control counts after the first two, in the inputted tsv file). Value A list containing two data frames: for the normalised sgRNAs’ counts (norm_counts) and the sgRNAs’ log fold changes (logFCs) respectively. First two columns in these data frames contain sgRNAs’ identifiers and HGNC symbols of targete gene, respectively. Author(s) Francesco Iorio (fi9323@gmail.com) References [1] Wang T, Wei JJ, Sabatini DM, Lander ES. Genetic screens in human cells using the CRISPRCas9 system. Science. 2014, 343: 80-84. [2] Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R106 26 ccr.perf_distributions [3] Tzelepis K, Koike-Yusa H, De Braekeleer E, et al A CRISPR dropout screen identifies genetic vulnerabilities and therapeutic targets in acute myeloid leukaemia. Cell Reports 2016 Oct 18;17(4):1193-1205 See Also KY_Library_v1.0 Examples ## loading sgRNA library annotation data(KY_Library_v1.0) ## derive path for an example dataset fn<-paste(system.file('extdata', package = 'CRISPRcleanR'),'/HT-29_counts.tsv',sep='') ## sgRNAs' normalisation and computation of log fold-changes normANDfcs<-ccr.NormfoldChanges(fn, min_reads=30, EXPname='Example', libraryAnnotation=KY_Library_v1.0) ## inspecting first 5 entries of the data frames containing the ## normalised counts and the log fold-changes head(normANDfcs$norm_counts) head(normANDfcs$logFCs) ccr.perf_distributions CRISPRcleanR correction assessment: inspection of sgRNA log fold changes distributions Description This function creates distributions density plots of sgRNA log fold changes for defined sets of targeted genes prior/post CRISPRcleanR correction. Usage ccr.perf_distributions(cellLine, correctedFCs, GDSC.geneLevCNA = NULL, CCLE.gisticCNA = NULL, RNAseq.fpkms = NULL, minCNs = c(8, 10), libraryAnnotation) Arguments cellLine A string specifying the name of a cell line (or a COSMIC identifier [1]); correctedFCs sgRNAs log fold changes corrected for gene independent responses to CRISPRCas9 targeting, generated with the function ccr.GWclean (first data frame included in the list outputted by ccr.GWclean, i.e. corrected_logFCs). ccr.perf_distributions 27 GDSC.geneLevCNA Genome-wide copy number data with the same format of GDSC.geneLevCNA. This can be assembled from the xls sheet specified in the source section [a] (containing data for the GDSC1000 cell lines). If NULL, then this function uses the built in GDSC.geneLevCNA data frame, containing data derived from [a] for 15 cell lines used in [2] to assess the performances of CRISPRcleanR. CCLE.gisticCNA Genome-wide Gistic [3] scores quantifying copy number status across cell lines with the same format of CCLE.gisticCNA. If NULL then this function uses the CCLE.gisticCNA builtin data frame, containing data for 13 cell lines of the 15 used in [2] to assess the performances of CRISPRcleanR. RNAseq.fpkms Genome-wide substitute reads with fragments per kilobase of exon per million reads mapped (FPKM) across cell lines. These can be derived from a comprehensive collection of RNAseq profiles described in [4]. The format must be the same of the RNAseq.fpkms builtin data frame. If NULL then this function uses the RNAseq.fpkms builtin data fram containing data for 15 cell lines used in [2] to assess CRISPRcleaneR results. minCNs A numerical vector with two entries specifying the minimal copy number for a gene in order to be considered amplified based on the data in GDSC.geneLevCNA. These two values can be 2, 4, 8 or 10. libraryAnnotation The sgRNA library annotations formatted as specified in the reference manual entry of the KY_Library_v1.0 built in library. Details This function generates 4 sets of plots. They contains log fold change distributions density plots prior/post CRISPRcleanR correction respectively for • (i) Copy number amplified genes according to the data in GDSC.geneLevCNA based on the two threshold values specified in minCNs; • (ii) Copy number amplified genes according to the data in CCLE.gisticCNA (gistic score = +2); • (iii) Copy number amplified non expressed genes according to the data in GDSC.geneLevCNA based on the two threshold values specified in minCNs, and the data in RNAseq.fpkms (FPKM < 0.05); • (iv) reference sets of core fitness essential genes from MSigDB [5] (included in the builtin vectors EssGenes.DNA_REPLICATION_cons, EssGenes.KEGG_rna_polymerase, EssGenes.PROTEASOME_cons, EssGenes.ribosomalProteins, EssGenes.SPLICEOSOME_cons, and reference core-fitness-essential and non-essential genes assembled from multiple RNAi studies used as classification template by the BAGEL algorithm to call gene depletion significance [6] (BAGEL_essential, BAGEL_nonEssential). Author(s) Francesco Iorio (fi9323@gmail.com) Source [a] ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-6.0/Gene_level_ CN.xlsx. 28 ccr.perf_statTests References [1] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783. [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 [3] Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011;12(4):R41. doi: 10.1186/gb-2011-12-4-r41. [4] Garcia-Alonso L, Iorio F, Matchan A, et al. Transcription factor activities enhance markers of drug response in cancer doi: https://doi.org/10.1101/129478 [5] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. http://doi.org/10.1073/pnas.0506580102 [6] BAGEL: a computational framework for identifying essential genes from pooled library screens. Traver Hart and Jason Moffat. BMC Bioinformatics, 2016 vol. 17 p. 164. See Also KY_Library_v1.0, ccr.GWclean, GDSC.geneLevCNA, CCLE.gisticCNA, RNAseq.fpkms, EssGenes.DNA_REPLICATION_cons, EssGenes.KEGG_rna_polymerase, EssGenes.PROTEASOME_cons, EssGenes.ribosomalProteins, EssGenes.SPLICEOSOME_cons BAGEL_essential, BAGEL_nonEssential Examples ## loading corrected sgRNAs log fold-changes and segment annotations for an example ## cell line (HT-29) data(HT.29correctedFCs) ## loading library annotation data(KY_Library_v1.0) ## inpecting sgRNA log fold change distributions prior/post CRISPRcleanR correction ccr.perf_distributions('HT-29',HT.29correctedFCs$corrected_logFCs, libraryAnnotation = KY_Library_v1.0) ccr.perf_statTests CRISPRcleanR correction assessment: Statistical tests Description This function tests the log fold changes of sgRNAs targeting different sets of genes for statistically significant differences with respect to background pre and post CRISPRcleanR correction, creating two sets of boxplots with outcomes and outputting statistical indicators. ccr.perf_statTests 29 Usage ccr.perf_statTests(cellLine, libraryAnnotation, correctedFCs, outDir = "./", GDSC.geneLevCNA = NULL, CCLE.gisticCNA = NULL, RNAseq.fpkms = NULL) Arguments cellLine A string specifying the name of a cell line (or a COSMIC identifier [1]); libraryAnnotation The sgRNA library annotations formatted as specified in the reference manual entry of the KY_Library_v1.0 built in library. correctedFCs sgRNAs log fold changes corrected for gene independent responses to CRISPRCas9 targeting, generated with the function ccr.GWclean (first data frame included in the list outputted by ccr.GWclean, i.e. corrected_logFCs). outDir The path of the folder where the boxplot will be saved. GDSC.geneLevCNA Genome-wide copy number data with the same format of GDSC.geneLevCNA. This can be assembled from the xls sheet specified in the source section [a] (containing data for the GDSC1000 cell lines). If NULL, then this function uses the built in GDSC.geneLevCNA data frame, containing data derived from [a] for 15 cell lines used in [2] to assess the performances of CRISPRcleanR. CCLE.gisticCNA Genome-wide Gistic [3] scores quantifying copy number status across cell lines with the same format of CCLE.gisticCNA. If NULL then this function uses the CCLE.gisticCNA builtin data frame, containing data for 13 cell lines of the 15 used in [2] to assess the performances of CRISPRcleanR. RNAseq.fpkms Genome-wide substitute reads with fragments per kilobase of exon per million reads mapped (FPKM) across cell lines. These can be derived from a comprehensive collection of RNAseq profiles described in [4]. The format must be the same of the RNAseq.fpkms builtin data frame. If NULL then this function uses the RNAseq.fpkms builtin data fram containing data for 15 cell lines used in [2] to assess CRISPRcleaneR results. Details This functions assess the statistical difference pre/post CRISPRcleanR correction of log fold changes for sgRNAs targeting respectively: • copy number (CN) deleted genes according to the GDSC1000 repository • CN deleted genes (gistic score = -2) according to the CCLE repository • non expressed genes (FPKM < 0.05) • genes with gistic score = 1 • genes with gistic score = 2 • non espressed genes (FPKM < 0.05) with gistic score = 1 • non espressed genes (FPKM < 0.05) with gistic score = 2 • genes with minimal CN = 2, according to the GDSC1000 • genes with minimal CN = 4, according to the GDSC1000 30 ccr.perf_statTests • genes with minimal CN = 8, according to the GDSC1000 • genes with minimal CN = 10, according to the GDSC1000 • non expressed genes (FPKM < 0.05) with minimal CN = 2, according to the GDSC1000 • non expressed genes (FPKM < 0.05) with minimal CN = 4, according to the GDSC1000 • non expressed genes (FPKM < 0.05) with minimal CN = 8, according to the GDSC1000 • non expressed genes (FPKM < 0.05) with minimal CN = 10, according to the GDSC1000 • core fitness essential genes, assembling signatures from MsigDB [5], included in the builtin vectors EssGenes.DNA_REPLICATION_cons, EssGenes.KEGG_rna_polymerase, EssGenes.PROTEASOME_cons, EssGenes.ribosomalProteins, EssGenes.SPLICEOSOME_cons • Reference core fitness essential genes assembled from multiple RNAi studies used as classification template by the BAGEL algorithm to call gene depletion significance [6] (BAGEL_essential) • Reference core fitness essential genes assembled from multiple RNAi studies used as classification template by the BAGEL algorithm to call gene depletion significance [6] after the removal core fitness essential genes from MsigDB [5] • Reference non essential genes assembled from multiple RNAi studies used as classification template by the BAGEL algorithm to call gene depletion significance [6] (BAGEL_nonEssential) Value A list of three named 2x19 matrices, with one entry per statistical test, rows indicating pre/post CRISPRcleanR correction sgRNAs’ log fold changes and one column per each tested gene set. In each matrix the entries contains, respectively PVALS Pvalue resulting from a Student’s t-test assessing the differences between sgRNAs log fold changes pre (first row) and post (second row) CRISPRcleanR correction with respect to background SIGNS The sign of the difference (1 = mean log fold change of the tested set larger that the mean of the background population, -1 = mean log fold change of the tested set smaller than the mean of the background population) EFFsizes Effect size (computing via the Cohen’s D): difference of the means / pooled standard deviation. Author(s) Francesco Iorio (fi9323@gmail.com) Source [a] ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-6.0/Gene_level_ CN.xlsx. References [1] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783. ccr.PlainTsvFile 31 [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 [3] Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011;12(4):R41. doi: 10.1186/gb-2011-12-4-r41. [4] Garcia-Alonso L, Iorio F, Matchan A, et al. Transcription factor activities enhance markers of drug response in cancer doi: https://doi.org/10.1101/129478 [5] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. http://doi.org/10.1073/pnas.0506580102 [6] BAGEL: a computational framework for identifying essential genes from pooled library screens. Traver Hart and Jason Moffat. BMC Bioinformatics, 2016 vol. 17 p. 164. See Also KY_Library_v1.0, ccr.GWclean, GDSC.geneLevCNA, CCLE.gisticCNA, RNAseq.fpkms, EssGenes.DNA_REPLICATION_cons, EssGenes.KEGG_rna_polymerase, EssGenes.PROTEASOME_cons, EssGenes.ribosomalProteins, EssGenes.SPLICEOSOME_cons BAGEL_essential, BAGEL_nonEssential Examples ## loading corrected sgRNAs log fold-changes and segment annotations for an example ## cell line (EPLC-272H) data(EPLC.272HcorrectedFCs) ## loading library annotation data(KY_Library_v1.0) ## Evaluate correction effects. Boxplots will be saved in EPLC-272H.pdf ## in the current directory RES<-ccr.perf_statTests('EPLC-272H',libraryAnnotation = KY_Library_v1.0, correctedFCs = EPLC.272HcorrectedFCs$corrected_logFCs) RES$PVALS RES$EFFsizes ccr.PlainTsvFile Saving a sgRNA counts’ object in plain tsv file Description This function takes in input a sgRNA counts’ object, as outputted (for example) by the ccr.NormfoldChanges function and saves it as plaing tab delimited text file (which can be processed by MAGeCK [1]). 32 ccr.PlainTsvFile Usage ccr.PlainTsvFile(sgRNA_count_object, fprefix = "", path = "./") Arguments sgRNA_count_object sgRNA counts data object. fprefix A string specifying a name prefix of the tsv file which will contain the inputted sgRNA counts data object. path A string specifying the location where the tsv file will be saved. Value A string specifying the complete path of the saves tsv file. Author(s) Francesco Iorio (fi9323@gmail.com) References [1] Li, W., Xu, H., Xiao, T., Cong, L., Love, M. I., Zhang, F., et al. (2014). MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology, 15(12), 554. [2] Hart, T., & Moffat, J. (2016). BAGEL: a computational framework for identifying essential genes from pooled library screens. BMC Bioinformatics, 17(1), 164. See Also ccr.NormfoldChanges Examples ## Loading sgRNA library annotation file data(KY_Library_v1.0) ## Deriving the path of the file with the example dataset, ## from the mutagenesis of the EPLC-272H colorectal cancer cell line fn<-paste(system.file('extdata', package = 'CRISPRcleanR'), '/EPLC-272H_counts.tsv',sep='') ## Loading, median-normalizing and computing fold-changes for the example dataset normANDfcs<-ccr.NormfoldChanges(fn,min_reads=30, EXPname='EPLC-272H', libraryAnnotation = KY_Library_v1.0, display=FALSE) ## saving median-normalised sgRNA counts' as a plain tsv file in ./EPLC-272H_sgRNA_count.tsv uncorrected_fn<-ccr.PlainTsvFile(sgRNA_count_object = normANDfcs$norm_counts,fprefix = 'EPLC-272H') uncorrected_fn ccr.PrRc_Curve ccr.PrRc_Curve 33 Classification performances of reference sets of genes (or sgRNAs) based on depletion log fold-changes Description This functions computes Precision/Recall (or PPV/Sensitivity, PrRc) curve, area under the PrRc curve and (optionally) Recall (i.e. TPR) at fixed false discovery rate (computed as 1 - Precision (or PPV)) and corresponding log fold change threshold) when classifying reference sets of genes (or sgRNAs) based on their depletion log fold-changes Usage ccr.PrRc_Curve(FCsprofile, positives, negatives, display = TRUE, FDRth = NULL) Arguments FCsprofile A numerical vector containing gene average depletion log fold changes (or sgRNAs’ depletion log fold changes) with names corresponding to HGNC symbols (or sgRNAs’ identifiers). positives A vector of strings containing a reference set of positive cases: HGNC symbols of essential genes or identifiers of their targeting sgRNAs. This must be a subset of FCsprofile names, disjointed from negatives. negatives A vector of strings containing a reference set of negative cases: HGNC symbols of essential genes or identifiers of their targeting sgRNAs. This must be a subset of FCsprofile names, disjointed from positives. display A logical parameter specifying if a plot containing the computed precision/recall curve with ROC indicators should be plotted (default = TRUE). FDRth If different from NULL, will be a numerical value >=0 and <=1 specifying the false discovery rate threshold at which fixed recall will be computed. In this case, if the display parameter is TRUE, an orizontal dashed line will be added to the plot at the resulting recall and its value will be visualised in the legend. Value A list containint three numerical variable AUC, Recall, and sigthreshold indicating the area under PrRc curve and (if FDRth is not NULL) the recall at the specifying false discovery rate and the corresponding log fold change threshold (both equal to NULL, if FDRth is NULL), respectively. Author(s) Francesco Iorio (fi9323@gmail.com) 34 ccr.RecallCurves See Also BAGEL_essential, BAGEL_nonEssential, ccr.genes2sgRNAs, ccr.VisDepAndSig, ccr.ROC_Curve Examples ## loading corrected sgRNAs log fold-changes and segment annotations for an example ## cell line (EPLC-272H) data(EPLC.272HcorrectedFCs) ## loading reference sets of essential and non-essential genes data(BAGEL_essential) data(BAGEL_nonEssential) ## loading library annotation data(KY_Library_v1.0) ## storing sgRNA log fold-changes in a named vector FCs<-EPLC.272HcorrectedFCs$corrected_logFCs$avgFC names(FCs)<-rownames(EPLC.272HcorrectedFCs$corrected_logFCs) ## deriving sgRNAs targeting essential and non-essential genes (respectively) BAGEL_essential_sgRNAs<-ccr.genes2sgRNAs(KY_Library_v1.0,BAGEL_essential) BAGEL_nonEssential_sgRNAs<-ccr.genes2sgRNAs(KY_Library_v1.0,BAGEL_nonEssential) ## computing classification performances at the sgRNA level ccr.PrRc_Curve(FCs,BAGEL_essential_sgRNAs,BAGEL_nonEssential_sgRNAs) ## computing gene level log fold-changes geneFCs<-ccr.geneMeanFCs(FCs,KY_Library_v1.0) ## computing classification performances at the sgRNA level, with Recall at 5% FDR ccr.PrRc_Curve(geneFCs,BAGEL_essential,BAGEL_nonEssential,FDRth = 0.05) ccr.RecallCurves CRISPRcleanR correction assessment: Recall curve inspection Description This function creates plots with Recall curve outcomes (a it computes areas under the Recall curves) resulting from classifying defined sets of sgRNAs (respectively genes) based on their log fold change (respectively log fold changes averaged across targeting sgRNAs). Usage ccr.RecallCurves(cellLine, correctedFCs, GDSC.geneLevCNA = NULL, RNAseq.fpkms = NULL, minCN = 8, libraryAnnotation, GeneLev = FALSE) ccr.RecallCurves 35 Arguments cellLine A string specifying the name of a cell line (or a COSMIC identifier [1]); correctedFCs sgRNAs log fold changes corrected for gene independent responses to CRISPRCas9 targeting, generated with the function ccr.GWclean (first data frame included in the list outputted by ccr.GWclean, i.e. corrected_logFCs). GDSC.geneLevCNA Genome-wide copy number data with the same format of GDSC.geneLevCNA. This can be assembled from the xls sheet specified in the source section [a] (containing data for the GDSC1000 cell lines). If NULL, then this function uses the built in GDSC.geneLevCNA data frame, containing data derived from [a] for 15 cell lines used in [2] to assess the performances of CRISPRcleanR. RNAseq.fpkms Genome-wide substitute reads with fragments per kilobase of exon per million reads mapped (FPKM) across cell lines. These can be derived from a comprehensive collection of RNAseq profiles described in [4]. The format must be the same of the RNAseq.fpkms builtin data frame. If NULL then this function uses the RNAseq.fpkms builtin data fram containing data for 15 cell lines used in [2] to assess CRISPRcleaneR results. minCN A numerical value specifying the minimal copy number for a gene in order to be considered amplified based on the data in GDSC.geneLevCNA. This value can be 2, 4, 8 or 10. libraryAnnotation The sgRNA library annotations formatted as specified in the reference manual entry of the KY_Library_v1.0 built in library. GeneLev A logical value specifying if the Recall should be computed at level of genes. In this case average gene log fold changes are computed from the inputted corrected log fold changes across targeting sgRNAs. Details This function generates 2 plots, showing Recall curves resulting from classifying the following 4 sets of sgRNAs (or Genes, depending on the parameter GeneLev, based on their log fold changes (or log fold changes averaged across targeting guides): • (i) Copy number amplified genes according to the data in GDSC.geneLevCNA based on the threshold value specified in minCNs; • (ii) Copy number amplified non expressed genes according to the data in GDSC.geneLevCNA based on the threshold value specified in minCNs, and the data in RNAseq.fpkms (FPKM < 0.05); • (iv) reference sets of core-fitness-essential and non-essential genes assembled from multiple RNAi studies used as classification template by the BAGEL algorithm to call gene depletion significance [5] (BAGEL_essential, BAGEL_nonEssential). Author(s) Francesco Iorio (fi9323@gmail.com) Source [a] ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-6.0/Gene_level_ CN.xlsx. 36 ccr.ROC_Curve References [1] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783. [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 [3] Mermel CH, Schumacher SE, Hill B, et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011;12(4):R41. doi: 10.1186/gb-2011-12-4-r41. [4] Garcia-Alonso L, Iorio F, Matchan A, et al. Transcription factor activities enhance markers of drug response in cancer doi: https://doi.org/10.1101/129478 [5] BAGEL: a computational framework for identifying essential genes from pooled library screens. Traver Hart and Jason Moffat. BMC Bioinformatics, 2016 vol. 17 p. 164. See Also KY_Library_v1.0, ccr.GWclean, GDSC.geneLevCNA, RNAseq.fpkms, BAGEL_essential, BAGEL_nonEssential Examples ## loading corrected sgRNAs log fold-changes and segment annotations for an example ## cell line (EPLC-272H) data(EPLC.272HcorrectedFCs) ## loading library annotation data(KY_Library_v1.0) ## Creating recall curve plots and computing corresponding underlying area ## at the level of sgRNAs ccr.RecallCurves('EPLC-272H',EPLC.272HcorrectedFCs$corrected_logFCs, libraryAnnotation=KY_Library_v1.0) ## Creating recall curve plots and computing corresponding underlying area ## at the gene level ccr.RecallCurves('EPLC-272H',EPLC.272HcorrectedFCs$corrected_logFCs, libraryAnnotation=KY_Library_v1.0,GeneLev = TRUE) ccr.ROC_Curve Classification performances of reference sets of genes (or sgRNAs) based on depletion log fold-changes ccr.ROC_Curve 37 Description This functions computes Specificity/Sensitivity (or TNR/TPR, or ROC) curve, area under the ROC curve and (optionally) Recall (i.e. TPR) at fixed false discovery rate (computed as 1 - Precision (or Positive Predicted Value)) and corresponding log fold change threshold) when classifying reference sets of genes (or sgRNAs) based on their depletion log fold-changes Usage ccr.ROC_Curve(FCsprofile, positives, negatives, display = TRUE, FDRth = NULL) Arguments FCsprofile A numerical vector containing gene average depletion log fold changes (or sgRNAs’ depletion log fold changes) with names corresponding to HGNC symbols (or sgRNAs’ identifiers). positives A vector of strings containing a reference set of positive cases: HGNC symbols of essential genes or identifiers of their targeting sgRNAs. This must be a subset of FCsprofile names, disjointed from negatives. negatives A vector of strings containing a reference set of negative cases: HGNC symbols of essential genes or identifiers of their targeting sgRNAs. This must be a subset of FCsprofile names, disjointed from positives. display A logical parameter specifying if a plot containing the computed precision/recall curve with ROC indicators should be plotted (default = TRUE). FDRth If different from NULL, will be a numerical value >=0 and <=1 specifying the false discovery rate threshold at which fixed recall will be computed. In this case, if the display parameter is TRUE, an orizontal dashed line will be added to the plot at the resulting recall and its value will be visualised in the legend. Value A list containint three numerical variable AUC, Recall, and sigthreshold indicating the area under ROC curve and (if FDRth is not NULL) the recall at the specifying false discovery rate and the corresponding log fold change threshold (both equal to NULL, if FDRth is NULL), respectively. Author(s) Francesco Iorio (fi9323@gmail.com) See Also BAGEL_essential, BAGEL_nonEssential, ccr.genes2sgRNAs, ccr.VisDepAndSig, ccr.PrRc_Curve 38 ccr.VisDepAndSig Examples ## loading corrected sgRNAs log fold-changes and segment annotations for an example ## cell line (EPLC-272H) data(EPLC.272HcorrectedFCs) ## loading reference sets of essential and non-essential genes data(BAGEL_essential) data(BAGEL_nonEssential) ## loading library annotation data(KY_Library_v1.0) ## storing sgRNA log fold-changes in a named vector FCs<-EPLC.272HcorrectedFCs$corrected_logFCs$avgFC names(FCs)<-rownames(EPLC.272HcorrectedFCs$corrected_logFCs) ## deriving sgRNAs targeting essential and non-essential genes (respectively) BAGEL_essential_sgRNAs<-ccr.genes2sgRNAs(KY_Library_v1.0,BAGEL_essential) BAGEL_nonEssential_sgRNAs<-ccr.genes2sgRNAs(KY_Library_v1.0,BAGEL_nonEssential) ## computing classification performances at the sgRNA level ccr.ROC_Curve(FCs,BAGEL_essential_sgRNAs,BAGEL_nonEssential_sgRNAs) ## computing gene level log fold-changes geneFCs<-ccr.geneMeanFCs(FCs,KY_Library_v1.0) ## computing classification performances at the sgRNA level, with Recall at 5% FDR ccr.ROC_Curve(geneFCs,BAGEL_essential,BAGEL_nonEssential,FDRth = 0.05) ccr.VisDepAndSig Depletion profile visualisation with genes signatures superimposed and recall Description This functions ranks the gene (or sgRNAs) log fold changes. Based on this it determines a log fold change threshold based on a user defined false discovery rate when classifying two gene (sgRNA) positive/negative references sets (tipically core-fitness-essential and non-essential genes), and it computes the Recall (or True Positive Rate) of genes in other user defined sets at the determined threshold. It produces a plot where the log fold changes are visualised alongside the rank positions of the genes included in the inputted sets and, their recall and the determined FDR threshold. Usage ccr.VisDepAndSig(FCsprofile,SIGNATURES,TITLE='', pIs=NULL,nIs=NULL, th=0.05,plotFCprofile=TRUE) ccr.VisDepAndSig 39 Arguments FCsprofile A numerical vector containing gene average depletion log fold changes (or sgRNAs’ depletion log fold changes) with names corresponding to HGNC symbols (or sgRNAs’ identifiers). SIGNATURES A named list of vectors containing HGNC gene symbols. Two of these lists are used as classification template (respectively for positive and negative cases) to determine a log fold-change threshold providing a user defined classification false discovery rate. TITLE A string specifiying the title of the plot. pIs The index position of the signature that contains the positive cases of the classification template. nIs The index position of the signature that contains the negative cases of the classification template. th A numerical value specifying the desired classification false discovery rate (this must be a real number between 0 and 1). plotFCprofile A logic value specifying whether the log fold changes should be plotted. Value A named numerical vector containing recall scores for all the inputted signatures at the computed false discovery rate threshold for log fold-changes. Author(s) Francesco Iorio (iorio@gmail.com) See Also ccr.ROC_Curve, ccr.PrRc_Curve Examples ## loading corrected sgRNAs log fold-changes and segment annotations ## for an example cell line (EPLC-272H) data(EPLC.272HcorrectedFCs) ## loading reference sets of essential and non-essential genes data(BAGEL_essential) data(BAGEL_nonEssential) ## loading other sets of core fitness genes data(EssGenes.ribosomalProteins) data(EssGenes.DNA_REPLICATION_cons) data(EssGenes.KEGG_rna_polymerase) data(EssGenes.PROTEASOME_cons) data(EssGenes.SPLICEOSOME_cons) ## storing the sgRNA log fold changes into a name vector FCs<-EPLC.272HcorrectedFCs$corrected_logFCs$avgFC names(FCs)<-rownames(EPLC.272HcorrectedFCs$corrected_logFCs) ## loading sgRNA library annotation data(KY_Library_v1.0) 40 CL.subset ## computing gene average log fold changes FCs<-ccr.geneMeanFCs(FCs,KY_Library_v1.0) ## Assembling a named list with all the considered gene sets SIGNATURES<-list(Ribosomal_Proteins=EssGenes.ribosomalProteins, DNA_Replication = EssGenes.DNA_REPLICATION_cons, RNA_polymerase = EssGenes.KEGG_rna_polymerase, Proteasome = EssGenes.PROTEASOME_cons, Spliceosome = EssGenes.SPLICEOSOME_cons, CFE=BAGEL_essential, non_essential=BAGEL_nonEssential) ## Visualising log fold change profile with superimposed signatures specifying ## that the reference gene sets are in positions 6 and 7 Recall_scores<-ccr.VisDepAndSig(FCsprofile = FCs, SIGNATURES = SIGNATURES, TITLE = 'EPLC-272H', pIs = 6, nIs = 7) Recall_scores CL.subset COSMIC identifiers of 15 immortalised human cancer cell lines Description COSMIC identifiers [1] of 15 cell lines included in the GDSC1000 panel [2] that are used in [3] to assess CRISPRcleaneR results. Usage data(CL.subset) Format A vector of strings. References [1] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783, [2] Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, et al. A landscape of pharmacogenomic interactions in cancer Cell 2016 Jul 28;166(3):740-54 [3] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 EPLC.272HcorrectedFCs 41 Examples data(CL.subset) ## Loading annotation for the GDSC1000 cell lines data(GDSC.CL_annotation) ## Visualising annotation GDSC.CL_annotation[CL.subset,] EPLC.272HcorrectedFCs CRISPRcleanR corrected data for an example cell line Description This list contains corrected sgRNAs log fold-changes and segment annotations for an example cell line (EPLC-272H), obtained using the ccr.GWclean function, as detailed in its reference manual entry ccr.GWclean. Usage data("EPLC.272HcorrectedFCs") Format A list containing two data frames and a vector of strings. The first data frame (corrected_logFCs) contains a named row per each sgRNA and the following columns/header: • CHR: the chromosome of the gene targeted by the sgRNA under consideration; • startp: the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; • endp: the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; • genes: the HGNC symbol of the gene targeted by the sgRNA under consideration; • avgFC: the log fold change of the sgRNA averaged across replicates; • correction: the type of correction: 1 = increased log fold change, -1 = decreased log fold change. 0 indicates no correction; • correctedFC: the corrected log fold change of the sgRNA . The second data frame (segments) contains the identified region of estimated equal log fold changes (one region per row) and the following columns/headers: • CHR: the chromosome of the gene targeted by the sgRNA under consideration; • startp: the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; • endp: the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; • genes: the HGNC symbol of the gene targeted by the sgRNA under consideration; 42 EssGenes.DNA_REPLICATION_cons • avgFC: the log fold change of the sgRNA averaged across replicates; • correction: the type of correction: 1 = increased log fold change, -1 = decreased log fold change. 0 indicates no correction; • correctedFC: the corrected log fold change of the sgRNA . The second data frame (segments) contains the identified region of estimated equal log fold changes (one region per row) and the following columns/headers: • CHR: the chromosome of the region under consideration; • startp: the genomic coordinate of the starting position of the region under consideration; • endp: the genomic coordinate of the ending position of the region under consideration; • n.sgRNAs: the number of sgRNAs targeting sequences in the region under consideration; • avg.logFC: the average log fold change of the sgRNAs in the region; • guideIdx: the indexes range of the sgRNAs targeting the region under consideration as they appear in the gwSortedF Cs provided in input. The string of vectors (SORTED_sgRNAs) contains the sgRNAs’ identifiers in the same order as they are reported in the gwSortedFCs data frame inputted to the ccr.GWclean function. Examples data(EPLC.272HcorrectedFCs) head(EPLC.272HcorrectedFCs$corrected_logFCs) head(EPLC.272HcorrectedFCs$segments) head(EPLC.272HcorrectedFCs$SORTED_sgRNAs) EssGenes.DNA_REPLICATION_cons Core Fitness essential genes involved in DNA replication Description List of core fitness essential genes involved in DNA replication assembled by merging together multilpe DNA replication signatures from MSigDB [1] as detailed in [2]. Usage data("EssGenes.DNA_REPLICATION_cons") Format A vector of strings containing HGNC symbols. EssGenes.KEGG_rna_polymerase 43 References [1] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. http://doi.org/10.1073/pnas.0506580102 [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 Examples data(EssGenes.DNA_REPLICATION_cons) head(EssGenes.DNA_REPLICATION_cons) EssGenes.KEGG_rna_polymerase Core Fitness essential rna polymerase genes Description List of core fitness essential rna polymerase genes downloaded from MSigDB [1]. Usage data("EssGenes.KEGG_rna_polymerase") Format A vector of strings containing HGNC symbols. References [1] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. http://doi.org/10.1073/pnas.0506580102 [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 Examples data(EssGenes.KEGG_rna_polymerase) head(EssGenes.KEGG_rna_polymerase) 44 EssGenes.ribosomalProteins EssGenes.PROTEASOME_cons Core Fitness essential proteasome genes Description List of core fitness essential proteasome genes assembled by merging together multilpe DNA replication signatures from MSigDB [1] as detailed in [2]. Usage data("EssGenes.PROTEASOME_cons") Format A vector of strings containing HGNC symbols. References [1] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. http://doi.org/10.1073/pnas.0506580102 [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 Examples data(EssGenes.PROTEASOME_cons) head(EssGenes.PROTEASOME_cons) EssGenes.ribosomalProteins Core Fitness essential genes coding for ribosomal proteins Description List of core fitness essential coding for ribosomal proteins curated from [1]. Usage data("EssGenes.KEGG_rna_polymerase") Format A vector of strings containing HGNC symbols. EssGenes.SPLICEOSOME_cons 45 References [1] Yoshihama, M. et al. The human ribosomal protein genes: sequencing and comparative analysis of 73 genes. Genome Res. 12, 379-390 (2002) [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 Examples data(EssGenes.ribosomalProteins) head(EssGenes.ribosomalProteins) EssGenes.SPLICEOSOME_cons Core Fitness essential spliceosome genes Description List of core fitness essential spliceosome genes assembled by merging together multilpe DNA replication signatures from MSigDB [1] as detailed in [2]. Usage data("EssGenes.SPLICEOSOME_cons") Format A vector of strings containing HGNC symbols. References [1] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 1554515550. http://doi.org/10.1073/pnas.0506580102 [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 Examples data(EssGenes.SPLICEOSOME_cons) head(EssGenes.SPLICEOSOME_cons) 46 GDSC.geneLevCNA GDSC.CL_annotation Tissue type and other annotations for 1,001 human cancer cell lines Description Tissue type and other annotations for 1,001 human cancer cell lines Usage data(GDSC.CL_annotation) Format A data frame with 1,001 observations of the following 7 variables. CL.name Cell line name; COSMIC.ID Cosmic identifier of the cell line; GDSC.description_1 Tissue descriptor (Genomics of Drug Sensitivity in Cancer - Level 1); GDSC_description_2 Tissue descriptor (Genomics of Drug Sensitivity in Cancer - Level 2); ‘TCGA type’ Manaually curated matched TCGA cancer type; MMR Microsatellite instability status (MSI-S = Stable, MSI-L = Instable, MSI-H = highly-Instable). Source This data frame has been derived from the xls table available at http://www.cancerrxgene.org/ gdsc1000/GDSC1000_WebResources//Data/suppData/TableS1E.xlsx. References [1] Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, et al. A landscape of pharmacogenomic interactions in cancer Cell 2016 Jul 28;166(3):740-54 Examples data(GDSC.CL_annotation) head(GDSC.CL_annotation) GDSC.geneLevCNA Genome-wide copy number data for 15 human cancer cell lines. Description Genome-wide copy number data derived from PICNIC analysis of Affymetrix SNP6 segmentation data (EGAS00001000978, part of the Genomics of Drug Sensitivity in 1,000 Cancer Cell Lines (GDSC1000) panel [1]) for 15 cell lines used in [2] to assess CRISPRcleaneR results. Usage data(GDSC.geneLevCNA) HT.29correctedFCs 47 Format A data frame with HGNC gene symbols on the row cancer cell lines’ cosmic identifiers on the columns. The entry in position i,j indicates the copy number status of gene i in cell line j. Details Each entry of the data frame is a string made of four comma seperated peices of data (n1,n2,n3,n4), hyphen (-) is used when the corresponding data is unknown. The four values indicate: • n1: Maximum copy number of any genomic segment containing coding sequence of the gene (-1 indicates a value could not be assigned). • n2: Minimum copy number of any genomic segment containing coding sequence of the gene (-1 indicates a value could not be assigned). • n3: Zygosity - (H) if all segments containing gene sequence are heterozygous, (L) if any segment containing coding sequence has LOH, (0) if the complete coding sequence of the gene falls within a homozygous deletion. • n4: Disruption (D) if the gene spans more than 1 genomic segment (-) if no disruption occures. Source This data frame has been derived from the xls table available at ftp://ftp.sanger.ac.uk/pub/ project/cancerrxgene/releases/release-6.0/Gene_level_CN.xlsx. References [1] Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, et al. A landscape of pharmacogenomic interactions in cancer Cell 2016 Jul 28;166(3):740-54 [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 Examples data(GDSC.geneLevCNA) GDSC.geneLevCNA[1:10,1:10] HT.29correctedFCs CRISPRcleanR corrected data for an example cell line Description This list contains corrected sgRNAs log fold-changes and segment annotations for an example cell line (HT-29), obtained using the ccr.GWclean function, as detailed in its reference manual entry ccr.GWclean. Usage data("HT.29correctedFCs") 48 HT.29correctedFCs Format A list containing two data frames and a vector of strings. The first data frame (corrected_logFCs) contains a named row per each sgRNA and the following columns/header: • CHR: the chromosome of the gene targeted by the sgRNA under consideration; • startp: the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; • endp: the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; • genes: the HGNC symbol of the gene targeted by the sgRNA under consideration; • avgFC: the log fold change of the sgRNA averaged across replicates; • correction: the type of correction: 1 = increased log fold change, -1 = decreased log fold change. 0 indicates no correction; • correctedFC: the corrected log fold change of the sgRNA . The second data frame (segments) contains the identified region of estimated equal log fold changes (one region per row) and the following columns/headers: • CHR: the chromosome of the gene targeted by the sgRNA under consideration; • startp: the genomic coordinate of the starting position of the region targeted by the sgRNA under consideration; • endp: the genomic coordinate of the ending position of the region targeted by the sgRNA under consideration; • genes: the HGNC symbol of the gene targeted by the sgRNA under consideration; • avgFC: the log fold change of the sgRNA averaged across replicates; • correction: the type of correction: 1 = increased log fold change, -1 = decreased log fold change. 0 indicates no correction; • correctedFC: the corrected log fold change of the sgRNA . The second data frame (segments) contains the identified region of estimated equal log fold changes (one region per row) and the following columns/headers: • • • • • • CHR: the chromosome of the region under consideration; startp: the genomic coordinate of the starting position of the region under consideration; endp: the genomic coordinate of the ending position of the region under consideration; n.sgRNAs: the number of sgRNAs targeting sequences in the region under consideration; avg.logFC: the average log fold change of the sgRNAs in the region; guideIdx: the indexes range of the sgRNAs targeting the region under consideration as they appear in the gwSortedF Cs provided in input. The string of vectors (SORTED_sgRNAs) contains the sgRNAs’ identifiers in the same order as they are reported in the gwSortedFCs data frame inputted to the ccr.GWclean function. Examples data(HT.29correctedFCs) head(HT.29correctedFCs$corrected_logFCs) head(HT.29correctedFCs$segments) head(HT.29correctedFCs$SORTED_sgRNAs) KY_Library_v1.0 KY_Library_v1.0 49 sgRNAs’ genome-wide annotation for the Sanger sgRNA pooled Library v1.0 Description A data frame with a named row for each sgRNA of the Sanger sgRNA pooled library presented in [1] including annotations such as targeted genes, and genomic coordinates. Usage data("KY_Library_v1.0") Format A a row named data frame with 90709 observations (one for each sgRNA) of the following 7 variables. CODE alphanumerical identifier of the sgRNAs; GENES targeted gene; EXONE exone of the targeted genomic region (string with ’ex’ prefix followed by the exone number); CHRM chromosome of where the targeted region resides (string) STRAND targeted DNA strand (’+’ or ’-’) STARTpos starting genomic coordinate of the targeted genomic region (numeric); ENDpos ending genomic coordinate of the targeted genomic region (numeric). References [1] Tzelepis K, Koike-Yusa H, De Braekeleer E, Li Y, Metzakopian E, Dovey OM, Mupo A, Grinkevich V, Li M, Mazan M, Gozdecka M, Onishi S, Cooper J, Patel M, McKerrell T, Chen B, Domingues AF, Gallipoli P, Teichmann S, Ponstingl H, McDermott U, Saez-Rodriguez J, Huntly BJP, Iorio F, Pina C, Vassiliou GS, Yusa K. A CRISPR dropout screen identifies genetic vulnerabilities and therapeutic targets in acute myeloid leukaemia. Cell Reports 2016 Oct 18;17(4):1193-1205 Examples data(KY_Library_v1.0) head(KY_Library_v1.0) 50 RNAseq.fpkms RNAseq.fpkms RNAseq derived genome-wide basal expression profiles for 15 cell lines. Description Genome-wide substitute reads with fragments per kilobase of exon per million reads mapped (FPKM) for the 15 cell lines specified in CL.subset, derived from a comprehensive collection of RNAseq profiles described in [1] and used in [2] to assess CRISPRcleaneR results. Usage data(RNAseq.fpkms) Format A data frame with one bservations per gene and one variable per cell line. Row names indicates HGNC symbols and column names indicate cell line COSMIC identifiers [3]. References [1] Garcia-Alonso L, Iorio F, Matchan A, et al. Transcription factor activities enhance markers of drug response in cancer doi: https://doi.org/10.1101/129478 [2] Iorio, F., Behan, F. M., Goncalves, E., Beaver, C., Ansari, R., Pooley, R., et al. (n.d.). Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. http://doi.org/10.1101/228189 [3] Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution Nucleic Acids Research, Volume 45, Issue D1, 4 January 2017, Pages D777-D783, See Also CL.subset Examples data(RNAseq.fpkms) head(RNAseq.fpkms) Index CCLE.gisticCNA, 3, 12, 27–29, 31 ccr.cleanChrm, 4, 18 ccr.correctCounts, 7 ccr.ExecuteMageck, 9, 20 ccr.geneMeanFCs, 10 ccr.genes2sgRNAs, 11, 34, 37 ccr.get.CCLEgisticSets, 12, 14 ccr.get.gdsc1000.AMPgenes, 13, 13, 16 ccr.get.nonExpGenes, 15 ccr.GWclean, 8, 16, 28, 31, 36, 41, 47 ccr.impactOnPhenotype, 19 ccr.logFCs2chromPos, 5, 6, 17, 22 ccr.multDensPlot, 23 ccr.NormfoldChanges, 5, 6, 8, 17, 22, 23, 24, 31, 32 ccr.perf_distributions, 26 ccr.perf_statTests, 28 ccr.PlainTsvFile, 31 ccr.PrRc_Curve, 33, 37, 39 ccr.RecallCurves, 34 ccr.ROC_Curve, 34, 36, 39 ccr.VisDepAndSig, 34, 37, 38 CL.subset, 3, 40, 50 ∗Topic Assessment and Visualisation ccr.impactOnPhenotype, 19 ccr.multDensPlot, 23 ccr.perf_distributions, 26 ccr.perf_statTests, 28 ccr.PrRc_Curve, 33 ccr.RecallCurves, 34 ccr.ROC_Curve, 36 ccr.VisDepAndSig, 38 ∗Topic analysis ccr.cleanChrm, 4 ccr.correctCounts, 7 ccr.GWclean, 16 ccr.logFCs2chromPos, 22 ccr.NormfoldChanges, 24 ∗Topic datasets BAGEL_essential, 2 BAGEL_nonEssential, 3 CCLE.gisticCNA, 3 CL.subset, 40 EPLC.272HcorrectedFCs, 41 EssGenes.DNA_REPLICATION_cons, 42 EssGenes.KEGG_rna_polymerase, 43 EssGenes.PROTEASOME_cons, 44 EssGenes.ribosomalProteins, 44 EssGenes.SPLICEOSOME_cons, 45 GDSC.CL_annotation, 46 GDSC.geneLevCNA, 46 HT.29correctedFCs, 47 KY_Library_v1.0, 49 RNAseq.fpkms, 50 ∗Topic utils ccr.ExecuteMageck, 9 ccr.geneMeanFCs, 10 ccr.genes2sgRNAs, 11 ccr.get.CCLEgisticSets, 12 ccr.get.gdsc1000.AMPgenes, 13 ccr.get.nonExpGenes, 15 ccr.PlainTsvFile, 31 EPLC.272HcorrectedFCs, 41 EssGenes.DNA_REPLICATION_cons, 28, 31, 42 EssGenes.KEGG_rna_polymerase, 28, 31, 43 EssGenes.PROTEASOME_cons, 28, 31, 44 EssGenes.ribosomalProteins, 28, 31, 44 EssGenes.SPLICEOSOME_cons, 28, 31, 45 GDSC.CL_annotation, 46 GDSC.geneLevCNA, 14, 27–29, 31, 35, 36, 46 HT.29correctedFCs, 47 KY_Library_v1.0, 10–12, 22, 23, 25–29, 31, 35, 36, 49 RNAseq.fpkms, 15, 27–29, 31, 35, 36, 50 BAGEL_essential, 2, 3, 27, 28, 30, 31, 34–37 BAGEL_nonEssential, 2, 3, 27, 28, 30, 31, 34–37 51
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 51 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.16 Create Date : 2018:06:08 15:57:34+01:00 Modify Date : 2018:06:08 15:57:34+01:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015) kpathsea version 6.2.1EXIF Metadata provided by EXIF.tools