CRISPhieRmix Manual CRISPhie Rmix
CRISPhieRmixManual
User Manual:
Open the PDF directly: View PDF .
Page Count: 23
Download | ![]() |
Open PDF In Browser | View PDF |
CRISPhieRmix Manual Timothy Daley 8/2/2018 Contents 1 Introduction to CRISPhieRmix 2 2 Input parameters for CRISPhieRmix 2.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 3 Examples 3.1 A CRISPRi dropout screen . . . . . . . . . 3.2 Checking the negative control distribution . 3.3 Using a normal hierarchical mixture to rank 3.4 Estimating gene effect sizes after filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . genes without negative control guides . . . . . . . . . . . . . . . . . . . . . . Contents 1 . . . . . . . . . . . . . . . . . . . . . . . . 4 4 11 12 17 1 Introduction to CRISPhieRmix CRISPhieRmix is an R package for analysing large CRISPR interference and activation (CRISPRi/a) screens. CRISPhieRmix uses a hierarchical mixture model approach to identify genes that are unlikely to follow the null distribution. CRISPhieRmix assumes that genes follow a mixture of null genes (genes with no effect) and non-null genes (genes with a significant effect). All guides from null genes follow a common null distribution. The guides for the non-null genes, on the other hand, are assumed to follow a mixture distribution, where some guides are ineffective (possibly with little or no change in the target gene expression) and follow the null distribution and some guides have an effect and follow an alternative distribution. CRISPhieRmix can be used with out without negative control guides. We find that negative control guides help to better model the null distribution, as in our experience the null distribution tends to have long tails, and this helps to control the false discovery rate. A critical assumption is that the negative control guides accurcately reflect the null distribution. This assumption can be violated in some screens, and we will discuss this later in depth. If you have any issues with software, please create an issue in the github page https://github.com/timydaley/ CRISPhieRmix. If you have any questions, please email me at tdaley@stanford.edu. 2 2 Input parameters for CRISPhieRmix 2.1 Input • x log2 fold changes of guides targeting genes (required) • geneIds gene ids corresponding to x (required) • negCtrl log2 fold changes of negative control guides; if negCtrl is not included then CRISPhieRmix uses a normal hierachical mixture model • max_iter maximum number of iterations for EM algorithm, default = 100 • tol tolerance for convergence of EM algorithm, default = 1e-10 • pq initial value of p·q, default = 0.1 • mu initial value of mu for the interesting genes, default = -4 • sigma initial value of sigma for the interesting genes, default = 1 • nMesh the number of points to use in numerical integration of posterior probabilities, default = 100 • BIMODAL boolean variable for BIMODAL mode to fit both positive and negative sides, used for cases such as Jost et al. 2017 where both sides are of interest. • VERBOSE boolean variable for VERBOSE mode, default = FALSE • PLOT boolean variable to produce plots, default = FALSE 2.2 Return • mixFit a list containing the mixture fit for x • genes a vector of genes from geneIds, in the same order as the following return values • locfdr a vector of local false discovery rates, the posterior probablity a gene is null in the same order as genes • FDR a vector of global false discovery rates in the same order as genes • genePosteriors a vector of posterior probabilities that the genes are non-null (equal to 1 - locfdr), in the same order as genes; when BIMODAL is set to TRUE then both negGenePosteriors and posGenePosteriors are returned that give the posterior probability that a gene is negative and positive, respectively 3 3 3.1 Examples A CRISPRi dropout screen Gilbert et al. 2014 performed genome wide screens for ricin resistance and susceptibility. In table 2, sheet 1 the authors provide the results of the CRISPRi screen at the sgRNA level. This includes the growth phenotype gamma and the ricin phenotype rho. They were interested in the ricin phenotype, but we will look at the growth phenotype to identify genes that lead to decreased growth and can thus be considered essential. In the CRISPhieRmix paper we used log2 fold changes computed by DESeq2 for a fairer comparison of algorithms (since other algorithms such as MAGeCK work directly on the counts), but here we will use the scores computed by Gilbert et al to show the flexibility of the CRISPhieRmix approach. Gilbert2014Table2CRISPRi = read.table(file = "Gilbert2014Table2CRISPRi.txt", sep = "\t", header = TRUE) head(Gilbert2014Table2CRISPRi) ## gene sgRNA.ID Transcripts.targeted Protospacer.sequence ## 1 A1BG A1BG-1 all GCAAGAGAAAGACCACGAGCA ## 2 A1BG A1BG-10 all GCGGGAACAGGAGCCTTACGG ## 3 A1BG A1BG-2 all GTCTGCAGCAATGAGGCCCCA ## 4 A1BG A1BG-3 all GCAGCCATATGTGAGTGCAG ## 5 A1BG A1BG-4 all GACATGATGGTCGCGCTCACTC ## 6 A1BG A1BG-5 all GAATGGTGGGCCCAGGCCGGG ## Growth.phenotype..gamma. CTx.DTA.phenotype..rho. ## 1 -0.008127700 0.011324965 ## 2 -0.006738800 0.011671726 ## 3 -0.009591072 -0.028942999 ## 4 -0.017039814 0.046149024 ## 5 -0.004868127 0.004907662 ## 6 -0.012730560 -0.046096108 # identify the negative control guides head(sort(table(Gilbert2014Table2CRISPRi$gene), decreasing = TRUE)) ## ## negative_control ## 11219 ## STON1 ## 45 HLA 193 SSPO 44 NKX2 85 C1QTNF9B 45 geneTargetingGuides = which(Gilbert2014Table2CRISPRi$gene != "negative_control") negCtrlGuides = which(Gilbert2014Table2CRISPRi$gene == "negative_control") gamma = Gilbert2014Table2CRISPRi$Growth.phenotype..gamma.[geneTargetingGuides] negCtrl = Gilbert2014Table2CRISPRi$Growth.phenotype..gamma.[negCtrlGuides] geneIds = Gilbert2014Table2CRISPRi$gene[geneTargetingGuides] # need to remove the negative_control factor geneIds = factor(geneIds, levels = unique(geneIds)) x = data.frame(gamma = c(gamma, negCtrl), category = c(rep("gene targeting", times = length(gamma)), rep("negative control", times = length(negCtrl)))) x$category = factor(x$category, levels = c("negative control", "gene targeting")) library(ggplot2) ggplot(x, aes(x = gamma, colour = category)) + geom_density() + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line 4 density 30 category 20 negative control gene targeting 10 0 −0.5 0.0 gamma We see from the figure above that most of the gene targeting guides follow the same distribution as the negative control guides, but with a longer tail on the negative end. This is the signal due to cells dying as a result of the gene effects. We can now apply CRISPhieRmix to this data to identify genes that led to decreased growth or cell death. library(CRISPhieRmix) gamma.CRISPhieRmix = CRISPhieRmix(x = gamma, geneIds = geneIds, negCtrl = negCtrl, mu = -0.2, sigma = 0.1, VERBOSE = TRUE, PLOT = TRUE) 5 20 15 0 5 10 Density 25 30 negative control fit −0.8 −0.6 −0.4 −0.2 negCtrl ## ## ## ## ## ## fit negative control distributions 2 groups EM converged mu = -0.1575749 sigma = 0.1315886 pq = 0.04938287 6 0.0 0.2 0.4 15 10 0 5 Density 20 25 mixture fit to observations −1.0 −0.5 0.0 0.5 x hist(gamma.CRISPhieRmix$FDR, breaks = 100) 200 400 600 800 0 Frequency 1200 Histogram of gamma.CRISPhieRmix$FDR 0.0 0.2 0.4 0.6 gamma.CRISPhieRmix$FDR sum(gamma.CRISPhieRmix$FDR < 0.1) ## [1] 1373 7 0.8 The peak near zero indicates the genes that are likely to be essential. Let’s look at how much these genes overlap the gold standard list of core constitutive genes and non-essential genes of Hart et al. 2014. ConstitutiveCoreEssentialGenes = scan("ConstitutiveCoreEssentialGenes.txt", what = character()) length(ConstitutiveCoreEssentialGenes) ## [1] 217 length(intersect(ConstitutiveCoreEssentialGenes, gamma.CRISPhieRmix$genes[which(gamma.CRISPhieRmix$FDR < 0.1)])) ## [1] 170 NonEssentialGenes = scan("NonEssentialGenes.txt", what = character()) length(NonEssentialGenes) ## [1] 927 length(intersect(NonEssentialGenes, gamma.CRISPhieRmix$genes[which(gamma.CRISPhieRmix$FDR < 0.1)])) ## [1] 3 From this we can estimate the empirical false discovery rate at FDR level 0.1 as 3/1373 ≈ 0.002 and an empirical true positive rate as 170/217 ≈ 0.78. The empirical vs estimated FDR curve is plotted below. EssentialGenes = data.frame(gene = factor(c(sapply(ConstitutiveCoreEssentialGenes, toString), sapply(NonEssentialGenes, toString))), essential = c(rep(1, times = length(ConstitutiveCoreEssentialGenes)), rep(0, times = length(NonEssentialGenes)))) EssentialGenes = EssentialGenes[which(EssentialGenes$gene %in% gamma.CRISPhieRmix$genes), ] gamma.CRISPhieRmixEssential = data.frame(genes = gamma.CRISPhieRmix$genes, FDR = gamma.CRISPhieRmix$FDR) gamma.CRISPhieRmixEssential = gamma.CRISPhieRmixEssential[which(gamma.CRISPhieRmixEssential$genes %in% E gamma.CRISPhieRmixEssential = gamma.CRISPhieRmixEssential[match(EssentialGenes$gene, gamma.CRISPhieRmixE fdr.curve <- function(thresh, fdrs, baseline){ w = which(fdrs < thresh) if(length(w) > 0){ return(sum(1 - baseline[w])/length(w)) } else{ return(NA) } } s = seq(from = 0, to = 1, length = 1001) gamma.CRISPhieRmixFdrCurve = sapply(s, function(t) fdr.curve(t, gamma.CRISPhieRmixEssential$FDR, EssentialGenes$essential)) plot(c(0, s[!is.na(gamma.CRISPhieRmixFdrCurve)]), c(0, gamma.CRISPhieRmixFdrCurve[!is.na(gamma.CRISPhieR ylab = "empirical FDR", main = "Estimated vs Empirical Fdr", xlim = c(0, 1), ylim = c(0, 1), lwd = 2, col = "deepskyblue") abline(0, 1) 8 0.6 0.4 0.0 0.2 empirical FDR 0.8 1.0 Estimated vs Empirical Fdr 0.0 0.2 0.4 0.6 0.8 1.0 estimated FDR The receiver operator curve based on the above annotated genes is given below. gamma.CRISPhieRmixROC = pROC::roc(EssentialGenes$essential, gamma.CRISPhieRmixEssential$FDR, auc = TRUE) gamma.CRISPhieRmixROC ## ## ## ## ## ## Call: roc.default(response = EssentialGenes$essential, predictor = gamma.CRISPhieRmixEssential$FDR, auc Data: gamma.CRISPhieRmixEssential$FDR in 455 controls (EssentialGenes$essential 0) > 217 cases (Essen Area under the curve: 0.9259 plot(gamma.CRISPhieRmixROC, col = "deepskyblue", lwd = 2, xlim = c(0, 1), ylim = c(0, 1), main = "ROC") 9 0.0 0.2 0.4 0.6 Sensitivity 0.8 1.0 ROC 0.0 0.5 Specificity 10 1.0 3.2 Checking the negative control distribution The data shown here is taken from Wang et al, 2015. The authors performed a genome-wide screen for essential genes. Wang2015counts = read.table(file = "aac7041_SM_Table_S2.txt", sep = "\t", header = TRUE) Wang2015counts = Wang2015counts[-which(rowSums(Wang2015counts[ ,-c(1)]) == 0), ] which.negCtrl = which(startsWith(sapply(Wang2015counts$sgRNA, toString), "CTRL")) geneIds = sapply(Wang2015counts$sgRNA[-which.negCtrl], function(g) unlist(strsplit(toString(g), split = "_"))[1]) geneIds = sapply(geneIds, function(g) substring(g, first = 3)) geneIds = factor(geneIds, levels = unique(geneIds)) counts = Wang2015counts[ , -c(1)] colData = data.frame(cellType = sapply(colnames(counts), function(x) unlist(strsplit(toString(x), split = ".", fixed = TRUE))[1]), condition = factor(rep(c(0, 1), times = 5))) rownames(colData) = colnames(counts) Wang2015DESeq = DESeq2::DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition) Wang2015DESeq = DESeq2::DESeq(Wang2015DESeq) ## estimating size factors ## estimating dispersions ## gene-wise dispersion estimates ## mean-dispersion relationship ## final dispersion estimates ## fitting model and testing Wang2015DESeq = DESeq2::results(Wang2015DESeq) log2fc = Wang2015DESeq$log2FoldChange log2fc.negCtrl = log2fc[which.negCtrl] log2fc.geneTargeting = log2fc[-which.negCtrl] library(ggplot2) x = data.frame(log2fc = log2fc, category = c(rep("negative control", times = length(which.negCtrl)), rep("gene targeting", times = length(log2fc.geneTargeting)) ggplot(x, aes(x = log2fc, colour = category)) + geom_density() + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line 11 1.5 1.0 density category gene targeting negative control 0.5 0.0 −10 −5 0 5 log2fc As we can see, the distribution of the negative control guides do not line up with the central peak of the gene targeting guides. This indicates that the distribution of the negative control guides does not reflect the null genes. In this case, using the negative control guides to estimate the null will likely lead to very incorrect inferences and is not suggested. It’s unclear why the negative control guides do not look like the central peak. If this is the case in your experiment, it is worth investigating why. 3.3 Using a normal hierarchical mixture to rank genes without negative control guides We’ll look at ranking the genes without negative control guides for this data. Wang2015NormalMix = CRISPhieRmix(x = log2fc.geneTargeting, geneIds = geneIds, PLOT = TRUE, VERBOSE = TRU ## no negative controls provided, fitting hierarchical normal model ## number of iterations= 52 12 0.4 0.0 0.2 Density 0.6 0.8 Density Curves −10 −5 0 5 Data Wang2015NormalMixGeneScores = data.frame(genes = Wang2015NormalMix$genes, FDR = Wang2015NormalMix$FDR) head(Wang2015NormalMixGeneScores[order(Wang2015NormalMixGeneScores$FDR, decreasing = FALSE), ], 20) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## genes FDR 7 AAAS 0 18 AAMP 0 22 AARS 0 23 AARS2 0 26 AASDHPPT 0 28 AATF 0 49 ABCB7 0 51 ABCB9 0 67 ABCE1 0 68 ABCF1 0 78 ABHD11 0 86 ABHD16B 0 90 ABHD2 0 100 ABL1 0 102 ABLIM1 0 108 ABT1 0 117 ACAD8 0 130 ACBD3 0 137 ACD 0 138 ACE 0 hist(Wang2015NormalMixGeneScores$FDR, breaks = 100, col = "grey") 13 3000 0 1000 Frequency 5000 Histogram of Wang2015NormalMixGeneScores$FDR 0.0 0.1 0.2 0.3 0.4 0.5 Wang2015NormalMixGeneScores$FDR sum(Wang2015NormalMixGeneScores$FDR == 0) ## [1] 2777 EssentialGenes = data.frame(gene = factor(c(sapply(ConstitutiveCoreEssentialGenes, toString), sapply(NonEssentialGenes, toString))), essential = c(rep(1, times = length(ConstitutiveCoreEssentialGenes)), rep(0, times = length(NonEssentialGenes)))) EssentialGenes = EssentialGenes[which(EssentialGenes$gene %in% Wang2015NormalMixGeneScores$genes), ] Wang2015NormalMixGeneScoresEssential = Wang2015NormalMixGeneScores[which(Wang2015NormalMixGeneScores$gen Wang2015NormalMixGeneScoresEssential = Wang2015NormalMixGeneScoresEssential[match(EssentialGenes$gene, W Wang2015NormalMixGeneScoresEssentialFdrCurve = sapply(s, function(t) fdr.curve(t, Wang2015NormalMixGeneScoresEssential EssentialGenes$essential)) plot(c(0, s[!is.na(Wang2015NormalMixGeneScoresEssentialFdrCurve)]), c(0, Wang2015NormalMixGeneScoresEsse ylab = "empirical FDR", main = "Estimated vs Empirical Fdr", xlim = c(0, 1), ylim = c(0, 1), lwd = 2, col = "deepskyblue") abline(0, 1) 14 0.6 0.4 0.0 0.2 empirical FDR 0.8 1.0 Estimated vs Empirical Fdr 0.0 0.2 0.4 0.6 0.8 1.0 estimated FDR Wang2015NormalMixGeneScoresEssentialROC = pROC::roc(EssentialGenes$essential, Wang2015NormalMixGeneScoresEssential$FDR, auc = TRUE) Wang2015NormalMixGeneScoresEssentialROC ## ## ## ## ## ## Call: roc.default(response = EssentialGenes$essential, predictor = Wang2015NormalMixGeneScoresEssential$FDR Data: Wang2015NormalMixGeneScoresEssential$FDR in 771 controls (EssentialGenes$essential 0) > 216 cas Area under the curve: 0.9555 plot(Wang2015NormalMixGeneScoresEssentialROC, col = "darkgreen", lwd = 2, xlim = c(0, 1), ylim = c(0, 1) 15 0.6 0.4 0.0 0.2 Sensitivity 0.8 1.0 ROC 0.0 0.5 1.0 Specificity As we can see, the normal hierarchical mixture model is calling way too many genes as significant, but does a good job at ranking the essential genes ahead of the non-essential genes. We see that it calls way too many genes are essential at any level of FDR. This will be problematic for most applications, as we can’t test all the genes called positive. One strategy to use a broader tailed that is estimated from the central peak, as discussed in Efron, 2012. The idea is to use a semi-parametric distribution to estimate the null distribution from the central peak. We have found that this does not well approximate the negative control distribution because of the long tails, but it does a better job than the normal distribution. We have also found that for some data it is very difficult to get a approximation of the central peak, so we have not implemented this in the package. We show it here for the interest of the reader. 16 3.4 Estimating gene effect sizes after filtering One issue with CRISPhieRmix is that we only rank genes and compute false discovery rates. Unlike MAGeCK MLE, CRISPhieRmix is currently not able to compute gene effect sizes. The major problem is the identifiability issue. With off-targets and gene to gene variation in guide efficiency, it is difficult to consistently estimate gene effect sizes. But if the genes are first chosen to be likely non-null and then gene effect sizes are estimated, then it is relatively easy to estimate gene effect sizes. Below I will show one way of how using Rstan and setting informative priors from the full analysis, though there are other ways. # redefine geneIds geneIds = Gilbert2014Table2CRISPRi$gene[geneTargetingGuides] # need to remove the negative_control factor geneIds = factor(geneIds, levels = unique(geneIds)) gamma.CRISPhieRmix$mixFit$mu ## [1] -0.1575749 gamma.CRISPhieRmix$mixFit$sigma ## [1] 0.1315886 gamma.CRISPhieRmix$mixFit$pq ## [1] 0.04938287 topGenes = gamma.CRISPhieRmix$genes[which(gamma.CRISPhieRmix$FDR < 0.01)] length(topGenes)/length(gamma.CRISPhieRmix$genes) ## [1] 0.05940469 x = data.frame(gamma = c(gamma[which(geneIds %in% topGenes)], negCtrl), category = c(rep("gene targeting", times = sum(geneIds %in% topGenes)), rep("negative control", times = length(negCtrl)))) x$category = factor(x$category, levels = c("negative control", "gene targeting")) ggplot(x, aes(x = gamma, fill = category)) + geom_density(alpha = 0.6) + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line 17 density 30 category 20 negative control gene targeting 10 0 −0.9 −0.6 −0.3 0.0 0.3 gamma topGenesGammaMixFit = mixtools::normalmixEM(gamma[which(geneIds %in% topGenes)], k = 2, mu = c(0, -0.15) ## number of iterations= 55 hist(gamma[which(geneIds %in% topGenes)], breaks = 50, probability = TRUE, xlab = "gamma", main = "Histo s = seq(from = -2, to = 2, length = 1001) lines(s, topGenesGammaMixFit$lambda[1]*dnorm(s, mean = topGenesGammaMixFit$mu[1], sd = topGenesGammaMixFit$sigma[1]), col = "red", lwd = 2) lines(s, topGenesGammaMixFit$lambda[2]*dnorm(s, mean = topGenesGammaMixFit$mu[2], sd = topGenesGammaMixFit$sigma[2]), col = "deepskyblue", lwd = 2) lines(s, topGenesGammaMixFit$lambda[1]*dnorm(s, mean = topGenesGammaMixFit$mu[1], sd = topGenesGammaMixFit$sigma[1]) + topGenesGammaMixFit$lambda[2]*dnorm(s, mean = topGenesGammaMixFit$mu[2], sd = topGenesGammaMixFit$sigma[2]), col = "darkviolet", lwd = 2, lty = 2) 18 6 0 2 4 Density 8 10 Histogram of top genes −0.8 −0.6 −0.4 −0.2 0.0 0.2 gamma Using a normal distribution as the null distribution likely overestimates the gene effects. The skew-t can be used by storing the probabilities in a look-up table, but this is much slower. For this example, we will use a normal distribution for the null. We’re going to assume the following hierarchical model with informative priors and fit it in stan. γi ∼ (1 − πgi )N (0, 0.0152 ) + πgi N (µgi , σ12 ); µg ∼ N (−0.175, σ22 ); πg ∼ Uniform(0, 1); σ1 , σ2 ∼ N+ (0, 0.52 ). model <-' data { intn_sgRNAs; int n_genes; real x[n_sgRNAs]; // gamma int gene_ids[n_sgRNAs]; } parameters { real mu_g[n_genes]; // gene effect sizes real pi_g[n_genes]; // mixing parameters real sigma1; // guide-level sd real sigma2; // gene-level sd } model{ sigma1 ~ normal(0, 0.5); sigma2 ~ normal(0, 0.5); mu_g ~ normal(-0.175, sigma2); pi_g ~ beta(1, 1); for(i in 1:n_sgRNAs){ target += log_mix(pi_g[gene_ids[i]], 19 normal_lpdf(x[i] | mu_g[gene_ids[i]], sigma1), normal_lpdf(x[i] | 0, 0.015)); } } ' library(rstan) options(mc.cores = 2) gene_ids = geneIds[which(geneIds %in% topGenes)] gene_ids = factor(gene_ids, levels = unique(gene_ids)) sgRNAdata = list(n_sgRNAs = length(gene_ids), n_genes = length(levels(gene_ids)), x = gamma[which(geneIds %in% topGenes)], gene_ids = as.numeric(gene_ids)) gammaHierMixFit = stan(model_code = model, data = sgRNAdata, chains = 4, iter = 1000); ## Warning: There were 46 transitions after warmup that exceeded the maximum treedepth. Increase max_tre ## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded ## Warning: Examine the pairs() plot to diagnose sampling problems gammaHierMixFit.stan.summary = summary(gammaHierMixFit, probs = c(0.05, 0.5, 0.95))$summary # trace plot for sigmas bayesplot::mcmc_trace(as.array(gammaHierMixFit), pars = c("sigma1", "sigma2")) sigma1 sigma2 0.070 0.122 Chain 0.065 0.120 1 2 3 4 0.118 0.060 0.116 0.055 0 100 200 300 400 500 0 100 200 300 400 gammaHierMixFit.stan.summary = data.frame(parameter = as.factor(c(paste0("mu[", levels(gene_ids), "]"), paste0("pi[", levels(gene_ids), "]"), 20 500 "sigma1", "sigma2", "lp__")), gammaHierMixFit.stan.summary) rownames(gammaHierMixFit.stan.summary) = c() tail(gammaHierMixFit.stan.summary, 3) ## ## ## ## ## ## ## ## parameter mean se_mean sigma1 1.190777e-01 2.551200e-05 sigma2 6.202982e-02 5.731784e-05 lp__ 1.366123e+04 1.388832e+00 X50. X95. n_eff 1889 1.190947e-01 1.209164e-01 2000.0000 1890 6.199475e-02 6.580674e-02 1588.7728 1891 1.366357e+04 1.371656e+04 666.3221 1889 1890 1891 sd X5. 0.001140931 1.171315e-01 0.002284655 5.826072e-02 35.850207884 1.360002e+04 Rhat 0.999373 1.000609 1.002904 hist(gammaHierMixFit.stan.summary$Rhat, breaks = 50, main = "histogram of Rhat") 80 60 0 20 40 Frequency 120 histogram of Rhat 0.998 0.999 1.000 1.001 1.002 1.003 gammaHierMixFit.stan.summary$Rhat gammaIndices = grep("mu", gammaHierMixFit.stan.summary$parameter) x = gammaHierMixFit.stan.summary$X50.[gammaIndices] hist(x, breaks = 100, xlim = c(min(x), 0), col = "grey", main = "estimated gene effects") abline(v = -0.175, lwd = 2, lty = 2) 21 30 20 0 10 Frequency 40 estimated gene effects −0.35 −0.30 −0.25 −0.20 −0.15 −0.10 −0.05 0.00 x mixingIndices = grep("pi", gammaHierMixFit.stan.summary$parameter) x = gammaHierMixFit.stan.summary$X50.[mixingIndices] hist(x, breaks = 100, xlim = c(0, 1), col = "grey", main = "estimated mixing parameter") abline(v = 0.5, lwd = 2, lty = 2) 20 15 10 5 0 Frequency 25 30 estimated mixing parameter 0.0 0.2 0.4 0.6 x 22 0.8 1.0 y = data.frame(genes = levels(gene_ids), geneGamma = gammaHierMixFit.stan.summary$X50.[gammaIndices], geneGammaLowerCI = gammaHierMixFit.stan.summary$X5.[gammaIndices], geneGammaUpperCI = gammaHierMixFit.stan.summary$X95.[gammaIndices], geneMixing = gammaHierMixFit.stan.summary$X50.[mixingIndices], geneMixingLowerCI = gammaHierMixFit.stan.summary$X5.[mixingIndices], geneMixingUpperCI = gammaHierMixFit.stan.summary$X95.[mixingIndices]) library(ggplot2) ## Warning: package 'ggplot2' was built under R version 3.4.4 ggplot(y, aes(x = geneGamma, y = geneMixing)) + geom_errorbar(aes(ymin = geneMixingLowerCI, ymax = geneMixingUpperCI), colour = "grey", alpha = 0.35) geom_errorbarh(aes(xmin = geneGammaLowerCI, xmax = geneGammaUpperCI), colour = "grey", alpha = 0.35) + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = geom_point(alpha = 0.75) + xlab("gamma") + ylab("pi") + ggtitle("gene level effect size vs guide effic gene level effect size vs guide efficiency 1.00 pi 0.75 0.50 0.25 −0.4 −0.3 −0.2 −0.1 0.0 gamma The trace plots and Rhat’s near 1 means the model is likely converged. We see that when the gene effect size is far away from zero, mixing parameters can vary a lot. On the other hand, when the gene effect size is near zero it is more difficult to find genes with low mixing parameters. This is likely because genes with small effects and low mixing parameters are chosen by CRISPhieRmix to less likely be null. 23
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 23 Page Mode : UseOutlines Author : Timothy Daley Title : CRISPhieRmix Manual Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.16 Create Date : 2018:11:13 18:01:12-08:00 Modify Date : 2018:11:13 18:01:12-08:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015) kpathsea version 6.2.1EXIF Metadata provided by EXIF.tools