Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 6
Download | |
Open PDF In Browser | View PDF |
Utilizing fit∂a∂i: a short how-to guide Bernard Kim bernard.kim@ucla.edu Updated: November 8, 2018 1 Introduction This short guide outlines how to utilize the code we have made available for inferring of the distribution of fitness effects (DFE). This code serves as an add-on module for the package ∂a∂i [1], which is primarily used for demographic inference. The code examples shown here are meant to work with the example dataset. For simplicity’s sake, I have generated an example dataset with PReFerSIM [2]. Furthermore, we will work with a relatively small sample size and simple demographic model so that the examples can be worked through quickly on a laptop. Lastly, all the example code is provided in the example.py script as well as in this document. Another important thing to note: ∂a∂i characterizes genotype fitnesses as: 1, 1 + 2sh, and 1 + 2s, where 1 + 2sh is the fitness of the heterozygote. Furthermore, the DFEs inferred are scaled in terms of the ancestral population size: γ = 2NA s. This means that the selection coefficients must sometimes be rescaled, for instance when using the program SLiM [3]. If you have any additional questions, please feel free to email us: Bernard Kim (bernard.kim@ucla.edu) or Kirk Lohmueller (klohmueller@ucla.edu). 2 Installation ∂a∂i can be downloaded from the Gutenkunst Lab’s BitBucket site. Once you have ∂a∂i, there are two easy ways to utilize the fit∂a∂i module. The first method allows you to set up fit∂a∂i so that it is installed with the ∂a∂i code. To do this, you copy Selection.py and init .py into the dadi directory before installing ∂a∂i with python setup.py install. Note, it may be useful to comment out the line import matplotlib in init .py if you are trying to use ∂a∂i on a cluster. Alternately, you can import our code separately. To do this, you should install ∂a∂i in the standard manner. Copy Selection.py into your working directory. Then, our module can be loaded by including the line import Selection.py into your ∂a∂i/Python script. The latter method is the way in which the example script is set up: 1 2 3 import numpy import d a d i import S e l e c t i o n Note that in order to run ∂a∂i with the example script you must also copy Selection.py into the same directory as the example script. 1 3 Example dataset The example dataset used in the example script was generated with forward simulations under the PRF model, with the simulation program PReFerSIM. Additionally, we will assume we know the true underlying demographic model rather than trying to fit one. This dataset is summarized with a site frequency spectrum, has sample size 2n = 250 (125 diploids), and is saved in the file sample.sfs file. It was generated with a single size change demography and an underlying gamma DFE. Specifically, a population of size N = 10, 000 diploids expands to 20, 000 diploids 1000 generations ago and the gamma DFE has shape parameter 0.186 and scale paramter 686.7. This is the same gamma DFE that we inferred from the 1000 Genomes EUR dataset, but the scale parameter has been rescaled to the ancestral population size of 10,000 diploids. Finally, the amount of diversity in the sample dataset matches θN S = 4000 = 4NA µLN S . 4 Demographic inference Because the usage of ∂a∂i for demographic inference is extensively documented, it will not be discussed in detail here. In practice, we find that, as long as the demographic model that fits the synonymous sites reasonably well also works well for inference of the DFE. Briefly, we fit a demographic model to synonymous sites, which are assumed to be evolving in a neutral or nearly neutral manner. We believe this accurately represents the effects of linked selection and population structure, and condition upon this demographic model to fit the DFE. However, note the assumption of neutral synonymous variants may not hold for species with large population sizes, since this will increase the efficacy of selection on mutations with small fitness effects. Our sample dataset was generated with a two epoch (single size change) demography. Although we are skipping the demographic inference, the following ∂a∂i function describes a two epoch demographic model. 1 2 3 4 5 6 7 d e f two epoch ( params , ns , p t s ) : nu , T = params xx = d a d i . Numerics . d e f a u l t g r i d ( p t s ) p h i = d a d i . PhiManip . phi 1D ( xx ) p h i = d a d i . I n t e g r a t i o n . one pop ( phi , xx , T, nu ) f s = d a d i . Spectrum . f r o m p h i ( phi , ns , ( xx , ) ) return f s We will assume we infer a 2-fold population expansion 0.05*2NA generations ago, where NA is the ancestral population size. Therefore, the parameter vector is: [nu, T]. 1 demog params = [ 2 , 0 . 0 5 ] Again, we assume that the population scaled nonsynonymous mutation rate, θN S = 4, 000. In practice, we compute the synonymous mutation rate, θS , by using the multinomial likelihood to fit the demographic model. Because this method only fits the proportional SFS, θS is estimated with the dadi.Inference.optimal sfs scaling method. Then, we multiply θS by 2.31 to get θN S , θS ∗ 2.31 = θN S . Remember that our sample size is 125 diploids (250 chromosomes). 1 2 theta ns = 4000. ns = numpy . a r r a y ( [ 2 5 0 ] ) 2 5 Pre-computing of the SFS for many γ Next, we must generate frequency spectra for a range of gammas. The demographic function is modified to allow for a single selection coefficient. Here, each selection coefficient is scaled with the ancestral population size, γ = 2NA s. In other words, if s is constant, the same gamma should be used throughout the demographic function. If gamma=0, this function is the same as the original demographic function. 1 2 3 4 5 6 7 d e f t w o e p o c h s e l ( params , ns , p t s ) : nu , T, gamma = params xx = d a d i . Numerics . d e f a u l t g r i d ( p t s ) p h i = d a d i . PhiManip . phi 1D ( xx , gamma=gamma) p h i = d a d i . I n t e g r a t i o n . one pop ( phi , xx , T, nu , gamma=gamma) f s = d a d i . Spectrum . f r o m p h i ( phi , ns , ( xx , ) ) return f s Then, we generate the frequency spectra for a range of gammas. The following code generates expected frequency spectra, conditional on the demographic model fit to synonymous sites, over Npts log-spaced points over the range of int bounds. Additionally, the mp=True argument tells fit∂a∂i whether it should utilize multiple cores/threads, which is convenient since this step takes the longest. If the argument cpus is passed, it will utilize that many cores, but if mp=True and no cpus argument is passed, it will use n-1 threads, where n is the number of threads available. If mp=False, each SFS will be computed in serial. This step should take 1-10 minutes depending on your CPU. 1 2 3 4 p t s l = [ 6 0 0 , 800 , 1000] s p e c t r a = S e l e c t i o n . s p e c t r a ( demog params , ns , t w o e p o c h s e l , p t s l=p t s l , i n t b o u n d s =(1e −5 ,500) , Npts =300 , echo=True , mp=True ) Another manner in which this can be done is multiple log-spaced grids over some predetermined breaks. While this method is almost the same as the previous, this method is more appropriate for discretely binned DFEs. For example, if we were to create discrete bins at the breaks γ = [0.1, 1, 100], we would use the following command, importantly, passing a specific int breaks. 1 2 3 4 5 p t s l = [ 6 0 0 , 800 , 1000] i n t b r e a k s = [ 1 e −4 , 0 . 1 , 1 , 1 0 0 , 5 0 0 ] s p e c t r a = S e l e c t i o n . s p e c t r a ( demog params , ns , t w o e p o c h s e l , p t s l=p t s l , i n t b r e a k s=i n t b r e a k s , Npts =300 , echo=True , mp=True ) Note, one error message that will come up often with very negative selection coefficients is: WARNING:Numerics:Extrapolation may have failed. for unexpected results. Check resulting frequency spectrum One way to fix this is by increasing the pts l grid sizes – this will need to increase as the sample size increases and/or if the integration is done over a range which includes stronger selection coefficients. dadi.Numerics.make extrap func is used to extrapolate the frequency spectra to infinitely many gridpoints, but will sometimes return tiny negative values (often |Xi | < 1e − 50) due to floating point rounding errors. Using dadi.Numerics.make extrap log func will sometimes return Inf values and care should be taken that these numerical errors do not propagate to downstream steps of the analysis. In practice, it seems that the tiny negative values do not affect the integration because they are insignificant, but if the error message appears the SFS should be manually double-checked. Alternately, the small negative values can be manually replaced with 0. 3 In the example, the pre-computed SFSs are saved in the list spectra.spectra. For convenience’s sake, the spectra object can be pickled. 1 2 #import p i c k l e #p i c k l e . dump( s p e c t r a , open ( ” e x a m p l e s p e c t r a . sp ” , ”w” ) ) 6 Fitting a DFE 6.1 Fitting simple DFEs Fitting a DFE is the quickest part of this procedure, especially for simple distributions such as the gamma distribution. If you wish to get an SFS for a specific DFE, you can use the integrate method that is built into the spectra object: spectra.integrate(sel params, sel dist, theta). Another option is to use the spectra.integrate norm method. The former does not normalize the DFE and the second normalizes the DFE. We chose to use the former and assumed that mutations with selection coefficients outside of the range we integrated over were effectively lethal, that is, not seen in the sample. Note that we integrated over the range of gammas corresponding to |s| = [0, 1]. sel params is a list containing the DFE parameters, sel dist is the distribution used for the DFE, and theta is θN S . To compute the expected SFS for our simulations with the true parameter values, we would use spectra.integrate([0.186, 686.7], Selection.gamma dist, 4000.). First, load the sample data: 1 data = d a d i . Spectrum . f r o m f i l e ( ’ example . s f s ’ ) Similar to the way in which vanilla ∂a∂i is used, you should have a starting guess at the parameters. Set an upper and lower bound. Perturb the parameters to select a random starting point, then fit the DFE. This should be done multiple times from different starting points. We use the spectra.integrate methods to generate the expected SFSs during each step of the optimization. The following lines of code fit a gamma DFE to the example data: 1 2 3 4 5 6 7 8 9 sel params = [ 0 . 2 , 1000.] l o w e r b o u n d = [ 1 e −3 , 1 e −2] upper bound = [ 1 , 5 0 0 0 0 . ] p0 = d a d i . Misc . p e r t u r b p a r a m s ( s e l p a r a m s , l o w e r b o u n d=lower bound , upper bound=upper bound ) popt = S e l e c t i o n . o p t i m i z e l o g ( p0 , data , s p e c t r a . i n t e g r a t e , S e l e c t i o n . gamma dist , t h e t a n s , l o w e r b o u n d=lower bound , upper bound=upper bound , v e r b o s e=l e n ( s e l p a r a m s ) , m a x i t e r =30) If this runs correctly, you should infer something close to, but not exactly, the true DFE. The final results will be stored in popt: >>> popt [-678.338 , array([ 0.187071 , 666.092 ])] The expected SFS at the MLE can be computed with: 1 m o d e l s f s = s p e c t r a . i n t e g r a t e ( popt [ 1 ] , S e l e c t i o n . gamma dist , t h e t a n s ) 4 6.2 Fitting complex DFEs Fitting complex distributions is similar to fitting simple DFEs, but requires a few additional steps, outlined in the following lines of code. Here, we are fitting a neutral+gamma DFE to an SFS generated under a gamma DFE just for the sake of the example. Additionally, we assume that every selection coefficient γ < 1e − 4 is effectively neutral. Since this is a mixture of two distributions, we infer the proportion of neutral mutations, pneu , and assume the complement of that (i.e. 1 − pneu ) is the proportion of new mutations drawn from a gamma distribution. Therefore, the parameter vector is: [pneu ,shape, scale]. 1 2 3 4 5 6 7 8 d e f neugamma (mgamma, pneu , alpha , b e t a ) : mgamma=−mgamma #assume a n y t h i n g with gamma<1e−4 i s n e u t r a l i f ( 0 <= mgamma) and (mgamma < 1 e −4) : r e t u r n pneu / ( 1 e −4) + (1−pneu ) ∗ S e l e c t i o n . gamma dist(−mgamma, alpha , beta ) else : r e t u r n S e l e c t i o n . gamma dist(−mgamma, alpha , b e t a ) ∗ (1−pneu ) Then, the custom DFE needs to be vectorized. This is easily done with the numpy.frompyfunc function. 1 neugamma vec = numpy . frompyfunc ( neugamma , 4 , 1 ) Fit the DFE as before, accounting for the extra parameter to describe the proportion of neutral new mutations. Note that pneu is explicitly bounded to be 0 < pneu ≤ 1. 1 2 3 4 5 6 7 8 9 sel params = [ 0 . 2 , 0.2 , 1000.] l o w e r b o u n d = [ 1 e −3 , 1 e −3, 1 e −2] upper bound = [ 1 , 1 , 5 0 0 0 0 . ] p0 = d a d i . Misc . p e r t u r b p a r a m s ( s e l p a r a m s , l o w e r b o u n d=lower bound , upper bound=upper bound ) popt = S e l e c t i o n . o p t i m i z e l o g ( p0 , data , s p e c t r a . i n t e g r a t e , neugamma vec , t h e t a n s , l o w e r b o u n d=lower bound , upper bound=upper bound , v e r b o s e=l e n ( s e l p a r a m s ) , m a x i t e r =30) If this has run properly, your result should look like the following: >>> popt [-678.35389753002551, array([ 1.10498940e-03, 1.87041580e-01, 6.71354664e+02])] Another way to approach this problem is by defining the neutral+gamma DFE as a function of four parameters instead of assuming the two are complementary. While this is unnecessary for the neutral+gamma distribution since we bounded pneu to be between 0 and 1, it becomes necessary for mixtures of three or more distributions. However, the principles will be the same as shown here. First, define the neutral+gamma DFE as a function with parameter vector: [pneu ,pgamma ,shape, scale]. 1 2 3 4 5 6 7 d e f neugamma (mgamma, pneu , pgamma , alpha , b e t a ) : mgamma=−mgamma #assume a n y t h i n g with gamma<1e−4 i s n e u t r a l i f ( 0 <= mgamma) and (mgamma < 1 e −4) : r e t u r n pneu / ( 1 e −4) + pgamma∗ S e l e c t i o n . gamma dist(−mgamma, alpha , b e t a ) else : r e t u r n S e l e c t i o n . gamma dist(−mgamma, alpha , b e t a ) ∗ pgamma 5 Here we do not explicitly set pneu +pgamma = 1, so we need to apply some additional constraints to enforce this. A function that equals 0 when the constraint is satisfied should be used: 1 2 def consfunc (x , ∗ args ) : r e t u r n 1−sum ( x [ 0 : − 2 ] ) In other words, the constraint is satisfied when sum([pneu ,pgamma ,shape, scale][0:-2])=1, that is, when sum([pneu ,pgamma ])=1. Then, use the function Selection.optimize cons to fit the DFE. This function is the same as Selection.optimize except that it requires the constraint function to be passed as an additional argument. 1 neugamma vec = numpy . frompyfunc ( neugamma , 5 , 1 ) 2 3 4 5 sel params = [ 0 . 2 , 0.8 , 0.2 , 1000.] l o w e r b o u n d = [ 1 e −3 , 1 e −3, 1 e −3, 1 e −2] upper bound = [ 1 , 1 , 1 , 5 0 0 0 0 . ] 6 7 8 9 10 11 12 p0 = d a d i . Misc . p e r t u r b p a r a m s ( s e l p a r a m s , l o w e r b o u n d=lower bound , upper bound=upper bound ) popt = S e l e c t i o n . o p t i m i z e c o n s ( p0 , data , s p e c t r a . i n t e g r a t e , neugamma vec , t h e t a n s , l o w e r b o u n d=lower bound , upper bound=upper bound , v e r b o s e=l e n ( s e l p a r a m s ) , m a x i t e r =30 , c o n s t r a i n t=c o n s f u n c ) The result is roughly similar to the previous method: >>> popt [-678.40252289901571, array([ 1.00000000e-03, 9.99000000e-01, 1.85534565e-01, 6.93897711e+02])] References [1] Ryan N Gutenkunst, Ryan D Hernandez, Scott H Williamson, and Carlos D Bustamante. Inferring the joint demographic history of multiple populations from multidimensional snp frequency data. PLoS Genet, 5(10):e1000695, 2009. [2] Diego Ortega-Del Vecchyo, Clare D Marsden, and Kirk E Lohmueller. Prefersim: Fast simulation of demography and selection under the poisson random field model. Bioinformatics, page btw478, 2016. [3] Benjamin C Haller and Philipp W Messer. Slim 2: Flexible, interactive forward genetic simulations. Molecular Biology and Evolution, page msw211, 2016. 6
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 6 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.18 Create Date : 2018:11:08 15:12:34-08:00 Modify Date : 2018:11:08 15:12:34-08:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) kpathsea version 6.2.3EXIF Metadata provided by EXIF.tools