Manual

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 6

Utilizing fitai: a short how-to guide
Bernard Kim
bernard.kim@ucla.edu
Updated: November 8, 2018
1 Introduction
This short guide outlines how to utilize the code we have made available for inferring of the dis-
tribution of fitness effects (DFE). This code serves as an add-on module for the package ai [1],
which is primarily used for demographic inference.
The code examples shown here are meant to work with the example dataset. For simplicity’s
sake, I have generated an example dataset with PReFerSIM [2]. Furthermore, we will work with
a relatively small sample size and simple demographic model so that the examples can be worked
through quickly on a laptop. Lastly, all the example code is provided in the example.py script as
well as in this document.
Another important thing to note: ai characterizes genotype fitnesses as: 1, 1 + 2sh, and
1 + 2s, where 1 + 2sh is the fitness of the heterozygote. Furthermore, the DFEs inferred are scaled
in terms of the ancestral population size: γ= 2NAs. This means that the selection coefficients
must sometimes be rescaled, for instance when using the program SLiM [3].
If you have any additional questions, please feel free to email us: Bernard Kim
(bernard.kim@ucla.edu) or Kirk Lohmueller (klohmueller@ucla.edu).
2 Installation
ai can be downloaded from the Gutenkunst Lab’s BitBucket site. Once you have ai, there
are two easy ways to utilize the fitai module. The first method allows you to set up fitai so
that it is installed with the ai code. To do this, you copy Selection.py and init .py into
the dadi directory before installing ai with python setup.py install. Note, it may be useful
to comment out the line import matplotlib in init .py if you are trying to use ai on a
cluster. Alternately, you can import our code separately. To do this, you should install ai in
the standard manner. Copy Selection.py into your working directory. Then, our module can be
loaded by including the line import Selection.py into your ai/Python script.
The latter method is the way in which the example script is set up:
1import numpy
2import dadi
3import Selection
Note that in order to run ai with the example script you must also copy Selection.py into
the same directory as the example script.
1
3 Example dataset
The example dataset used in the example script was generated with forward simulations under the
PRF model, with the simulation program PReFerSIM. Additionally, we will assume we know the
true underlying demographic model rather than trying to fit one.
This dataset is summarized with a site frequency spectrum, has sample size 2n= 250 (125
diploids), and is saved in the file sample.sfs file. It was generated with a single size change
demography and an underlying gamma DFE. Specifically, a population of size N= 10,000 diploids
expands to 20,000 diploids 1000 generations ago and the gamma DFE has shape parameter 0.186
and scale paramter 686.7. This is the same gamma DFE that we inferred from the 1000 Genomes
EUR dataset, but the scale parameter has been rescaled to the ancestral population size of 10,000
diploids. Finally, the amount of diversity in the sample dataset matches θN S = 4000 = 4NAµLN S .
4 Demographic inference
Because the usage of ai for demographic inference is extensively documented, it will not be
discussed in detail here. In practice, we find that, as long as the demographic model that fits the
synonymous sites reasonably well also works well for inference of the DFE.
Briefly, we fit a demographic model to synonymous sites, which are assumed to be evolving
in a neutral or nearly neutral manner. We believe this accurately represents the effects of linked
selection and population structure, and condition upon this demographic model to fit the DFE.
However, note the assumption of neutral synonymous variants may not hold for species with large
population sizes, since this will increase the efficacy of selection on mutations with small fitness
effects.
Our sample dataset was generated with a two epoch (single size change) demography. Although
we are skipping the demographic inference, the following ai function describes a two epoch
demographic model.
1def two epoch ( params , ns , pts ) :
2nu , T = params
3xx = dad i . Numerics . d e f a u l t g r i d ( pt s )
4phi = dad i . PhiManip . phi 1D ( xx )
5phi = dad i . I n t e g r a t i o n . one pop ( phi , xx , T, nu )
6f s = da di . Spectrum . fr om p hi ( phi , ns , ( xx , ) )
7return f s
We will assume we infer a 2-fold population expansion 0.05*2NAgenerations ago, where NAis
the ancestral population size. Therefore, the parameter vector is: [nu, T].
1demog params = [ 2 , 0 . 0 5 ]
Again, we assume that the population scaled nonsynonymous mutation rate, θN S = 4,000. In
practice, we compute the synonymous mutation rate, θS, by using the multinomial likelihood to fit
the demographic model. Because this method only fits the proportional SFS, θSis estimated with
the dadi.Inference.optimal sfs scaling method. Then, we multiply θSby 2.31 to get θN S ,
θS2.31 = θN S . Remember that our sample size is 125 diploids (250 chromosomes).
1t h e t a n s = 4 0 00 .
2ns = numpy . a r r a y ( [ 2 5 0 ] )
2
5 Pre-computing of the SFS for many γ
Next, we must generate frequency spectra for a range of gammas. The demographic function is
modified to allow for a single selection coefficient. Here, each selection coefficient is scaled with the
ancestral population size, γ= 2NAs. In other words, if sis constant, the same gamma should be
used throughout the demographic function. If gamma=0, this function is the same as the original
demographic function.
1def t w o e p o c h s e l ( params , ns , p t s ) :
2nu , T, gamma = params
3xx = dad i . Numerics . d e f a u l t g r i d ( p ts )
4phi = dad i . PhiManip . phi 1D ( xx , gamma=gamma)
5phi = dad i . I n t e g r a t i o n . one pop ( phi , xx , T, nu , gamma=gamma)
6f s = da di . Spectrum . fr om p hi ( phi , ns , ( xx , ) )
7return f s
Then, we generate the frequency spectra for a range of gammas. The following code generates
expected frequency spectra, conditional on the demographic model fit to synonymous sites, over
Npts log-spaced points over the range of int bounds. Additionally, the mp=True argument tells
fitai whether it should utilize multiple cores/threads, which is convenient since this step takes
the longest. If the argument cpus is passed, it will utilize that many cores, but if mp=True and no
cpus argument is passed, it will use n-1 threads, where nis the number of threads available. If
mp=False, each SFS will be computed in serial. This step should take 1-10 minutes depending on
your CPU.
1p t s l = [ 6 0 0 , 80 0 , 1 0 00 ]
2s p e c t r a = S e l e c t i o n . s p e c t r a ( demog params , ns , t w o e p o ch s e l , p t s l=p t s l ,
3in t b ou nd s =(1e 5 ,5 00 ) , Npts =300 , ec ho=True ,
4mp=True )
Another manner in which this can be done is multiple log-spaced grids over some predetermined
breaks. While this method is almost the same as the previous, this method is more appropriate
for discretely binned DFEs. For example, if we were to create discrete bins at the breaks γ=
[0.1,1,100], we would use the following command, importantly, passing a specific int breaks.
1p t s l = [ 6 0 0 , 80 0 , 1 0 00 ]
2i n t b r e a k s = [ 1 e 4 , 0 . 1 , 1 , 1 00 , 5 0 0 ]
3s p e c t r a = S e l e c t i o n . s p e c t r a ( demog params , ns , t w o e p o ch s e l , p t s l=p t s l ,
4i n t b r e a k s=i n t b r e ak s , Npts =300 ,
5echo=True , mp=True )
Note, one error message that will come up often with very negative selection coefficients is:
WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum
for unexpected results.
One way to fix this is by increasing the pts l grid sizes – this will need to increase as the
sample size increases and/or if the integration is done over a range which includes stronger selec-
tion coefficients. dadi.Numerics.make extrap func is used to extrapolate the frequency spectra
to infinitely many gridpoints, but will sometimes return tiny negative values (often |Xi|<1e50)
due to floating point rounding errors. Using dadi.Numerics.make extrap log func will some-
times return Inf values and care should be taken that these numerical errors do not propagate to
downstream steps of the analysis. In practice, it seems that the tiny negative values do not affect
the integration because they are insignificant, but if the error message appears the SFS should be
manually double-checked. Alternately, the small negative values can be manually replaced with 0.
3
In the example, the pre-computed SFSs are saved in the list spectra.spectra. For conve-
nience’s sake, the spectra object can be pickled.
1#import p i c k l e
2#p i c k l e . dump ( s p e c t r a , ope n ( ” e x a m p l e s p e c t r a . s p , ”w” ) )
6 Fitting a DFE
6.1 Fitting simple DFEs
Fitting a DFE is the quickest part of this procedure, especially for simple distributions such as the
gamma distribution. If you wish to get an SFS for a specific DFE, you can use the integrate
method that is built into the spectra object: spectra.integrate(sel params, sel dist, theta).
Another option is to use the spectra.integrate norm method. The former does not normalize
the DFE and the second normalizes the DFE. We chose to use the former and assumed that muta-
tions with selection coefficients outside of the range we integrated over were effectively lethal, that
is, not seen in the sample. Note that we integrated over the range of gammas corresponding to
|s|= [0,1]. sel params is a list containing the DFE parameters, sel dist is the distribution used
for the DFE, and theta is θN S . To compute the expected SFS for our simulations with the true
parameter values, we would use spectra.integrate([0.186, 686.7], Selection.gamma dist,
4000.).
First, load the sample data:
1data = d adi . Spectrum . f r o m f i l e ( ’ example . s f s )
Similar to the way in which vanilla ai is used, you should have a starting guess at the param-
eters. Set an upper and lower bound. Perturb the parameters to select a random starting point,
then fit the DFE. This should be done multiple times from different starting points. We use the
spectra.integrate methods to generate the expected SFSs during each step of the optimization.
The following lines of code fit a gamma DFE to the example data:
1s e l p a r a m s = [ 0 . 2 , 1 0 0 0 . ]
2lowe r b oun d = [ 1 e 3 , 1e 2]
3upper bound = [ 1 , 5 0 0 0 0 . ]
4p0 = dad i . Misc . per tu rb params ( s el p ar a m s , lowe r bound=lower bound ,
5upper bound=upper bound)
6popt = S e l e c t i o n . o p t i mi z e l o g ( p0 , data , s p e ct r a . i n t eg r a te , S e l e c t i o n . gamma dist ,
7t h e t a n s , lo we r b o un d=lowe r b ou nd ,
8upper bound=upper bound , v e r b o s e=len( sel params) ,
9max it er =30)
If this runs correctly, you should infer something close to, but not exactly, the true DFE. The
final results will be stored in popt:
>>> popt
[-678.338 , array([ 0.187071 , 666.092 ])]
The expected SFS at the MLE can be computed with:
1m od e l s f s = s p e ct r a . i n t e g r a t e ( popt [ 1 ] , S e l e c t i o n . gamma dist , t h et a n s )
4
6.2 Fitting complex DFEs
Fitting complex distributions is similar to fitting simple DFEs, but requires a few additional steps,
outlined in the following lines of code. Here, we are fitting a neutral+gamma DFE to an SFS
generated under a gamma DFE just for the sake of the example. Additionally, we assume that every
selection coefficient γ < 1e4 is effectively neutral. Since this is a mixture of two distributions, we
infer the proportion of neutral mutations, pneu, and assume the complement of that (i.e. 1 pneu)
is the proportion of new mutations drawn from a gamma distribution. Therefore, the parameter
vector is: [pneu,shape, scale].
1def neugamma (mgamma, pneu , alpha , bet a ) :
2mgamma=mgamma
3#assume an yt hi ng with gamma<1e4 i s n e u t r a l
4i f (0 <= mgamma) and (mgamma <1 e 4) :
5return pneu / (1 e 4) + (1pneu ) S e l e c t i o n . gamma dist(mgamma, alpha ,
6beta)
7e l s e :
8return S e l e c t i o n . gamma dist(mgamma, alpha , b eta ) (1pneu )
Then, the custom DFE needs to be vectorized. This is easily done with the numpy.frompyfunc
function.
1neugamma vec = numpy . frompyfunc ( neugamma , 4 , 1)
Fit the DFE as before, accounting for the extra parameter to describe the proportion of neutral
new mutations. Note that pneu is explicitly bounded to be 0 < pneu 1.
1s e l p a r a m s = [ 0 . 2 , 0 . 2 , 1 0 0 0 . ]
2lowe r b oun d = [ 1 e 3 , 1e 3, 1e 2]
3upper bound = [ 1 , 1 , 5 0 0 0 0 . ]
4p0 = dad i . Misc . per tu rb params ( s el p ar a m s , lowe r bound=lower bound ,
5upper bound=upper bound)
6popt = S e l e c t i o n . o p t i m i z e l o g ( p0 , data , s p e c t r a . i n t e g r a t e , neugamma vec ,
7t h e t a n s , lo we r b o un d=lowe r b ou nd ,
8upper bound=upper bound , v e r b o s e=len( sel params) ,
9max it er =30)
If this has run properly, your result should look like the following:
>>> popt
[-678.35389753002551, array([ 1.10498940e-03, 1.87041580e-01, 6.71354664e+02])]
Another way to approach this problem is by defining the neutral+gamma DFE as a function of
four parameters instead of assuming the two are complementary. While this is unnecessary for the
neutral+gamma distribution since we bounded pneu to be between 0 and 1, it becomes necessary
for mixtures of three or more distributions. However, the principles will be the same as shown here.
First, define the neutral+gamma DFE as a function with parameter vector: [pneu,pgamma,shape,
scale].
1def neugamma (mgamma, pneu , pgamma , alp ha , be ta ) :
2mgamma=mgamma
3#assume an yt hi ng with gamma<1e4 i s n e u t r a l
4i f (0 <= mgamma) and (mgamma <1 e 4) :
5return pneu / (1 e 4) + pgammaS e l e c t i o n . gamma dist(mgamma, alpha , be ta )
6e l s e :
7return S e l e c t i o n . gamma dist(mgamma, alpha , b eta ) pgamma
5
Here we do not explicitly set pneu +pgamma = 1, so we need to apply some additional constraints
to enforce this. A function that equals 0 when the constraint is satisfied should be used:
1def c on s fu n c ( x , a r g s ) :
2return 1sum( x [ 0 : 2])
In other words, the constraint is satisfied when sum([pneu,pgamma,shape, scale][0:-2])=1,
that is, when sum([pneu,pgamma])=1.
Then, use the function Selection.optimize cons to fit the DFE. This function is the same as
Selection.optimize except that it requires the constraint function to be passed as an additional
argument.
1neugamma vec = numpy . frompyfunc ( neugamma , 5 , 1)
2
3s e l p a r a m s = [ 0 . 2 , 0 . 8 , 0 . 2 , 1 0 0 0 . ]
4lowe r b oun d = [ 1 e 3 , 1e 3, 1e 3, 1 e 2]
5upper bound = [ 1 , 1 , 1 , 5 0 0 0 0 . ]
6
7p0 = dad i . Misc . per tu rb params ( s el p ar a m s , lowe r bound=lower bound ,
8upper bound=upper bound)
9popt = S e l e c t i o n . o p t i m i z e c o n s ( p0 , data , s p e c t r a . i n t e g r a t e , neugamma vec ,
10 t h e t a n s , lo we r b o un d=lowe r b ou nd ,
11 upper bound=upper bound , v e r b o s e=len( sel params) ,
12 max it er =30 , c o n s t r a i n t=c ons fu nc )
The result is roughly similar to the previous method:
>>> popt
[-678.40252289901571, array([ 1.00000000e-03, 9.99000000e-01, 1.85534565e-01,
6.93897711e+02])]
References
[1] Ryan N Gutenkunst, Ryan D Hernandez, Scott H Williamson, and Carlos D Bustamante. Infer-
ring the joint demographic history of multiple populations from multidimensional snp frequency
data. PLoS Genet, 5(10):e1000695, 2009.
[2] Diego Ortega-Del Vecchyo, Clare D Marsden, and Kirk E Lohmueller. Prefersim: Fast simula-
tion of demography and selection under the poisson random field model. Bioinformatics, page
btw478, 2016.
[3] Benjamin C Haller and Philipp W Messer. Slim 2: Flexible, interactive forward genetic simu-
lations. Molecular Biology and Evolution, page msw211, 2016.
6

Navigation menu