Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 6

Introduction
Installation
Example dataset
Demographic inference
Pre-computing of the SFS for many
Fitting a DFE
- Fitting simple DFEs
- Fitting complex DFEs

Utilizing ﬁt∂a∂i: a short how-to guide

Bernard Kim

bernard.kim@ucla.edu

Updated: November 8, 2018

1 Introduction

This short guide outlines how to utilize the code we have made available for inferring of the dis-

tribution of ﬁtness eﬀects (DFE). This code serves as an add-on module for the package ∂a∂i [1],

which is primarily used for demographic inference.

The code examples shown here are meant to work with the example dataset. For simplicity’s

sake, I have generated an example dataset with PReFerSIM [2]. Furthermore, we will work with

a relatively small sample size and simple demographic model so that the examples can be worked

through quickly on a laptop. Lastly, all the example code is provided in the example.py script as

well as in this document.

Another important thing to note: ∂a∂i characterizes genotype ﬁtnesses as: 1, 1 + 2sh, and

1 + 2s, where 1 + 2sh is the ﬁtness of the heterozygote. Furthermore, the DFEs inferred are scaled

in terms of the ancestral population size: γ= 2NAs. This means that the selection coeﬃcients

must sometimes be rescaled, for instance when using the program SLiM [3].

If you have any additional questions, please feel free to email us: Bernard Kim

(bernard.kim@ucla.edu) or Kirk Lohmueller (klohmueller@ucla.edu).

2 Installation

∂a∂i can be downloaded from the Gutenkunst Lab’s BitBucket site. Once you have ∂a∂i, there

are two easy ways to utilize the ﬁt∂a∂i module. The ﬁrst method allows you to set up ﬁt∂a∂i so

that it is installed with the ∂a∂i code. To do this, you copy Selection.py and init .py into

the dadi directory before installing ∂a∂i with python setup.py install. Note, it may be useful

to comment out the line import matplotlib in init .py if you are trying to use ∂a∂i on a

cluster. Alternately, you can import our code separately. To do this, you should install ∂a∂i in

the standard manner. Copy Selection.py into your working directory. Then, our module can be

loaded by including the line import Selection.py into your ∂a∂i/Python script.

The latter method is the way in which the example script is set up:

1import numpy

2import dadi

3import Selection

Note that in order to run ∂a∂i with the example script you must also copy Selection.py into

the same directory as the example script.

3 Example dataset

The example dataset used in the example script was generated with forward simulations under the

PRF model, with the simulation program PReFerSIM. Additionally, we will assume we know the

true underlying demographic model rather than trying to ﬁt one.

This dataset is summarized with a site frequency spectrum, has sample size 2n= 250 (125

diploids), and is saved in the ﬁle sample.sfs ﬁle. It was generated with a single size change

demography and an underlying gamma DFE. Speciﬁcally, a population of size N= 10,000 diploids

expands to 20,000 diploids 1000 generations ago and the gamma DFE has shape parameter 0.186

and scale paramter 686.7. This is the same gamma DFE that we inferred from the 1000 Genomes

EUR dataset, but the scale parameter has been rescaled to the ancestral population size of 10,000

diploids. Finally, the amount of diversity in the sample dataset matches θN S = 4000 = 4NAµLN S .

4 Demographic inference

Because the usage of ∂a∂i for demographic inference is extensively documented, it will not be

discussed in detail here. In practice, we ﬁnd that, as long as the demographic model that ﬁts the

synonymous sites reasonably well also works well for inference of the DFE.

Brieﬂy, we ﬁt a demographic model to synonymous sites, which are assumed to be evolving

in a neutral or nearly neutral manner. We believe this accurately represents the eﬀects of linked

selection and population structure, and condition upon this demographic model to ﬁt the DFE.

However, note the assumption of neutral synonymous variants may not hold for species with large

population sizes, since this will increase the eﬃcacy of selection on mutations with small ﬁtness

eﬀects.

Our sample dataset was generated with a two epoch (single size change) demography. Although

we are skipping the demographic inference, the following ∂a∂i function describes a two epoch

demographic model.

1def two epoch ( params , ns , pts ) :

2nu , T = params

3xx = dad i . Numerics . d e f a u l t g r i d ( pt s )

4phi = dad i . PhiManip . phi 1D ( xx )

5phi = dad i . I n t e g r a t i o n . one pop ( phi , xx , T, nu )

6f s = da di . Spectrum . fr om p hi ( phi , ns , ( xx , ) )

7return f s

We will assume we infer a 2-fold population expansion 0.05*2NAgenerations ago, where NAis

the ancestral population size. Therefore, the parameter vector is: [nu, T].

1demog params = [ 2 , 0 . 0 5 ]

Again, we assume that the population scaled nonsynonymous mutation rate, θN S = 4,000. In

practice, we compute the synonymous mutation rate, θS, by using the multinomial likelihood to ﬁt

the demographic model. Because this method only ﬁts the proportional SFS, θSis estimated with

the dadi.Inference.optimal sfs scaling method. Then, we multiply θSby 2.31 to get θN S ,

θS∗2.31 = θN S . Remember that our sample size is 125 diploids (250 chromosomes).

1t h e t a n s = 4 0 00 .

2ns = numpy . a r r a y ( [ 2 5 0 ] )

5 Pre-computing of the SFS for many γ

Next, we must generate frequency spectra for a range of gammas. The demographic function is

modiﬁed to allow for a single selection coeﬃcient. Here, each selection coeﬃcient is scaled with the

ancestral population size, γ= 2NAs. In other words, if sis constant, the same gamma should be

used throughout the demographic function. If gamma=0, this function is the same as the original

demographic function.

1def t w o e p o c h s e l ( params , ns , p t s ) :

2nu , T, gamma = params

3xx = dad i . Numerics . d e f a u l t g r i d ( p ts )

4phi = dad i . PhiManip . phi 1D ( xx , gamma=gamma)

5phi = dad i . I n t e g r a t i o n . one pop ( phi , xx , T, nu , gamma=gamma)

6f s = da di . Spectrum . fr om p hi ( phi , ns , ( xx , ) )

7return f s

Then, we generate the frequency spectra for a range of gammas. The following code generates

expected frequency spectra, conditional on the demographic model ﬁt to synonymous sites, over

Npts log-spaced points over the range of int bounds. Additionally, the mp=True argument tells

ﬁt∂a∂i whether it should utilize multiple cores/threads, which is convenient since this step takes

the longest. If the argument cpus is passed, it will utilize that many cores, but if mp=True and no

cpus argument is passed, it will use n-1 threads, where nis the number of threads available. If

mp=False, each SFS will be computed in serial. This step should take 1-10 minutes depending on

your CPU.

1p t s l = [ 6 0 0 , 80 0 , 1 0 00 ]

2s p e c t r a = S e l e c t i o n . s p e c t r a ( demog params , ns , t w o e p o ch s e l , p t s l=p t s l ,

3in t b ou nd s =(1e −5 ,5 00 ) , Npts =300 , ec ho=True ,

4mp=True )

Another manner in which this can be done is multiple log-spaced grids over some predetermined

breaks. While this method is almost the same as the previous, this method is more appropriate

for discretely binned DFEs. For example, if we were to create discrete bins at the breaks γ=

[0.1,1,100], we would use the following command, importantly, passing a speciﬁc int breaks.

1p t s l = [ 6 0 0 , 80 0 , 1 0 00 ]

2i n t b r e a k s = [ 1 e −4 , 0 . 1 , 1 , 1 00 , 5 0 0 ]

3s p e c t r a = S e l e c t i o n . s p e c t r a ( demog params , ns , t w o e p o ch s e l , p t s l=p t s l ,

4i n t b r e a k s=i n t b r e ak s , Npts =300 ,

5echo=True , mp=True )

Note, one error message that will come up often with very negative selection coeﬃcients is:

WARNING:Numerics:Extrapolation may have failed. Check resulting frequency spectrum

for unexpected results.

One way to ﬁx this is by increasing the pts l grid sizes – this will need to increase as the

sample size increases and/or if the integration is done over a range which includes stronger selec-

tion coeﬃcients. dadi.Numerics.make extrap func is used to extrapolate the frequency spectra

to inﬁnitely many gridpoints, but will sometimes return tiny negative values (often |Xi|<1e−50)

due to ﬂoating point rounding errors. Using dadi.Numerics.make extrap log func will some-

times return Inf values and care should be taken that these numerical errors do not propagate to

downstream steps of the analysis. In practice, it seems that the tiny negative values do not aﬀect

the integration because they are insigniﬁcant, but if the error message appears the SFS should be

manually double-checked. Alternately, the small negative values can be manually replaced with 0.

In the example, the pre-computed SFSs are saved in the list spectra.spectra. For conve-

nience’s sake, the spectra object can be pickled.

1#import p i c k l e

2#p i c k l e . dump ( s p e c t r a , ope n ( ” e x a m p l e s p e c t r a . s p ” , ”w” ) )

6 Fitting a DFE

6.1 Fitting simple DFEs

Fitting a DFE is the quickest part of this procedure, especially for simple distributions such as the

gamma distribution. If you wish to get an SFS for a speciﬁc DFE, you can use the integrate

method that is built into the spectra object: spectra.integrate(sel params, sel dist, theta).

Another option is to use the spectra.integrate norm method. The former does not normalize

the DFE and the second normalizes the DFE. We chose to use the former and assumed that muta-

tions with selection coeﬃcients outside of the range we integrated over were eﬀectively lethal, that

is, not seen in the sample. Note that we integrated over the range of gammas corresponding to

|s|= [0,1]. sel params is a list containing the DFE parameters, sel dist is the distribution used

for the DFE, and theta is θN S . To compute the expected SFS for our simulations with the true

parameter values, we would use spectra.integrate([0.186, 686.7], Selection.gamma dist,

4000.).

First, load the sample data:

1data = d adi . Spectrum . f r o m f i l e ( ’ example . s f s ’ )

Similar to the way in which vanilla ∂a∂i is used, you should have a starting guess at the param-

eters. Set an upper and lower bound. Perturb the parameters to select a random starting point,

then ﬁt the DFE. This should be done multiple times from diﬀerent starting points. We use the

spectra.integrate methods to generate the expected SFSs during each step of the optimization.

The following lines of code ﬁt a gamma DFE to the example data:

1s e l p a r a m s = [ 0 . 2 , 1 0 0 0 . ]

2lowe r b oun d = [ 1 e −3 , 1e −2]

3upper bound = [ 1 , 5 0 0 0 0 . ]

4p0 = dad i . Misc . per tu rb params ( s el p ar a m s , lowe r bound=lower bound ,

5upper bound=upper bound)

6popt = S e l e c t i o n . o p t i mi z e l o g ( p0 , data , s p e ct r a . i n t eg r a te , S e l e c t i o n . gamma dist ,

7t h e t a n s , lo we r b o un d=lowe r b ou nd ,

8upper bound=upper bound , v e r b o s e=len( sel params) ,

9max it er =30)

If this runs correctly, you should infer something close to, but not exactly, the true DFE. The

ﬁnal results will be stored in popt:

>>> popt

[-678.338 , array([ 0.187071 , 666.092 ])]

The expected SFS at the MLE can be computed with:

1m od e l s f s = s p e ct r a . i n t e g r a t e ( popt [ 1 ] , S e l e c t i o n . gamma dist , t h et a n s )

6.2 Fitting complex DFEs

Fitting complex distributions is similar to ﬁtting simple DFEs, but requires a few additional steps,

outlined in the following lines of code. Here, we are ﬁtting a neutral+gamma DFE to an SFS

generated under a gamma DFE just for the sake of the example. Additionally, we assume that every

selection coeﬃcient γ < 1e−4 is eﬀectively neutral. Since this is a mixture of two distributions, we

infer the proportion of neutral mutations, pneu, and assume the complement of that (i.e. 1 −pneu)

is the proportion of new mutations drawn from a gamma distribution. Therefore, the parameter

vector is: [pneu,shape, scale].

1def neugamma (mgamma, pneu , alpha , bet a ) :

2mgamma=−mgamma

3#assume an yt hi ng with gamma<1e−4 i s n e u t r a l

4i f (0 <= mgamma) and (mgamma <1 e −4) :

5return pneu / (1 e −4) + (1−pneu ) ∗S e l e c t i o n . gamma dist(−mgamma, alpha ,

6beta)

7e l s e :

8return S e l e c t i o n . gamma dist(−mgamma, alpha , b eta ) ∗(1−pneu )

Then, the custom DFE needs to be vectorized. This is easily done with the numpy.frompyfunc

function.

1neugamma vec = numpy . frompyfunc ( neugamma , 4 , 1)

Fit the DFE as before, accounting for the extra parameter to describe the proportion of neutral

new mutations. Note that pneu is explicitly bounded to be 0 < pneu ≤1.

1s e l p a r a m s = [ 0 . 2 , 0 . 2 , 1 0 0 0 . ]

2lowe r b oun d = [ 1 e −3 , 1e −3, 1e −2]

3upper bound = [ 1 , 1 , 5 0 0 0 0 . ]

4p0 = dad i . Misc . per tu rb params ( s el p ar a m s , lowe r bound=lower bound ,

5upper bound=upper bound)

6popt = S e l e c t i o n . o p t i m i z e l o g ( p0 , data , s p e c t r a . i n t e g r a t e , neugamma vec ,

7t h e t a n s , lo we r b o un d=lowe r b ou nd ,

8upper bound=upper bound , v e r b o s e=len( sel params) ,

9max it er =30)

If this has run properly, your result should look like the following:

>>> popt

[-678.35389753002551, array([ 1.10498940e-03, 1.87041580e-01, 6.71354664e+02])]

Another way to approach this problem is by deﬁning the neutral+gamma DFE as a function of

four parameters instead of assuming the two are complementary. While this is unnecessary for the

neutral+gamma distribution since we bounded pneu to be between 0 and 1, it becomes necessary

for mixtures of three or more distributions. However, the principles will be the same as shown here.

First, deﬁne the neutral+gamma DFE as a function with parameter vector: [pneu,pgamma,shape,

scale].

1def neugamma (mgamma, pneu , pgamma , alp ha , be ta ) :

2mgamma=−mgamma

3#assume an yt hi ng with gamma<1e−4 i s n e u t r a l

4i f (0 <= mgamma) and (mgamma <1 e −4) :

5return pneu / (1 e −4) + pgamma∗S e l e c t i o n . gamma dist(−mgamma, alpha , be ta )

6e l s e :

7return S e l e c t i o n . gamma dist(−mgamma, alpha , b eta ) ∗pgamma

Here we do not explicitly set pneu +pgamma = 1, so we need to apply some additional constraints

to enforce this. A function that equals 0 when the constraint is satisﬁed should be used:

1def c on s fu n c ( x , ∗a r g s ) :

2return 1−sum( x [ 0 : −2])

In other words, the constraint is satisﬁed when sum([pneu,pgamma,shape, scale][0:-2])=1,

that is, when sum([pneu,pgamma])=1.

Then, use the function Selection.optimize cons to ﬁt the DFE. This function is the same as

Selection.optimize except that it requires the constraint function to be passed as an additional

argument.

1neugamma vec = numpy . frompyfunc ( neugamma , 5 , 1)

3s e l p a r a m s = [ 0 . 2 , 0 . 8 , 0 . 2 , 1 0 0 0 . ]

4lowe r b oun d = [ 1 e −3 , 1e −3, 1e −3, 1 e −2]

5upper bound = [ 1 , 1 , 1 , 5 0 0 0 0 . ]

7p0 = dad i . Misc . per tu rb params ( s el p ar a m s , lowe r bound=lower bound ,

8upper bound=upper bound)

9popt = S e l e c t i o n . o p t i m i z e c o n s ( p0 , data , s p e c t r a . i n t e g r a t e , neugamma vec ,

10 t h e t a n s , lo we r b o un d=lowe r b ou nd ,

11 upper bound=upper bound , v e r b o s e=len( sel params) ,

12 max it er =30 , c o n s t r a i n t=c ons fu nc )

The result is roughly similar to the previous method:

>>> popt

[-678.40252289901571, array([ 1.00000000e-03, 9.99000000e-01, 1.85534565e-01,

6.93897711e+02])]

References

[1] Ryan N Gutenkunst, Ryan D Hernandez, Scott H Williamson, and Carlos D Bustamante. Infer-

ring the joint demographic history of multiple populations from multidimensional snp frequency

data. PLoS Genet, 5(10):e1000695, 2009.

[2] Diego Ortega-Del Vecchyo, Clare D Marsden, and Kirk E Lohmueller. Prefersim: Fast simula-

tion of demography and selection under the poisson random ﬁeld model. Bioinformatics, page

btw478, 2016.

[3] Benjamin C Haller and Philipp W Messer. Slim 2: Flexible, interactive forward genetic simu-

lations. Molecular Biology and Evolution, page msw211, 2016.

Manual

Navigation menu

Versions of this User Manual:

Views

Navigation