Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 31

Introduction
Getting help
Hands on
Input data
Specifying scheme
Specifying a model
Several repeats and parallel computing
Output
Plotting model
- Time units on model's plot
- Plotting after GADMA finished
Usefull options
Extra parameters file
Units
Decreasing number of model's parameters
Dependencies
Installation
Appendices
Example of full parameters file
Example of generated code

GADMA user manual

Corresponding to version 1.0.0

Ekaterina Noskova

Contents

1 Introduction 3

2 Getting help 3

3 Hands on 3

4 Input data 5

4.1 VCFdataformat ................................. 6

4.2 Frequency spectrum ﬁle format . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.3 SNPdataformat ................................. 7

5 Specifying scheme 8

6 Specifying a model 8

6.1 Dynamics of population size change . . . . . . . . . . . . . . . . . . . . . . . 8

6.2 Specifying structure of the model . . . . . . . . . . . . . . . . . . . . . . . . 9

6.2.1 Initialstructure.............................. 10

6.2.2 Finalstructure .............................. 10

6.3 Specifying model in details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

7 Several repeats and parallel computing 12

8 Output 12

8.1 Stdoutandlogﬁle ................................ 13

8.2 Modelrepresentation............................... 13

8.3 Output directory content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8.4 Generated code of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

9 Plotting model 16

9.1 Time units on model’s plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

9.2 Plotting after GADMA ﬁnished . . . . . . . . . . . . . . . . . . . . . . . . . 17

10 Usefull options 17

10.1Timeforgeneration................................ 17

10.2 Theta0 ....................................... 18

10.2.1 Estimating Theta0 ............................ 18

10.2.2 Changing Theta0 ............................. 18

10.3 Examples of Theta0 and time for generation . . . . . . . . . . . . . . . . . . 19

10.4 Relative to NAparameters............................ 19

10.5Disablemigrations ................................ 19

10.6 Upper bound for time split . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

10.7Resumelaunch .................................. 20

10.7.1 Onlymodels................................ 20

11 Extra parameters ﬁle 20

12 Units 20

13 Decreasing number of model’s parameters 20

14 Dependencies 22

15 Installation 22

15.1 ∂a∂i ........................................ 22

15.2 moments ...................................... 22

15.3Pillow ....................................... 23

15.4GADMA...................................... 23

15.4.1 Verifying installation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Appendices 24

A Example of full parameters ﬁle 24

B Example of generated code 30

1 Introduction

Welcome to GADMA!

GADMA implements methods for automatic inferring joint demographic history of mul-

tiple populations from the genetic data.

GADMA is based on two open source packages: the ∂a∂i developed by Ryan Gutenkunst

[https://bitbucket.org/gutenkunstlab/dadi/] and the moments developed by Simon

Gravel [https://bitbucket.org/simongravel/moments/].

In contrast, GADMA is a command-line tool. It presents a series of launches of the

genetic algorithm and infer demographic history from Allele Frequency Spectrum of multiple

populations (up to three).

2 Getting help

Please don’t be afraid to contact me for diﬀerent problems and oﬀers via email ekate-

rina.e.noskova@gmail.com. I will be glad to answer all questions.

3 Hands on

Test case

GADMA has test case for simple demographic model for 1 population: just constant size of

10000 individuals in population. To run test case type:

$ gad ma -- test

Example 2

Suppose we have SNP data for two populations. Data is in ∂a∂i’s SNP format in ﬁle

snp_data.txt. Suppose we want to get all output in some gadma_output directory:

$ gadma -i sn p_data . txt -o gadma _ou tpu t

Example 3

We didn’t specify AFS size or labels for populations, they are taken automaticaly from the

input ﬁle.We can see parameters ﬁle of our run in gadma_output/params ﬁle. We see that

there are

# gadma_output / params

...

Population labels : pop_name_1 , pop_name_2

Pro jec ti on s : 18 , 20

...

But we know that spectrum should be 20 ×20! To specify size of AFS we need to create

parameters ﬁle and set Projections:

# params_file

Projections : 20 ,20

Also we can change order of populations. We should add:

# params_file

Projections : 20 ,20

Population labels : pop_name_2 , pop_name_1

If we want to rename populations, we should change names in snp_data.txt ﬁle.

Now assume we want to get the simplest demographic model that we can as faster as we

can. We will tell GADMA that we don’t need no other dynamics of population sizes except

sudden (constant) population size change and that we want to use moments library.

We add corresponding string to parameters ﬁle and now it looks like:

# params_file

Projections : 20 ,20

Population labels : pop_name_2 , pop_name_1

Only sudden : True

Use moments or dadi : moments

To run GADMA we need to specify -p/--params command-line option in cmd:

$ gadma -i snp_data . txt -o gadma_output -p params_file

Example 4

Consider some AFS ﬁle fs_data.fs. There is spectrum for three populations: YRI, CEU,

CHB. However axes are mixed up: CHB, YRI, CEU. To run GADMA we should order them

from most ancient to last formed:

# params_file

Population labels : YRI , CEU , CHB

We want to allow exponential growth (it is default behaviour) and have some extra change

in size of the ancient population. To do so we should specify Initial structure. It is list

of three numbers: ﬁrst — number of time intervals before ﬁrst split (we want here 2); second

— number of time periods betseen forst and second split events (at least 1); third — number

of time periods after second split.

# params_file

Population labels : YRI , CEU , CHB

Initial structu re : 2 ,1 ,1

Also we can put information about input ﬁle and output directory to our parameters ﬁle:

# params_file

In put fil e : fs _d ata . fs

Output directory : gadma_output

Population labels : YRI , CEU , CHB

Initial structu re : 2 ,1 ,1

Now we can run GADMA the following way:

$ gadma -p params

Example 5

We have our GADMA launch interrupted for some reason. We want to resume it:

$ gadma -- resume ga dma _out put

where gadma_output is output directory of previous run. We can ﬁnd resumed run in

gadma_output_resumed

Example 6

Our launch was ﬁnished, we used ∂a∂i with default grid size which GADMA determines

automatically if it is not specify by user. We found out that it would be better to ﬁnd some

models using greater number of grid points in ∂a∂i scheme, but we want to take ﬁnal models

from previous run:

# params_file

Pts : 40 , 50 , 60 # Greater value of grid size than it was

And run GADMA:

$ gadma -- resume ga dma _out put -- on ly_m ode ls -p params

--only_models tell GADMA to take from gadma_output ﬁnal models only.

There is another way to do the same:

# params_file

Resume from : gadma_output

Only models : True

Pts : 40 , 50 , 60 # Greater value of grid size than it was

And run GADMA the following way:

$ gadma -p params

Example YRI, CEU

GADMA has example of full parameters ﬁle example_params, which is preseted here too at

the end of the manual. To run GADMA with this parameters one should just run from the

GADMA’s home directory:

$ gadma -p example_params

4 Input data

GADMA handles two formats of input ﬁles:

•Frequency spectrum ﬁle format (.fs or .sfs),

•SNP data format (.txt).

These data formats should be familiar to one who was used ∂a∂i or moments.

Input ﬁle can be speciﬁed to GADMA with two ways:

1. Through command-line option -i/--input:

$ gadma -i fs_file . fs -o out_dir

$ gadma -- input s np_file . txt -o out_dir

2. In parameters ﬁle params_file:

# params_file

# Input file path

In put fil e : fs_file . fs

...

Extra information about input AFS can also be put in parameters ﬁle. For example,

AFS can be projected to smaller size with Projections option, populations can be

named or their order can be changed with Population labels option. If parameters

ﬁle does not contain some options, they are automaticaly pulled out from the input

ﬁle.

# params_file

# Input file path

In put fil e : fs_file . fs

# ( New ) size of the AFS

Projections : 20 ,20

# Labels of populations

Population labels : CEU , YRI

...

GADMA can be launched with parameters ﬁle the following way:

$ gadma -p params_file -o out_dir

4.1 VCF data format

If one have .vcf ﬁle, it can be converted it into .sfs ﬁle using https://github.com/isaacovercast/

easySFS.

4.2 Frequency spectrum ﬁle format

Each ﬁle begins with any number of comment lines beginning with #. The ﬁrst non-comment

line contains Pintegers giving the dimensions of the FS array, where Pis the number of

populations represented. For a FS representing data from 4x4x2 samples, this would be

5 5 3. (Each dimension is one larger than the number of samples, because the number of

observations can range, for example, from 0 to 4 if there are 4 samples, for a total of 5

possibilities.) On the same line, the string folded or unfolded denoting whether or not the

stored FS is folded.

The actual data is stored in a single line listing all the FS elements separated by spaces,

in the order fs[0,0,0] fs[0,0,1] fs[0,0,2]. . . fs[0,1,0] fs[0,1,1]. . . . This is followed by a single line

giving the elements of the mask in the same order as the data, with 1indicating masked and

0indicating unmasked.

4.3 SNP data format

Human Chimp Allele1 YRI CEU Allele2 YRI CEU Gene Position

ACG ATG C 29 24 T 1 0 abcb1 289

CCT CCT C 29 23 G 3 2 abcb1 345

Listing 1: Example of SNP ﬁle format

The data ﬁle begins with any number of comment lines that being with #. The ﬁrst

parsed line is a column header line. Whitespace is used to separate entries within the table,

so no spaces are allowed within any entry. Individual rows make be commented out using #.

The ﬁrst column contains the in-group reference sequence at that SNP, including the

ﬂanking bases. If the ﬂanking bases are unknown, they can be denoted by -. The header

label is arbitrary.

The second column contains the aligned outgroup reference sequence at that SNP, in-

cluding the ﬂanking bases. Unknown entries can be denoted by -. The header label is

arbitrary.

The third column gives the ﬁrst segregating allele. The column header must be exactly

Allele1.

Then follows an arbitrary number of columns, one for each population, each giving the

number of times Allele1 was observed in that population. The header for each column should

be the population identiﬁer.

The next column gives the second segregating allele. The column header must be exactly

Allele2.

Then follows one column for each population, each giving the number of times Allele2

was observed in that population. The header for each column should be the population

identiﬁer, and the columns should be in the same order as for the Allele1 entries.

Then follows an arbitrary number of columns which will be concatenated with _to assign

a label for each SNP.

The Allele1 and Allele2 headers must be exactly those values because the number of

columns between those two is used to infer the number of populations in the ﬁle.

5 Specifying scheme

GADMA use either ∂a∂i or moments to simulate expected AFS from the demographic model.

By default ∂a∂i is used. To use ∂a∂i it is recommended to check value of the Pts option in

parameters ﬁle. Pts is sequence of three numbers, each of which is equal to the number of

points in grid size. The greater numbers are, the more accurate ∂a∂i numerical solution of

partial diﬀerential equation is, however, it would take more time. By default, GADMA takes

Pts : n, n + 10, n + 20, where n— is the maximum sample size among populations of

interest.

moments library does not need Pts to be speciﬁed. To change ∂a∂i to moments scheme,

specify option in parameters ﬁle:

# params file

...

Use moments or dadi : moments

...

6 Specifying a model

6.1 Dynamics of population size change

Figure 1: Three main demographic dynamics of population size change.

In GADMA size of population can be changed due to one of three dynamics: sudden

change, linear change and exponential change of the eﬀective population size (Figure 1).

In order to infer demographic model with sudden changes of populations sizes only, option

Only sudden in parameters ﬁle should be set to True:

# params_file

...

Only sudden : True

...

By default, this option is set to False and dynamics are found like other parameters of the

demographic model.

6.2 Specifying structure of the model

GADMA infers a demographic model from an AFS with nothing required from the user,

except simple information, that determines how much the detailed model is required — the

structure of the model.

Assume a division of one population into two new subpopulations and ﬁxed temporal

order of the populations: from the ancient to the most recently formed population.

We can divide time of our model into splits events and time intervals, during which a

certain dynamics of change of eﬀective size is maintained for each population and migration

rates are constant. The number of split events is one less than the number of populations

under consideration. Now we can deﬁne the concept of the structure of the demographic

model:

Structure of the demographic model is:

•the number of time intervals in case of one population;

•the number of time intervals as those that occur before and after a single split, in the

case of two populations;

•the number of time intervals prior to the ﬁrst split, those between the ﬁrst and second

split, and the ones after the second split, in the case of three populations.

For example, we can divide time of the model on the Figure 2 to 4 time intervals: T1,

T2,T3and T4, and two populations’ splits: S1and S2. The structure of this model is 2,1,1,

because two intervals (T1as T2) before ﬁrst split S1, one interval (T3) between ﬁrst and

second splits and one interval (T4) after second split S2.

Figure 2: Example of representation of demographic model. Time is on the axis of abscissa

and population size is on the axis of ordinates. The structure of that model is (2,1,1). Here

colors are just a formality and are used for highlighting diﬀerent demographic dynamics.

6.2.1 Initial structure

To specify Structure of the inferred model one should set Initial structure in the param-

eters ﬁle:

# params file

...

Initial structure : 2

...

# params file

...

Initial stru ctur e : 2 ,1

...

# params file

...

Initial structu re : 2 ,1 ,1

...

By default the simplest structure is used (1 or 1,1 or 1,1,1).

6.2.2 Final structure

It is also possible to start with more simple structure in order to get more complex. To do

so one should specify option Final structure in the parameters ﬁle. For example:

# params file

In put fil e : s om e_ 2d _f s . fs

Initial stru ctur e : 1 ,1

Fi nal struct ure : 2 ,1

...

During this parameters ﬁle GADMA will ﬁnd parameters for demographic model with 1,1

structure, then increase structure for 2,1 and ﬁnd parameters for the model with this struc-

ture. Found parameters for more simple structure (in this case it is 1,1) are used in search

of the parameters of more complex structure (2,1).

Note: Structure increases random, so if one specify initial structure to 1,1 and ﬁnal to

2,2, it is not garanteed to ﬁnad optimal parameters for demographic models with structures

between 1,1 and 2,2, i.e. intermediate state can be either 1,2 or 2,1.

Recommendation: Use scheme with increase of the structure, as it produce more

stable solutions.

Recommendation: Choose not very large values for model’s structure. Final structure

is recommended to be diﬀer by one element from initial structure, for example, 1,1 and 2,1;

1,2,1 and 2,2,1.

6.3 Specifying model in details

It is also possible to use Genetic algorithm from GADMA to a proposed model that is deﬁned

as Python function using ∂a∂i or moments. It is the way that ∂a∂i and moments work with

demographic models inference. To understand how to specify model like that one can read

manuals to the packages.

For example, consider simple bottleneck model for one population: at time TF + TB in

the past, an equilibrium population goes through a bottleneck of depth nuB, recovering to

relative size nuF:

def bottleneck ( params , ns , pts ):

nuB , nuF , TB , TF = params

xx = dadi . Numeri cs . de fau lt_ gri d ( pts )

phi = d adi . P hi Ma nip . p hi_1D ( xx )

phi = dadi . Integration . one_pop ( phi , xx , TB , nuB )

phi = Integration . one_pop ( phi , xx , TF , nuF )

fs = dadi . Spectrum . from_phi ( phi , ns , (xx ,))

return fs

To run optimization from GADMA one need to run optimization function just like in

∂a∂i and moments:

# Import GADMA ’s optimization function :

import gadma

# Specify input data and its parameters :

data = da di . Spectr um . f ro m_ fi le ( " f s_ fi le na me . fs " )

ns = data.sample_sizes # size of AFS

pts = [40 ,50 ,60] # grid size for dadi

# Wrap our bottleneck function :

func_ex = dadi . Numerics . m ake_e xtrap_lo g_func ( bottleneck )

# Specify upper and lower bounds for parameters :

upper_bound = [100 , 100 , 3, 3]

low er_ bo un d = [1 e -2 , 1 e -2 , 0 , 0]

# Run optimizations:

# Beginning GADMA optimization

popt = gadma . Inference . optimize_ga (4 , data , func_ex , pts_l ,

lower_bound = lower_bound ,

upper_bound=upper_bound)

# Beginning local optimization from dadi

popt = dadi . Inference . optimize_log ( popt , data , func_ex , pts_l ,

lower_bound = lower_bound ,

upper_bound = upper_bound ,

verbose = len ( p0 ), m ax iter =3)

print (’ Found parameters : {0} ’. format ( popt ))

Note: No initial parameters p0 are set in gadma.Inference.optimize_ga function as

GADMA optimization is global search algorithm. However it is possible to specify p0:

# Initia l p ara mete rs can be set too :

p0 = [0.01 , 1.5 , 0.2 , 0.2]

# Beginning GADMA optimization

popt = gadm a . In fe re nc e . op ti mi ze _g a (4 , data , func_ex , pts_l , p0 = p0

lower_bound = lower_bound ,

upper_bound=upper_bound)

If one want to ﬁnd other parameters fo gadma.Inference.optimize_ga function:

>>> import gadma

>>> help ( gadma . Inference . optimize_ga )

7 Several repeats and parallel computing

By default GADMA run optimization that uses Genetic algorithm once. However it is

recomended to run optimization several times and get best model among the result ones.

Option Number of repeats in parameters ﬁle tell GADMA to execute optimization the

necessary number of times. Moreover there are another option Number of processes allows

GADMA to parallelize those runs.

Note: GADMA parallelize exactly several runs, not every of them. So if one ask to

repeat optimization 2 times and parallelize in more than 2 processes, only two processes will

be used eventually.

Recomendation: Number of processes should be lesser than Number of repeats

and it will be better if it is aliquot to the Number of repeats.

Note: Number of processes shouldn’t be greater than number of kernels on one’s

computer. Otherwise there will be no sence in parallelization.

# params file

...

Number of repeats : 6

Number of processes : 2 # or 3 or 6

...

8 Output

GADMA put all ﬁles to the directory that user set through -o/--output command-line

option:

$ gad ma -o o ut pu t_ di r - i in pu t_ fs . fs

or through Output directory option in parameters ﬁle:

# params file

Output directory : output_dir

...

8.1 Stdout and log ﬁle

GADMA prints its progress about every minute in stdout and in output_dir/GADMA.log

ﬁle:

[ hhh : mm : ss ]

All best logLL models :

GA number logLL BIC Model

All best BIC models :

GA number logLL BIC Model

-- Best model by log likeliho od - -

Log likelihood : Best logLL

with BIC score : BIC_score

Model : rep resenta tion of best logLL model

-- Best model by BIC score --

Log likelihood : logLL

with BIC score : Best BIC score

Model : rep resenta tion of best BIC model

Note: One can set Silence option in parameters ﬁle to True to disable output in

stdout, ﬁle output_dir/GADMA.log will still have it.

8.2 Model representation

Every model is printed as a line of parameters. All model parameters, except mutation rates,

have precision equals to 2.

Consider designations:

•T— time,

•P— percent of split,

•Ni — size of population number i,

•di — dynamic of changing of the size of population number i,

•mij — mutation rate from population ito population j.

Dynamic of population size change has numerical values:

•di = 0 — sudden change of size for population number i;

•di = 1 — linear change of size for population number i;

•di = 2 — exponential change of size for population number i;

Model is printed as sequence of time intervals and splits that are represented the following

way:

•First period (NA — size of ancestry population):

[ NA ]

•Split:

–If there are one population before split event, so there will be two populations

after it:

[ P%, [N1, N2] ]

–If there are two population before split event, so last formed population is divided

and there will be three populations after it:

[ P%, [N1, N2, N3] ]

•Usual time period:

–If there is one population:

[ T, [N1], [d1] ]

–If there are two populations:

[ T, [N1, N2], [d1, d2], [[None, m12], [m21, None]] ]

–If there are three populations:

[ T, [N1, N2, N3], [d1, d2, d3],

[[None, m12, m13], [m21, None, m23], [m31, m32, None]] ]

At the end of the string, that corresponds to the model, there is information about

model’s ancestry in the genetic algorithm:

•c— for model, that is child of crossover,

•r— if it was formed random way,

•m— if it was mutated,

•f— ﬁnal model of genetic algorithm.

Note: mis added as many times as model was mutated.

Example of the demographic model for two populations:

[ [144.38] ][ 16.00%, [23.10, 121.28] ][ 375.77, [143.31, 30.07],

[2, 2], [[None, 3.33e-03][7.53e-04, None]] ]

8.3 Output directory content

For every repeat of Genetic algorithm GADMA create new folder in output directory with

correspond number.

In every folder there is GADMA_GA.log, where every iteration of algorithm is saved, folders

pictures and python_code.

When genetic algorithm ﬁnishes GADMA saves picture and python code of result model

in the corresponding folder.

When all GA ﬁnishes picture and code of best model among them are saved in root

directory.

- <output_dir >

- 1

GADMA_GA . log

- pictures

- python_code

- dadi

- moments

result_model.png

result_model_code.py

- 2

GADMA_GA . log

- pictures

- python_code

- dadi

- moments

result_model.png

result_model_code.py

params

extra_params

GADMA . log

best_logLL_model.png

best_logLL_model_dadi_code.py

b es t_ lo gL L_ mo de l_ mo me nt s_ co de . py

best_bic_model.png

best_bic_model_dadi_code.py

best_bic_model_moments_code.py

8.4 Generated code of models

By default GADMA generates python code only for ﬁnal models both for ∂a∂i and moments.

However it can do it every Niteration of genetic algorithm also. In this case option should be

set in the parameters ﬁle. GADMA save ﬁles with code to the output_dir/<GA_number>/python_code

directory. Both ∂a∂i and moments code are generated and saved in diﬀerent folders there.

Each code contain function of the model, which takes values of the parameters as input,

and strings that load observed AFS, simulate expected AFS from the model’s function and

calculate log likelihood between two AFS’. The result log likelihood is printed to stdout. For

the moments code picture is also draw.

All code can be run the following way:

$ python fi le _w ith _c od e . py

Example of generated code one can ﬁnd at the end of this manual in the corresponding

section.

9 Plotting model

Figure 3: Exaple of demographic model plot that GADMA draw during run.

GADMA always draws ﬁnal best models during genetic algorithms search for best solu-

tions.

However models can be drawing during the pipeline every N’s iteration of each genetic al-

gorithm. Nis equal to the value of Draw models every N iteration option in parameters

ﬁle. So to enable drawing one should set this option in ﬁle.

Recommendation: Don’t draw models very often, because changes are usually not

very signiﬁcant and drawing takes some time, so optimization will be slower.

Note: One can disable drawing by setting Draw models every N iteration : 0 in

the parameters ﬁle. This is also default behaviour.

Models are drawn with moments library, so it should be installed if one want to have

pictures. In the top left corner there is size of ancestry population. Other parameters one

can ﬁnd in string representation of the model in the log ﬁles.

9.1 Time units on model’s plot

Time on the demographic model’s plot can be drawn in units of years, thousand years

or in genetic units. By default choice depends on the Time for generation option in the

parameters ﬁle: if it is set to some value (in years) then time will be shown in years, otherwise

it will be in genetic units.

But, of course, it is possible to tell GADMA what units are preferable. For example, if

one wants time to be in thousand of years on the pictures, as it is big values in years:

# params file

...

Units of time in drawing : thousand years

...

9.2 Plotting after GADMA ﬁnished

Figure 4: Example of model’s plot that was drawn with generated python script.

Sometimes ﬁnal pictures aren’t satisfying or one didn’t draw for some reasons (don’t want

to install moments, want fast launch) it is possible to plot demographic model (and only it)

again.

To do it one should run corresponding generated python script for the model. For exam-

ple, ﬁnal model can be draw again the following way:

$ python b es t_ lo gl l_ mo del _m om en ts _co de . py

Note: Again it’s possible only if one have moments installed.

Note: One can change code inside ﬁle and draw again if picture isn’t satisfying.

10 Usefull options

10.1 Time for generation

Option Time for generation in parameters ﬁle is corresponding to the time per one gener-

ation in Wright-Fisher model. It is responsible basically for one thing: time on the model’s

plots. If it is speciﬁed, then time on the pictures will be converted from genetic units by

scaling with this value. Otherwise, if it is not set, time will be shown in genetic units.

Note: If Time for generation is speciﬁed, it should be consistent with another option:

Theta0, which is described in the corresponding section.

10.2 Theta0

Theta0 is equal to the expected number of mutation that occur in one chromosome in one

generation in the inﬁnite-sited model. GADMA can scale all values of demographic model

parameters due to known value of Theta0. However, it is not always possible to ﬁnd it.

There is a way to solve this problem: one can to set Theta0 to None or just not to specify

it at all, so GADMA will take it as 1.0 and after launch one can scale result values due to

found Theta0 (how to do it is described in the corresponding section).

10.2.1 Estimating Theta0

If µ— is the neutral mutation rate per site per generation and L— length of the sequence,

then:

θ0= 4 ∗µ∗L

Note: Lis the eﬀective sequenced length, which accounts for losses in alignment and

missed calls.

Note: µshould be estimated based on generation time. One can leave Time per generation

option in parameters ﬁle unspeciﬁed (then time on the model’s plots will be in the genetic

units), but don’t forget to recalculate µ!

For example (Gutenkunst et al, 2009 [1]):

We estimate the neutral mutation rate µusing the divergence between human and

chimp. Comparing aligned sequences in our data, we estimate the divergence to

be 1.13%. Assuming a divergence time of 6 million years and a mean generation

time for human and chimp over this interval of 25 years, we have

µ= 0.0113 ·25/(2 ·6·106) = 2.35 ×10−8per generation.

10.2.2 Changing Theta0

If GADMA was launched with one Theta0 and now one want to use another or if it was

launched with default Theta0 = 1.0 and now one have estimated its real value, model’s

parameters can be simply scaled:

Let a=θNEW

θOLD

•Size of population /a,

•Time /a,

•Migration rates ∗a,

•Split percent stay the same.

10.3 Examples of Theta0 and time for generation

The following tables produces diﬀerent possible values for the demographic model inference

for three populations of modern people: YRI, CEU, CHB.

FS ﬁlename Gen. time µ L Theta0

(years) (per site per gen.) (base pair) (per chr. per gen.)

YRI CEU CHB.fs 25 [1] 2.35 ·10−8[1] 4.04 ·106[1] 0.37976

YRI CEU CHB.fs 24 [3] 2.26 ·10−8[1] 4.04 ·106[1] 0.36521

YRI CEU CHB.fs 29 [2] 1.44 ·10−8[2] 4.04 ·106[1] 0.23270

YRI CEU CHB.fs 24 [3] 1.2·10−8[2] 4.04 ·106[1] 0.19392

Table 1: Examples of diﬀerent values of generation time and its inﬂuence on µand Theta0

In Gutenkunst et al. [1] generation time for human populations was equal to 25 years and

mutation rate µwas estimated as 2.35 ·10−8. If one want to change time for one generation

to 24 years, one need to scale µ:µ/25 ·24 = 2.26 ·10−8.

In Jouganous et al. [2] generation time was grater — 29 years and mutation rate was

equal to 1.44 ·10−8. To change generation time to 24, one need to change value of mutation

rate: µNEW =µ/29 ·24 = 1.2·10−8.Theta0 is calculated then by the formula above.

Note: One can ﬁnd FS ﬁles in fs_examples directory.

10.4 Relative to NAparameters

Sometimes it is more important to see parameters scaled to Nref =NA. To tell GADMA

show models with scaled parameters, option Relative parameters should be set to True.

By default, it is False. It is conveniently when Theta0 is unknown.

10.5 Disable migrations

GADMA can to exclude migrations rates from optimization and consider them be equal to

zero. In that case all migrations are disabled. One should set option No migrations to True

in the parameters ﬁle.

10.6 Upper bound for time split

To limit time of some split one should specify option in parameters ﬁle. Splits are numerated

from the most ancient. So split 1 is split that occurred with ancient population and split

2 is next division of second population (exist only for three populations). There are two

appropriate options: Upper bound of first split and Upper bound of second split.

One should translate time from years into genetic units, therefore divide it by 2 ·Tg,

where Tgis time (in years) for one generation. For example, one want to limit last split with

2000 years. Time for one generation is estimated as 24 years, then one should specify in

parameters ﬁle:

...

Upper bound of second split : 83.333

...

10.7 Resume launch

To resume interrupted launch one can use --resume command-line option or set Resume from

in the parmeters ﬁle. One need to set output directory of previous run.

If neither Output directory or -o/--output aren’t speciﬁed, GADMA will continue

evaluation in the directory: <previous_output_dir>_resumed.

10.7.1 Only models

GADMA can resume launch taking ﬁnal models only from previous run. This means, that

it is not resumption, but run from some initial values. It is usufull, for example, when one

have run GADMA with some small grid size for dadi and then want to restart it with greater

number of grid points. To do so, one should set command-line option --only_models with

--resume or specify Only models option in parameters ﬁle to True.

11 Extra parameters ﬁle

GADMA take extra parameters ﬁle as input. However one probably do not need

them. Nevertheless, if one is interested, extra_params_template with all options and their

descriptions can be found in gadma folder.

12 Units

GADMA shows model parameters in genetic units. To scale them from one should multiply

migration rates by 2 only. Other units are as usual. In case when option Relative parameters

is set to True one should ﬁrst rescale from units of Nref : sizes of populations and time —

multiply by NA, migration rates — divide by NA.

13 Decreasing number of model’s parameters

There are several ways to decrease number of demographic model’s parameters in GADMA.

Firstly, as it was discussed above, one can consider only sudden dynamics of population size

changes (Only sudden : True). Second way to decrease number of parameters is so-called

multinom scheme: when at every step NAis chosen to ﬁt the observed data best. This

scheme leads to faster work of GADMA, however it is more possible to get stuck in local

optima. To use it one should set option Multinom to True in parameters ﬁle. Third way is

to disable migration by setting No migrations to True in parameters ﬁle.

The last and more reliable way to decrease number of model’s parameters is to use

GADMA optimization on the custom model written with ∂a∂i or moments (section Speci-

fying model in details). However it requires some extra work for user.

14 Dependencies

The absolute dependencies for GADMA are:

•Python (2.5, 2.6, 2.7)

•NumPy (≥1.2.0)

•Scipy (≥0.6.0)

•∂a∂i (≥1.7.0) or moments (≥1.0.0)

To draw demographic models install:

•matplotlib (≥0.98.1)

•Pillow (≥4.2.1)

•moments (≥1.0.0)

15 Installation

Before GADMA installation one need either ∂a∂i or moments been installed.

15.1 ∂a∂i

To install ∂a∂i, go to the work directory and run:

$ git clone https :// bitbuc ket . org / g ute nku nst lab / dadi /

Then go to the dadi directory and install package:

$ cd dadi

$ python s etup . py install

To check ∂a∂i’s installation, type in python interpreter:

$ python

>>> import dadi

15.2 moments

To install moments, go to the work directory and run:

$ git clone https :// bitbuc ket . org / s imon grav el / moments /

Check that Cython is installed:

$ python -m pip install -- upgrade Cython

Then go to the moments directory and install package:

$ cd moments

$ python s etup . py install

To check moments’s installation, run in python interpreter:

$ python

>>> import moments

15.3 Pillow

Install Pillow throgh pip:

$ python -m pip install Pillow

To check Pillow’s installation, type in python interpreter:

$ python

>>> import PIL

15.4 GADMA

To download GADMA, go to the work directory and run:

$ git clone https :// github . com / ctlab / GADMA

Then go to the GADMA directory and install GADMA:

$ cd GADMA

$ python s etup . py install

Now one can run it like this:

$ gad ma -- help

15.4.1 Verifying installation

To verify installation run test:

$ gad ma -- test

If the installation is successful, one will ﬁnd the following information at the end:

-- Finish pipeline - -

-- Test passed correctly - -

Thank you for using GADMA !

Appendices

A Example of full parameters ﬁle

# It is parameters file for GADMA software .

# Lines started from # are ignored .

# Also comments at the end of a line are ignored too .

# Every line contains : Ide nti fic ato r of parameter : value .

# If one want to change some default parameters , one need to

# remove # at the begining of line and change corresponding

# parameter.

# Output directory to write all GADMA out .

# One need to set it to not existing or empty directory .

# If it is resumed from other directory and output directory

# isn ’t set , GADMA will add ’_resumed ’ for previous output

# directory.

Output directory : my_ex ample_r un

# One can resume from some other launch of GADMA by setting

# output directory of it to ’Resume from ’ parameter .

# You can set again new parameters of resumed launch .

# Resume from : ano ther_ outp ut_d ir

# If you want to take only models from previous run set this

# flag . Then iterations of GA will start from 0 and values of

# mutation rate and strength will be initial .

# Default : None

# Only models : None

# Input file can be sfs file ( should end with .fs) or

# file of SNP ’s in dadi format ( should end with .txt ).

In put fil e : f s_ ex am pl es / Y RI_CEU . fs

# ’ Population labels ’ is sequence of population names (the same

# names as in input file )

# If .fs file is in old format then it would rename population

# labels that are absent .

# It is nec essary to put them in order from most ancient to less .

# (In case of more than two populations )

# It is important , because the last of formed populations take

# part in next split .

# For example , if we have YRI - African population ,

# CEU - European population and CHB - Chineese populaion ,

# then we can write YRI , CEU , CHB or YRI , CHB , CEU

# ( YRI must be at the first pl ace )

# Default : from input file

Population labels : CEU , YRI # we change populations order

# (in input file YRI is first )

# Also one can project your spectrum to less size .

# For example , we have 80 individuals in each of three

# populations , then spectrum will be 81 x81x81 and one can

# project it to 21 x21x21 by set ’Projections ’ parameter

# to 20, 20, 20.

# Default : from input file

Projections : None # will be 20, 20

# Now all main parameters :

# Total mutation flux - theta .

# It is equal to:

# theta = 4 * mu * L

# where mu - mutation rate per site per generation and

# L - effective sequenced length , which accounts for losses

# in alignment and missed calls .

# Note : one should e stima te mu based on gener ati on time .

# Default : 1.0

Theta0 : 0.37976 # the same as in Gutenkunst et al 2009

# Time ( years ) for one generation . Can be float .

# Is important for drawing models . If one don ’t want to draw ,

# one can pass it.

# Default : 1.0

Time for generation : 25 # the same as in Gutenkunst et al 2009

# Parameters for demographic models :

# Use moments or dadi

# Default : dadi

Use moments or dadi : moments

# Use multinom scheme : N_A is not parameter for search ,

# it is calculated through optimal_sfs_scaling .

# Multinom scheme decrease number of parameter by one and

# is usually faster , however non multinom scheme usually

# finds better solutions .

# Default : False

Multinom : False

# If you choose to use dadi , please set pts parameter - number

# of points in grid

# Default : Let n = max number of individuals in one population ,

# then pts = n , n +10 , n +20

# Pts : 20, 30 , 40

# Structure of model for one population - number of time periods

# (e.g. 5).

# Structure of model for two populations - number of time periods

# before split of ancestral population and after it (e.g. 2,2).

# Structure of model for three populations - number of time periods

# before first split , between first and second splits and after

# second split (e.g. 2 ,1 ,2).

# St ructure of initial model :

# Default : all is ones - 1 or 1 ,1 or 1 ,1,1

Initial stru ctur e : 2 ,1

# St ructure of final model :

# Default : equals to initial structure

# Final structur e : 2 ,2

# It is possible to limit time of splits .

# Split 1 is the most ancient split .

# ! Note that time is in genetic units (2 * time for 1 g ene rat ion ):

# e.g. we want to limit by 150 kya , time for one generation is

# 25 years , then bound will be 150000 / (2*25) = 3000.

# Upper bound for split 1 (in case of 2 or 3 populations ).

# Default : None

# Upper bound of first split : None

# Upper bound for split 2 (in case of 3 populations ).

# Default : None

# Upper bound of second split : None

# Print parameters of model in units of N_ref = N_A .

# N_A will be placed in brackets at the end of string .

# Default : False

Relative parameters : False

# Disable migrations in demograph ic models .

# Default : False

No migrations : false

# Parameters for Genetic Algorithm .

# Size of population of demographic models in GA:

# Default : 10

Size of population in GA : 10

# Fractions of current models , mutated models and crossed models

# to be taken to new population .

# Sum of fractions should be <= 1, the remaining fraction is

# fraction of random models .

# Default : 0.2 ,0.3 ,0.3

# Fractions in GA : 0.2 ,0.3 ,0.3

# Mutation strength - fraction of parameters in model to mutate

# during global mutation process of model .

# Number of parameters to mutate is sampled from binomial

# distribution , so we need to set mean .

# Default : 0.2

Mean mutation strength : 0.3

# Mutation strength can be adaptive : if mutation is good ,

# i.e. has the best fitness function ( log likelihood ),

# then mutation strength is increased multiplying by const

# otherwise it decreases dividing by (1/4)^ const .

# When const is 1.0 it is not adaptive .

# Default : 1.0

Const for mutation strength : 1.05

# Mutation rate - fraction of any parameter to change during

# its mutation .

# Mutation rate is sampled from truncated normal distribution ,

# so we need mean (std can be specified in extra params ).

# Default 0.2

Mean mutation rate : 0.1

# Mutation rate also can be adaptive as mutation strength .

# Default : 1.02

Const for mutation rate : 1.01 # very small changes

# Genetic algorithm stops when it couldn ’t improve model by

# more that epsilon in logLL

# Default : 1e -2

Epsilon : 1e -2

# and it happens during N iterations :

# Default : 100

Stop iteration : 50

# Local optimization .

# Choice of local optimization , that is launched after

# each genetic algorithm .

# Choices:

# * optimize ( BFGS method )

# * opt imi za_l og ( BFGS method )

# * opti mize_p owell ( Powell ’s co nju gat e d ire cti on method )

# ( Note : is implemented in moments : one need to have moments

# installed .)

# (If optimizations are often hitting the parameter bounds ,

# try using these methods :)

# * optimize_lbfgsb

# * optimize_log_lbfgsb

# ( Note that it is probably best to start with the vanilla BFGS

# methods , because the L -BFGS -B methods will always try parameter

# values at the bounds during the search .

# This can dramatically slow model fitting .)

# * opti mize _log _fmi n ( simplex (a.k .a. amoeba ) method )

# * hill_climbing

# Default : optimize_powell

Name of local optimization : optimize_log

# Parameters of pipeline

# One can automatically draw models every N iteration .

# If 0 then never .

# Pictures are saved in GA ’s directory in picture folder .

# Default : 0

Draw models every N iteration : 100

# One can automatically generate dadi and moments code for models .

# If 0 then only current best model will be printed in GA ’s

# working directory .

# Also result model will be saved there .

# If specified ( not 0) then every N iteration model will be saved

# in python code folder .

# Default : 0

Print models ’ code every N iteration : 100

# One can choose time units in models ’ plots : years or thousand

# years (kya , KYA ). If time for one generation isn ’t specified

# then time is in genetic units .

# Default : years

Units of time in drawing : thousand years

# No std output .

# Default : False

Silence : False

# How many times launch GADMA with this parameters .

# Default : 1

Number of repeats : 3

# How many processes to use for this repeats .

# Note that one repeat isn ’t parallelized , so increasing number

# of processes doesn ’t effect on time of one repeat .

# It is desirable that the number of repeats is a multiple of

# the number of processes .

# Default : 1

Number of processes : 3

B Example of generated code

For example, moments generated code for 2d AFS for YRI, CEU populations from [1]:

# current best params = [7194.792822462478 , 13410.251542201073 , 0.9582544565961783 , 13542.979276844108 , 12114.968575519626 , 2683.3787253409746 , 846.6668954957415 , 0.00014172779289593632 , 0.00012195685425105608]

import matplotlib

matplo tlib . use (" Agg " )

import moments

import numpy as np

def g ene ra ted _m ode l ( params , ns ):

Ns = params [:5]

Ts = params [5:7]

Ms = params [7:]

theta1 = 0.37976

sts = moments . LinearSyste m_1D . steady_sta te_1D ( sum (ns ),

th eta = theta1 , N = Ns [0])

fs = moments . Spectrum ( sts )

bef ore = [ Ns [ 0]]

T = Ts [0]

after = Ns [1:2]

growth_funcs = [ lambda t: after [0]]

list_growth_funcs = lambda t: [ f (t ) for f in growth_funcs ]

fs . i nt eg ra te ( Npop = lis t_gr owt h_f uncs , tf =T , d t_fac =0.1 ,

theta = theta1 )

before = after

fs = moments . Manips . split_1D_to_2D ( fs , ns [0] , sum ( ns [1:]))

before . append ((1 - Ns [2]) * before [ -1])

before [ -2] *= Ns [2]

T = Ts [1]

after = Ns [3:5]

growth_funcs = [ lambda t: after [0] ,

lambda t : before [1] * ( after [1] / before [1]) ** (t / T)]

list_growth_funcs = lambda t: [ f (t ) for f in growth_funcs ]

m = np . array ([[0 , params [7]] ,[ params [8] , 0]])

fs . i nt eg ra te ( Npop = lis t_gr owt h_f uncs , tf =T , m =m , dt _fac =0.1 ,

theta = theta1 )

before = after

return fs

data = m oments . Spec trum . f ro m_f il e ( ’ data / YRI_CEU . fs ’ )

popt = [7194.792822462478, 13410.251542201073, 0.9582544565961783,

13542.979276844108, 12114.968575519626, 2683.3787253409746,

846.6668954957415 , 0.00014172779289593632 , 0 .0001 21956 85425 105608]

ns = [20 , 20]

mo del = ge ne ra ted _m od el ( popt , ns )

ll_mode l = moments . Inf ere nce . ll ( model , data )

pr int ( ’ M ode l log li ke li ho od ( LL ( model , dat a )): {0} ’ . form at ( ll_ model ))

# now we need to norm vector of params so that first value is 1:

popt_norm = [1.0, 1.8638829321580523, 0.9582544565961783,

1.8823306815120924, 1.6838523185401713, 0.3729612223111333,

0.1176777311575127, 1.0197021070711312, 0.8774542996156008]

pr int ( ’ Drawing mod el to m od el _f ro m_ GA DM A_ fr om _s im pl e . png ’)

model = moments . ModelPlot . generate_model ( generated_model , popt_norm , ns )

moments . ModelPlot . plot_model ( model ,

save _fi le = ’ mo del _f ro m_ GA DM A_ fr om _si mp le . png ’ ,

fig_ tit le = ’ ’ ,

pop_labels =[ ’YRI ’, ’ CEU ’],

nref =7194 ,

dra w_ sc al e = True ,

gen_time =0.025 ,

ge n_ ti me_ un it s = " T hou sand yea rs " ,

reverse_timeline=True)

References

[1] Ryan N Gutenkunst, Ryan D Hernandez, Scott H Williamson, and Carlos D Bustamante,

Inferring the joint demographic history of multiple populations from multidimensional snp

frequency data, PLoS genetics 5(2009), no. 10, e1000695.

[2] Julien Jouganous, Will Long, Aaron P. Ragsdale, and Simon Gravel, Inferring the joint

demographic history of multiple populations: Beyond the diﬀusion approximation, Genet-

ics 206 (2017), no. 3, 1549–1567.

[3] Marguerite Lapierre, Amaury Lambert, and Guillaume Achaz, Accuracy of demographic

inferences from the site frequency spectrum: the case of the yoruba population, Genetics

(2017), genetics–116.

Manual

Navigation menu

Versions of this User Manual:

Views

Navigation