Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 31
Download | |
Open PDF In Browser | View PDF |
GADMA user manual Corresponding to version 1.0.0 Ekaterina Noskova Contents 1 Introduction 3 2 Getting help 3 3 Hands on 3 4 Input data 4.1 VCF data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Frequency spectrum file format . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 SNP data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 7 7 5 Specifying scheme 8 6 Specifying a model 6.1 Dynamics of population size change 6.2 Specifying structure of the model . 6.2.1 Initial structure . . . . . . . 6.2.2 Final structure . . . . . . . 6.3 Specifying model in details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Several repeats and parallel computing 8 Output 8.1 Stdout and log file . . . . 8.2 Model representation . . . 8.3 Output directory content . 8.4 Generated code of models 8 8 9 10 10 11 12 . . . . 12 13 13 15 15 9 Plotting model 9.1 Time units on model’s plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Plotting after GADMA finished . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 17 . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Usefull options 10.1 Time for generation . . . . . . . . . . . . . . 10.2 Theta0 . . . . . . . . . . . . . . . . . . . . . 10.2.1 Estimating Theta0 . . . . . . . . . . 10.2.2 Changing Theta0 . . . . . . . . . . . 10.3 Examples of Theta0 and time for generation 10.4 Relative to NA parameters . . . . . . . . . . 10.5 Disable migrations . . . . . . . . . . . . . . 10.6 Upper bound for time split . . . . . . . . . . 10.7 Resume launch . . . . . . . . . . . . . . . . 10.7.1 Only models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 18 18 18 19 19 19 19 20 20 11 Extra parameters file 20 12 Units 20 13 Decreasing number of model’s parameters 20 14 Dependencies 22 15 Installation 15.1 ∂a∂i . . . . . . 15.2 moments . . . . 15.3 Pillow . . . . . 15.4 GADMA . . . . 15.4.1 Verifying 22 22 22 23 23 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendices 24 A Example of full parameters file 24 B Example of generated code 30 2 1 Introduction Welcome to GADMA! GADMA implements methods for automatic inferring joint demographic history of multiple populations from the genetic data. GADMA is based on two open source packages: the ∂a∂i developed by Ryan Gutenkunst [https://bitbucket.org/gutenkunstlab/dadi/] and the moments developed by Simon Gravel [https://bitbucket.org/simongravel/moments/]. In contrast, GADMA is a command-line tool. It presents a series of launches of the genetic algorithm and infer demographic history from Allele Frequency Spectrum of multiple populations (up to three). 2 Getting help Please don’t be afraid to contact me for different problems and offers via email ekaterina.e.noskova@gmail.com. I will be glad to answer all questions. 3 Hands on Test case GADMA has test case for simple demographic model for 1 population: just constant size of 10000 individuals in population. To run test case type: $ gadma -- test Example 2 Suppose we have SNP data for two populations. Data is in ∂a∂i’s SNP format in file snp_data.txt. Suppose we want to get all output in some gadma_output directory: $ gadma -i snp_data . txt -o gadma_output Example 3 We didn’t specify AFS size or labels for populations, they are taken automaticaly from the input file.We can see parameters file of our run in gadma_output/params file. We see that there are # gadma_output / params ... Population labels : pop_name_1 , pop_name_2 Projections : 18 , 20 ... But we know that spectrum should be 20 × 20! To specify size of AFS we need to create parameters file and set Projections: 3 # params_file Projections : 20 ,20 Also we can change order of populations. We should add: # params_file Projections : 20 ,20 Population labels : pop_name_2 , pop_name_1 If we want to rename populations, we should change names in snp_data.txt file. Now assume we want to get the simplest demographic model that we can as faster as we can. We will tell GADMA that we don’t need no other dynamics of population sizes except sudden (constant) population size change and that we want to use moments library. We add corresponding string to parameters file and now it looks like: # params_file Projections : 20 ,20 Population labels : pop_name_2 , pop_name_1 Only sudden : True Use moments or dadi : moments To run GADMA we need to specify -p/--params command-line option in cmd: $ gadma -i snp_data . txt -o gadma_output -p params_file Example 4 Consider some AFS file fs_data.fs. There is spectrum for three populations: YRI, CEU, CHB. However axes are mixed up: CHB, YRI, CEU. To run GADMA we should order them from most ancient to last formed: # params_file Population labels : YRI , CEU , CHB We want to allow exponential growth (it is default behaviour) and have some extra change in size of the ancient population. To do so we should specify Initial structure. It is list of three numbers: first — number of time intervals before first split (we want here 2); second — number of time periods betseen forst and second split events (at least 1); third — number of time periods after second split. # params_file Population labels : YRI , CEU , CHB Initial structure : 2 ,1 ,1 Also we can put information about input file and output directory to our parameters file: # params_file Input file : fs_data . fs Output directory : gadma_output Population labels : YRI , CEU , CHB Initial structure : 2 ,1 ,1 Now we can run GADMA the following way: 4 $ gadma -p params Example 5 We have our GADMA launch interrupted for some reason. We want to resume it: $ gadma -- resume gadma_output where gadma_output is output directory of previous run. We can find resumed run in gadma_output_resumed Example 6 Our launch was finished, we used ∂a∂i with default grid size which GADMA determines automatically if it is not specify by user. We found out that it would be better to find some models using greater number of grid points in ∂a∂i scheme, but we want to take final models from previous run: # params_file Pts : 40 , 50 , 60 # Greater value of grid size than it was And run GADMA: $ gadma -- resume gadma_output -- only_models -p params --only_models tell GADMA to take from gadma_output final models only. There is another way to do the same: # params_file Resume from : gadma_output Only models : True Pts : 40 , 50 , 60 # Greater value of grid size than it was And run GADMA the following way: $ gadma -p params Example YRI, CEU GADMA has example of full parameters file example_params, which is preseted here too at the end of the manual. To run GADMA with this parameters one should just run from the GADMA’s home directory: $ gadma -p example_params 4 Input data GADMA handles two formats of input files: • Frequency spectrum file format (.fs or .sfs), 5 • SNP data format (.txt). These data formats should be familiar to one who was used ∂a∂i or moments. Input file can be specified to GADMA with two ways: 1. Through command-line option -i/--input: $ gadma -i fs_file . fs -o out_dir or $ gadma -- input snp_file . txt -o out_dir 2. In parameters file params_file: # params_file # Input file path Input file : fs_file . fs ... Extra information about input AFS can also be put in parameters file. For example, AFS can be projected to smaller size with Projections option, populations can be named or their order can be changed with Population labels option. If parameters file does not contain some options, they are automaticaly pulled out from the input file. # params_file # Input file path Input file : fs_file . fs # ( New ) size of the AFS Projections : 20 ,20 # Labels of populations Population labels : CEU , YRI ... GADMA can be launched with parameters file the following way: $ gadma -p params_file -o out_dir 4.1 VCF data format If one have .vcf file, it can be converted it into .sfs file using https://github.com/isaacovercast/ easySFS. 6 4.2 Frequency spectrum file format Each file begins with any number of comment lines beginning with #. The first non-comment line contains P integers giving the dimensions of the FS array, where P is the number of populations represented. For a FS representing data from 4x4x2 samples, this would be 5 5 3. (Each dimension is one larger than the number of samples, because the number of observations can range, for example, from 0 to 4 if there are 4 samples, for a total of 5 possibilities.) On the same line, the string folded or unfolded denoting whether or not the stored FS is folded. The actual data is stored in a single line listing all the FS elements separated by spaces, in the order fs[0,0,0] fs[0,0,1] fs[0,0,2]. . . fs[0,1,0] fs[0,1,1]. . . . This is followed by a single line giving the elements of the mask in the same order as the data, with 1 indicating masked and 0 indicating unmasked. 4.3 SNP data format Human Chimp Allele1 YRI CEU Allele2 ACG ATG C 29 24 T CCT CCT C 29 23 G Listing 1: Example of SNP YRI CEU Gene Position 1 0 abcb1 289 3 2 abcb1 345 file format The data file begins with any number of comment lines that being with #. The first parsed line is a column header line. Whitespace is used to separate entries within the table, so no spaces are allowed within any entry. Individual rows make be commented out using #. The first column contains the in-group reference sequence at that SNP, including the flanking bases. If the flanking bases are unknown, they can be denoted by -. The header label is arbitrary. The second column contains the aligned outgroup reference sequence at that SNP, including the flanking bases. Unknown entries can be denoted by -. The header label is arbitrary. The third column gives the first segregating allele. The column header must be exactly Allele1. Then follows an arbitrary number of columns, one for each population, each giving the number of times Allele1 was observed in that population. The header for each column should be the population identifier. The next column gives the second segregating allele. The column header must be exactly Allele2. Then follows one column for each population, each giving the number of times Allele2 was observed in that population. The header for each column should be the population identifier, and the columns should be in the same order as for the Allele1 entries. Then follows an arbitrary number of columns which will be concatenated with _ to assign a label for each SNP. The Allele1 and Allele2 headers must be exactly those values because the number of 7 columns between those two is used to infer the number of populations in the file. 5 Specifying scheme GADMA use either ∂a∂i or moments to simulate expected AFS from the demographic model. By default ∂a∂i is used. To use ∂a∂i it is recommended to check value of the Pts option in parameters file. Pts is sequence of three numbers, each of which is equal to the number of points in grid size. The greater numbers are, the more accurate ∂a∂i numerical solution of partial differential equation is, however, it would take more time. By default, GADMA takes Pts : n, n + 10, n + 20, where n — is the maximum sample size among populations of interest. moments library does not need Pts to be specified. To change ∂a∂i to moments scheme, specify option in parameters file: # params file ... Use moments or dadi : moments ... 6 6.1 Specifying a model Dynamics of population size change Sudden Figure 1: Three main demographic dynamics of population size change. In GADMA size of population can be changed due to one of three dynamics: sudden change, linear change and exponential change of the effective population size (Figure 1). In order to infer demographic model with sudden changes of populations sizes only, option Only sudden in parameters file should be set to True: # params_file ... Only sudden : True ... By default, this option is set to False and dynamics are found like other parameters of the demographic model. 8 6.2 Specifying structure of the model GADMA infers a demographic model from an AFS with nothing required from the user, except simple information, that determines how much the detailed model is required — the structure of the model. Assume a division of one population into two new subpopulations and fixed temporal order of the populations: from the ancient to the most recently formed population. We can divide time of our model into splits events and time intervals, during which a certain dynamics of change of effective size is maintained for each population and migration rates are constant. The number of split events is one less than the number of populations under consideration. Now we can define the concept of the structure of the demographic model: Structure of the demographic model is: • the number of time intervals in case of one population; • the number of time intervals as those that occur before and after a single split, in the case of two populations; • the number of time intervals prior to the first split, those between the first and second split, and the ones after the second split, in the case of three populations. For example, we can divide time of the model on the Figure 2 to 4 time intervals: T1 , T2 , T3 and T4 , and two populations’ splits: S1 and S2 . The structure of this model is 2, 1, 1, because two intervals (T1 as T2 ) before first split S1 , one interval (T3 ) between first and second splits and one interval (T4 ) after second split S2 . Figure 2: Example of representation of demographic model. Time is on the axis of abscissa and population size is on the axis of ordinates. The structure of that model is (2,1,1). Here colors are just a formality and are used for highlighting different demographic dynamics. 9 6.2.1 Initial structure To specify Structure of the inferred model one should set Initial structure in the parameters file: # params file ... Initial structure : 2 ... or # params file ... Initial structure : 2 ,1 ... or # params file ... Initial structure : 2 ,1 ,1 ... By default the simplest structure is used (1 or 1,1 or 1,1,1). 6.2.2 Final structure It is also possible to start with more simple structure in order to get more complex. To do so one should specify option Final structure in the parameters file. For example: # params file Input file : some_2d_fs . fs Initial structure : 1 ,1 Final structure : 2 ,1 ... During this parameters file GADMA will find parameters for demographic model with 1,1 structure, then increase structure for 2,1 and find parameters for the model with this structure. Found parameters for more simple structure (in this case it is 1,1) are used in search of the parameters of more complex structure (2,1). Note: Structure increases random, so if one specify initial structure to 1,1 and final to 2,2, it is not garanteed to finad optimal parameters for demographic models with structures between 1,1 and 2,2, i.e. intermediate state can be either 1,2 or 2,1. Recommendation: Use scheme with increase of the structure, as it produce more stable solutions. Recommendation: Choose not very large values for model’s structure. Final structure is recommended to be differ by one element from initial structure, for example, 1,1 and 2,1; 1,2,1 and 2,2,1. 10 6.3 Specifying model in details It is also possible to use Genetic algorithm from GADMA to a proposed model that is defined as Python function using ∂a∂i or moments. It is the way that ∂a∂i and moments work with demographic models inference. To understand how to specify model like that one can read manuals to the packages. For example, consider simple bottleneck model for one population: at time TF + TB in the past, an equilibrium population goes through a bottleneck of depth nuB, recovering to relative size nuF: def bottleneck ( params , ns , pts ): nuB , nuF , TB , TF = params xx = dadi . Numerics . default_grid ( pts ) phi = dadi . PhiManip . phi_1D ( xx ) phi = dadi . Integration . one_pop ( phi , xx , TB , nuB ) phi = Integration . one_pop ( phi , xx , TF , nuF ) fs = dadi . Spectrum . from_phi ( phi , ns , ( xx ,)) return fs To run optimization from GADMA one need to run optimization function just like in ∂a∂i and moments: # Import GADMA ’s optimization function : import gadma # Specify input data and its parameters : data = dadi . Spectrum . from_file ( " fs_filename . fs " ) ns = data . sample_sizes # size of AFS pts = [40 ,50 ,60] # grid size for dadi # Wrap our bottleneck function : func_ex = dadi . Numerics . make_extrap_log_func ( bottleneck ) # Specify upper and lower bounds for parameters : upper_bound = [100 , 100 , 3 , 3] lower_bound = [1 e -2 , 1e -2 , 0 , 0] # Run optimizations : # Beginning GADMA optimization popt = gadma . Inference . optimize_ga (4 , data , func_ex , pts_l , lower_bound = lower_bound , upper_bound = upper_bound ) # Beginning local optimization from dadi popt = dadi . Inference . optimize_log ( popt , data , func_ex , pts_l , lower_bound = lower_bound , 11 upper_bound = upper_bound , verbose = len ( p0 ) , maxiter =3) print ( ’ Found parameters : {0} ’. format ( popt )) Note: No initial parameters p0 are set in gadma.Inference.optimize_ga function as GADMA optimization is global search algorithm. However it is possible to specify p0: # Initial parameters can be set too : p0 = [0.01 , 1.5 , 0.2 , 0.2] # Beginning GADMA optimization popt = gadma . Inference . optimize_ga (4 , data , func_ex , pts_l , p0 = p0 lower_bound = lower_bound , upper_bound = upper_bound ) If one want to find other parameters fo gadma.Inference.optimize_ga function: >>> import gadma >>> help ( gadma . Inference . optimize_ga ) 7 Several repeats and parallel computing By default GADMA run optimization that uses Genetic algorithm once. However it is recomended to run optimization several times and get best model among the result ones. Option Number of repeats in parameters file tell GADMA to execute optimization the necessary number of times. Moreover there are another option Number of processes allows GADMA to parallelize those runs. Note: GADMA parallelize exactly several runs, not every of them. So if one ask to repeat optimization 2 times and parallelize in more than 2 processes, only two processes will be used eventually. Recomendation: Number of processes should be lesser than Number of repeats and it will be better if it is aliquot to the Number of repeats. Note: Number of processes shouldn’t be greater than number of kernels on one’s computer. Otherwise there will be no sence in parallelization. # params file ... Number of repeats : 6 Number of processes : 2 # or 3 or 6 ... 8 Output GADMA put all files to the directory that user set through -o/--output command-line option: 12 $ gadma -o output_dir -i input_fs . fs or through Output directory option in parameters file: # params file Output directory : output_dir ... 8.1 Stdout and log file GADMA prints its progress about every minute in stdout and in output_dir/GADMA.log file: [ hhh : mm : ss ] All best logLL models : GA number logLL BIC Model All best BIC models : GA number logLL BIC Model -- Best model by log likelihood - Log likelihood : Best logLL with BIC score : BIC_score Model : representation of best logLL model -- Best model by BIC score - Log likelihood : logLL with BIC score : Best BIC score Model : representation of best BIC model Note: One can set Silence option in parameters file to True to disable output in stdout, file output_dir/GADMA.log will still have it. 8.2 Model representation Every model is printed as a line of parameters. All model parameters, except mutation rates, have precision equals to 2. Consider designations: • T — time, • P — percent of split, • Ni — size of population number i, • di — dynamic of changing of the size of population number i, • mij — mutation rate from population i to population j. 13 Dynamic of population size change has numerical values: • di = 0 — sudden change of size for population number i; • di = 1 — linear change of size for population number i; • di = 2 — exponential change of size for population number i; Model is printed as sequence of time intervals and splits that are represented the following way: • First period (NA — size of ancestry population): [ NA ] • Split: – If there are one population before split event, so there will be two populations after it: [ P%, [N1, N2] ] – If there are two population before split event, so last formed population is divided and there will be three populations after it: [ P%, [N1, N2, N3] ] • Usual time period: – If there is one population: [ T, [N1], [d1] ] – If there are two populations: [ T, [N1, N2], [d1, d2], [[None, m12], [m21, None]] ] – If there are three populations: [ T, [N1, N2, N3], [d1, d2, d3], [[None, m12, m13], [m21, None, m23], [m31, m32, None]] ] At the end of the string, that corresponds to the model, there is information about model’s ancestry in the genetic algorithm: • c — for model, that is child of crossover, • r — if it was formed random way, • m — if it was mutated, • f — final model of genetic algorithm. Note: m is added as many times as model was mutated. Example of the demographic model for two populations: [ [144.38] ][ 16.00%, [23.10, 121.28] ][ 375.77, [143.31, 30.07], [2, 2], [[None, 3.33e-03][7.53e-04, None]] ] 14 8.3 Output directory content For every repeat of Genetic algorithm GADMA create new folder in output directory with correspond number. In every folder there is GADMA_GA.log, where every iteration of algorithm is saved, folders pictures and python_code. When genetic algorithm finishes GADMA saves picture and python code of result model in the corresponding folder. When all GA finishes picture and code of best model among them are saved in root directory. - < output_dir > - 1 GADMA_GA . log - pictures - python_code - dadi - moments result_model . png result_model_code . py - 2 GADMA_GA . log - pictures - python_code - dadi - moments result_model . png result_model_code . py params extra_params GADMA . log best_logLL_model . png best_logLL_model_dadi_code . py best_logLL_model_moments_code . py best_bic_model . png best_bic_model_dadi_code . py best_bic_model_moments_code . py 8.4 Generated code of models By default GADMA generates python code only for final models both for ∂a∂i and moments. However it can do it every N iteration of genetic algorithm also. In this case option should be set in the parameters file. GADMA save files with code to the output_dir//python_code directory. Both ∂a∂i and moments code are generated and saved in different folders there. Each code contain function of the model, which takes values of the parameters as input, and strings that load observed AFS, simulate expected AFS from the model’s function and 15 calculate log likelihood between two AFS’. The result log likelihood is printed to stdout. For the moments code picture is also draw. All code can be run the following way: $ python file_with_code . py Example of generated code one can find at the end of this manual in the corresponding section. 9 Plotting model Figure 3: Exaple of demographic model plot that GADMA draw during run. GADMA always draws final best models during genetic algorithms search for best solutions. However models can be drawing during the pipeline every N ’s iteration of each genetic algorithm. N is equal to the value of Draw models every N iteration option in parameters file. So to enable drawing one should set this option in file. Recommendation: Don’t draw models very often, because changes are usually not very significant and drawing takes some time, so optimization will be slower. Note: One can disable drawing by setting Draw models every N iteration : 0 in the parameters file. This is also default behaviour. Models are drawn with moments library, so it should be installed if one want to have pictures. In the top left corner there is size of ancestry population. Other parameters one can find in string representation of the model in the log files. 9.1 Time units on model’s plot Time on the demographic model’s plot can be drawn in units of years, thousand years or in genetic units. By default choice depends on the Time for generation option in the parameters file: if it is set to some value (in years) then time will be shown in years, otherwise it will be in genetic units. But, of course, it is possible to tell GADMA what units are preferable. For example, if one wants time to be in thousand of years on the pictures, as it is big values in years: 16 # params file ... Units of time in drawing : thousand years ... 9.2 Plotting after GADMA finished Figure 4: Example of model’s plot that was drawn with generated python script. Sometimes final pictures aren’t satisfying or one didn’t draw for some reasons (don’t want to install moments, want fast launch) it is possible to plot demographic model (and only it) again. To do it one should run corresponding generated python script for the model. For example, final model can be draw again the following way: $ python best_logll_model_moments_code . py Note: Again it’s possible only if one have moments installed. Note: One can change code inside file and draw again if picture isn’t satisfying. 10 10.1 Usefull options Time for generation Option Time for generation in parameters file is corresponding to the time per one generation in Wright-Fisher model. It is responsible basically for one thing: time on the model’s plots. If it is specified, then time on the pictures will be converted from genetic units by scaling with this value. Otherwise, if it is not set, time will be shown in genetic units. Note: If Time for generation is specified, it should be consistent with another option: Theta0, which is described in the corresponding section. 17 10.2 Theta0 Theta0 is equal to the expected number of mutation that occur in one chromosome in one generation in the infinite-sited model. GADMA can scale all values of demographic model parameters due to known value of Theta0. However, it is not always possible to find it. There is a way to solve this problem: one can to set Theta0 to None or just not to specify it at all, so GADMA will take it as 1.0 and after launch one can scale result values due to found Theta0 (how to do it is described in the corresponding section). 10.2.1 Estimating Theta0 If µ — is the neutral mutation rate per site per generation and L — length of the sequence, then: θ0 = 4 ∗ µ ∗ L Note: L is the effective sequenced length, which accounts for losses in alignment and missed calls. Note: µ should be estimated based on generation time. One can leave Time per generation option in parameters file unspecified (then time on the model’s plots will be in the genetic units), but don’t forget to recalculate µ! For example (Gutenkunst et al, 2009 [1]): We estimate the neutral mutation rate µ using the divergence between human and chimp. Comparing aligned sequences in our data, we estimate the divergence to be 1.13%. Assuming a divergence time of 6 million years and a mean generation time for human and chimp over this interval of 25 years, we have µ = 0.0113 · 25/(2 · 6 · 106 ) = 2.35 × 10−8 per generation. 10.2.2 Changing Theta0 If GADMA was launched with one Theta0 and now one want to use another or if it was launched with default Theta0 = 1.0 and now one have estimated its real value, model’s parameters can be simply scaled: θN EW Let a = 0OLD , θ0 • Size of population /a, • Time /a, • Migration rates ∗a, • Split percent stay the same. 18 10.3 Examples of Theta0 and time for generation The following tables produces different possible values for the demographic model inference for three populations of modern people: YRI, CEU, CHB. FS filename YRI YRI YRI YRI CEU CEU CEU CEU CHB.fs CHB.fs CHB.fs CHB.fs Gen. time (years) 25 [1] 24 [3] 29 [2] 24 [3] µ (per site per gen.) 2.35 · 10−8 [1] 2.26 · 10−8 [1] 1.44 · 10−8 [2] 1.2 · 10−8 [2] L (base pair) 4.04 · 106 [1] 4.04 · 106 [1] 4.04 · 106 [1] 4.04 · 106 [1] Theta0 (per chr. per gen.) 0.37976 0.36521 0.23270 0.19392 Table 1: Examples of different values of generation time and its influence on µ and Theta0 In Gutenkunst et al. [1] generation time for human populations was equal to 25 years and mutation rate µ was estimated as 2.35 · 10−8 . If one want to change time for one generation to 24 years, one need to scale µ: µ/25 · 24 = 2.26 · 10−8 . In Jouganous et al. [2] generation time was grater — 29 years and mutation rate was equal to 1.44 · 10−8 . To change generation time to 24, one need to change value of mutation rate: µN EW = µ/29 · 24 = 1.2 · 10−8 . Theta0 is calculated then by the formula above. Note: One can find FS files in fs_examples directory. 10.4 Relative to NA parameters Sometimes it is more important to see parameters scaled to Nref = NA . To tell GADMA show models with scaled parameters, option Relative parameters should be set to True. By default, it is False. It is conveniently when Theta0 is unknown. 10.5 Disable migrations GADMA can to exclude migrations rates from optimization and consider them be equal to zero. In that case all migrations are disabled. One should set option No migrations to True in the parameters file. 10.6 Upper bound for time split To limit time of some split one should specify option in parameters file. Splits are numerated from the most ancient. So split 1 is split that occurred with ancient population and split 2 is next division of second population (exist only for three populations). There are two appropriate options: Upper bound of first split and Upper bound of second split. One should translate time from years into genetic units, therefore divide it by 2 · Tg , where Tg is time (in years) for one generation. For example, one want to limit last split with 2000 years. Time for one generation is estimated as 24 years, then one should specify in parameters file: 19 ... Upper bound of second split : 83.333 ... 10.7 Resume launch To resume interrupted launch one can use --resume command-line option or set Resume from in the parmeters file. One need to set output directory of previous run. If neither Output directory or -o/--output aren’t specified, GADMA will continue evaluation in the directory: _resumed. 10.7.1 Only models GADMA can resume launch taking final models only from previous run. This means, that it is not resumption, but run from some initial values. It is usufull, for example, when one have run GADMA with some small grid size for dadi and then want to restart it with greater number of grid points. To do so, one should set command-line option --only_models with --resume or specify Only models option in parameters file to True. 11 Extra parameters file GADMA take extra parameters file as input. However one probably do not need them. Nevertheless, if one is interested, extra_params_template with all options and their descriptions can be found in gadma folder. 12 Units GADMA shows model parameters in genetic units. To scale them from one should multiply migration rates by 2 only. Other units are as usual. In case when option Relative parameters is set to True one should first rescale from units of Nref : sizes of populations and time — multiply by NA, migration rates — divide by NA. 13 Decreasing number of model’s parameters There are several ways to decrease number of demographic model’s parameters in GADMA. Firstly, as it was discussed above, one can consider only sudden dynamics of population size changes (Only sudden : True). Second way to decrease number of parameters is so-called multinom scheme: when at every step NA is chosen to fit the observed data best. This scheme leads to faster work of GADMA, however it is more possible to get stuck in local optima. To use it one should set option Multinom to True in parameters file. Third way is to disable migration by setting No migrations to True in parameters file. 20 The last and more reliable way to decrease number of model’s parameters is to use GADMA optimization on the custom model written with ∂a∂i or moments (section Specifying model in details). However it requires some extra work for user. 21 14 Dependencies The absolute dependencies for GADMA are: • Python (2.5, 2.6, 2.7) • NumPy (≥ 1.2.0) • Scipy (≥ 0.6.0) • ∂a∂i (≥ 1.7.0) or moments (≥ 1.0.0) To draw demographic models install: • matplotlib (≥ 0.98.1) • Pillow (≥ 4.2.1) • moments (≥ 1.0.0) 15 Installation Before GADMA installation one need either ∂a∂i or moments been installed. 15.1 ∂a∂i To install ∂a∂i, go to the work directory and run: $ git clone https :// bitbucket . org / gutenkunstlab / dadi / Then go to the dadi directory and install package: $ cd dadi $ python setup . py install To check ∂a∂i’s installation, type in python interpreter: $ python >>> import dadi 15.2 moments To install moments, go to the work directory and run: $ git clone https :// bitbucket . org / simongravel / moments / Check that Cython is installed: $ python -m pip install -- upgrade Cython Then go to the moments directory and install package: 22 $ cd moments $ python setup . py install To check moments’s installation, run in python interpreter: $ python >>> import moments 15.3 Pillow Install Pillow throgh pip: $ python -m pip install Pillow To check Pillow’s installation, type in python interpreter: $ python >>> import PIL 15.4 GADMA To download GADMA, go to the work directory and run: $ git clone https :// github . com / ctlab / GADMA Then go to the GADMA directory and install GADMA: $ cd GADMA $ python setup . py install Now one can run it like this: $ gadma -- help 15.4.1 Verifying installation To verify installation run test: $ gadma -- test If the installation is successful, one will find the following information at the end: -- Finish pipeline - -- Test passed correctly - Thank you for using GADMA ! 23 Appendices A # # # # # # # Example of full parameters file It is parameters file for GADMA software . Lines started from # are ignored . Also comments at the end of a line are ignored too . Every line contains : Identificator of parameter : value . If one want to change some default parameters , one need to remove # at the begining of line and change corresponding parameter . # Output directory to write all GADMA out . # One need to set it to not existing or empty directory . # If it is resumed from other directory and output directory # isn ’t set , GADMA will add ’ _resumed ’ for previous output # directory . Output directory : my_example_run # One can resume from some other launch of GADMA by setting # output directory of it to ’ Resume from ’ parameter . # You can set again new parameters of resumed launch . # Resume from : another_output_dir # # If you want to take only models from previous run set this # flag . Then iterations of GA will start from 0 and values of # mutation rate and strength will be initial . # Default : None # Only models : None # Input file can be sfs file ( should end with . fs ) or # file of SNP ’s in dadi format ( should end with . txt ). Input file : fs_examples / YRI_CEU . fs # # # # # ’ Population labels ’ is sequence of population names ( the same names as in input file ) If . fs file is in old format then it would rename population labels that are absent . It is necessary to put them in order from most ancient to less . 24 # ( In case of more than two populations ) # It is important , because the last of formed populations take # part in next split . # For example , if we have YRI - African population , # CEU - European population and CHB - Chineese populaion , # then we can write YRI , CEU , CHB or YRI , CHB , CEU # ( YRI must be at the first place ) # Default : from input file Population labels : CEU , YRI # we change populations order # ( in input file YRI is first ) # Also one can project your spectrum to less size . # For example , we have 80 individuals in each of three # populations , then spectrum will be 81 x81x81 and one can # project it to 21 x21x21 by set ’ Projections ’ parameter # to 20 , 20 , 20. # Default : from input file Projections : None # will be 20 , 20 # Now all main parameters : # # Total mutation flux - theta . # It is equal to : # theta = 4 * mu * L # where mu - mutation rate per site per generation and # L - effective sequenced length , which accounts for losses # in alignment and missed calls . # Note : one should estimate mu based on generation time . # Default : 1.0 Theta0 : 0.37976 # the same as in Gutenkunst et al 2009 # Time ( years ) for one generation . Can be float . # Is important for drawing models . If one don ’t want to draw , # one can pass it . # Default : 1.0 Time for generation : 25 # the same as in Gutenkunst et al 2009 # # # # Parameters for demographic models : Use moments or dadi Default : dadi 25 Use moments or dadi : moments # Use multinom scheme : N_A is not parameter for search , # it is calculated through optimal_sfs_scaling . # Multinom scheme decrease number of parameter by one and # is usually faster , however non multinom scheme usually # finds better solutions . # Default : False Multinom : False # If you choose to use dadi , please set pts parameter - number # of points in grid # Default : Let n = max number of individuals in one population , # then pts = n , n +10 , n +20 # Pts : 20 , 30 , 40 # Structure of model for one population - number of time periods # (e.g. 5). # Structure of model for two populations - number of time periods # before split of ancestral population and after it (e.g. 2 ,2). # Structure of model for three populations - number of time periods # before first split , between first and second splits and after # second split (e.g. 2 ,1 ,2). # # Structure of initial model : # Default : all is ones - 1 or 1 ,1 or 1 ,1 ,1 Initial structure : 2 ,1 # Structure of final model : # Default : equals to initial structure # Final structure : 2 ,2 # It is possible to limit time of splits . # Split 1 is the most ancient split . # ! Note that time is in genetic units (2 * time for 1 generation ): # e.g. we want to limit by 150 kya , time for one generation is # 25 years , then bound will be 150000 / (2*25) = 3000. # # Upper bound for split 1 ( in case of 2 or 3 populations ). # Default : None # Upper bound of first split : None # Upper bound for split 2 ( in case of 3 populations ). # Default : None # Upper bound of second split : None 26 # Print parameters of model in units of N_ref = N_A . # N_A will be placed in brackets at the end of string . # Default : False Relative parameters : False # Disable migrations in demographic models . # Default : False No migrations : false # Parameters for Genetic Algorithm . # # Size of population of demographic models in GA : # Default : 10 Size of population in GA : 10 # Fractions of current models , mutated models and crossed models # to be taken to new population . # Sum of fractions should be <= 1, the remaining fraction is # fraction of random models . # Default : 0.2 ,0.3 ,0.3 # Fractions in GA : 0.2 ,0.3 ,0.3 # Mutation strength - fraction of parameters in model to mutate # during global mutation process of model . # Number of parameters to mutate is sampled from binomial # distribution , so we need to set mean . # Default : 0.2 Mean mutation strength : 0.3 # # Mutation strength can be adaptive : if mutation is good , # i.e. has the best fitness function ( log likelihood ), # then mutation strength is increased multiplying by const # otherwise it decreases dividing by (1/4)^ const . # When const is 1.0 it is not adaptive . # Default : 1.0 Const for mutation strength : 1.05 # # # # Mutation rate - fraction of any parameter to change during its mutation . Mutation rate is sampled from truncated normal distribution , so we need mean ( std can be specified in extra params ). 27 # Default 0.2 Mean mutation rate : 0.1 # # Mutation rate also can be adaptive as mutation strength . # Default : 1.02 Const for mutation rate : 1.01 # very small changes # Genetic algorithm stops when it couldn ’t improve model by # more that epsilon in logLL # Default : 1e -2 Epsilon : 1e -2 # # and it happens during N iterations : # Default : 100 Stop iteration : 50 # # # # # # # # # # # # # # # # # # # # # # # # # # # # Local optimization . Choice of local optimization , that is launched after each genetic algorithm . Choices : * optimize ( BFGS method ) * optimiza_log ( BFGS method ) * optimize_powell ( Powell ’s conjugate direction method ) ( Note : is implemented in moments : one need to have moments installed .) ( If optimizations are often hitting the parameter bounds , try using these methods :) * optimize_lbfgsb * optimize_log_lbfgsb ( Note that it is probably best to start with the vanilla BFGS methods , because the L - BFGS -B methods will always try parameter values at the bounds during the search . This can dramatically slow model fitting .) * optimize_log_fmin ( simplex (a.k.a. amoeba ) method ) * hill_climbing Default : optimize_powell 28 Name of local optimization : optimize_log # Parameters of pipeline # # One can automatically draw models every N iteration . # If 0 then never . # Pictures are saved in GA ’s directory in picture folder . # Default : 0 Draw models every N iteration : 100 # One can automatically generate dadi and moments code for models . # If 0 then only current best model will be printed in GA ’s # working directory . # Also result model will be saved there . # If specified ( not 0) then every N iteration model will be saved # in python code folder . # Default : 0 Print models ’ code every N iteration : 100 # One can choose time units in models ’ plots : years or thousand # years ( kya , KYA ). If time for one generation isn ’t specified # then time is in genetic units . # Default : years Units of time in drawing : thousand years # No std output . # Default : False Silence : False # How many times launch GADMA with this parameters . # Default : 1 Number of repeats : 3 # How many processes to use for this repeats . # Note that one repeat isn ’t parallelized , so increasing number # of processes doesn ’t effect on time of one repeat . # It is desirable that the number of repeats is a multiple of # the number of processes . # Default : 1 Number of processes : 3 29 B Example of generated code For example, moments generated code for 2d AFS for YRI, CEU populations from [1]: # current best params = [7194.792822462478 , 13410.251542201073 , 0.958254456 import matplotlib matplotlib . use ( " Agg " ) import moments import numpy as np def generated_model ( params , ns ): Ns = params [:5] Ts = params [5:7] Ms = params [7:] theta1 = 0.37976 sts = moments . LinearSystem_1D . steady_state_1D ( sum ( ns ) , theta = theta1 , N = Ns [0]) fs = moments . Spectrum ( sts ) before = [ Ns [0]] T = Ts [0] after = Ns [1:2] growth_funcs = [ lambda t : after [0]] list_growth_funcs = lambda t : [ f ( t ) for f in growth_funcs ] fs . integrate ( Npop = list_growth_funcs , tf =T , dt_fac =0.1 , theta = theta1 ) before = after fs = moments . Manips . split_1D_to_2D ( fs , ns [0] , sum ( ns [1:])) before . append ((1 - Ns [2]) * before [ -1]) before [ -2] *= Ns [2] T = Ts [1] after = Ns [3:5] growth_funcs = [ lambda t : after [0] , lambda t : before [1] * ( after [1] / before [1]) ** ( t / T )] list_growth_funcs = lambda t : [ f ( t ) for f in growth_funcs ] m = np . array ([[0 , params [7]] ,[ params [8] , 0]]) fs . integrate ( Npop = list_growth_funcs , tf =T , m =m , dt_fac =0.1 , theta = theta1 ) before = after return fs data = moments . Spectrum . from_file ( ’ data / YRI_CEU . fs ’) popt = [7194.792822462478 , 13410.251542201073 , 0.9582544565961783 , 30 13542.979276844108 , 12114.968575519626 , 2683.3787253409746 , 846.6668954957415 , 0.00014172779289593632 , 0.00012195685425105608] ns = [20 , 20] model = generated_model ( popt , ns ) ll_model = moments . Inference . ll ( model , data ) print ( ’ Model log likelihood ( LL ( model , data )): {0} ’. format ( ll_model )) # now we need to norm vector of params so that first value is 1: popt_norm = [1.0 , 1.8638829321580523 , 0.9582544565961783 , 1.8823306815120924 , 1.6838523185401713 , 0.3729612223111333 , 0.1176777311575127 , 1.0197021070711312 , 0.8774542996156008] print ( ’ Drawing model to model_from_GADMA_from_simple . png ’) model = moments . ModelPlot . generate_model ( generated_model , popt_norm , ns ) moments . ModelPlot . plot_model ( model , save_file = ’ model_from_GADMA_from_simple . png ’ , fig_title = ’ ’ , pop_labels =[ ’ YRI ’ , ’ CEU ’] , nref =7194 , draw_scale = True , gen_time =0.025 , gen_time_units = " Thousand years " , reverse_timeline = True ) References [1] Ryan N Gutenkunst, Ryan D Hernandez, Scott H Williamson, and Carlos D Bustamante, Inferring the joint demographic history of multiple populations from multidimensional snp frequency data, PLoS genetics 5 (2009), no. 10, e1000695. [2] Julien Jouganous, Will Long, Aaron P. Ragsdale, and Simon Gravel, Inferring the joint demographic history of multiple populations: Beyond the diffusion approximation, Genetics 206 (2017), no. 3, 1549–1567. [3] Marguerite Lapierre, Amaury Lambert, and Guillaume Achaz, Accuracy of demographic inferences from the site frequency spectrum: the case of the yoruba population, Genetics (2017), genetics–116. 31
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : Yes Author : Create Date : 2018:10:24 12:44:13Z Creator : LaTeX with hyperref package Modify Date : 2018:10:24 12:44:13Z PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) kpathsea version 6.2.3 Producer : pdfTeX-1.40.18 Subject : Title : Trapped : False Page Mode : UseOutlines Page Count : 31EXIF Metadata provided by EXIF.tools