Pyvolve Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 27

Download
Open PDF In Browser	View PDF

P YVOLVE U SER M ANUAL
Stephanie J. Spielman, PhD
Email: spielman@growan.com

Contents
1

Introduction

2

2

Installation

2

3

Citation

2

4

Basic Usage

3

5

Defining phylogenies

3

6

Defining Evolutionary Models
6.1 Nucleotide Models . . . . . . . . . . . . . . . . . .
6.2 Amino-acid models . . . . . . . . . . . . . . . . . .
6.3 Mechanistic (dN/dS) codon models . . . . . . . .
6.4 Mutation-selection models . . . . . . . . . . . . . .
6.5 Empirical codon model . . . . . . . . . . . . . . . .
6.6 Specifying custom rate matrices . . . . . . . . . . .
6.7 Specifying equilibrium frequencies . . . . . . . . .
6.7.1 EqualFrequencies class . . . . . . . . . . . .
6.7.2 RandomFrequencies class . . . . . . . . . .
6.7.3 CustomFrequencies class . . . . . . . . . .
6.7.4 ReadFrequencies class . . . . . . . . . . . .
6.7.5 EmpiricalModelFrequencies class . . . . .
6.7.6 Restricting frequencies to certain states . .
6.7.7 Converting frequencies between alphabets
6.8 Specifying mutation rates . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

5
6
7
7
9
10
11
12
13
13
13
14
14
15
15
16

7

Mutation rates vs. branch lengths

16

8

Defining Partitions
8.1 Specifying ancestral sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17
17

9

Evolving sequences
9.1 Evolver output files . . . . . . . . . . . . . . . . . . . .
9.2 Sequence post-processing . . . . . . . . . . . . . . . .
9.2.1 Interpreting the "site_rates.txt" output file . . .
9.2.2 Interpreting the "site_rates_info.txt" output file
9.3 Simulating replicates . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

17
18
19
19
19
20

10 Implementing site-wise rate heterogeneity
10.1 Implementing site-wise heterogeneity for nucleotide and amino-acid models
10.1.1 Gamma-distributed rate categories . . . . . . . . . . . . . . . . . . . . .
10.1.2 Custom-distributed rate categories . . . . . . . . . . . . . . . . . . . . .
10.2 Implementing site-wise heterogeneity for mechanistic codon models . . . . .
10.3 Implementing site-wise heterogeneity for mutation-selection models . . . . .
10.4 Implementing site-wise heterogeneity for the Empirical Codon Model . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

20
20
20
21
22
22
22

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

11 Implementing branch (temporal) heterogeneity

22

12 Implementing branch-site heterogeneity

25

1

1

Introduction

Pyvolve is an open-source Python module for simulating genetic data along a phylogeny using Markov
models of sequence evolution, according to standard methods [19]. Pyvolve is freely available under a
FreeBSD license and is hosted on github: http://sjspielman.org/pyvolve/. Pyvolve has several dependencies, including BioPython, NumPy, and SciPy. These modules must be properly installed and in your
Python path. Please file any and all bug reports on the github repository Issues section.
Pyvolve is written such that it can be seamlessly integrated into your Python pipelines without having to
interface with external software platforms. However, please note that for extremely large (>10,000 taxa)
and/or extremely heterogenous simulations (e.g. where each site evolves according to a unique evolutionary model), Pyvolve may be quite slow and thus may take several minutes or more to run. Faster sequence
simulators you may find useful are detailed in ref. [1], which gives an overview of various sequence simulation softwares (from 2012).

Pyvolve supports a variety of evolutionary models, including the following:
• Nucleotide Models
– Generalized time-reversible model [14] and all nested variants
• Amino-acid exchangeability models
– JTT [7], WAG [16], LG [10], AB [11], mtMam [18], mtREV24 [17], and DAYHOFF [2]
• Codon models
– Mechanistic (dN/dS) models (MG-style [12] and GY-style [4])
– Empirical codon model [9]
• Mutation-selection models
– Halpern-Bruno model [5], implemented for codons and nucleotides
Both site- and branch- (temporal) heterogeneity are supported. A detailed and highly-recommended overview
of Markov process evolutionary models, for DNA, amino acids, and codons, is available in the book Computational Molecular Evolution, by Ziheng Yang [19].
Although Pyvolve does not simulate insertions and deletions (indels), Pyvovle does include several novel
options not available (to my knowledge) in other sequence simulation softwares. These options, detailed
in Section ??, include custom rate-matrix specification, novel matrix-scaling approaches, and branch length
perturbations.

2

Installation

Pyvolve may be downloaded and installed using pip or easy_install. Source code is available from
https://github.com/sjspielman/pyvolve/releases.

3

Citation

If you use Pyvolve, or code derived from Pyvolve, please cite us:
Spielman, SJ and Wilke, CO. 2015. Pyvolve: A flexible Python module for simulating sequences along
phylogenies. PLOS ONE. 10(9): e0139047.

2

4

Basic Usage

Similar to other simulation platforms, Pyvolve evolves sequences in groups of partitions (see, for instance,
the Indelible simulation platform [3]). Each partition has an associated size and model (or set of models, if
branch heterogeneity is desired). Note that all partitions will evolve according to the same phylogeny1 .
The general framework for a simple simulation is given below. In order to simulate sequences, you must
define the phylogeny along which sequences evolve as well as any evolutionary model(s) you’d like to use,
and assign model(s) to partition(s). Each evolutionary model has associated parameters which you can
customize, as detailed in Section 6.
1
2

######### General pyvolve framework #########
#############################################

3
4
5

# Import the Pyvolve module
import pyvolve

6
7
8

# Read in phylogeny along which Pyvolve should simulate
my_tree = pyvolve.read_tree(file = "file_with_tree_for_simulating.tre")

9
10
11

# Define evolutionary model(s) with the Model class
my_model = pyvolve.Model(, )

12
13
14

# Define partition(s) with the Partition class
my_partition = pyvolve.Partition(models = my_model, size = 100)

15
16
17
18

# Evolve partitions with the callable Evolver class
my_evolver = pyvolve.Evolver(tree = my_tree, partitions = my_partition)
my_evolver() # evolve sequences

Each of these steps is explained below, in detail with several examples. For additional information, consult
the API documentation, at http://sjspielman.org/pyvolve. Further, all functions and classes in Pyvolve
have highly descriptive docstrings, which can be accessed with Python’s help() function.

5

Defining phylogenies

Phylogenies must be specified as newick strings (see this wikipedia page for details) with branch lengths.
Pyvolve reads phylogenies using the function read_tree, either from a provided file name or directly from
a string:
1
2

# Read phylogeny from file with the keyword argument "file"
phylogeny = pyvolve.read_tree(file = "/path/to/tree/file.tre")

3
4
5

# Read phylogeny from string with the keyword argument "tree"
phylogeny = pyvolve.read_tree(tree = "(t4:0.785,(t3:0.380,(t2:0.806,(t5:0.612,t1:0.660
):0.762):0.921):0.207);")
1 If you wish to have different partitions evolve according to distinct phylogenies, I recommend performing several simulations
and then merging the resulting alignments in the post-processing stage.

3

To implement branch (temporal) heterogeneity, in which different branches on the phylogeny evolve according to different models, you will need to specify model flags at particular nodes in the newick tree, as
detailed in Section 11.
Further, to assess that a phylogeny has been parsed properly (or to determine the automatically-assigned
names of internal nodes), use the print_tree function:
1
2

# Read phylogeny from string
phylogeny = pyvolve.read_tree(tree = "(t4:0.785,(t3:0.380,(t2:0.806,(t5:0.612,t1:0.660
):0.762):0.921):0.207);")

3
4
5
6
7
8
9
10
11
12
13
14
15
16

# Print the parsed phylogeny
pyvolve.print_tree(phylogeny)
## Output from the above statement: node_name
'''
>>>
t4 0.785 None
>>>
internalNode3 0.207 None
>>>
t3 0.38 None
>>>
internalNode2 0.921 None
>>>
t2 0.806 None
>>>
internalNode1 0.762 None
>>>
t5 0.612 None
>>>
t1 0.66 None
'''

branch_length

model_flag

In the above output, tabs represent nested hierarchies in the phylogeny. Each line shows the node name
(either a tip name, "root", or an internal node), the branch length leading to that node, and the model flag
associated with that node. This final value will be None if model flags are not provided in the phylogeny.
Again, note that model flags are only required in cases of branch heterogeneity (see Section 11).
It is also possible to provide a phylogeny with named internal nodes. Any internal nodes without provided
names will be automatically assigned.
1
2

# Read phylogeny with some internal node names (myname1, myname2)
phylogeny = pyvolve.read_tree(tree = "(t4:0.785,(t3:0.380,(t2:0.806,(t5:0.612,t1:0.660
)myname1:0.762)myname2:0.921):0.207);")

3
4
5
6
7
8
9
10
11
12
13
14
15
16

# Print the parsed phylogeny
pyvolve.print_tree(phylogeny)
## Output from the above statement: node_name
'''
>>>
t4 0.785 None
>>>
internalNode1 0.207 None
>>>
t3 0.38 None
>>>
myname2 0.921 None
>>>
t2 0.806 None
>>>
myname1 0.762 None
>>>
t5 0.612 None
>>>
t1 0.66 None
'''

4

branch_length

model_flag

You can also rescale, as desired, branch lengths on your input phylogeny using the keyword argument
scale_tree in the read_tree function. This argument takes a numeric value and multiplies all branch
lengths in your tree by this scalar:
t = pyvolve.read_tree(file = "name_of_file.tre", scale_tree = 2.5) # Multiply all
branch lengths by 2.5

1

This argument is useful for changing the overall tree length, hence increasing or decreasing the expected
number of substitutions over the course of simulation.

6

Defining Evolutionary Models

The evolutionary models built into Pyvolve are outlined in Table 1 of this manual. Pyvolve uses Model
objects to store evolutionary models:
1
2

# Basic framework for defining a Model object (second argument optional)
my_model = Model(, )

A single argument, , is required when defining a Model object. Available model types are
shown in Table 1. Each model type has various associated parameters, which can be customized via the
second optional argument to Model, written above as . This
argument should be a dictionary of parameters to customize, and each modeling framework has particular keys which can be included in this dictionary. Available model types and associated customizable
parameters are shown in Table 1 and detailed in the subsections below.
Note that there are additional optional keyword arguments which may be passed to Model, including arguments pertaining to site-rate heterogeneity (see Section 10).

5

Optional parameters ("key")

Modeling framework

Pyvolve model type(s)

Nucleotide models

"nucleotide"

• Equilibrium frequencies ("state_freqs")
• Mutation rates ("mu" or "kappa")

Empirical amino-acid
models

"JTT", "WAG", "LG", "AB",
"DAYHOFF", "MTMAM", or
"MTREV24"

• Equilibrium frequencies ("state_freqs")

Mechanistic (dN/dS)
codon models

"GY", "MG", or "codon"

Mutation-selection
models

"MutSel"

Empirical codon model
(ECM)

"ECMrest", "ECMunrest", or
"ECM"

• Equilibrium frequencies ("state_freqs")
• Mutation rates ("mu" or "kappa")
• dN/dS ("alpha", "beta", and/or "omega")
• Equilibrium frequencies ("state_freqs") OR fitness values
("fitness")
• Mutation rates ("mu" or "kappa")
• Equilibrium frequencies ("state_freqs")
• Transition-tranversion bias(es) ("k_ti" and/or "k_tv")
• dN/dS † ("alpha", "beta", and/or "omega")

Table 1. Accepted model types in Pyvolve with associated customizable parameters. Names given in the column "Pyvolve model
type(s)" should be specified as the first argument to Model as strings (case-insensitive). Customizable parameters indicated in the
column "Optional parameters" should be specified as keys in the custom model-parameters dictionary, the second argument using
when defining a Model object.
† Note that the interpretation of this dN/dS value is different from the usual interpretation.

Subsections below explain each modeling framework in detail, with examples of parameter customizations.

6.1

Nucleotide Models

Nucleotide rate matrix elements, for the substitution from nucleotide i to j, are generally given by
qij = µij πj

(1)

where µij describes the rate of change from nucleotide i to j (i.e. mutation rate), and πj represents the
equilibrium frequency of the target nucleotide j. Note that mutation rates are symmetric, e.g. µij = µji .
By default, nucleotide models have equal equilibrium frequencies and equal mutation rates. A basic model
can be constructed with,
1
2

# Simple nucleotide model
nuc_model = pyvolve.Model("nucleotide")

To customize a nucleotide model, provide a custom-parameters dictionary with optional keys "state_freqs"
for custom equilibrium frequencies and "mu" for custom mutation rates (see Section 6.7 for details on frequency customization and Section 6.8 for details on mutation rate customization).
1
2
3

# Define mutation rates in a dictionary with keys giving the nucleotide pair
# Below, the rate from A to C is 0.5, and similarly C to A is 0.5
custom_mu = {"AC":0.5, "AG":0.25, "AT":1.23, "CG":0.55, "CT":1.22, "GT":0.47}

4
5

# Define custom frequencies, in order A C G T. This can be a list or numpy array.

6

6

freqs = [0.1, 0.45, 0.3, 0.15]

7
8
9

# Construct nucleotide model with custom mutation rates and frequencies.
nuc_model = pyvolve.Model( "nucleotide", {"mu":custom_mu, "state_freqs":freqs} )

As nucleotide model mutation rates are symmetric, if you provide a rate for A → T (key "AT"), it will
automatically be applied as the rate for T → A. Any unspecified mutation rate pairs will have a value of 1.
As an alternate to "mu", you can provide the key "kappa", which corresponds to the transition:transversion
ratio (e.g. for an HKY85 model [6]), in the custom-parameters dictionary. When kappa is specified, tranversion mutation rates are set to 1, and transition mutation rates are set to the provided "kappa" value.
1

2

# Construct nucleotide model with transition-to-transversion bias, and default
frequencies
nuc_model = pyvolve.Model( "nucleotide", {"kappa":2.75, "state_freqs":freqs} )

6.2

Amino-acid models

Amino-acid exchangeability matrix elements, for the substitution from amino acid i to j, are generally given
by
qij = rij πj
(2)
where rij is a symmetric matrix that describes the probability of changing from amino acid i to j, and
πj is the equilibrium frequency of the target amino acid j. The rij matrix corresponds to an empirically
determined model, such as WAG [16] or LG [10].
By default, Pyvolve assigns the default model equilibrium frequencies for empirical models. These frequencies correspond to those published with each respective model’s original paper. A basic amino-acid model
can be constructed with,
1
2

# Simple amino-acid model
aa_model = pyvolve.Model("WAG") # Here, WAG can be one of JTT, WAG, LG, DAYHOFF, MTMAM
, MTREV24 (case-insensitive)

To customize an amino-acid model, provide a custom-parameters dictionary with the key "state_freqs"
for custom equilibrium frequencies (see Section 6.7 for details on frequency customization). Note that
amino-acid frequencies must be in the order A, C, D, E, ... Y.

6.3

Mechanistic (dN/dS) codon models

GY-style [4] matrix elements, for the substitution from codon i to j, are generally given by


 µoi tj πj α synonymous change



qij =
,
µoi tj πj β nonsynonymous change





0 multiple nucleotide changes

(3)

where µoi tj is the mutation rate (e.g. for a change AAA to AAC, the corresponding mutation rate would be
A → C), πj is the frequency of the target codon j, α is the rate of synonymous change (dS), and β is the rate
of nonsynonymous change (dN ).
7

MG-style [12] matrix elements, for the substitution from codon i to j, are generally given by



µoi tj πtj α synonymous change



qij =
,
µoi tj πtj β nonsynonymous change





0 multiple nucleotide changes

(4)

where µoi tj is the mutation rate, πtj is the frequency of the target nucleotide tj (e.g. for a change AAA to
AAC, the target nucleotide would be C), α is the rate of synonymous change (dS), and β is the rate of
nonsynonymous change (dN ).
Both GY-style and MG-style codon models use symmetric mutation rates. Codon models require that you
provide a dN/dS rate ratio as a parameter in the custom-parameters dictionary. There are several ways to
specify this value:
• Specify a single parameter, "omega". This option sets the synonymous rate to 1.
• Specify a single parameter, "beta". This option sets the synonymous rate to 1.
• Specify a two parameters, "alpha" and "beta". This option sets the synonymous rate to α and the
nonsynonymous rate to β.
By default, mechanistic codon models have equal mutation rates and equal equilibrium frequencies. Basic
mechanistic codon models can be constructed with,
1
2

# Simple GY-style model (specify as GY)
gy_model = pyvolve.Model("GY", {"omega": 0.5})

3
4
5

# Simple MG-style model (specify as MG)
mg_model = pyvolve.Model("MG", {"alpha": 1.04, "beta": 0.67})

6
7
8

# Specifying "codon" results in a *GY-style* model
codon_model = pyvolve.Model("codon", {"beta": 1.25})

To customize a mechanistic codon model, provide a custom-parameters dictionary with optional keys
"state_freqs" for custom equilibrium frequencies and "mu" for custom mutation rates (see Section 6.7
for details on frequency customization and Section 6.8 for details on mutation rate customization). Note
that codon frequencies must ordered alphabetically (AAA, AAC, AAG, ..., TTG, TTT) without stop codons.
1
2
3

# Define mutation rates in a dictionary with keys giving the nucleotide pair
# Below, the rate from A to C is 0.5, and similarly C to A is 0.5
custom_mu = {"AC":0.5, "AG":0.25, "AT":1.23, "CG":0.55, "CT":1.22, "GT":0.47}

4
5
6

# Construct codon model with custom mutation rates
codon_model = pyvolve.Model( "codon", {"mu":custom_mu, "omega":0.55} )

Mechanistic codon model mutation rates are symmetric; if you provide a rate for A → T (key "AT"), it will
automatically be applied as the rate for T → A. Any unspecified mutation rate pairs will have a value of 1.
As an alternate to "mu", you can provide the key "kappa", which corresponds to the transition:transversion
ratio (e.g. for an HKY85 model [6]), in the custom-parameters dictionary. When kappa is specified, tranversion mutation rates are set to 1, and transition mutation rates are set to the provided "kappa" value.
1
2

# Construct codon model with transition-to-transversion bias, and default frequencies
codon_model = pyvolve.Model( "codon", {"kappa":2.75, "alpha":0.89, "beta":0.95} )

8

Importantly, by default, mechanistic codon models are scaled so that the mean substitution rate per unit
time is 1. Here, "mean substitution rate" includes both synonymous and nonsynonymous substitutions in
its calculations. An alternative scaling scheme (note, which the author strongly prefers) would be to scale
the matrix such that the mean neutral substitution rate per unit time is 1. To specify this approach, include
the argument neutral_scaling = True when defining a Model:
1
2

# Construct codon model with neutral scaling
codon_model = pyvolve.Model( "codon", {"omega":0.5}, neutral_scaling = True )

6.4

Mutation-selection models

Mutation-selection (MutSel) model [5] matrix elements, for the substitution from codon (or nucleotide) i to
j, are generally given by

Sij


µij 1−e−S
single nucleotide change

ij


qij =
,
(5)





0 multiple nucleotide changes
where µij is the mutation rate, and Sij is the scaled selection coefficient. The scaled selection coefficient
indicates the fitness difference between the target and source state, e.g. f itnessj −f itnessi . MutSel mutation
rates are not constrained to be symmetric (e.g. µij can be different from µji ).
MutSel models are implemented both for codons and nucleotides, and they may be specified either with
equilibrium frequencies or with fitness values. Note that equilibrium frequencies must sum to 1, but fitness
values are not constrained in any way. (The relationship between equilibrium frequencies and fitness values
for MutSel models is detailed in refs. [5, 13]). Pyvolve automatically determines whether you are evolving
nucleotides or codons based on the provided vector of equilibrium frequencies or fitness values; a length
of 4 indicates nucleotides, and a length of 61 indicates codons. Note that, if you are constructing a codon
MutSel model based on fitness values, you can alternatively specify a vector of 20 fitness values, indicating
amino-acid fitnesses (in the order A, C, D, E, ... Y). These fitness values will be directly assigned to codons,
such that all synonymous codons will have the same fitness.
Basic nucleotide MutSel models can be constructed with,
1

2
3

# Simple nucleotide MutSel model constructed from frequencies, with default (equal)
mutation rates
nuc_freqs = [0.1, 0.4, 0.3, 0.2]
mutsel_nuc_model_freqs = pyvolve.Model("MutSel", {"state_freqs": nuc_freqs})

4
5

6
7

# Simple nucleotide MutSel model constructed from fitness values, with default (equal)
mutation rates
nuc_fitness = [1.5, 0.88, -4.2, 1.3]
mutsel_nuc_model_fits = pyvolve.Model("MutSel", {"fitness": nuc_fitness})

Basic codon MutSel models can be constructed with,
1

import numpy as np # imported for convenient example frequency/fitness generation

2
3

# Simple codon MutSel model constructed from frequencies, with default (equal)
mutation rates

9

4

5

codon_freqs = np.repeat(1./61, 61) # constructs a vector of equal frequencies, as an
example
mutsel_codon_model_freqs = pyvolve.Model("MutSel", {"state_freqs": codon_freqs})

6
7

8

9

# Simple codon MutSel model constructed from codon fitness values, with default (equal
) mutation rates
codon_fitness = np.random.normal(size = 61) # constructs a vector of normally
distributed codon fitness values, as an example
mutsel_codon_model_fits = pyvolve.Model("MutSel", {"fitness": codon_fitness})

10
11

12

13

# Simple codon MutSel model constructed from *amino-acid* fitness values, with default
(equal) mutation rates
aa_fitness = np.random.normal(size = 20) # constructs a vector of normally distributed
amino-acid fitness values, as an example
mutsel_codon_model_fits2 = pyvolve.Model("MutSel", {"fitness": aa_fitness})

Mutation rates can be customized with either the "mu" or the "kappa" key in the custom-parameters dictionary. Note that mutation rates in MutSel models do not need to be symmetric. However, if you a rate
for A → C (key "AC") and no rate for C → A (key "CA"), then Pyvolve will assume symmetry and assign
C → A the same rate as A → C. If neither pair is provided (e.g. both "AC" and "CA" are not defined), then
both will be given a rate of 1.

6.5

Empirical codon model

Matrix elements of the empirical codon model (ECM) [9] are given by,


 sij πj κ(i, j)α synonymous change
qij =
,

 sij πj κ(i, j)β nonsynonymous change

(6)

where sij is the symmetric, empirical matrix indicating the probability of changing from codon i to j,
πj is the equilibrium frequency of the target codon j, κ(i, j) is a mutational parameter indicating transition and/or tranversion bias, and α and β represent dS and dN , respectively. Importantly, because this
model is empirically-derived, the parameters κ(i, j), α, and β as used in ECM each represent the transitiontranversion bias, synonymous rate, and nonsynonymous rate, respectively, relative to the average level
present in the PANDIT database [15], from which this model was constructed 2 . The parameter κ(i, j) is
described in depth in ref. [9], specifically in the second half of the Results section Application of the ECM.
Importantly, there are two versions of this model: restricted and unrestricted. The restricted model restricts
instantaneous change to single-nucleotide only, whereas the unrestricted model also allows for double- and
triple-nucleotide changes. Pyvolve refers to these models, respectively, as ECMrest and ECMunrest.
By default, Pyvolve assumes that κ(i, j), α, and β all equal 1, and Pyvolve uses the default empirical model
equilibrium frequencies. These frequencies correspond to those published in the original paper publishing
ECM.
Basic ECM can be constructed by specifying either "ECMrest" or "ECMunrest" (case-insensitive) when
defining a Model object,
2 Personally,

I would not recommend using any of these parameters when simulating (although they have been fully implemented
in Pyvolve), as their interpretation is neither straight-forward nor particularly biological.

10

1
2

# Simple restricted ECM
ecm_model = pyvolve.Model("ECMrest")

3
4
5

# Simple unrestricted ECM
ecm_model = pyvolve.Model("ECMunrest")

6
7
8

# Specifying "ECM" results in a *restricted ECM* model
ecm_model = pyvolve.Model("ECM")

As with mechanistic codon models, the dS and dN parameters can be specified with custom model parameter dictionary keys α, β, and/or ω (but again, these parameters do not correspond to dN/dS in the
traditional sense!):
1
2

# Restricted ECM with dN/dS parameter of 0.75
ecm_model = pyvolve.Model("ECMrest", {"omega":0.75})

The κ(i, j) parameter is specified using the keys "k_ti" for transition bias and "k_tv", for transversion
bias. Specifically, "k_ti" corresponds to ts, and "k_tv" corresponds to tv in equations 9-11 in ref. [9]. Thus,
each of these parameters can be specified as either 0, 1, 2, or 3 (the Pyvolve default is 1).
Finally, equilibrium frequencies can be customized with the "state_freqs" key in the custom model parameters dictionary (see Section 6.7 for details on frequency customization).

6.6

Specifying custom rate matrices

Rather than using a built-in modeling framework, you can specify a custom rate matrix. This rate matrix
must be square and all rows in this matrix must sum to 0. Pyvolve will perform limited sanity checks on
your matrix to ensure that these conditions are met, but beyond this, Pyvolve takes your matrix at facevalue. In particular, Pyvolve will not scale the matrix in any manner.
Importantly, if you have separate state frequencies which have not been incorporated into your rate matrix
already, the supplied matrix must be symmetric. If you do not supply state frequencies explicitly, Pyvolve
will automatically determine them directly from your provided matrix.
When providing a custom matrix, you also have the option to provide a custom code, or custom states
which are evolved. In this way, you can evolve characters of any kind according to any specified transition
matrix. If you do not provide a custom code, Pyvolve checks to make sure that your matrix has dimensions
of either 4 × 4, 20 × 20, or 61 × 61 (for nucleotide, amino-acid, or codon evolution, respectively). Otherwise,
Pyvolve will check that your provided code and matrix are compatible (in terms of dimensions). Providing
a custom code is, therefore, an attractive option for specifying arbitrary models of character evolution.
To specify a custom rate matrix, provide the argument "custom" as the first argument when defining a
Model object, and provide your matrix in the custom-parameters dictionary using the key matrix. Any
custom matrix specified should be either a 2D numpy array or a python list of lists. Below is an example of
specifying a custom nucleotide rate matrix:
1

import numpy as np # import to construct matrix

2
3
4
5

# Define a 4x4 custom rate matrix
custom_matrix= np.array([[-1.0, 0.33, 0.33, 0.34],
[0.25, -1.0, 0.25, 0.50],

11

6
7

[0.10, 0.80, -1.0, 0.10],
[0.34, 0.33, 0.33, -1.0]] )

8
9
10

# Construct a model using the custom rate-matrix
custom_model = pyvolve.Model("custom", {"matrix":custom_matrix})

Pyvolve automatically assumes that any 4 × 4 matrix indicates nucleotide evolution. As stated above,
Pyvolve will extract equilibrium frequencies from this matrix and check that they are acceptable. This
frequency vector will be automatically saved to a file called "custom_matrix_frequencies.txt", and these
values will be used to generate the root sequence during simulation.
To provide a custom code, include the additional key "code"in your dictionary. Note that this key would
be ignored for any built-in model.
1

import numpy as np # import to construct matrix

2
3
4
5
6

# Define a 3x3 custom rate matrix
custom_matrix= np.array([[-0.50, 0.30, 0.20],
[0.25, -0.50, 0.25],
[0.40, 0.10, 0.50]] )

7
8
9
10

custom_code = ["0", "1", "2"]
# Construct a model using the custom rate-matrix and the custom code
custom_model = pyvolve.Model("custom", {"matrix":custom_matrix, "code":custom_code})

The resulting data simulated using the above model will contain characters 0, 1, and 2. Although the above
example shows a 3×3 matrix, it is certainly possible to specify custom matrices and codes for the "standard"
dimensions of 4, 20, and 61.

6.7

Specifying equilibrium frequencies

Equilibrium frequencies can be specified for a given Model object with the key "state_freqs" in the
custom-parameters dictionary. This key’s associated value should be a list (or numpy array) of frequencies, summing to 1. The values in this list should be ordered alphabetically. For nucleotides, the list should
be ordered ACGT. For amino-acids, the list should be ordered alphabetically, with regards to single-letter
amino-acids abbreviations: ACDEFGHIKLMNPQRSTVWY. Finally, for codons, the list should be ordered
AAA, AAC, AAG, AAT, ACA, ... TTT, excluding stop codons.
By default, Pyvolve assumes equal equilibrium frequencies (e.g. 0.25 for nucleotides, 0.05, for amino-acids,
1/61 for codons). These conditions are not, however, very realistic, so I strongly recommend that you
specify custom equilibrium frequencies for your simulations. Pyvolve provides a convenient class, called
StateFrequencies, to help you with this step, with several child classes:
• EqualFrequencies (default)
– Computes equal frequencies
• RandomFrequencies
– Computes (semi-)random frequencies
• CustomFrequencies
– Computes frequencies from a user-provided dictionary of frequencies
• ReadFrequencies

12

– Computes frequencies from a sequence or alignment file
• EmpiricalModelFrequencies3
– Sets frequencies to default values for a given empirical model
All of these classes should be used with the following setup (the below code uses EqualFrequencies as a
representative example):
1
2

3

# Define frequency object
f = pyvolve.EqualFrequencies("nucleotide") # or "amino_acid" or "codon", depending on
your simulation
frequencies = f.compute_frequencies() # returns a vector of equilibrium frequencies

The constructed vector of frequencies (named "frequencies" in the example above) can then be provided
to the custom model parameters dictionary with the key "state_freqs". In addition, to conveniently
save this vector of frequencies to a file, use the argument savefile =  when calling
.construct_frequencies():
1
2
3

# Define frequency object
f = pyvolve.EqualFrequencies("nucleotide")
frequencies = f.compute_frequencies(savefile = "my_frequency_file.txt") # returns a
vector of equilibrium frequencies and saves them to file

6.7.1

EqualFrequencies class

Pyvolve uses this class to construct the default equilibrium frequencies. Usage should be relatively straightforward, according to the example above.

6.7.2

RandomFrequencies class

This class is used to compute "semi-random" equilibrium frequencies. The resulting frequency distributions
are not entirely random, but rather are virtually flat distributions with some amount of noise.

6.7.3

CustomFrequencies class

With this class, you can provide a dictionary of frequencies, using the argument freq_dict, from which a
vector of frequencies is constructed. The keys for this dictionary are the nucleotides, amino-acids (single
letter abbreviations!), or codons, and the values should be the frequencies. Any states not included in this
dictionary will be assigned a 0 frequency, so be sure the values in this dictionary sum to 1.
In the example below, CustomFrequencies is used to create a vector of amino-acid frequencies in which
aspartate and glutamate each have a frequency of 0.25, and tryptophan has a frequency of 0.5. All other
amino acids will have a frequency of 0.
1
2
3

# Define CustomFrequencies object
f = pyvolve.CustomFrequencies("amino_acid", freq_dict = {"D":0.25, "E":0.25, "W":0.5})
frequencies = f.compute_frequencies()
3 Note

that this is not actually a child class of StateFrequencies, but its behavior is virtually identical.

13

6.7.4

ReadFrequencies class

The ReadFrequencies class can be used to compute equilibrium frequencies from a file of sequences
and/or multiple sequence alignment. Frequencies can be computed either using all data in the file, or,
if the file contains an alignment, using specified alignment column(s). Note that Pyvolve will ignore all
ambiguous characters present in this sequence file.
When specifying a file, use the argument file, and to specify the file format (e.g. "fasta" or "phylip"), use
the argument format. Pyvolve uses BioPython to read the sequence file, so consult the BioPython AlignIO
module documentation (or this nice wiki) for available formats. Pyvolve assumes a default file format of
FASTA, so the format argument is not needed when the file is FASTA.
1
2
3

# Build frequencies using *all* data in the provided file
f = pyvolve.ReadFrequencies("amino_acid", file = "a_file_of_sequences.fasta")
frequencies = f.compute_frequencies()

To read frequencies from a specific column in a multiple sequence alignment, use the argument columns,
which should be a list (indexed from 1) of integers giving the column(s) which should be considered in
frequency calculations.
1
2

3

# Build frequencies using alignment columns 1 through 5 (inclusive)
f = pyvolve.ReadFrequencies("amino_acid", file = "alignment_file.fasta", columns =
range(1,6))
frequencies = f.compute_frequencies()

4
5
6

7

# Build frequencies using only phylip-formatted alignment column 15
f = pyvolve.CustomFrequencies("amino_acid", file = "alignment_file.phy", format = "
phylip", columns = 15)
frequencies = f.compute_frequencies()

6.7.5

EmpiricalModelFrequencies class

The EmpiricalModelFrequencies class will return the default vector of equilibrium frequencies for a given
empirical model [amino-acid models and the codon model ECM, restricted and unrestricted versions (see
ref. [9] for details)]. These default frequencies correspond to the frequencies originally published with each
respective empirical model. Provide EmpiricalModelFrequencies with the name of the desired empirical
model to obtain these frequencies:
1
2
3

# Obtain frequencies for the WAG model
f = pyvolve.EmpiricalModelFrequencies("WAG")
frequencies = f.compute_frequencies()

4
5

6
7

# For the ECM models, use the argument "ECMrest" for restricted, and "ECMunrest" for
unrestricted
f = pyvolve.EmpiricalModelFrequencies("ECMrest") # restricted ECM frequencies
frequencies = f.compute_frequencies()

Note that Pyvolve uses these empirical frequencies as the default frequencies, if none are provided, for each
respective empirical model!

14

6.7.6

Restricting frequencies to certain states

When using the classes EqualFrequencies and RandomFrequencies, it is possible to specify that only certain states be considered during calculations using the restrict argument, when defining the object. This
argument takes a list of states (nucleotides, amino-acids, or codons) which should have non-zero frequencies. All states not included in this list will have a frequency of zero. Thus, by specifying this argument,
frequencies will be distributed only among the indicated states.
The following example will return a vector of amino-acid frequencies evenly divided among the five specified amino-acids; therefore, each amino acid in the restrict list will have a frequency of 0.2.
1
2
3

# Compute equal frequencies among 5 specified amino acids
f = pyvolve.EqualFrequencies("amino_acid", restrict = ["A", "G", "V", "E", "F"])
frequencies = f.compute_frequencies()

Note that specifying this argument will have no effect on the CustomFrequencies, ReadFrequencies, or
EmpiricalModelFrequencies classes.

6.7.7

Converting frequencies between alphabets

When defining a StateFrequencies object, you always have to indicate the alphabet (nucleotide, amino
acid, or codon) in which frequency calculations should be performed. However, it is possible to have
the .construct_frequencies() method return frequencies in a different alphabet, using the argument
type. This argument takes a string specifying the desired type of frequencies returned (either "nucleotide",
"amino_acid", or "codon").
This functionality is probably most useful when used with the ReadFrequencies class; for example, you
might want to obtain amino-acid frequencies from multiple sequence alignment of codons:
1
2
3

# Define frequency object
f = pyvolve.ReadFrequencies("codon", file = "my_codon_alignment.fasta")
frequencies = f.compute_frequencies(type = "amino_acid")

As another example, you might want to obtain amino-acid frequencies which correspond to equal codon
frequencies of 1/61 each:
1
2

f = pyvolve.EqualFrequencies("codon")
frequencies = f.compute_frequencies(type = "amino_acid") # returns a vector of aminoacid frequencies that correspond to equal codon frequencies

Alternatively, you can also go the other way (amino acids to codons):
1
2

f = pyvolve.EqualFrequencies("amino_acid")
frequencies = f.compute_frequencies(type = "codon")

When converting amino acid to codon frequencies, Pyvolve assumes that there is no codon bias and assigns
each synonymous codon the same frequency.

15

6.8

Specifying mutation rates

Nucleotide, mechanistic codon (dN/dS), and mutation-selection (MutSel) models all use nucleotide mutation rates as parameters. By default, mutation rates are equal for all nucleotide changes (e.g. the Jukes
Cantor model [8]). These default settings can be customized, in the custom model parameters dictionary,
in one of two ways:
1. Using the key "mu" to define custom rates for any/all nucleotide changes
2. Using the key "kappa" to specify a transition-to-transversion bias ratio (e.g. the HKY85 mutation
model. [6])
The value associated with the "mu" key should itself be a dictionary of mutation rates, with keys "AC",
"AG", "AT", etc, such that, for example, the key "AC" represents the mutation rate from A to C. Importantly,
nucleotide and codon models use symmetric mutation rates; therefore, if a rate for "AC" is defined, the same
value will automatically be applied to the change C to A. Thus, there are a total of 6 nucleotide mutation
rates you can provide for a custom nucleotide and/or mechanistic codon model. Note that any rates not
specified will be set to 1.
Alternatively, MutSel models do not constrain mutation rates to be symmetric, and thus, for instance, the
"AC" rate may be different from the "CA" rate. Thus, there are a total of 12 nucleotide mutation rates you
can provide for a custom MutSel model. Again, if a rate for "AC" but not "CA" is defined, then the "AC"
rate will be automatically applied to "CA". Any unspecified nucleotide rate pairs will be set to 1.
1
2

3

# Example using customized mutation rates to construct a nucleotide model
custom_mutation_rates = {"AC":1.5, "AG":0.5, "AT":1.75, "CG":0.6, "CT":1.25, "GT":1.88
}
my_model = pyvolve.Model("nucleotide", {"mu": custom_mutation_rates})

If, instead, the key "kappa" is specified, then the mutation rate for all transitions (e.g. purine to purine or
pyrimidine to pyrimidine) will be set to the specified value, and the mutation rate for all transversions (e.g.
purine to pyrimidine or vice versa) will be set to 1. This scheme corresponds to the HKY85 [6] mutation
model.
1
2

# Example using customized kappa to construct a nucleotide model
my_model = pyvolve.Model("nucleotide", {"kappa": 3.5})

7

Mutation rates vs. branch lengths

In the context of Markov models implemented in Pyvolve, mutation rates do not have the same interpretation as they would in a population genetics framework. Here, mutation rates indicate the relative
probabilities of mutating between different nucleotides. Mutation rates do not correspond the underlying
rate of change across the tree – this quantity is represented instead by branch lengths. The main purpose
of different mutation rates is to set up biases in mutation, for example if you want A→T to occur at a 5times higher rate than C→T. To increase/decrease the rate of overall change, you’ll want to change branch
lengths. Note that branch lengths can be changed across the whole tree using the scale_tree keyword
argument when calling pyvolve.read_tree() (see Section 5).
Importantly, here is what your branch lengths represent for each modeling framework • Nucleotide and amino-acid models: Mean number of substitutions per unit time

16

• Mechanistic codon models (dN/dS): By default, these branch lengths mean the mean number of
substitutions per unit time, agnostic in terms of nonsynonymous vs. synonymous substitutions. To
instead force branch lengths to represent the mean number of neutral substitutions per unit time,
supply the argument neutral_scaling = True when defining a Model instance.
• Mutation–selection models: Mean number of neutral substitutions per unit time.

8

Defining Partitions

Partitions are defined using the Partition() class, with two required keyword arguments: models, the
evolutionary model(s) associated with this partition, and size, the number of positions (sites) to evolve
within this partition.
1
2

# Define a default nucleotide model
my_model = pyvolve.Model("nucleotide")

3
4
5

# Define a Partition object which evolves 100 positions according to my_model
my_partition = pyvolve.Partition(models = my_model, size = 100)

In cases of branch homogeneity (all branches evolve according to the same model), each partition is associated with a single model, as shown above. When branch hetergeneity is desired, a list of models used
should be provided to the models argument (as detailed, with examples, in Section 11).

8.1

Specifying ancestral sequences

For each partition, you can assign an ancestral sequence which will be automatically used at the root of the
phylogeny for the given partition. This can be accomplished using the keyword argument root_sequence:
1
2

# Define a default nucleotide model
my_model = pyvolve.Model("nucleotide")

3
4
5

# Define a Partition object with a specified ancestral sequence
my_partition = pyvolve.Partition(models = my_model, root_sequence = "GATAGAAC")

When providing an ancestral sequence, it is not required to also specify a size for the partition. Pyvolve
will automatically determine this information from the provided root sequence.
Note that, when ancestral sequences are specified, site heterogeneity is not allowed (even if it was provided
to the model used in this partition). Multiple partitions must be specified, each with different rates, to
specify root sequences which will experience different rates across sites.

9

Evolving sequences

The callable class Evolver is Pyvolve’s engine for all sequence simulation. Defining an Evolver object
requires two keyword arguments: partitions, either the name of a single partition or a list of partitions to
evolve, and tree, the phylogeny along which sequences are simulated.

17

Examples below show how to define an Evolver object and then evolve sequences. The code below
assumes that the variables my_partition and my_tree were previously defined using Partition and
read_tree, respectively.
1
2
3

# Define an Evolver instance to evolve a single partition
my_evolver = pyvolve.Evolver(partitions = my_partition, tree = my_tree)
my_evolver() # evolve sequences

4
5
6

7

# Define an Evolver instance to evolve several partitions
my_multpart_evolver = pyvolve.Evolver(partitions = [partition1, partition2, partition3
], tree = my_tree)
my_multpart_evolver() # evolve sequences

9.1

Evolver output files

Calling an Evolver object will produce three output files to the working directory:
1. simulated_alignment.fasta, a FASTA-formatted file containing simulated data
2. site_rates.txt, a tab-delimited file indicating to which partition and rate category each simulated site
belongs (described in Section 9.2.1)
3. site_rates_info.txt, a tab-delimited file indicating the rate factors and probabilities associated with
each rate category (described in Section 9.2.2)
In the context of complete homogeneity, in which all sites and branches evolve according to a single model,
the files "site_rates.txt" and "site_rates_info.txt" will not contain much useful information. However, when
sites evolve under site-wise and/or branch heterogeneity, these files will provide useful information for
any necessary post-processing.
To change the output file names for any of those files, provide the arguments seqfile ("simulated_alignment.fasta"),
ratefile ("site_rates.txt"), and/or infofile ("site_rates_info.txt") when calling an Evolver object:
1
2
3
4

# Define an Evolver object
my_evolver = pyvolve.Evolver(tree = my_tree, partitions = my_partition)
# Evolve sequences with custom file names
my_evolver(ratefile = "custom_ratefile.txt", infofile = "custom_infofile.txt", seqfile
= "custom_seqfile.fasta" )

To suppress the creation of any of these files, define the argument(s) as either None or False:
1
2
3

# Only output a sequence file (suppress the ratefile and infofile)
my_evolver = pyvolve.Evolver(tree = my_tree, partitions = my_partition)
my_evolver(ratefile = None, infofile = None)

The output sequence file’s format can be changed with the argument seqfmt. Pyvolve uses BioPython
to write sequence files, so consult the BioPython AlignIO module documentation (or this nice wiki) for
available formats.
1
2
3

# Save the sequence file as seqs.phy, in phylip format
my_evolver = pyvolve.Evolver(tree = my_tree, partitions = my_partition)
my_evolver(seqfile = "seqs.phy", seqfmt = "phylip")

18

By default, the output sequence file will contain only the tip sequences. To additionally output all ancestral (including root) sequences, provide the argument write_anc = True when calling an Evolver object.
Ancestral sequences will be included with tip sequences in the output sequence file (not in a separate file!).
When ancestral sequences are written, the root sequence is denoted with the name "root", and internal
nodes are named "internal_node1", "internal_node2", etc. To see precisely to which node each internal node
name corresponds, it is useful to print the parsed newick tree with the function print_tree, as explained
in Section 5.
1
2
3

# Output ancestral sequences along with the tip sequences
my_evolver = pyvolve.Evolver(tree = my_tree, partitions = my_partition)
my_evolver(write_anc = True)

9.2

Sequence post-processing

In addition to saving sequences to a file, Evolver can also return sequences back to you for post-processing
in Python. Sequences can be easily obtained using the method .get_sequences(). This method will return
a dictionary of sequences, where the keys are IDs and the values are sequences (as strings). Note that you
must evolve sequences by calling your Evolver object before sequences can be returned!
1
2
3
4

# Return simulated sequences as dictionary
my_evolver = pyvolve.Evolver(tree = my_tree, partitions = my_partition)
my_evolver()
simulated_sequences = my_evolver.get_sequences()

By default, .get_sequences() will contain only the tip (leaf) sequences. To include ancestral sequences
(root and internal node sequences) in this dictionary, specify the argument anc = True:
1

simulated_sequences = my_evolver.get_sequences(anc = True)

9.2.1

Interpreting the "site_rates.txt" output file

The output file "site_rates.txt" has three columns of data:
• Site_Index
– Indicates a given position in the simulated data (indexed from 1)
• Partition_Index
– Indicates the partition associated with this site
• Rate_Category
– Indicates the rate category index associated with this site
The values in "Partition_Index" are ordered, starting from 1, based on the partitions argument list specified when setting up the Evolver() instance. Similarly, the values in "Rate_Category" are ordered, starting from 1, based on the rate heterogeneity lists (see Section 10 for details) specified when initializing the
Model() objects used in the respective partition.

9.2.2

Interpreting the "site_rates_info.txt" output file

The output file "site_rates_info.txt" provides more detailed rate information for each partition. This file has
give columns of data:
19

• Partition_Index
– Indicates the partition index (can be mapped back to the Partition_Index column in "site_rates.txt")
• Model_Name
– Indicates the model name (note that, if no name provided, this is None. Also, only relevant for
branch het)
• Rate_Category
– Indicates the rate category index (can be mapped back to the Rate_Category column in "site_rates.txt")
• Rate_Probability
– Indicates the probability of a site being in the respective rate category
• Rate_Factor
– Indicates either the rate scaling factor (for nucleotide and amino-acid models), or dN/dS value
for this rate category for codon models

9.3

Simulating replicates

The callable Evolver class makes simulating replicates of given modeling scheme straight-forward: simply
define an Evolver object, and then call this object in a for-loop as many times as needed.
1
2
3
4

# Simulate 50 replicates
my_evolver = pyvolve.Evolver(tree = my_tree, partitions = my_partition)
for i in range(50):
my_evolver(seqfile = "simulated_replicate" + str(i) + ".fasta") # Change seqfile
name to avoid overwriting!

10

Implementing site-wise rate heterogeneity

This section details how to implement heterogeneity in site-wise rates within a partition.

10.1

Implementing site-wise heterogeneity for nucleotide and amino-acid models

In the context of nucleotide and amino-acid models, rate heterogeneity is applied by multiplying the rate
matrix by scalar factors. Thus, sites evolving at different rates exhibit the same evolutionary patterns but
differ in how quickly evolution occurs. Two primary parameters govern this sort of rate heterogeneity: the
rate factors used to scale the matrix, and the probability associated with each rate factor (in other words,
the probability that a given site is in each rate category).
Pyvolve models site-rate heterogeneity discretely, using either a discrete gamma distribution or a userspecified discrete rate distribution. Rate heterogeneity is incorporated into a Model object with several
additional keyword arguments, detailed below.

10.1.1

Gamma-distributed rate categories

Gamma (Γ) distributed heterogeneity is specified with two-four keyword arguments when initializing a
Model object:

20

• alpha, the shape parameter of the discrete gamma distribution from which rates are drawn (Note:
following convention, α = β in these distributions [19]).
• num_categories, the number of rate categories to draw
• pinv, a proportion of invariant sites. Use this option to simulate according to Γ + I heterogeneity.
Examples for specifying Γ rate heterogeneity are shown below.
1

2

# Gamma-distributed heterogeneity for a nucleotide model. Gamma shape parameter is 0.5
, and 6 categories are specified.
nuc_model_het = pyvolve.Model("nucleotide", alpha = 0.5, num_categories = 6)

3
4

5

# Gamma-distributed heterogeneity for an amino-acid model. Gamma shape parameter is 0.
5, and 6 categories are specified.
aa_model_het = pyvolve.Model("WAG", alpha = 0.5, num_categories = 6)

6
7

8

# Gamma+I heterogeneity for a nucleotide model with a proportion (0.25) invariant
sites. Remaining sites are distributed according to a discrete gamma, with 5
categories
nuc_model_het = pyvolve.Model("nucleotide", alpha = 0.2, num_categories = 5, pinv = 0.
25)

10.1.2

Custom-distributed rate categories

A user-determined heterogeneity distribution is specified with one (or two) arguments when initializing a
Model object:
• rate_factors, a list of scaling factors for each category
• rate_probs, an optional list of probabilities for each rate category. If unspecified, all rate categories
are equally probable. This list should sum to 1!
Examples for specifying custom rate heterogeneity distributions are shown below.
1
2

# Custom heterogeneity for a nucleotide model, with four equiprobable categories
nuc_model_het = pyvolve.Model("nucleotide", rate_factors = [0.4, 1.87, 3.4, 0.001])

3
4

5

# Custom heterogeneity for a nucleotide model, with four categories, each with a
specified probability (i.e. rate 0.4 occurs with a probability of 0.15, etc.)
nuc_model_het = pyvolve.Model("nucleotide", rate_factors = [0.4, 1.87, 3.4, 0.001],
rate_probs = [0.15, 0.25, 0.2, 0.5])

6
7

8

# Gamma-distributed heterogeneity for an amino-acid model, with four equiprobable
categories
aa_model_het = pyvolve.Model("WAG", rate_factors = [0.4, 1.87, 3.4, 0.001])

If you would like to specify a proportion of invariant sites, simply set one of the rate factors to 0 and assign
it a corresponding probability as usual:
1
2

# Custom heterogeneity with proportion (0.4) invariant sites
nuc_model_het = pyvolve.Model("nucleotide", rate_factors = [0.4, 1.87, 3.4, 0.],
rate_probs = [0.2, 0.2, 0.2, 0.4])

21

10.2

Implementing site-wise heterogeneity for mechanistic codon models

Due to the nature of mechanistic codon models, rate heterogeneity is not modeled with scalar factors, but
with a distinct model for each rate (i.e. dN/dS value) category. To define a Model object with dN/dS
heterogeneity, provide a list of dN/dS values the custom-parameters dictionary, rather than a single rate
ratio value. As with standard codon models, you can provide dN/dS values with keys "omega", "beta",
or "alpha" and "beta" together (to incorporate both synonymous and nonsynonymous rate variation).
By default, each discrete dN/dS category will have the same probability. To specify custom probabilities,
provide the argument rate_probs, a list of probabilities, when initializing the Model object.
Examples for specifying heterogeneous mechanistic codon models are shown below (note that a GY-style
model is shown in the examples, but as usual, both GY-style and MG-style are allowed.)
1

2

# Define a heterogeneous codon model with dN/dS values of 0.1, 0.5, 1.0, and 2.5 .
Categories are, by default, equally likely.
codon_model_het = pyvolve.Model("GY", {"omega": [0.1, 0.5, 1.0, 2.5]})

3
4

5

# Define a heterogeneous codon model with two dN/dS categories: 0.102 (from 0.1/0.98)
and 0.49 (from 0.5/1.02). Categories are, by default, equally likely.
codon_model_het = pyvolve.Model("GY", {"beta": [0.1, 0.5], "alpha": [0.98, 1.02]})

6
7

8

# Define a heterogeneous codon model with dN/dS values of 0.102 (with a probability of
0.4) and 0.49 (with a probably of 0.6).
codon_model_het = pyvolve.Model("GY", {"beta": [0.1, 0.5], "alpha": [0.98, 1.02]},
rate_probs = [0.4, 0.6])

10.3

Implementing site-wise heterogeneity for mutation-selection models

Due to the nature of MutSel models, site-wise heterogeneity should be accomplished using a series of
partitions, in which each partition evolves according to a unique MutSel model. These partitions can then
be provided as a list when defining an Evolver object.

10.4

Implementing site-wise heterogeneity for the Empirical Codon Model

Due to the peculiar features of this model (both empirically-derived transition probabilities and "mechanistic" parameters such as dN/dS), site-wise heterogeneity is not supported for these models, at this time.
Pyvolve will simply ignore any provided arguments for site-rate heterogeneity with this model. Feel free
to email the author to discuss and/or request this feature.

11

Implementing branch (temporal) heterogeneity

This section details how to implement branch (also known as temporal) heterogeneity within a partition,
thus allowing different branches to evolve according to different models. To implement branch heterogeneity, your provided newick phylogeny should contain model flags at particular nodes of interest.

22

Model flags may be specified with either hashtags (#) or underscores (_), and they may be specified to
satisfy one of two paradigms:
• Using both trailing and leading symbols, e.g. _flagname_ or #flagname# . Specifying a model
flag with this format will cause ALL descendents of that node to also follow this model, unless a
new model flag is given downstream. In other words, this model will be propagated to apply to all
children of that branch.
• Using only a leading symbol, e.g. _flagname or #flagname. Specifying a model flag with this format
will cause ONLY that branch/edge to use the provided model. Descendent nodes will NOT inherit
this model flag. Useful for changing model along a single branch, or towards a single leaf.
Model flags must be provided AFTER the branch length. Model flags may be repeated throughout the tree,
but the model associated with each model flag will always be the same. Note that these model flag names
must have correspondingly named model objects.
For example, a tree specified as
(t4:0.785,(t3:0.380,(t2:0.806,(t5:0.612,t1:0.660):0.762_m1_):0.921_m2_):0.207);
will be interpreted as in Figure 1. Trees with model flags, just like any other tree, are defined with the function read_tree. Some examples of trees with model flags:
1
2

3
4

# Define a tree with propagating model flags m1 and m2, with a string
het_tree = pyvolve.read_tree(tree = "(t4:0.785,(t3:0.380,(t2:0.806,(t5:0.612,t1:0.660)
:0.762_m1_):0.921_m2_):0.207);")
#OR
het_tree = pyvolve.read_tree(tree = "(t4:0.785,(t3:0.380,(t2:0.806,(t5:0.612,t1:0.660)
:0.762\#m1\#):0.921\#m2\#):0.207);")

5
6
7
8
9
10
11
12
13
14
15
16
17
18

# Print het_tree to see how model flags are applied:
pyvolve.print_tree(het_tree)
'''
>>> root None None
>>>
t4 0.785 None
>>>
internalNode3 0.207 None
>>>
t3 0.38 None
>>>
internalNode2 0.921 m2
>>>
t2 0.806 m2
>>>
internalNode1 0.762 m1
>>>
t5 0.612 m1
>>>
t1 0.66 m1
'''

19
20
21

22
23

# Define a tree with non-propagating model flags m1 and m2
het_tree = pyvolve.read_tree(tree = "(t4:0.785,(t3:0.380,(t2:0.806,(t5:0.612,t1:0.660)
:0.762_m1):0.921_m2):0.207);")
#OR
het_tree = pyvolve.read_tree(tree = "(t4:0.785,(t3:0.380,(t2:0.806,(t5:0.612,t1:0.660)
:0.762#m1):0.921#m2):0.207);")

24
25
26
27

# Print het_tree to see how model flags are applied:
pyvolve.print_tree(het_tree)
'''

23

28
29
30
31
32
33
34
35
36
37

>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
'''

root None None
t4 0.785 None
internalNode3 0.207 None
t3 0.38 None
internalNode2 0.921 m2
t2 0.806 None
internalNode1 0.762 m1
t5 0.612 None
t1 0.66 None

0.785

0.38

t4

t3

0.207

0.806

t2

0.921
0.612
Root model

t5

0.762

m1 model
m2 model

0.66

t1

Figure 1: The newick tree with model flags given by
"(t4:0.785,(t3:0.380,(t2:0.806,(t5:0.612,t1:0.660):0.762_m1_):0.921_m2_):0.207);"

indicates the model assign-

ments shown.
All model flags specified in the newick phylogeny must have corresponding models. To link a model to
a model flag, specify a given model’s name using the keyword argument name when initializing a Model
object. This name must be identical to a given model flag, without the leading and trailing symbols (e.g. the
name "m1" corresponds to the flag _m1_ and/or #m1#).
The model at the root of the tree will not have a specific model flag, but nonetheless a model must be used
at the root (obviously), and indeed at all other nodes which are not assigned a model flag (not that all
branches on the tree which are not assigned a model flag will evolve according to the model used at the
root). To specify a model at the root of the tree, simply create a model, with a name, and indicate this name
when defining your partition.
Examples for defining models with names are shown below (for demonstrative purposes, nucleotide models with extreme state frequency differences are used here):
1
2

# Define the m1 model, with frequencies skewed for AT-bias
m1_model = pyvolve.Model("nucleotide", {"state_freqs":[0.4, 0.1, 0.1, 0.4]}, name = "
m1")

3
4
5

# Define the m2 model, with frequencies skewed for GC-bias
m2_model = pyvolve.Model("nucleotide", {"state_freqs":[0.1, 0.4, 0.4, 0.1]}, name = "

24

m2")
6
7
8

# Define the root model, with default equal nucleotide frequecies
root_model = pyvolve.Model("nucleotide", name = "root")

Alternatively, you can assign/re-assign a model’s name with the .assign_name() method:
1
2

# (Re-)assign the name of the root model
root_model.assign_name("new_root_model_name")

Finally, when defining the partition that uses all of these models, provide all Model objects in list to the
models argument. In addition, you must specify the name of the model you wish to use at the root of the
tree with the keyword argument root_model_name.
1
2

# Define partition with branch heterogeneity, with 50 nucleotide positions
temp_het_partition = pyvolve.Partition(models = [m1_model, m2_model, root_model], size
= 50, root_model_name = root_model.name)

12

Implementing branch-site heterogeneity

Simulating according to so-called "branch-site" models, in which there are both site-wise and branch heterogeneity, is accomplished using the same strategies shown for each individual aspect (branch, Section 11
and site, Section 10). However, there is a critical caveat to these models: all models within a given partition
must have the same number of rate categories. Furthermore, the rate probabilities must be the same across
models within a partition; if different values for rate_probs are indicated, then the probabilities provided
for the root model will be applied to all subsequent branch models. (Note that this behavior is identical for
other simulation platforms, like Indelible [3].)
The example below shows how to specify a branch-site heterogeneous nucleotide model with two models,
root and model1 (note that this code assumes that the provided phylogeny contained the flag _model1_),
when the rate categories are not equiprobable.
1

2

# Shared rate probabilities. Must be explicitly specified for all models (not just the
root model)!
shared_rate_probs = [0.25, 0.3, 0.45]

3
4
5

# Construct a nucleotide model with 3 rate categories
root = Model("nucleotide", name = "root", rate_probs = shared_rate_probs, rate_factors
= [1.5, 1.0, 0.05])

6
7
8

# Construct a second nucleotide model with 3 rate categories
model1 = Model("nucleotide", name = "model1", rate_probs = shared_rate_probs,
rate_factors = [0.06, 2.5, 0.11])

9
10
11

# Construct a partition with these models, defining the root model nameas "root"
part = Partition(models = [root, model1], root_model_name = "root", size = 50)

25

References
[1] M Arenas. Simulation of molecular data under diverse evolutionary scenarios. PLoS Comp Biol,
8:e1002495, 2012.
[2] MO Dayhoff, RM Schwartz, and BC Orcutt. A model of evolutionary change in proteins. Atlas of
Protein Sequence and Structure, 5(3):345–352, 1978.
[3] W Fletcher and Z Yang. INDELible: A flexible simulator of biological sequence evolution. Mol Biol
Evol, 26(8):1879–1888, 2009.
[4] N Goldman and Z Yang. A codon-based model of nucleotide substitution for protein-coding DNA
sequences. Mol Biol Evol, 11:725–736, 1994.
[5] AL Halpern and WJ Bruno. Evolutionary distances for protein-coding sequences: modeling sitespecific residue frequencies. Mol Biol Evol, 15:910–917, 1998.
[6] M Hasegawa, H Kishino, and T Yano. Dating of human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol, 22(2):160–174, 1985.
[7] DT Jones, WR Taylor, and JM Thornton. The rapid generation of mutation data matrices from protein
sequences. CABIOS, 8:275–282, 1992.
[8] TH Jukes and CR Cantor. Evolution of protein molecules. In HN Munro, editor, Mammalian protein
metabolism. Academic Press, New York, 1969.
[9] C. Kosiol, I. Holmes, and N. Goldman. An empirical codon model for protein sequence evolution. Mol
Biol Evol, 24:1464 – 1479, 2007.
[10] SQ Le and O Gascuel. An improved general amino acid replacement matrix. Mol Biol Evol, 25:1307–
1320, 2008.
[11] Alexander Mirsky, Linda Kazandjian, and Maria Anisimova. Antibody-specific model of amino acid
substitution for immunological inferences from alignments of antibody sequences. Mol Biol Evol, 32:806
– 819, 2015.
[12] SV Muse and BS Gaut. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol, 11:715–724, 1994.
[13] SJ Spielman and CO Wilke. The relationship between dn/ds and scaled selection coefficients. Mol Biol
Evol, 32:1097 – 1108, 2015.
[14] S Tavare. Lines of descent and genealogical processes, and their applications in population genetics
models. Theor Popul Biol, 26:119–164, 1984.
[15] S Whelan, P de Bakker, E Quevillion, N Rodriguez, and N Goldman. PANDIT: an evolution-centric
database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res, 34:D327
– D331, 2006.
[16] S Whelan and N Goldman. A general empirical model of protein evolution derived from multiple
protein families using a maximum likelihood approach. Mol Biol Evol, 18:691–699, 2001.
[17] N Yang, R Nielsen, and M Hasegawa. MOLPHY version 2.3: programs for molecular phylogenetics
based on maximum likelihood. Comput Sci Monogr, 28:1–150, 19896.
[18] N Yang, R Nielsen, and M Hasegawa. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol, 15:1600–1611, 1998.
[19] Z. Yang. Computational Molecular Evolution. Oxford University Press, 2006.

26

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 27
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.17
Create Date                     : 2019:01:20 14:54:14-05:00
Modify Date                     : 2019:01:20 14:54:14-05:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2

EXIF Metadata provided by EXIF.tools

Pyvolve Manual

Navigation menu

Versions of this User Manual:

Views

Navigation