Bayesian SSC

Bayesian Serial SimCoal

from The Hadly Lab at
Stanford University

Programmed and maintained by:
Christian Anderson

Now at Havard University

at UC San Diego

Visualize your modern and paleo data via Tempnet

Because this program owes so much to Excoffier et al.'s SIMCOAL (ver 1.0), we ask that users of Serial SimCoal cite the following in any publications.

Excoffier L, Novembre J and Schneider S (2000) SIMCOAL: a general coalescent program for simulation of molecular data in interconnected populations with arbitrary demography. J. Hered., 91, 506-509.

Version History

Bayesian Serial SimCoal, (BayeSSC) is a modification of SIMCOAL 1.0, a program written by Laurent Excoffier, John Novembre, and Stefan Schneider. Their website presents the most thorough documentation available for this family of programs, and should be regarded as authoritative.

The original version of the software on this site, Serial Simcoal, was first described in:

Anderson CNK, Ramakrishnan U, Chan YL and Hadly EA (2005) Serial SimCoal: A population genetic model for data from multiple populations and points in time. Bioinformatics, 21, 1733-1734.

The modifications were made explicitly to be backwards compatable with SIMCOAL 1.0, and SSC, so any input file from any previous version can also be used for input into BayeSSC. After work on this project began, a new version (SimCoal2) was released. Details can be found at any of the links above.

SIMCOAL 1.0 is a very flexible package that allows for almost any sampling regime and population history. Serial SimCoal allows samples to be taken from different points in time. Using ancient DNA, one could "scenario test", and get approximations of how likely one scenario was relative to another. BayeSSC Bayes SSC is powerful because it allows flexible coalescent modelling from a variety of different priors. The enables parameter estimation, likelihood calculations, and Bayesian inference. For example, SIMCOAL 1.0 could be used to simulate the effect of a historical event of modern samples, Serial SimCoal incorporates ancient samples from before the event, and BayeSSC derives the most likely date and severity of the event.

Bayesian Coalescence

The tutorial below explains the basics of coalescence theory. Users who are familiar with the coalescent process can skip ahead.

Typically, BayeSSC generates thousands of hypothetical trees using slightly different population parameters. The simulated genetics of these trees can then be compared to the actual genetics of the user's samples to investigate which history of these many simulated histories is the most likely to have generated the samples.

There are many ways to do this. One particularly useful approach is Mark Beaumont's ABC method (Approximate Bayesian Computation described here). In this method, the average euclidean "distance" between the simulated genetic statistics and the observed genetic statistics is calculated for thousands of parameter combinations. This approximates the error associated with each simulation; parameters yielding low errors are more likely to represent the true population history than parameters with high errors. If errors are very high, the simulation is rejected entirely. These results can then be used to estimate the Bayesian posterior probability of different parameter combinations.

The sections below give examples of how this works, and how to create input files specifying the range of histories under consideration.

Input and Output

Schematic Representation

The program flow of Serial SimCoal can be summarized in the following figure:

Input

Only one input file is needed to run the simulations. The format of the input file is almost identical to the "input file notation" section of the SimCoal 1.0 Help Manual. To help learn the input format, we'll walk through an example. Imagine you have 300-year-old samples from a population with high levels of genetic diversity. But when you take samples from the modern population, diversity is either low or quite different than the older samples. One reasonable hypothesis is this change in the genetic structure is due to a bottleneck in the population between now and 300 years ago. (We're assuming for simplicity here that 1 generation = 1 year). With that hypothesis, BayeSSC could be used to answer the following questions:

When did this bottleneck occur?
How severe was it? That is, how small was the population during the bottleneck?

Here is an example of an input file that could be used to investigate this question. The fields are described in detail below. If you wish, you can download this file as a template, or for practice running the program.

Input File: eg_bayes.par

Parameter Type

An example of input parameters for BayesSSC
1 population with ancient data
Deme sizes (haploid number of genes)
3000
Sample sizes: # samples, age, deme, stat group
2 sample groups
10 0 0 0
10 300 0 1
Growth rates
{ln([4]/3000)/[2]}
Number of migration matrices
0
Historical event: date from to %mig new_N new_r migmat
1 event
{U:1,299} 0 0 0 {3000/[4]} 0 0
Mutations per generation for the whole sequence
0.0003
Number of loci
300
//Data type : either DNA, RFLP, or MICROSAT
DNA 1
//Mutation rates: gamma parameters, theta and k
0 0
{E:200}

Comment
Populations
Comment
DemeSize
Comment
# of Samples
SampleSize
SampleSize
Comment
GrowthRate
Comment
# of Mig_Matrix
Comment
# of Events
Event
Comment
MutationRate
Comment
Loci
Comment
DataType
Comment
Mut_Model
AbstractPrior

What follows is an explanation of each type of parameter listed in the second column. As mentioned above, most of these parameters are identical to the parameters of SimCoal 1.0, and alternative explanations can be found in the Help Manual to that program.

Type Explanation

Comments:

These lines do not actually affect the program in any way. They serve as a place for users to jot notes to themselves about what comes next. In most example files, comments begin with a double slash (//). They don't need to, but it's a good way to distinguish notes from lines that contain information the program will use. They can be no longer than one line, or you can leave them blank, but you must have one comment line before each line of information.

Example:
//This is a comment, blah blah blah

Populations:

Put the number of demes you wish to simulate after the first comment line, along with whatever text you want. If you are using samples from more than one time point then the text following the number must include the words "with ancient".

Example 1 (a simulation with only modern DNA):
//Begining of SimCoal file:
3 populations from Fiji, Vanuatu, and Kiribati

Example 2 (with old samples):
//Parameters for a SimCoal File:
3 demes with ancient DNA data

Deme sizes:

One number per population representing the effective population size (N_ef) of that population. Keep in mind that the effective population size is very different than the census size: although there are nearly 7 billion people on the planet, most races have an effective population size much smaller than 50,000.

Example (sim with 3 populations):
//Deme sizes:
1000
3200
8743

Sample sizes:

Without ancient information: One sample group per population is assumed. List the number of samples from each population

With ancient information: An arbitrary number of sampling groups can be added to each population, and they can be pooled together in any combination for statistical analysis. The first line begins with the total number of sampling groups, and can end with any text you want. After that the format is:
First: Number of individuals in sample
Second: Age of the sample (in generations)
Third: The number of the deme the sample belongs to (0,1,2,...)
Fourth: Which stat group the sample group should be pooled with.

Example 1: No ancient data, three populations
//Sample Sizes:
20
12
31

Example 2: Three populations with ancient data
//Format: Samples, age, deme, stat group
8 sample groups; that's quite a few
10 0   0 0
10 200 0 1
10 0   1 0
2  450 1 2
10 0   2 0
10 200 2 1
7  450 2 2
4  450 2 3

In both examples, there are 20, 12 and 31 samples taken from three populations respectively. In the second example, 10 of the samples are recent in each population, the rest are ancient. The 20 samples that are 200 generations old are pooled together for statistical calculations, and 9 of 13 450-generation-old samples are pooled. The last four will have their statistics calculated independently (just to show you that it's possible).

Growth rates:

This one is a bit tricky. It is the NEGATIVE of the intrinsic rate of growth (r) from the standard equation for exponential growth:
N(t)=N(0)e^rt
Enter one value per population. Because coalescent simulations run backward through time, a negative growth rate implies a population larger now than in the past.

Example: Two stable populations, and one that is growing 2% per generation
//Growth rates:
0
0
-.02

Migration matrices:

Several matrices can be listed in the input file. The first line begins with the number of matrices (0 is fine). The next lines define the ratio of migrants from each deme to each deme; each migration matrix must be preceeded by a comment. The first migration matrix is assumed to represent the migration in the present (or at t=0). If you have more than one population but no migration, then the demes will NEVER coalesce and you will get no information. Note that the diagonal elements of the matrix are meaningless, but the simulations will run faster if you set them to 0.

Example 1: two migration matrices for a simulation with three populations
//Migration matrices
2
//Matrix 0: Deme0 <-> Deme1 <-> Deme2
0   .01 0
.01 0   .01
0   .01 0
//Matrix 1: Migration stopped
0 0 0
0 0 0
0 0 0

Example 2: Some lineages "today" in deme 0 are descendants of lineages originally in deme 1, but not vice versa. That is, as the tree builds back through time, there is a 1%/gen chance that each lineage in deme 0 will migrate to deme 1.
//Asymetric migration matrix
1 matrix
//Migration from deme 0 to deme 1 *backwards* through time
0   .01
0   0

Historical events:

Like migration matricies, the first line gives the number of events (0 is fine). Each subsequent line then lists one such event (no comment lines between). An event consists of the following:

The time (in generations) when the event occured
The source deme (0,1,2...)
The sink deme.
The proportion of the source that migrates to the sink. It also represents the probability for each lineage in the source deme to migrate in the sink deme. If no migration is involved in the event, then just specify the same source, sink, and a migration probability of 0.
The new effective population size of the sink deme relative to one generation later in time. Remeber, coalescent simulations run backwards. So a value of 0.5 here implies the event doubled the population (think, "The population used to be half as big").
The new growth rate of the sink deme. Negative values mean the population is growing.
The id of the new migration matrix to use for all demes.

Example: 2000 generations ago, deme 0 and 2 split from what used to be a larger deme 1
//Format: time, src, sink, % mig, new Nef, new r, MigMat
2000 0 1 1 2 0 1
2000 2 1 1 1 0 1

You could read the first event like this: When you go back to 2000 generations, take deme0 and move 100% of those lineages back to deme1, make deme1 twice as big, but then keep deme1's population constant thereafter (r=0), and use migration matrix 1 from now on.

COMMON MISTAKE: Remember that even though several events may happen during the same generation, the computer will apply them sequentially. For example,
100 0 1 1 5 .5 1
100 2 1 1 5 0 1
will first make Deme1 5x bigger and growing, but then make it 5x bigger again (=25x bigger) and reset the growth rate to 0. Similarly,
100 0 1 .5 1 0 0
100 0 2 .5 1 0 0
will send half of deme0 to deme1, and then send half of the half that is left to deme2, leaving 25% of the lineages in deme0.

Mutation rate:

This number refers to the mutation rate for all loci taken as a whole. For DNA sequences, it should correspond to the average mutation number of mutations per generation per nucleotide, times the number of nucleotides.

It is possible to have the mutation rate vary with time. See the section on prior distributions below.

Example: 10%/bp/million years for a 300bp sequence and a species whose generations are 5 years long
10%/bp/1,000,000yr = .00000001/bp/yr * 300bp = .00003/yr * 5 yr/gen = .00015/gen
//Mutation rate
.00015

Number of loci:

For DNA, the length of the sequence to simulate. For RFLP and STRs, the number of RFLPs/STRs to simulate. Complete linkage is assumed.

Example: A 300bp sequence
//Number of loci:
300

Data type:

Type of data to simulate:

MICROSAT: Microsatellites are simulated with a pure stepwise model, and can be followed with a range constraint if you wish (no number implies no limit).
DNA: followed by the transition/transversion bias number. Mutation probabilities can be heterogenous (see "Gamma")
RFLP: a two allele model.

Example 1: Using DNA where 1/3 of the mutations are A<->G or C<->T (all mutations are equally likely)
//Number of loci:
DNA 0.33333

Mutation model:

These parameters control the heterogeneity of DNA mutation rates along the sequence. The first number is the shape parameter a of a Gamma distribution of mutation rates. If a value of zero is entered, then an even mutation rate model is implemented. The second number is the number of rate classes to simulate. If a value of zero is entered, then a continuous distribution is used (as many classes as there are loci or nucleotides). Note that this is the reverse order and nomenclature from that used by statisticians when talking about gamma distributions. For a statistician, the first number is the scale parameter q and the second number is the shape parameter k.

Example 1: Uniform mutation rates (Cantor-Jukes model)
//Gamma distribution for mutation:
0 0

Example 2: Heterogenous mutation (Kimura 2-Parameter model)
//Gamma distribution for mutation:
0.4 10

Abstract Priors:

See below

Bayesian Prior Distributions

At any point in the input file, rather than typing in an actual number, you can also specify a range of possible values to investigate. BayeSSC allows five different prior distributions:

In the example file above there are two priors: a uniform distribution allowing the date of the bottleneck to vary uniformly from 1 to 299 generations in the past {U:1,299}; and an exponential distribution for the population size during the bottleneck {E:200}. This will investigate mostly bottleneck population sizes less than 200, but occasionally try one much larger.

The program will also evaluate expressions, such as {4*6+2^3} (which equals 32) and {4*(6+2^3)} (equalling 56). All items contained in curly brackets {} recieve an id number, starting at 1 (not 0). To reference this number, use square brackets. Thus, {6*[2]^3}, means "six times the second prior-distribution-or-equation cubed". The equation interpreter understands the following symbols:

+ - * / ^ ( ) ln exp
Parentheses can be nested; that is {((4-3)*2)^2} would be valid, and equal 4.

In the example file above, there are two expressions given. The growth rate is given as {ln([4]/3000)/[2]}, meaning the modern population size of 3000 is the result of exponential growth following a bottleneck when the population was size (prior 4). This bottleneck occured (prior 2) generations in the past. The equation comes from rearranging the exponential growth equation:

The second expression is the scaling factor in the historical event: {3000/[4]}. Here we are assuming the population crashed from an ancestral size of 3000 individuals to the bottleneck size (prior [4]) in a single generation. Therefore, the generation before the bottleneck, the population was 3000/[4] times larger. We could just as easily assume the effective ancestral population size was 10000: 10000/[4]; or that the ancestral population was always the same size as the bottleneck population (ie, there was no bottleneck, just recent growth from a historically small population): 1.

For the mutation rate (and only for the mutation rate), it is possible to vary with time. If, for example, you wanted the mutation rate to begin at .05 and decrease to nothing going back in time, you could do this by specifying the mutation rate as {.05*.9^[T]}. The program understands [T] to be the time in generations before present. You can use any formula you wish. While this makes the program quite flexible, it also makes it run slower than if you specify a constant mutation rate or one based only on "once-per-simulation" priors.

Output

Most of the output from this program is the same as from SimCoal 1.0. If you tell the program to run B simulations you will get:

B Arlequin project files (*.arp) containing sequence information. This was meant to be used with the Arlequin program (also by Excoffier and company) to calculate statistics on large samples of genetic data. Each file contains the sequence/allele information for each sample in the simulation.
An Arlequin batch file (*.arb) containing B links to the Arlequin project files above.
NOTE: Arlequin files are not output by default. To get these output files, add the "-a" flag to the command line: "BayeSSC.exe -a".
B Paup files (*.pau) containing sequence/allele information from each sample for each iteration, and also the resulting genealogical tree. This is in NEXUS format.
One huge Paup file (*.pau) containing all the information from the B smaller Paup files above, and all B trees as well.
NOTE: These files are repressed by default. To get these output files, add the "-p" flag to the command line: "BayeSSC.exe -p".
A NEXUS batch file (*.bat) that contains links to all B of the paup files.
One unique output file (*.gen) that lists all the input values and calculates a few statistics on the results. These include the mutation rate per locus, the mean and standard deviations of coalescence times, the number and standard deviation of pairwise differences, and the average and standard deviation of the number of mutations per genealogy.
A file listing the number of mutations per locus (mut_hits_distr.sum) for each of the B simulations.
Two genealogy files (*.trees) one listing the branch lengths using the true generation length (*_true_trees.trees) and one listing the branch lengths in number of mutations (*_mut_trees.trees).

Serial SimCoal adds two more to this impressive list:

A tracer file (*.log) that outputs the sample size and number of extant lineages every generation. By default, this file is supressed. If you want this file output to the harddrive, execute BayeSSC with a "-t" flag from the command line. These files can be quite large, so probably one should run only 1 or 2 simulations in tracer mode.

A database of statistics (*_stat.csv) listing a number of genetic statistics for each iteration of the program. These are reported for each stat_group (as defined in the "Sample size" field of the input file), and for all samples combined. The statistics returned in this file include...

Within-StatGroup Statistics

Number of alleles

Number of private alleles. That is, for each allele in a stat_group, how many of the alleles are unique to individuals from one deme of the stat_group. For example, consider the following arrangmenent of alleles A-E.

	DEME 0:	DEME 1:
stat_group 0	A, B, C	A, C, D
stat_group 1	A, B	A, D, E

For stat group 0, there are 2 private alleles (B and D). In stat_group 1, there are 3 private alleles (B, D and E). In the combined stat_group, there is only 1 private haplotype (E). All three results are in the output file.

For STR and RFLP data: Heterozygosity
For STR and RFLP data: Allelic variance (for the first locus only)
For STR: Repeat frequency distribution
For DNA: Segregating sites
For DNA: Pairwise Differences
For DNA: Haplotype Diversity (biased by a factor of (n-1)/n)
For DNA: Nucleotide Diversity
For DNA: Tajima's D
For DNA: Mismatch Distribution. That is, a series of numbers where the first represents the number of pairs of numbers with no differences, the second all the pairs with 1 difference, etc. For example, in the following haplotype network at right has three populations, one with 7 samples one with two samples and 2bp different, and one with three samples and 1bp different. there are 7*6/2=21 identical pairs in the central popuation, 1 in the left, and 3 in the right for a total of 25 pairs with 0 differences. Then there are 7*3=21 pairs with one difference, 7*2=14 with two differences, and 2*3=6 pairs with three differences. So the mismatch distribution would be {25 21 14 6}.

Between-StatGroup Statistics

Number of alleles private to each stat group
For STR and RFLPs: Pearson's G² (how well sample 1 matches expectations from the pool of sample 1 and sample 2)
For STR and RFLPs: Slatkin's R_ST
For DNA: Pairwise Differences
For DNA: Mean Diversity ()
For DNA: Pooled Diversity (H_T)
For DNA: F_ST*
And the number of generations to the most recent common ancestor (MRCA)

* We use the definition proposed by Hudson et al. 1992 Genetics, Eq., 3. Arlequin will do this calculation if you choose "Population Comparisons -> Compute pairwise Fst -> Compute distance matrix -> Pairwise difference -> Gamma a value = 0.

This file can be opened using Microsoft Excel, or any other spreadsheet, database or statistical program (it's in .csv format). You can download a sample file here. This was produced by running the example file above through 50 simulations. This is a very small number of runs for statistical purposes. As a general rule of thumb, you should run 10^(3+p) simulations, where p is the number of parameters you are approximating. Hypothesis testing (with no parameters) can be adequately done in 1000 runs, but several published studies have done 10,000,000 simulations or more.

Modifications

Aside from the changes necessary to make SimCoal 1.0 compatible with ancient genetic data, there were four other alterations made to the source code.

Statistics Package

As previously discussed, a statistics program was integrated into SimCoal itself so that relevent population statistics could be calculated on the fly, not in an ex post facto analysis in Arlequin or Paup. The statistics are output into a *_stat.csv file. For a description of the contents of this file, see the section on output.

Multiple Coalescences

The original version of SimCoal allowed only one coalescent event per generation. This assumption is a reasonably good approximation for situations where the population size (N) is much larger than the sample size (k). However there are situations where this assumption is not met, as demonstrated below.

The probability of a coalescent event occuring in a population of size N with k lineages is:

After these two lineages have coalesced, there are (k-1) lineages remaining, so the probability of one more coalescence is simply:

Thus the probability of two coalescent events occuring at the same time is [1]*[2], or

This equation can be greatly simplified by the following approximation:

This approximation is better than it may at first appear (the error is <5% for k>5 and <1% for k>10). The implication of formula [4] is that there is a 100% chance that more than one lineage will coalesce in each generation where:

As a result, if condition [5] is ever met during a simulation, the program will surely produce a long-biased tree. Even if this condition is not met, the tree will still probably be biased longer than appropriate due to missing less frequent double or triple coalescences. This may translate into greater genotypic diversity than predicted.

This was changed in Serial SimCoal to allow as many coalescences per generation as was appropriate. After the first coalescence, the number of extant lineages is adjusted (decremented by one), and the probability of coalescence is recalculated. If a second coalescence occurs, again the probabilities are adjusted. This process continues until either a coalescence fails to occur (Unif(0,1)>Pr(coalescence)) or the tree coalesces completely. Again, it should be noted that this bias has also been adjusted in Excoffier's new version of the program,
SimCoal2.

Hudsonian Process

We chose to modify SimCoal 1.0 because of its unusually good tolerance of ambiguities: you can have any amount of migration, any size of populations, any number of samples, and all of these can change arbitrarily at any time thanks to historical events. However, in almost every simulation there comes a time when (1) all historical events have occured, (2) deme size has ceased to fluctuate and (3) no migration is occurring. When these three conditions are met, the waiting time to the next coalescent event can be calculated extremely rapidly using a random exponential variable. R.R. Hudson was among the first to exploit this convenient property. In Serial SimCoal, we use a Hudsonian process in such situations. This turns out to be a profitable modification, since these three conditions are often met near the root of the genealogy, when waiting times to the last few coalescent events very long. Thus, instead of waiting several thousand generations for one random number to be small enough for a coalescence, each coalescence time is generated from just one random number. This process, when implemented with example 1, sped the simulation up by a factor of 3. In simpler situations, gains would be even greater.

Mersenne Twister

The default pseudorandom number generator in C++ has certain, usually unimportant problems. One of them is that (on most operating systems) it produces a random number between 0 and 32,767. This means that if you run several million simulations with prior distributions, you will be trying the same 32,768 values multiple times, but not the values in between. BayeSSC uses the Mersenne Twister, which generates random numbers with a granularity of 2^63, is extremely fast, and only repeats every 43*10^6000 random numbers.

Model Comparison

How do I decide which model is "right"?

Often, researchers are interested not just in the parameters of any one model, but want to compare many different scenarios of past history. For example, the example file above eg_bayes.par assumes that there was a bottleneck in the population at some point in the last 300 generations. We might want to know if this model is any more likely than a model which has no bottleneck, that is, if the population has been the same size since the most recent common ancestor (MRCA). One way to do this is to evaluate the likelihood of the two models. This requires a few steps of analysis to be performed on the output of BayeSSC. The method outlined below is only one of many possible (and published) methods for doing this analysis, and researchers are encouraged to try their own techniques. The code for this method is written for the R statstical package, which is free, open source, and quickly becoming the most important mathematical package in the scientific world.

Approximate Bayesian Computation

Once you have run your simulations, you want to know which combinations of prior values give results that match your data. For example, it may be the case that your genetic sequences came from a population whose size was 10,000, but that a population of size 20,000 might have produced the same genetic sequences as well; in fact it may have been even more likely to have produced them! What ABC tells you is the relative probability of getting your data at different prior values. This means that the "right answer", the right population size for example, is not a single number, but a probability distribution. This is called a posterior distribution.

Consider the example file above. In this file we are trying to approximate two parameters: the date when the bottleneck happened, and the size of the population during the bottleneck (the modern growth rate and ratio of the bottleneck size to the ancient population size are not "free parameters", since they can be calculated once we know the other two). Let's say that we suspect a bottleneck, because the haplotype diversity in the modern is 0 (all samples have the same haplotype), but 300 generations ago it was 0.66. We are therefore trying to find the relative likelihood of bottlenecks of ~200 individuals at about 1-299 generations ago to produce this signiture. Here is an example of _stat.csv output you might get by running eg_bayes.par:

...and a bit later on in the file:

When BayeSSC simulated the first history, it randomly chose a bottleneck population size of 264 ("Abstract 0") and bottleck time of 164 generations ago ("Event Tim"). The simulation run under those conditions produced a haplotype diversity of 0.54 in the modern group (Group 0), and a haplotype diversity of 0.46 in the old group (Group 1). Of course it is generally not the case that such a bottleneck produces higher diversity in the modern group, but the simulation shows that it is possible. This is why we need to run many, many simulations to get a sense of the relative probability of different parameter values to produce data "like" ours.

One option for determining posterior distributions from this data is to use a rejection method. You will need to download the R statstical package. The first time you do this analysis in R you will also need to install the locfit, akima, and lattice packages (from the dropdown menu in R), and copy and paste this source code into R. Once you've done all that, then at the > prompt, type reject("[_stat.csv file name]") into R.

You will be given a list of the columns in the file, and asked which ones you want to use in the analysis. In this case, we want the haplotype diversity in group 0 (column #6) and the haplotype diversity in group 1 (column #22). For this example, lets say our data had 0 diversity in group 0 and 0.66 in group 1. In the next step, the program calculates the "euclidean distance" from each simulation result to the observed data. The smaller this distance, the more closely the simulated values match the real data. The program then asks you for a delta value. Simulations that lie within delta units of the observed data are "accepted", while simulations that produced data further away are rejected. Researchers tend to use delta values that accept from 0.1% to 10% of the simulations. However, you also want many "acceptances" to draw valid inferences about your posterior (certainly 50 or more), which is why it is better to do large numbers of simulations. In the example at right, we see that almost 400 of the 5000 simulations were within 0.1 units of the "right" answer, so we should choose a value of 0.1 or lower. For this example, I somewhat arbitrarily chose 0.05. That value produces the following posterior distributions (there are 4 priors in the file, but we are only interested in Posterior 2: the date of the bottleneck, and Posterior 4: the population size during the bottleneck):

The transparent bars represent the prior probabilities (uniform from 0 to 300, and exponential 200 respectively). The plum-colored distribution is the posterior. We have little confidence in the date of the event, though 250 generations b.p. is more likely than 50 generations b.p. We can have more confidence in claiming that the bottleneck population size was less than 200 individuals.

Posteriors

The reject() method in R will return the accepted simulation values, and the cumulitive density functions (cdfs) of the posterior distributions. Armed with this information, you can now rerun the simulations using the "best" (or "maximally credible") version of the model. There are two options at this point. The simple option is to replace each prior distribution with its maximum likelihood estimate (MLE), the peak of each posterior distribution, which is returned in the _post.pdf file. They can also be determined directly from the reject function using the mle() function (for example, reject(...)->eg; mle(eg) ). In the example above, we would replace the time of the first event in eg_bayes.par "{U:1,299}" with the MLE value "250". The other, more complicated option is to fit a standard statistical distribution to the posterior distribution, and replace the prior with this approximate posterior. Here I give an example of how to do this.

To estimate the posterior distribution, save the results of the rejection method (reject("[_stat.csv file name]")->eg). Then analyze them using the approx.posterior() function (for example, approx.posterior(eg$accept.sims$Event.Time.0)). This will attempt to fit four families of probability distribution to your posterior, and return the -log likelihood value for each one (the lower the negative log likelihood, the better the fit). In the case of the bottleneck time, shown at left, we had so little information about the time that the uniform posterior from 12 to 299 was actually the best fit, scarcely different than the prior distribution: ~U(1,299). The population size at the time of the bottleneck also fits the same distribution family as its prior (an exponential distribution), but with an optimal rate parameter of 73 rather than 200. The gamma distribution fits slightly better, and would also be acceptable. In this case, you would replace the abstract parameter at the end of eg_bayes.par "{E:200}" with either "{E:73}" or "{G:.806,.011}".

Whether to rerun your maximally credible model with the MLE or the approximate posterior depends on the purpose of your simulation. Generaly, it is more mathematically rigorous to use the posterior distribution at this step. Using the MLE presupposes that the scenario most likely to generate your data was in fact what happened; but part of the point of doing Bayesian inference is to have some idea of the error associated with that value. The error information is lost when you use just one number. On the other hand, one of the best ways to compare models is using the AIC value, which should only be done using MLEs. As a rule of thumb, if you are running the simulation to generate probability distributions of summary statistics it is best to use the approximate posterior; if you are comparing models, it is best to use MLEs.

Calculating likelihoods

As of yet, there is no single accepted "best" method to compare models. In the R-script, there are several different options available. In each of these methods, the output from the maximally credible models determined above are used to generate a second _stat.csv file. The statistics in this file are then compared to your observed data to determine how well the model fits observation.

Two-tailed probability: This is perhaps the most intuitively obvious method, but is also problematic. It uses a frequentist paradigm, which, though not invalid, is a confusing mixture of bayesian and frequentist approaches. Here, the probability of a model is equal to two times the number of simulations whose value is equal-to-or-more-extreme-than the observed divided by the total number of simulations. In the above example, running 1000 simulations of eg_bayes.par with {E:73} as the "fitted" abstract parameter generated 302 scenarios where the nucleotide diversity in group 0 matches the observed value of 0. Therefore, using this criterion the probability of the model is pr=2*302/1000=.604, and the model "fails to be rejected". An alternative model where no bottleneck occurs (this can be simulated conveniently by using {N:3000,0} as the abstract parameter) generated only 19 runs of 0 diversity, hence pr=.038 and the model is rejected. This can be loosely thought of as "bootstrapping" the model, and as such is a non-parametric test.

Relative support: This method was proposed by Mark Beaumont, and his R-implementation is included as the calmod() function. An alternative implementation is the shorter mod.comp() function, which may be more intuitive for those following the rubric above. Here, the number of "accepted" simulations from each model is compared in an ABC framework to determine their relative validity. In the example above, the two models have 302 and 19 "accepts" respectively, which provides 94.1% support for model 1, and 5.9% support for model 2. Note that this procedure is only valid if you run the same number of simulations of both models. Because models with more free parameters are better able to fit observed data, this method may be unfairly biased towards more complex models in some cases.

Estimated Likelihood: In this model, simulations "near" the observed value are considered "correct" and included in the likelihood. In this case, "near" means within 5% of the range of the simulation results. This may be preferable to the Two-tailed method, because models that generate an extremely broad range of summary statistic values will have many "extreme" values, and therefore have an unrealistically high probability value. This method is problematic because the summary statistic's range may not be the same between two different models. To limit this difficulty, the range is first trimmed by 2.5% on both sides, so that individual extreme values will not unduly influence the results. This technique is implemented in the SSC.Like() function, which allows the user to adjust the width of the acceptance kernel. Example: SSC.Like("eg_post.csv", cols=c(6,23), obs=c(0,.66), acc=.05) returns the probability density at the observed values plus-or-minus 2.5% instead of 5%.

Akaike Information Criterion: This is the most recommended method, according to my own personal bias, because it corrects for the complexity of the model. That is, models with hundreds of free parameters that almost always generate exactly the right answer are not necessarily "better" than a simple model (with just a few parameters) that gets pretty close most of the time. The AIC formula is -2*LL + 2*p, where LL is the model's log-likelihood and p is the number of free parameters in the model. Likelihood is here calculated as the product of the probability density function height at the observed values relative to the maximum pdf height. All statistics are unit-normalized so that the large range of say segregating sites does not make this statistic less important than small-ranged nucleotide diversity. In the example above, the first model has two free parameters: the bottleneck time and size at the bottleneck. There are two other priors in the file (the growth rate and the event size), but these are not FREE parameters, because they are completely determined (ie, calcluated from) the other two. Running aic.ssc("eg_post1.csv", c(6,23), c(0,.66), params=2) on the output from the maximally credible version of this model produces probability peaks at both observed values (left side of figure), and an AIC value of 4.27. By contrast, the parameter-free model (a population of constant size Ne=3000; right side of figure below) is unlikely to have a 0 diversity in the first sample group, though still likely to have diversity of 0.66 in the second group. Though not as good a fit to the data, the fact that such a fit can be achieved with 0 parameters makes this model a reasonable alternative, with an AIC of 5.65.

A third model (not shown) that suggests the bottleneck size of 73 has actually always been the constant population size would have one free parameter (the fitted value of 73) and would almost guarenteed to have 0 diversity in the modern group (LL=0), but highly unlikely to have 0.66 diversity in the ancient group (LL=-3.87), for an AIC value of 9.74.

Trace mode and log-files

BayeSSC is written to provide the maximum flexibility to the user, making it possible to simulate tremendously complex demographic scenarios. With this freedom comes the possibility for model misspecification to occur. As a fail-safe method, BayeSSC can be run from the command line with a �-t� flag, which causes a ".log" file to be generated. This lists the population size and the number of active lineages in each deme at each point in time for each simulation. It is therefore advisable to run only a few simulations in this mode, or else extremely large .log files will result. To better visualize these results, the plot.log() routine was written in R, which sequentially displays population sizes and lineages for each simulation run in trace mode. In the example shown below, the user can confirm that (in this simulation, at least) his or her model did in fact have two populations, one constant and one that is recovering from a bottleneck (solid lines), and that the two populations were exchanging lineages (dashed lines) through migration at a fairly brisk rate relative to the time required for coalescence. In this simulation, coalescence was reached in just under 4000 generations. Ten lineages were added to both demes via ancient sampling at 500 generations before the present, though enough coalescence had already happened by that point that no population ever contained more than 16 lineages.

The function viewlog() provides a similar error-checking function in the form of a spindle-diagram. Here, the number of lineages is represented by the black spindles, the deme's population size by the colored spindles that contain them. To help orient you to the two different ways of looking at the same data, note that (due to random migration in this simulation) all of the sampled lineages happen to be in deme 0 (red) 980 years ago, and none of them are in deme 1 (cyan).

Support and FAQs

More questions and answers will be added to the list below as they come in. For now, if you have any problems with BayeSSC or Serial SimCoal, contact the programmer, Christian Anderson at Harvard University. Comments and suggestions for improvements or extentions are also more than welcome.

On a Mac, which program should I use to open SimCoal? On some versions of MacOS-X you'll get an annoying message asking you which program to use to open the program. While we appreciate the metatextual irony, we have not been able to find a direct way around this problem. You will unfortunately need to work around it by executing directly from Terminal.
1. Open "Terminal" from the applications folder.
2. Type "cd" and the directory where you have BayeSSC.
3. Type "./BayeSSC" to run the program. If you get a "Permission denied" error, type "chmod 777 BayeSSC" and try again.
Why don't I get Arlequin / PAUP files? BayeSSC typically needs to run many thousands of simulations, which would produce many thousands of these files. By default, this extra output is turned off. In order to output this information add a -a or a -p flag (or both) to the command line, like this (for Windows)
BayeSSC.exe -a -p
Why can't SimCoal open my input files? Some operating systems are unable to open files that have a space in the path name, like "C:\Documents And Settings\test.par". If you put your input files in the same directory as SimCoal, then you don't need to enter path information at all.
Why can't SimCoal find my input file? In MacOS 10.4, UNIX executables always look for files on the desktop if you don't give them a path name, no matter where the program looking for the files is located. In 10.5, they always look in the user's directory. To bypass this built-in inconvinience, open terminal, go to the directory where SimCoal is located, and run it from there. Alternatively, if you have an input file on your desktop, you can type
Input file name (*.par): Desktop/examplefile.par
(See detailed instructions for the first FAQ).
Is a LINUX version available? No, but you can compile from the source code. Unzip the code in its own directory, and then use the command g++ *.cpp -fpermissive to make the executable. (This goes for people using UNIX and old versions of MacOS as well). If you have the 2011 version of the ming compiler, which is extremely unfriendly to prior versions of the C language, you will get lots of warnings but the program itself will work fine.

Downloads

Bayesian Serial SimCoal

Windows Executable (95, 98, 2000, NT, XP, Vista, 7) (628 KB)
Mac OS 10.4 (Intel processor), Mac OS 10.5 (Intel processor) ( hold down [ctrl] to "save link")
Source Code (64 KB)

Updated: Windows executable and source code - April 18, 2011

Serial SimCoal

Windows Executable (95, 98, 2000, NT, XP) (606 KB)
Macintosh Executable (OS-X) (476 KB)
Source Code (52 KB)

Bibliography

Drummond AJ, OG Pybus, A Rambaut, R Forsberg, AG Rodrigo (2003) Measurably evolving populations. Trends in Ecology and Evolution, 18:481-488.
Hudson R (1990) Gene genealogies and the coalescent proces. In Oxford Surveys in Evolutionary Biology (Futuyma DJ and JD Antonovics, eds.), New York: Oxford University Press; p. 1-44.
Kingman JFC (1982) The coalescent. Stochastic Processes and their Applications, 13:235-248.

Funding for development of this program and website was received from NSF (grant DEB#0108541 to Liz Hadly and Joanna Mountain), as well as from the Stanford's Office of Technology and Licensing (OTL) via a Research Incentive Fund award to Joanna Mountain.