|
|
George Weinberg
Coelescent theory appears to have less to do with genetics than I would have guessed, and more to do with underwear.
Email | Homepage | 12.22.07 - 1:09 pm | #
|
p-ter
:)
I think that used to be this site:
http://www.roberts-publishers.co...rs.com/wakeley/
Email | Homepage | 12.22.07 - 3:52 pm | #
|
John Hawks
The coalescent works very well applied to single locus simulations, but runs into excessive complexity very quickly when considering recombination over long ranges.
Every recombination event in a sample does two things: it increases by one the number of "independent" genealogies that must be accounted for, and it increases the simulated sample by one. If recombination is very rare (say, over a kilobase), this isn't a problem. But over a megabase, a sample of 100 sequences will have one recombinant or so per generation.
Given this problem, the assumptions required by the coalescent may be less appealing than direct simulation of relevant demographic models.
It seems quite possible to me that most genetics students won't learn the coalescent five years from now.
Email | Homepage | 12.23.07 - 4:47 pm | #
|
p-ter
The coalescent works very well applied to single locus simulations, but runs into excessive complexity very quickly when considering recombination over long ranges.
http://www.sph.umich.edu/csg/lia...g/liang/genome/
:)
Email | Homepage | 12.23.07 - 6:53 pm | #
|
p-ter
for those who don't want to click through, here's a summary:GENOME proposes a rapid coalescent-based approach
to simulate whole genome data. In addition to features of standard
coalescent simulators, the program allows for recombination rates
to vary along the genome and for flexible population histories.
Within small regions, we have evaluated samples simulated by
GENOME to verify that GENOME provides the expected LD patterns
and frequency spectra. The program can be used to study the
sampling properties of any statistic for a whole genome study. I'm not wedded to coalescent simulations (forward-time approaches have advantages in certain situations), but they will remain an essential tool for population geneticists for years to come. and contra john hawks, coalescent theory is absolutely worth learning for anyone interesting in analysing molecular data.
Email | Homepage | 12.23.07 - 7:10 pm | #
|
John Hawks
p-ter, did you even read that link? The only thing that's "coalescent" about GENOME is that it's implemented in backward-time and throws out non-ancestral sequences. Otherwise it's functionally identical to a forward-time simulation! In particular, it computes all its likelihoods sequence-wise one generation at a time.
That's precisely what I mean when I say student's won't learn the coalescent approach in five years. If they're learning something like GENOME, they're not learning coalescent theory. If "coalescent" is renamed to mean something like GENOME, then its meaning has changed beyond recognition -- which means those students won't recognize the Kingman approach as being "coalescent"!
Email | Homepage | 12.24.07 - 10:47 am | #
|
p-ter
john, it's obviously coalescent-based, and is incomprehensible without understanding the "vanilla" coalescent. but in any case, my main point was that your statement that the coalescent has problems with sequences on the order of 1 Mb is not true. the above coalescent-based approach simulates whole genomes, and even if you don't want to call that a "true" coalsecent, from that paper:Overall, the standard coalescent approach
which is suitable for short genomic segments (less than 2–3 Mb)
becomes very slow for larger regions (greater than 100 Mb). seriously, run ms (or whatever your favorite coalescent simulator is) with a recombination rate appropriate for 1Mb-- it doesn't take very long (on my machine, less than a minute: the command I tried is `ms 100 1 -t 400 -r 400 1000000`. this assumes Ne = 10000 and the recombination and mutation rates/base pair are both ~1e-8). again, the standard coalescent will remain a useful tool for years to come. other approaches (including ones that modify the classic coalescent) will of course also be useful. I'm not sure if this is a point of disagreement between us or just a difference in emphasis.
Email | Homepage | 12.24.07 - 12:29 pm | #
|
John Hawks
Well, I think you may be right that we are really disagreeing about emphasis rather than substance.
Still, the mathematical simplicity of the coalescent is lost once the genealogies start bifurcating through recombination, and once you go to a generation-by-generation simulation model there is no sense in which the model has the clear algebraic relationships that made the original such an accessible way into the theory. I think it's misleading to call it "coalescent" -- heck, there's not even any theoretical necessity that the lineages will coalesce! It's certainly a kludge, which is really useless as a teaching tool.
Now, nothing against kludges, but if you're forced to teach students a kludge, they will be better off if you teach them a theoretically simple kludge. You may lose a few computer cycles using the forward-time model, but not very many -- the parameters you plugged into ms above can be done in less than two minutes on my system with the forward-time model, and extensible to arbitrarily large sample sizes with no time penalty.
Brain cycles are a lot more valuable than computer cycles, and better to fill those with something really useful like diffusion theory.
Email | Homepage | 12.24.07 - 10:58 pm | #
|
Rich Lawler
Hi John,
That's an interesting take on the coalescent. I agree, it performs poorly when taking into account all the factors you mentioned (recombination, multiple loci), but it certainly seems a strong statement to say "coalescent theory" is a kludge (if I am reading you correctly), particularly when you compare it to diffusion theory.
The math for diffusion theory is much more complex than in the coalescent process, particularly because of its sampling properties, in which we just seek to determine the statistical properties of alleles in a sample, rather than model an entire population and then use some math to derive the behavior of combinations of evolutionary processes as if we had drawn a sample from it (like diffusion theory does).
The fact that coalescent approaches can be used to derive various measures of inbreeding effective population size (Nei) is certainly of some utility in seeing how we can connect allelic variation to ecological processes that cause this variation to be lost (e.g., variation in family size, different numbers of males/females, etc.).
I suspect that coalescent theory will continue to be taught in classes, not because is the most useful/versatile way to model stochastic processes in evolution, but because it forces us to see connections between individual (nonrecombining) loci and their evolutionary fate. Diffusion theory focuses on population-level parameters (usually variance effective population size in combination with mutation, selection, etc.) and thus doesn't keep track of each copy of the locus in a manner that connects demographic processes to evolutionary ones. I think there's some value to that...
Email | Homepage | 12.25.07 - 7:26 am | #
|
Rich Lawler
"The math for diffusion theory is much more complex than in the coalescent process, particularly because of its sampling properties, in which we just seek to determine the statistical properties of alleles in a sample, rather than model an entire population and then use some math to derive the behavior of combinations of evolutionary processes as if we had drawn a sample from it (like diffusion theory does)."
That is one awkwardly written paragraph...I hope it is clear above that I meant coalescent theory less complex math than diffusion theory.
Email | Homepage | 12.25.07 - 8:33 am | #
|
p-ter
the parameters you plugged into ms above can be done in less than two minutes on my system with the forward-time model, and extensible to arbitrarily large sample sizes with no time penalty
seriously? how is there no scaling with population size? and how fast does the sample reach equilibirum? I believe you can simulate allele frequencies that fast, but i'm skeptical you can output megabases of sequence--if this is published, I'll switch right over...
Email | Homepage | 12.25.07 - 9:21 am | #
|
p-ter
it performs poorly when taking into account all the factors you mentioned (recombination, multiple loci)
I'm not sure what you mean by poorly--the algorithm runs faster than anything else out there, and the math isn't *that* bad. what metric are we talking here?
Email | Homepage | 12.25.07 - 10:07 am | #
|
John Hawks
how is there no scaling with population size?
Sure, it scales with population size, but there's no penalty for larger sample sizes -- the forward-time case has to keep track of the entire population all the time anyway. I think this may be useful, since the next generation of samples will be tens of thousands of individuals.
Anyway, there's nothing amazing about the speed -- what you're perceiving is how slow the coalescent approach must be for these kinds of samples. Over a megabase, you're adding an average of one new genealogy to track every generation backward in time, because of recombination. Backward in time 10,000 generations, that's a lot of genealogies!
it certainly seems a strong statement to say "coalescent theory" is a kludge
Coalescent theory is one of the most beautiful formulations in genetics. It has two things that combine to make it useful for simulations -- you can ignore the genealogies of everyone who wasn't sampled, and you don't use discrete generations, all times being approximations in terms of N. My dissertation was all coalescent-based!
The problem is that its beauty mainly applies to the infinite sites model with no recombination. Some recent formulations lose the second advantage entirely, by counting time in discrete generations. The first loses much of its force when recombination is high enough, because every small segment of a sequence has a different genealogy. And simulating genealogies with selection in backward-time has always been sort of ugly.
Please don't get the wrong idea: I'm no purist, but I'm playing one to keep the conversation going.
For the relevant human demographic models, I'm using a hybrid approach with backward-time in the most recent (large population size) epochs and forward-time in earlier (small population size) epochs. It just doesn't make enough difference to me, burning a few more computer cycles, when I get the advantage of more flexibility
There are also algorithmic complexity and debugging to consider, which always take a lot more time than running the simulations. Both the plain-vanilla coalescent and the forward-time equivalent can be done in less than 30 lines of code. Neither are very complicated. Population structure, recombination, or selection all add complexity, so today's versions are pretty weighty. The main advantage of the coalescent today is the installed base of people used to writing in backward-time. But that community of people is relatively small compared to those who might be in a position to write such simulations in the future, so who knows?
Email | Homepage | 12.26.07 - 8:50 pm | #
|
p-ter
Please don't get the wrong idea: I'm no purist, but I'm playing one to keep the conversation going.
ah, ok, this clears things up a bit in my head :)
yes, forward-time simulations are useful in some situations, and will probably keep getting more useful as processors get faster. it's kind of funny--I'm also currently using a hybrid forward-time and coalescent approach, though a somewhat different one than yours.
but in many situations, the coalescent is still the method of choice, and will probably remain so for quite a while to come--it's just as flexible as a forward-time scheme for neutral simulations, if not more so, and still has a speed advantage.
Email | Homepage | 12.27.07 - 9:33 am | #
|
Comment Preview:
|
|
|
Commenting by HaloScan.com
|