The Merck Computational Biology and Chemistry Program |
Distinguished Seminar Abstract |
The next high-priority phase of human genomics will involve the development of a full Haplotype Map of the human genome. It will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. However, most studies will collect genotype data rather than haplotype data, requiring the deduction of haplotype information from genotype data. Input to the haplotyping problem is a set of N genotypes, and output is N haplotype pairs that "explain" the genotypes. Several computational approaches to this problem have been developed and used. Most of these use a statistical framework, such as MLE, or statistical computing methods such as EM, Gibbs sampling etc. We have developed several distinct combinatorial approaches to the haplotyping problem. I will talk mainly about the most recent of these approaches, based on viewing the haplotyping problem in the context of perfect phylogeny. The biological motivation for this is the surprising fact that some genomic DNA can be partitioned into long blocks where recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected. This, along with assumption of infinite sites implies that a "permitted" solution to the haplotyping problem should fit (or almost fit) a perfect phylogeny. This is a severe combinatorial constraint on the permitted solutions to the haplotyping problem, and it leads to efficient deterministic algorithms (I will talk about two of them) to deduce all features of the permitted haplotype solution(s) that can be known with certainty. We obtain: a) an efficient algorithm to find (if one exists) one permitted solution to the haplotype problem; b) a simple test to determine if it is the unique permitted solution; c) an efficient way to count the number of permitted solutions; d) and an efficient way to implicitly represent the set of all permitted solutions so that each can be efficiently created. As a by-product, we prove that the number of permitted solutions is bounded by 2k, where k is less than or equal to the number of sites. This is a huge reduction from the number of solutions if we do not require that a solution fit a perfect phylogeny. This implicit representation also allows simple solutions to secondary questions about the genotypes. We also observe a strong phase-transition in the probability that a given number of genotypes will be sufficient to uniquely determine the tree. Finally, I will discuss recent results obtained trying to incorporate constrained recombination into the underlying perfect phylogeny model. Parts of this work are joint with different collaborators, including R.H. Chung, V. Bafna, G. Lancia, C. Langley, and S. Yooseph.