The supplemental materials for:
HapBlock - The dynamic programming algorithms for Haplotype Block Partitioning and Tag SNPs
selection by Haplotype Data and Genotype Data
Programs
Our algorithms have been implemented in a program by C++, here are the executable files:
- The executable file for Unix (Sun OS Release 5.6 Version).
- The executable file for Windows (Windows 95,95, NT, 2000,XP).
- A brief Help File (PDF format) for how to use this program.
NEW Updates:
The HapBlock Program Can Handle the Genotype Data Now.
Attention: This program is free for academic use and not permitted for commercial purpose under any circumstances.
The part of program is produced with collaboration with Steve Qin and Jun Liu in the
department of statistics at Harvard University.
The souce code of program has not been provided and may be available upon request.
Copyright © 2003 The University of Southern California. All RIGHTS RESERVED.
Test Data Sets
The haplotype data is simulated by Coalesence Process with recombination implemented in the lab of
Richard Hudson. The following data sets are used
in our testing and can be used for exploring our program:
- A haplotype data file - TestHap.dat, which contains 80 haplotypes consisting of 234 SNPs.
- A genotype data file - TestGeno.dat, which contains 40 individuals and is generated by randomly pairing up haplotypes.
- A file for all the possible blocks and their number of tag SNPs - TestBT.dat, which is generated by our program.
- A map file for all the SNPs - TestGenoPos.dat, in which the genomic position of each SNPs is listed.
- A map file for all the SNPs - TestIndexPos.dat, in which the position of each SNPs is consecutively coded from 1 to the number of SNPs.
Results: Blocks, Tag SNPs and Haplotype Patterns
We test our program based on aforementioned data using a number of different setting of parameters. In the following, we list the parameter file,
the corresponding output files and the short explainations for those parameters and results. For the contents and format
of these files and the meaning of each parameter in the parameter file, please refer our
help file (PDF format).
Since only one definition
of block has been implemented in the current version of program and the same data sets are used, the paramter
files share many common parameters. The following definition is used in
all parameter files: a set of consecutive SNPs with size one or more are defined as a block only if
the percentage of common haplotypes is more than 80%. We also set the maximum number of samples, the maximum number
of SNPs, the maximum length of a block as 100, 250 and 100, respevtively.
- The dynamic programming algorithm is used to find block partition with minimum number
of tag SNPs on the basis of haplotype data. The haplotypes with frequency more than 4.99% are
considered as common haplotypes. The tag SNPs are a minimum set of SNPs that can distinguish
a set of common haplotypes that can acocunt for at least 80% of all unambiguous haplotypes.
The haplotype patterns are output. 100 permutations are performed and the statistics
of interest are stored in a file. Please find the setting of parameters and results from the
following files:
- A file contains the setting of parameters.
- A file contains the information of blocks and tag SNPs.
- A file contains the haplotype patterns in each block.
- A file contains the results of permutation test.
- The dynamic programming algorithm is used to find block partition with minimum number
of tag SNPs on the basis of haplotype data. The haplotypes represented more than 2 times are
considered as common haplotypes. The tag SNPs are a minimum set of SNPs that can distinguish
a set of common haplotypes that can acocunt for at least 80% of all unambiguous haplotypes.
The haplotype patterns are output. 100 permutations are performed and the statistics
of interest are stored in a file. Please find the setting of parameters and results from the
following files:
- A file contains the setting of parameters.
- A file contains the information of blocks and tag SNPs.
- A file contains the haplotype patterns in each block.
- A file contains the results of permutation test.
- The dynamic programming algorithm is used to find block partition with minimum number
of tag SNPs on the basis of haplotype data. The haplotypes represented more than 2 times are
considered as common haplotypes. The tag SNPs are a minimum set of SNPs that can account for at
least 95% of overall haplotype diversity.
The haplotype patterns are output. 100 permutations are performed and the statistics
of interest are stored in a file. Please find the setting of parameters and results from the
following files:
- A file contains the setting of parameters.
- A file contains the information of blocks and tag SNPs.
- A file contains the haplotype patterns in each block.
- A file contains the results of permutation test.
- The dynamic programming algorithm is used to find block partition with minimum number
of tag SNPs on the basis of genotype data. The haplotypes represented more than 2 times are
considered as common haplotypes. The tag SNPs are a minimum set of SNPs that can account for at
least 95% of overall haplotype diversity.
The haplotype patterns are output but permutation test has not been performed.
following files:
- A file contains the setting of parameters.
- A file contains the information of blocks and tag SNPs.
- A file contains the haplotype patterns in each block.
- The parametric dynamic programming algorithm is used to find block partition
with minimum number of tag SNPs that can cover at least 80% of genome on the basis of
all the possible blocks and the corresponding number of tag SNPs.
The blocks and the number of tag SNPs are in a file,
which is extracted from another file.
The position of a SNP is defined as its index and stored in a
file.
Please find the setting of parameters and results from the
following files:
- A file contains the setting of parameters.
- A file contains the information of blocks and tag SNPs.
- The two-dimension dynamic programming algorithm is used to find block partition
with 40 tag SNPs that can cover maximum fraction of genome on the basis of
all the possible blocks and the corresponding number of tag SNPs.
The blocks and the number of tag SNPs are in a file,
which is extracted from another file.
The position of a SNP is defined as its index and stored in a
file.
Please find the setting of parameters and results from the
following files:
- A file contains the setting of parameters.
- A file contains the information of blocks and tag SNPs.
References
- Clayton D (2001)
http://www.nature.com/ng/journal/v29/n2/extref/ng1001-233-S10.pdf
.
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J,
DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES,
Daly MJ, Altshuler D (2002) The structure of haplotype blocks in the human genome.
Science 296: 2225-2229.
-
Patil N, Berno A J, Hinds D A, Barrett W A, Doshi J M, Hacker CR, Kautzer CR, Lee D H, Marjoribanks C,
McDonough DP, et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of
human chromosome 21. Science 294: 1719-1723.
- Qin Z, Nu T, Liu J. 2002. Partitioning-Ligation-Expectation-Maximization Algorithm for
haplotype inference with single-nucleotide Ploymorphisms. Am J Hum Genet 71: 1242-1247.
- Wang N, Akey JM, Zhang K, Chakraborty K and Jin L (2002)
Distribution of recombination crossovers and the origin of haplotype blocks:
The interplay of population history, recombination, and mutation. Am J Hum Genet 71: 1227-1234.
- Zhang K, Deng M, Chen T, Waterman MS and Sun F (2002)
A dynamic programming algorithm for haplotype partitioning.
Proc Natl Acad Sci USA 99: 7335-7339.
- Zhang K, Qin Z, Liu JS, Sun F (2003a) Haplotype Block Partitioning and
Tag SNP selection via genotype data and Assessment of their accuracy. Working Paper.
- Zhang K, Sun F, Waterman MS, Chen T (2003b) Dynamic programming algorithms for
haplotype block partitioning: applications to human chromosome 21 haplotype data . In (eds)
Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB 2003).
ACM, New York, pp???-???. In press.
Created Date: March 20, 2003
Last Updated Date: March 25, 2003
Contact Us:
Program in Molecular and Computational Biology
Department of Biological Sciences
University of Southern California
Los Angeles, CA, 90089
(213) 740-2413 (phone)
(213) 740-2424 (fax)
Email: fsun@hto.usc.edu