User Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 10
Download | |
Open PDF In Browser | View PDF |
Genetic type 1 Error Calculator (GEC) (Version 0.2) User Manual Miao-Xin Li Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences; the Centre for Reproduction, Development and Growth; and Genome Research Centre, the University of Hong Kong, Pokfulam, Hong Kong Contents 1. Introduction ................................................................................................................ 3 2. Installation............................................................................................................... 4 2.1 Installation of Java Runtime Environment (JRE) ................................................ 4 2.2 Installation of GEC .............................................................................................. 4 3. Input files ................................................................................................................... 4 3.1 Input files used to obtain LD information of SNPs ............................................. 4 3.2. Input files for multiple testing............................................................................. 5 4. Examples .................................................................................................................... 6 4.1 Estimate effective number of independent test ............................................... 6 4.2 Do multiple-testing in a set of p-values ............................................................... 8 1. Introduction The Genetic Type I error calculator (GEC) is a Java-based application developed to address multiple-testing issue with dependent Single-nucleotide polymorphisms (SNPs). The core part is a new measure of effective number of independent tests Me [Hum Genet. 2012 May;131(5):747-56.], which is more roust that available methods [Figure 1]. Based on this new measure, several popular multiple-testing methods including Bonferroni, Holm, Simes correction was improved to evaluate significance level of SNP p-values in genome-wide association studies. A standalone version of this tool was provided to process large datasets on users’ local computers and an on-line version (GEC, http://statgenpro.psychiatry.hku.hk/gec/estimateB.php?function=Bonferroni) was made for users to quick handle a SMALL dataset conveniently. Figure 1: Box plot of MVN-derived FWERs for 5 different methods. For each method, the nominal FWER was set to be 0.05. The bottom and top of each box mark the 25th and 75th percentile, respectively, and the band in the box denotes the 50th percentile. The lines above and below each box denote the upper and lower 1.5 interquartile range (IQR). This result indicated the our new measure (The proposed) is more robust (See detailed description about the analysis in our paper) When may you need GEC to help you? I want to know the independent number of tests or redundancy degree in a set of dependent SNPs. the p-value thresholds to declare significant SNPs for association in a set of dependent p-values. 2. Installation 2.1 Installation of Java Runtime Environment (JRE) The JRE is required to run GEC on any operating systems (OS). It can be downloaded from http://java.sun.com/javase/downloads/index.jsp for free. 2.2 Installation of GEC GEC has not had an installation wizard by far. After downloaded from our website and decompressed, it can be launched through a command, java -jar –Xms256m –Xmx1300m "./GEC.jar”, in a command prompt window provided by OS. In the command, -Xms and -Xmx set the initial and maximum Java heap sizes for GEC respectively. A larger maximum heap size can speed up the process of analysis. A higher setting like –Xmx1300m is suggested dealing with large number of SNPs, say more than 2,000,000. The number, however, should be less than the size of physical memory. 3. Input files 3.1 Input files used to obtain LD information of SNPs GEC now is able to recognize dataset containing LD information of SNPs in 4 different formats. Use can choose any one of appropriate formats. 3.1.1 HapMap LD dataset format. Uses can download the LD data from the HapMap website (http://hapmap.ncbi.nlm.nih.gov/downloads/ld_data/latest/) as input of GEC without any modification. Notably, it is not necessary to uncompress the downloaded file form HapMap because GEC is able to recognize the “.gz” compressed format. 3.1.2 Plink Binary format GEC can directly read genotypes in the binary format generated by Plink(http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed), which compressed format and can be stored and processed more efficiently. GEC will calculate the genotypic correlation to approximate the LD degree between SNPs. The Plink binary file set always includes three linked files *.fam, *.bim and *.bed, which should be put in the same folder. 3.1.3 Linkage format Genotypes in the linkage format are also a valid input of GEC. The linkage file set include two linked files: Pedigree file and Map file. A detailed description about this format can be referred to http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped and http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#map. The following is a simple example. GEC is also able to recognize the “.gz” compressed format of these files. Linkage Pedigree Format (Example): 1 100 0 0 1 0 cg tg tg ca … 2 101 0 0 2 2 cc tt tt cc … 3 307 0 0 2 2 cg tg tg ca … 4 502 0 0 1 1 00 00 tt aa … 5 501 0 0 1 1 cc tt 00 cc … 6 306 0 0 1 1 gg gg gg aa … The first five columns indicate the Pedigree ID, Individual ID, Father ID, Mother ID, and Gender respectively. The sixth column is for phenotype and the remaining columns are for genotypes. Genotypes must be denoted by the standard single nucleotide symbols (a, t, g and c), which are case-insensitive. Missing genotypes are indicated by “0 0”. Hint: In the current version, GEC assume all subjects are independent when calculating the genotypic correlation between SNPs. So please do not input sample with genetically related individuals. Linkage Map File Format (Example): 1 1 1 1 1 rs2980300 rs10907175 rs2887286 rs307378 rs7540231 0 0 0 0 0 775852 1120590 1145994 1258710 1495898 The map file describes the SNP information with genotypes in the pedigree file. Four attributes are required, chromosome, rsID, genetic map, and physical position on the human genome assembly. The columns of chromosome, rsID and physical position are required. 3.1.4 Haplotype data of 1000 Genome project in VCF files This is a simple and easy format of phased haplotype data provided by us http://statgenpro.psychiatry.hku.hk/limx/kgg/phasedgty.html . You can download the ancestry matched VCF data to account for LD in your own dataset. 3.2. Input files for multiple testing GEC can read p-values and report significant SNPs. The output of many genetic association tools can be input to GEC without any modification. Users need specify the column order of chromosome, SNP ID, the physical position of SNP and p-values. 4. Examples 4.1 Estimate effective number of independent test 1) By genotype data in conventional linkage format I. on whole genome: java -Xmx1g -jar gec.jar --effect-number \ --linkage-file D:/tmp/test \ --genome --out test1 II. in some regions: java -Xmx1g -jar gec.jar --effect-number \ --linkage-file D:/tmp/test \ --regions ‘chr2:0-233434,chr4:2323-54564554’ \ --coordinate-version hg19>hg18 \ --out test1 Hint:--coordinate-version specifies the reference genome version of physical positions (coordinates) in the linkage map file and –-regions respectively. III. in some chromosomes: java -Xmx1g -jar gec.jar --effect-number \ --linkage-file D:/tmp/test \ --regions ‘chr2,chr4’ \ --out test1 2) By genotype data in plink binary format I. on whole genome: java -Xmx1g -jar gec.jar --effect-number \ --plink-binary D:/tmp/test \ --genome --out test1 II. in some regions: java -Xmx1g -jar gec.jar --effect-number \ --plink-binary D:/tmp/test \ --regions ‘chr2:0-233434,chr4:2323-54564554’ \ --coordinate-version hg19>hg18 \ --out test1 Hint:--coordinate-version specifies the reference genome version of physical positions (coordinates) in the plink map file and –-regions respectively. III. in some chromosomes: java -Xmx1g -jar gec.jar --effect-number \ --plink-binary D:/tmp/test \ --regions ‘chr2,chr4’ \ --out test1 3) By genotypes from 1000 Genomes Project VCF format I. on whole genome: java -Xmx1g -jar gec.jar --effect-number \ --vcf-file D:/tmp/CEU/ 1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \ --genome \ --out test1 Hint: 1. We have provided a set of compiled 1KG 1000 Genomes data, http://statgenpro.psychiatry.hku.hk/limx/kgg/phasedgty.html. 2. _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). II. in some regions: java -Xmx1g -jar gec.jar --effect-number \ --vcf-file D:/tmp/CEU/ 1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \ --regions ‘chr2:0-233434,chr4:2323-54564554’ \ --coordinate-version hg19>hg18 \ --out test1 Hint: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). --coordinate-version specifies the reference genome version of physical positions (coordinates) in the vcf file and –-regions respectively. III. in some chromosomes: java -Xmx1g -jar gec.jar --effect-number \ --vcf-file D:/tmp/CEU/ 1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \ --regions ‘chr2,chr4’ \ --out test1 Hint: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). 4) By HapMap LD data I. on whole genome: java -Xmx1g -jar gec.jar --effect-number \ --ld-file D:/tmp/CEU/ld_chr_CHROM__CEU.txt.gz \ --genome \ --out test1 Hint: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). II. in some regions: java -Xmx1g -jar gec.jar --effect-number \ --ld-file D:/tmp/CEU/ld_chr_CHROM__CEU.txt.gz \ --regions ‘chr2:0-233434,chr4:2323-54564554’ \ --coordinate-version hg19>hg18 \ --out test1 Hint: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). --coordinate-version specifies the reference genome version of physical positions (coordinates) in the ld file and –-regions respectively. IV. in some chromosomes: java -Xmx1g -jar gec.jar --effect-number \ --ld-file D:/tmp/CEU/ld_chr_CHROM__CEU.txt.gz \ --regions ‘chr2,chr4’ \ --out test1 Hint: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). 5) By Haploype data with the format the same as MACH I. on whole genome: java -Xmx1g -jar gec.jar --effect-number \ --haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap --haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \ --genome \ --out test1 Hint: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). II. in some regions: java -Xmx1g -jar gec.jar --effect-number \ --haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap \ --haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \ --regions ‘chr2:0-233434,chr4:2323-54564554’ \ --coordinate-version hg19>hg18 \ --out test1 Note: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). --coordinate-version specifies the reference genome version of physical positions (coordinates) in the haplotype map file and –-regions respectively. III. in some chromosomes: java -Xmx1g -jar gec.jar --effect-number \ --haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap \ --haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \ --regions ‘chr2,chr4’ \ --out test1 Hint: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). 4.2 Do multiple-testing in a set of p-values 6) By genotype data in conventional linkage format java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --linkage-file D:/tmp/test \ --pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \ --marker-position-column 3 \ --coordinate-version hg19-hg18 \ --pvalue-column 7 \ --out test1 Hint: --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. --coordinate-version specifies the reference genome version of physical positions (coordinates) in the linkage map file and p value file respectively. 7) By genotype data in plink binary format java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --plink-binary D:/tmp/test \ --pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \ --marker-position-column 3 \ --coordinate-version hg19-hg18 \ --pvalue-column 7 \ --out test1 Hint: --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. --coordinate-version specifies the reference genome version of physical positions (coordinates) in the plink map file and p value file respectively. 8) By genotypes from 1000 Genomes Project VCF format java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --vcf-file D:/tmp/CEU/ 1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \ --pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \ --marker-position-column 3 \ --pvalue-column 7 \ --coordinate-version hg19-hg18 \ --out test1 Hint: We have provided a set of compiled 1KG 1000 Genomes data, http://statgenpro.psychiatry.hku.hk/limx/kgg/phasedgty.html. _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. --coordinate-version specifies the reference genome version of physical positions (coordinates) in the vcf file and p value file respectively. If the versions of coordinates are identical, you can put the same version ID, say, hg19-hg19. 9) By HapMap LD data java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --ld-file D:/tmp/CEU/ld_chr_CHROM__CEU.txt.gz \(Assume the reference genome version is hg18) --pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \ --marker-position-column 3 \ --pvalue-column 7 \ --coordinate-version hg19-hg18 \ --out test1 Hint: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. --coordinate-version specifies the reference genome version of physical positions (coordinates) in the ld file and p value file respectively. 10) By Haploype data with the format the same as MACH java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error 0.05 \ --haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap \ --haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \ --pvalue-file c:/tmp/test-results.txt \ --chrom-column 1 \ --marker-column 2 \ --marker-position-column 3 \ --pvalue-column 7 \ --coordinate-version hg19-hg18 \ --out test1 Hint: _CHROM_ is a variable standing for the chromosome name (1,2,…X,Y). --multiple-test allows a modified multiple testing by Holm and Simes procedures. However, it will be very slow for large number of SNPs. --error is to specify the global type I error rate. --coordinate-version specifies the reference genome version of physical positions (coordinates) in the haplotype map file and p value file respectively.
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 10 Language : en-US Tagged PDF : Yes Author : Li Miao Xin Creator : Microsoft® Word 2013 Create Date : 2016:10:15 07:57:45+08:00 Modify Date : 2016:10:15 07:57:45+08:00 Producer : Microsoft® Word 2013EXIF Metadata provided by EXIF.tools