User Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 10

Genetic type 1 Error Calculator (GEC)

(Version 0.2)

User Manual

Miao-Xin Li

Department of Psychiatry and State Key Laboratory for Cognitive and Brain Sciences;

the Centre for Reproduction, Development and Growth; and Genome Research Centre,

the University of Hong Kong, Pokfulam, Hong Kong

Contents

1. Introduction ................................................................................................................ 3

2. Installation............................................................................................................... 4

2.1 Installation of Java Runtime Environment (JRE) ................................................ 4

2.2 Installation of GEC .............................................................................................. 4

3. Input files ................................................................................................................... 4

3.1 Input files used to obtain LD information of SNPs ............................................. 4

3.2. Input files for multiple testing............................................................................. 5

4. Examples .................................................................................................................... 6

4.1 Estimate effective number of independent test ............................................... 6

4.2 Do multiple-testing in a set of p-values ............................................................... 8

1. Introduction

The Genetic Type I error calculator (GEC) is a Java-based application developed to

address multiple-testing issue with dependent Single-nucleotide polymorphisms

(SNPs). The core part is a new measure of effective number of independent tests Me

[Hum Genet. 2012 May;131(5):747-56.], which is more roust that available methods

[Figure 1]. Based on this new measure, several popular multiple-testing methods

including Bonferroni, Holm, Simes correction was improved to evaluate significance

level of SNP p-values in genome-wide association studies. A standalone version of

this tool was provided to process large datasets on users’ local computers and an

on-line version (GEC,

http://statgenpro.psychiatry.hku.hk/gec/estimateB.php?function=Bonferroni) was

made for users to quick handle a SMALL dataset conveniently.

Figure 1: Box plot of MVN-derived FWERs for 5 different methods.

For each method, the nominal FWER was set to be 0.05. The bottom and top of each box mark the

25th and 75th percentile, respectively, and the band in the box denotes the 50th percentile. The

lines above and below each box denote the upper and lower 1.5 interquartile range (IQR).

This result indicated the our new measure (The proposed) is more robust (See detailed description

about the analysis in our paper)

When may you need GEC to help you?

I want to know

- the independent number of tests or redundancy degree in a set of dependent

SNPs.

- the p-value thresholds to declare significant SNPs for association in a set of

dependent p-values.

2. Installation

2.1 Installation of Java Runtime Environment (JRE)

The JRE is required to run GEC on any operating systems (OS). It can be downloaded

from http://java.sun.com/javase/downloads/index.jsp for free.

2.2 Installation of GEC

GEC has not had an installation wizard by far. After downloaded from our website and

decompressed, it can be launched through a command, java -jar –Xms256m –Xmx1300m

"./GEC.jar” <arguments >, in a command prompt window provided by OS. In the command,

-Xms<size> and -Xmx<size> set the initial and maximum Java heap sizes for GEC

respectively. A larger maximum heap size can speed up the process of analysis. A higher

setting like –Xmx1300m is suggested dealing with large number of SNPs, say more than

2,000,000. The number, however, should be less than the size of physical memory.

3. Input files

3.1 Input files used to obtain LD information of SNPs

GEC now is able to recognize dataset containing LD information of SNPs in 4

different formats. Use can choose any one of appropriate formats.

3.1.1 HapMap LD dataset format.

Uses can download the LD data from the HapMap website

(http://hapmap.ncbi.nlm.nih.gov/downloads/ld_data/latest/) as input of GEC without

any modification. Notably, it is not necessary to uncompress the downloaded file form

HapMap because GEC is able to recognize the “.gz” compressed format.

3.1.2 Plink Binary format

GEC can directly read genotypes in the binary format generated by

Plink(http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed), which compressed

format and can be stored and processed more efficiently. GEC will calculate the

genotypic correlation to approximate the LD degree between SNPs. The Plink binary

file set always includes three linked files *.fam, *.bim and *.bed, which should be put

in the same folder.

3.1.3 Linkage format

Genotypes in the linkage format are also a valid input of GEC. The linkage file

set include two linked files: Pedigree file and Map file. A detailed description

about this format can be referred to

http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped and

http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#map. The

following is a simple example. GEC is also able to recognize the “.gz”

compressed format of these files.

Linkage Pedigree Format (Example):

100

c g

t g

c a

…

101

c c

t t

c c

…

307

c g

t g

c a

…

502

0 0

t t

a a

…

501

c c

t t

0 0

c c

…

306

g g

a a

…

The first five columns indicate the Pedigree ID, Individual ID, Father ID, Mother ID,

and Gender respectively. The sixth column is for phenotype and the remaining

columns are for genotypes. Genotypes must be denoted by the standard single

nucleotide symbols (a, t, g and c), which are case-insensitive. Missing genotypes are

indicated by “0 0”.

Hint: In the current version, GEC assume all subjects are independent when calculating the genotypic

correlation between SNPs. So please do not input sample with genetically related individuals.

Linkage Map File Format (Example):

rs2980300

775852

rs10907175

1120590

rs2887286

1145994

rs307378

1258710

rs7540231

1495898

The map file describes the SNP information with genotypes in the pedigree file. Four

attributes are required, chromosome, rsID, genetic map, and physical position on the

human genome assembly. The columns of chromosome, rsID and physical position

are required.

3.1.4 Haplotype data of 1000 Genome project in VCF files

This is a simple and easy format of phased haplotype data provided by us

http://statgenpro.psychiatry.hku.hk/limx/kgg/phasedgty.html . You can download the

ancestry matched VCF data to account for LD in your own dataset.

3.2. Input files for multiple testing

GEC can read p-values and report significant SNPs. The output of many genetic

association tools can be input to GEC without any modification. Users need specify

the column order of chromosome, SNP ID, the physical position of SNP and p-values.

4. Examples

4.1 Estimate effective number of independent test

1) By genotype data in conventional linkage format

I. on whole genome:

java -Xmx1g -jar gec.jar --effect-number \

--linkage-file D:/tmp/test \

--genome --out test1

II. in some regions:

java -Xmx1g -jar gec.jar --effect-number \

--linkage-file D:/tmp/test \

--regions ‘chr2:0-233434,chr4:2323-54564554’ \

--coordinate-version hg19>hg18 \

--out test1

Hint:--coordinate-version specifies the reference genome

version of physical positions (coordinates) in the linkage map

file and –-regions respectively.

III. in some chromosomes:

java -Xmx1g -jar gec.jar --effect-number \

--linkage-file D:/tmp/test \

--regions ‘chr2,chr4’ \

--out test1

2) By genotype data in plink binary format

I. on whole genome:

java -Xmx1g -jar gec.jar --effect-number \

--plink-binary D:/tmp/test \

--genome --out test1

II. in some regions:

java -Xmx1g -jar gec.jar --effect-number \

--plink-binary D:/tmp/test \

--regions ‘chr2:0-233434,chr4:2323-54564554’ \

--coordinate-version hg19>hg18 \

--out test1

Hint:--coordinate-version specifies the reference genome

version of physical positions (coordinates) in the plink map file

and –-regions respectively.

III. in some chromosomes:

java -Xmx1g -jar gec.jar --effect-number \

--plink-binary D:/tmp/test \

--regions ‘chr2,chr4’ \

--out test1

3) By genotypes from 1000 Genomes Project VCF format

I. on whole genome:

java -Xmx1g -jar gec.jar --effect-number \

--vcf-file D:/tmp/CEU/

1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \

--genome \

--out test1

Hint: 1. We have provided a set of compiled 1KG 1000 Genomes data,

http://statgenpro.psychiatry.hku.hk/limx/kgg/phasedgty.html.

2. _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

II. in some regions:

java -Xmx1g -jar gec.jar --effect-number \

--vcf-file D:/tmp/CEU/

1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \

--regions ‘chr2:0-233434,chr4:2323-54564554’ \

--coordinate-version hg19>hg18 \

--out test1

Hint: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

--coordinate-version specifies the reference genome version of

physical positions (coordinates) in the vcf file and –-regions

respectively.

III. in some chromosomes:

java -Xmx1g -jar gec.jar --effect-number \

--vcf-file D:/tmp/CEU/

1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \

--regions ‘chr2,chr4’ \

--out test1

Hint: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

4) By HapMap LD data

I. on whole genome:

java -Xmx1g -jar gec.jar --effect-number \

--ld-file D:/tmp/CEU/ld_chr_CHROM__CEU.txt.gz \

--genome \

--out test1

Hint: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

II. in some regions:

java -Xmx1g -jar gec.jar --effect-number \

--ld-file D:/tmp/CEU/ld_chr_CHROM__CEU.txt.gz \

--regions ‘chr2:0-233434,chr4:2323-54564554’ \

--coordinate-version hg19>hg18 \

--out test1

Hint: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

--coordinate-version specifies the reference genome version of

physical positions (coordinates) in the ld file and –-regions

respectively.

IV. in some chromosomes:

java -Xmx1g -jar gec.jar --effect-number \

--ld-file D:/tmp/CEU/ld_chr_CHROM__CEU.txt.gz \

--regions ‘chr2,chr4’ \

--out test1

Hint: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

5) By Haploype data with the format the same as MACH

I. on whole genome:

java -Xmx1g -jar gec.jar --effect-number \

--haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap

--haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \

--genome \

--out test1

Hint: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

II. in some regions:

java -Xmx1g -jar gec.jar --effect-number \

--haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap \

--haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \

--regions ‘chr2:0-233434,chr4:2323-54564554’ \

--coordinate-version hg19>hg18 \

--out test1

Note: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

--coordinate-version specifies the reference genome version of

physical positions (coordinates) in the haplotype map file and

–-regions respectively.

III. in some chromosomes:

java -Xmx1g -jar gec.jar --effect-number \

--haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap \

--haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \

--regions ‘chr2,chr4’ \

--out test1

Hint: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

4.2 Do multiple-testing in a set of p-values

6) By genotype data in conventional linkage format

java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error

0.05 \

--linkage-file D:/tmp/test \

--pvalue-file c:/tmp/test-results.txt \

--chrom-column 1 \

--marker-column 2 \

--marker-position-column 3 \

--coordinate-version hg19-hg18 \

--pvalue-column 7 \

--out test1

Hint:

--multiple-test allows a modified multiple testing by Holm and Simes

procedures. However, it will be very slow for large number of SNPs.

--error is to specify the global type I error rate.

--coordinate-version specifies the reference genome version of

physical positions (coordinates) in the linkage map file and p

value file respectively.

7) By genotype data in plink binary format

java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error

0.05 \

--plink-binary D:/tmp/test \

--pvalue-file c:/tmp/test-results.txt \

--chrom-column 1 \

--marker-column 2 \

--marker-position-column 3 \

--coordinate-version hg19-hg18 \

--pvalue-column 7 \

--out test1

Hint:

--multiple-test allows a modified multiple testing by Holm and Simes

procedures. However, it will be very slow for large number of SNPs.

--error is to specify the global type I error rate.

--coordinate-version specifies the reference genome version of

physical positions (coordinates) in the plink map file and p value

file respectively.

8) By genotypes from 1000 Genomes Project VCF format

java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error

0.05 \

--vcf-file D:/tmp/CEU/

1kg.phase1.v3.shapeit2.eur.hg19.chr_CHROM_.vcf.gz \

--pvalue-file c:/tmp/test-results.txt \

--chrom-column 1 \

--marker-column 2 \

--marker-position-column 3 \

--pvalue-column 7 \

--coordinate-version hg19-hg18 \

--out test1

Hint: We have provided a set of compiled 1KG 1000 Genomes data,

http://statgenpro.psychiatry.hku.hk/limx/kgg/phasedgty.html.

_CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

--multiple-test allows a modified multiple testing by Holm and Simes

procedures. However, it will be very slow for large number of SNPs.

--error is to specify the global type I error rate.

--coordinate-version specifies the reference genome version of

physical positions (coordinates) in the vcf file and p value file

respectively. If the versions of coordinates are identical, you

can put the same version ID, say, hg19-hg19.

9) By HapMap LD data

java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error

0.05 \

--ld-file D:/tmp/CEU/ld_chr_CHROM__CEU.txt.gz \(Assume the

reference genome version is hg18)

--pvalue-file c:/tmp/test-results.txt \

--chrom-column 1 \

--marker-column 2 \

--marker-position-column 3 \

--pvalue-column 7 \

--coordinate-version hg19-hg18 \

--out test1

Hint: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

--multiple-test allows a modified multiple testing by Holm and Simes

procedures. However, it will be very slow for large number of SNPs.

--error is to specify the global type I error rate.

--coordinate-version specifies the reference genome version of

physical positions (coordinates) in the ld file and p value file

respectively.

10) By Haploype data with the format the same as MACH

java -Xmx1g -jar gec.jar --effect-number [or --multiple-test] --error

0.05 \

--haplotype-file D:/tmp/CEU/EUR.chr_CHROM_.hap \

--haplotype-map D:/tmp/CEU/EUR.chr_CHROM_.map \

--pvalue-file c:/tmp/test-results.txt \

--chrom-column 1 \

--marker-column 2 \

--marker-position-column 3 \

--pvalue-column 7 \

--coordinate-version hg19-hg18 \

--out test1

Hint: _CHROM_ is a variable standing for the chromosome name

(1,2,…X,Y).

--multiple-test allows a modified multiple testing by Holm and Simes

procedures. However, it will be very slow for large number of SNPs.

--error is to specify the global type I error rate.

--coordinate-version specifies the reference genome version of

physical positions (coordinates) in the haplotype map file and

p value file respectively.

User Manual

Navigation menu

Versions of this User Manual:

Views

Navigation