UPA Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 14
Download | |
Open PDF In Browser | View PDF |
UPA Matthew Jobin 1. Introduction The Universal Pipeline Accessory is a Python tool for automating processing and population genetics analyses downstream of generating BAM files from NGS sequencing runs. It was developed in response to numerous tasks that the members of the UCSC Human Paleogenomics lab performed on a regular basis with a myriad of separate scripts and one-liners in bash, in an attempt to standardize and simplify work in population genetics. While the Human Paleogenomics lab largely works with ancient DNA, many of the functions are applicable to modern DNA data. 2. Setup 2.1. Pre-Install You will need to have installed the following for full operation of UPA. Please note that failure to install any of these might result in less-than-obvious error output. plink: https://www.cog-genomics.org/plink2/ Bcftools: https://samtools.github.io/bcftools/ Samtools: https://github.com/samtools/samtools R: https://www.r-project.org and Rscript. SNPRelate: https://bioconductor.org/packages/release/bioc/html/SNPRelate.html yHaplo: https://github.com/23andMe/yhaplo Haplogrep: You will need to get in touch with the folks who created and maintain Haplogrp https://haplogrep.uibk.ac.at In order to obtain a Java program that allows querying of haplogroups. Without this, the .hsd files generated can still be uploaded manually at the Haplogrep site. 1 Java:https://java.com/ Biopython:https://biopython.org The EIGENSOFT packages, including convertf, smartpca and qp3pop: https://www.hsph.harvard.edu/alkes-price/software/ The following Python modules: progressbar linecache re 2.2. Installing UPA The simplest way to install UPA is from GitHub. From your install directory, type: git clone https://github.com/mjobin/UPA.git and then cd into UPA. You can either place the UPA folder in your PATH or move all the ups scripts to whichever directory in your path you would like to use. 2.3. Making permanent modifications You might want to modify the defaults for UPA, say so that your own scripts and executables folder does not need to be explicitly invoked from he command line. Any text editor will allow you to do this, though be aware that pulling from git will overwrite these changes, so you might need to make them again after an update. 3. Input UPA takes BAM files as input. One available structure is a text list of BAM files defining their locations on disk. The second is a “barcode file” list, wherein sequence ID’s samples and their barcodes (if any) are listed. This is to allow UPA input to stay in sync with a data pipeline by this author, Batpipe. 3.1. Barcode files A simple list of BAM files, with one BAM file per line. Note that the file should not 2 contain any blank lines. A typical barcode file, with annotations for the columns, is shown below: Figure 1: The “barcode”-type input file standard. 3.2. BAM files These must be in subfolders arranged by name and then sub-subfolder named BWA_. The file name must contain the reference name and a .q extension followed by the quality score. 3.3. VCF file The user may wish to input a VCF (Danecek et al. 2011) directly into UPA so that he/she may specify a method for calling bases before UPA runs. This is usually done because you would like to use a calling method apart from the provided bcftools pileup method, such as ANGSD or ATLAS. 4. Program Flow 3 Figure 2: UPA Program Flow 4 4.1. Options Command-line Option -bc_file -bc_leftspec -bc_rightspec Description Default Location of barcode-style input file. None Extensions or name to the left of the reference in sample name. Extensions of names to the right of the quality score in sample name. .M.cf. .s -bam_list Location of BAM list-style input file. None -vcf_file Location of VCF input file. None -wd Working directory . -verbose Print verbose output. False -overwrite -threads -ref Overwrite existing files and directories. Number of threads where applicable (e.g. ADMIXTURE). Reference FASTA file. Must be indexed. -q BWA minimum quality. -samindex Generate indexes for SAM/BAM files. -diploid Is data diploid? -regionrestrict - False 23 /data/genomes/hg19.fa 20 Restrict merge/call to a region of the genome. File with two columns, one for your BAMs existing chromosomes names 5 vcfchromrename and one for the new names. -mergevcffile -mergebamfile Name of external dataset in VCF format. Name of external dataset in BAM format. -mito Use mitochondrial functions. -ychr Use Y chromosomal functions. -imputor -imptree -maxheight -maxdepth -nsize -msize -ncollect -maxhops -yhaplo Imputation and correction (haploid data only). Pre-made phylogenetic tree for Imputor. Maximum height for IMPUTOR root ward search. Maximum depth for IMPUTOR rootward search. Minimum threshold neighbors to correct sequencing error. Minimum threshold number of neighbors to impute missing. IMPUTOR neighbor-collection method (rootward, hops, distance). Maximum number of hops to search in IMPUTOR’s hops method. 3 3 2 3 rootward 5 Location of yHaplo scripts. Leave blank to prevent yHaplo fro running. 6 Text file for population assignment. -poplistfile Also functions as a keep list. Plink formatted: first column is population, second is individual. -lowk Lowest K value for Admixture run. 2 -hik Highest K value for Admixture run. 10 -reps -tohaploid -tvonly -termcrit -optmethod -haplogrepjava -maxgap -mindepth Number of Admixture replicates per K. 10 Convert to haploid before running Admixture. Keep only transversions in adpipe functions. Termination criterion for ADMIXTURE. Optimizaiton method for ADMIXTURE. Can be em or block. 0.0001 block Invoke java version of Haplogrep Maximum gap in read before it is counted as a new region. Minimum depth (coverage) to be counted in a region. -admixture Run ADMIXTURE. -snprelatepca Run SnpRelatePCA. -smartpca Run smartpca. 1 1 Turn on arguments related to -ancient 7 processing ancient DNA. -scriptsloc -binloc -stripchr -addreadgroup -callmethod -gcbedfile Location of external scripts. Location of external binary executables. /data/scripts /usr/local/bin Strip out “chr” from chromosome names. Add read group (RG) back to your sample BAMs. Genotype calling method. Options: bcf, genocaller. bcf UCSC BED file for use with GenoCaller. -gcindent Indent depth for use with GenoCaller. 2 -plinkgeno Value for plink geno argument. 0.99 Annotate the ID column of your -annotate samples’ VCF using an external dataset. - When merging VCF files, only keep mergefoundonly sites found in both files Table 1: Command-line options for UPA Where UPA options listed above refer to external software, they are usually passing those options straight to that software, and thus consulting the manual for that software will give the use more detail about the option’s effect. 8 4.2. Processing input files A number of common steps need to be taken in many cases to process BAM files for population genetic analysis. One necessary step is the calling of genotypes from the raw BAMs, for which there are several methods. UPA provides two methods internally, and also allows user-defined VCF files to be imported, skipping the calling step and instead allowing the calling to be done externally. For internally-called BAMs, UPA also provides some optional steps for preparing files for analysis that should circumvent common bottlenecks in a data analysis pipeline. 4.2.1. Preparing BAM files for analysis There are a number of common preparatory steps tp working with BAMs from a sequencer. When merging your samples with another dataset, you might find that the chromosome names do not match those of your samples. Some forms of mapping strips the read group information needed by genotype callers. 4.2.1.1. Stripping 'chr' from sample chromosomes UPA provides a convenience function that strips the ‘chr’ element from the names of chromosomes. This may be helpful when names with the convention “chr1, chr2…” etc are used in your samples but your reference set uses “1,2,3,..”. Use the argument stripchr to invoke this 4.2.1.2. Adding ReadGroup formation Processing during sequencing can sometimes strip read group information from BAM files. If the user uses the -addreadgroup argument, UPA will add in basic read group information by invoking Picard. The sample name will be taken from the file name, while other information such as RGPL will default to “Illumina”. 4.2.2. Calling genotypes from BAM files UPA can use one of three methods for calling genotypes from BAM files. Each of 9 these have relative merits for speed, convenience and ability to work with degreased/ancient DNA. For a full comparison of these software, please consult their respective GitHub pages or manuals. 4.2.2.1. Bcftools mpileup UPA can call genotypes using BCFTools’s mpileup and call methods. This is also the approach taken if the user supplies external dataset in BAM format. The BCFTools options used by UPA are: -C 50: Coefficient for downgrading mapping quality for reads containing excessive mismatches. -d 8000: Maximum depth of 8000. -f : Uses as reference the FASTA files supplied by -ref. -q : Uses a quality score supplied by the user. 4.2.2.2. GenoCaller To use Genocaller (https://github.com/kveeramah/GenoCaller_indent) you will need a UCSC-style (non-binary) BED file. To obtain one from your external dataset in BED/BIM/FAM format, you can use steps similar to the below in plink: 1. plink --bfile 1240K --keep Keeplist.txt --make-bed --out 1240KTest -output-chr MT 2. plink --bfile 1240KTest --recode --output-chr MT --out 1240KTest 3. awk '{print $1, $4-1, $4}' 1240KTest.map > 1240KTest.bed You will need the resulting BED file to use GenoCaller in UPA. Reference it using the gcbedfile argument. GenoCaller directly outputs VCF files, which UPA then merges into a samples VCF file before proceeding. 4.2.3. Renaming chromosomes 10 The merging step in the following section will not work if the sample files’ conventions for chromosome names do not follow that of the external data. Hint: If you need to figure out what naming conventions were used for your chromosomes for a big external file, try something like: zgrep -o 'chrM’.vcf.gz | wc -l 4.2.4. Annotation It is often necessary to annotate (i.e. fill in missing information) for BAM files that are just returned from sequencing. In particular, the ID column of a VCF file generated from BAMs in UPA will need information in the ID column for successful merging with external data. The most straightforward way to do this is to annotate from the ID column of the external dataset you are merging, and this is the approach taken by UPA. When troubleshooting, it is a good idea to check your external VCF dataset to make sure that it does indeed contain the annotation information necessary. 4.2.5. Merging with an external dataset After the calling step, or after importing a sample VCF, UPA will attempt to merge the samples with a VCF from an external dataset if -mergevcffile is used. It is not uncommon for the REF alleles of the two files to fail to match, often an artifact of the assumptions used to generate that VCF file from BED/BIM/FAM format. If the user invokes UPA with -mergevcffile or -mergebamfile arguments, then it will merge that file with the VCF file generated from the BAM files or VCF file generated from the samples. The external data will be rearranged such that the REF allele matches that of the sample data, since the sample data is using a user-specified reference in the case of inputting BAM files. To save space, UPA will not merge lines where there are only missing alleles across all samples for the site. The name of the merged VCF file will be -MERGED.vcf. 11 If you use the -annotate argument, the ID column of the merged file will be drawn from the external data file. If you use the -mergefoundonly argument, the merged file will only contain sites found in both files, while leaving an option out will fill in missing data for the external data wherever it does not have a site matching the samples. NOTE: The QUAL and FILTER columns will preserve the information from your samples, NOT that of the external dataset. 5. Population genetics functions The functions of UPA are divided by type of data and function. The initial stages of the program perform data file conversion if necessary and/or requested, while the later stages perform commonly-used analyses based on the type of data and the external repositories available. 5.1. Autosomal only 5.1.1. ADMIXTURE UPA can invoke the software ADMIXTURE for demographic inference (Alexander et al. 2009). the user can specify the maximum and minimum K values, and the number of the replicates per K. The best run of each K is moved to a BEST directory based on its cross-validation score. Cross-validation plots are also generated in R using Rscript. 5.2. Y chromosome 5.2.1. yHaplo The yHaplo software calls haplogroups for Y chromosomal data (Poznik 2016). 5.3. Mitochondrion 5.3.1. Haplogrep 12 UPA can generate HSD files for use on Haplogrep website (Weissensteiner et al. 2016). Using the -mito switch creates a user-submittable HSD files. If the user has installed the Haplogrep software (on request from the maintainers of the Haplogrep site. The haplogrepjava switch will submit the generated file to Haplogrep if the software is installed. Please note that all regions that are not SNPs will be stripped from the HSD file. 5.4. All 5.4.1. ADMIXTURE 5.4.2. Rename and filter for populations An example of a population list file is shown below If a poplistfile is submitted, then only those individuals who have a matching population in that file will be preserved for further processing. 5.4.3. Prinicpal Components Analysis UPA can invoke one of two methods for performing Principal Components Analysis (PCA) on the samples: 5.4.3.1. SmartPCA SmartPCA is part of the Eigensoft package (Patterson et al. 2006) (Price et al. 2006). 5.4.3.2. SNPRelate SNPRelate is a parallel-processing PCA package for the R statistical suite (Zheng et al. 2012). 6. Bibliography Alexander, D.H., Novembre, J. & Lange, K., 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19(9), pp.1655–1664. Available at: http://genome.cshlp.org/cgi/doi/10.1101/gr.094052.109. 13 Danecek, P. et al., 2011. The variant call format and VCFtools. Bioinformatics, 27(15), pp.2156–2158. Available at: https://academic.oup.com/bioinformatics/articlelookup/doi/10.1093/bioinformatics/btr330. Patterson, N., Price, A.L. & Reich, D., 2006. Population Structure and Eigenanalysis. PLoS Genet, 2(12), p.NaN–NaN. Available at: http://dx.plos.org/10.1371/journal.pgen.0020190. Poznik, G.D., 2016. Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men. , pp.1–5. Available at: http://biorxiv.org/lookup/doi/10.1101/088716. Price, A.L. et al., 2006. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet, 38(8), pp.904–909. Available at: http://www.nature.com/articles/ng1847. Weissensteiner, H. et al., 2016. HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing. Nucleic Acids Research, 44, p.W58. Available at: https://academic.oup.com/nar/articlelookup/doi/10.1093/nar/gkw233. Zheng, X. et al., 2012. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics, 28(24), pp.3326–3328. Available at: https://academic.oup.com/bioinformatics/articlelookup/doi/10.1093/bioinformatics/bts606. 14
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf Linearized : No Page Count : 14 PDF Version : 1.4 Title : UPA User Guide Producer : macOS Version 10.14 (Build 18A389) Quartz PDFContext Creator : Manuscripts Create Date : 2018:11:01 01:57:19Z Modify Date : 2018:11:01 01:57:19ZEXIF Metadata provided by EXIF.tools