Victor Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 15
MANUALMANUAL TUTORIALTUTORIALTUTORIAL USENOWUSENOWUSENOW HISTORYHISTORYHISTORY FAQFAQFAQ CONTACTCONTACTCONTACT
About
‑VICTOR
Functions
Contents
Features
Versions
‑Options
‑Configure file
‑Pipe
‑Standard error
‑Input files
‑Genome
1.About
VICTOR(VariantInterpretationforClinicalTestingOrResearch)isapipelinefortheanalysisofnextgeneration
sequencingdatastartingfromamultisamplerawVCFfilewithoutdecomposing,normalization,orannotation.
Itcanbeusedfordiseasegenediscoveryresearchorclinicalgenetictesting.Itisdesignedtobescalableto
wholegenomesequencing(WGS)ofalargesampleofindividualsthatistypicalofaresearchonacomplex
disease.ThedownloadablepackageincludesPERCH,VANNER,utilityprograms,scripttemplates,anddatafiles.
Itismostlyaselfcontainedpackage,wherebytherequirementforthirdpartyapplicationsisminimal(seethe
“USENOW”pagefordetails),andallnecessarydatabasesarealreadyincluded.Itprovidesdataupdatesona
monthlybasis.
ThispipelinerunsonLinuxorMacOSX.Itsupportsparallelcomputing.Theprimaryinterfaceofthispipelineis
aSLURMscripttemplatenamedslurm.all_steps.SLURMisajobschedulerforLinuxsystemsthatisusedby
manyoftheworld’ssupercomputersandcomputerclusters.Thisscripttemplatecanalsobeconvertedtowork
forotherjobschedulerssuchasPBSorMOAB.Alternatively,itcanbeusedasaBashscriptwithoutajob
scheduler.Usingthistemplate,userscanmodifyanalysisparametersandinputfilesspecifictotheirresearch
projectandsubmitajobtoacomputercluster.Thisscriptsupportsreanalysisfromthemiddleofthepipeline
ifpreviousstepsaresuccessful.Userscanmodifyparametersinthescript,selectthestepstobeexecuted,and
thensubmitajobtotheclusteragain.Thescriptwillautomaticallywritelogfilessequentially.
Thiswebpagedescribesthefeaturescommontoallincludedprogramsandtheusageoftheutilitytools.Forthe
detailsofPERCHandVANNER,pleasebereferredtothecorrespondinguserManual.
WhatVICTORdoes:
1. Conductsgenotype,variant,andsamplewisequalitycontrolofdata.Pleaseclickhereforalistofmethods.
2. Performsprinciplecomponentanalysisandautomaticallyadjustforpopulationstructureinassociationtests.
3. Calculatesrelatedness,removecorrelatedindividuals,anddetectpedigreestructureerrors.
4. Annotatesallelefrequency,deleteriousness,functionalconsequence,regulatoryregions,microRNAbinding
sites,proteindomains,andclinicalsignificanceofvariants.
5. ForgenediscoveryresearchonaMendeliandisease,complexdisease,orthediseasethatishypothesizedto
becausedbydenovomutations,itperformsvariantprioritization,geneprioritization,andgenesetanalysis.
Itworksfordifferentstudydesignsincludingcasecontrol,caseonly,extendedpedigrees,trios,ormixed.
Statisticsincludelinkageanalysis,linearorlogisticregression,sumsquaredU,Fisher’sexact,ranksum,etc.
6. Forclinicaltesting,itintegratesdeleteriousness,cosegregationandassociationtesttocalculateaposterior
probabilityofpathogenicityforavariant.ItimplementsthecomponentsintheIARCguidelinesthatarenot
availableinothersoftware.YoucanalsoconverttheVICTORoutputstoastrengthofevidence(supporting,
moderate,strong,verystrong)forintegrationwiththeACMGguidelines.
7. Itreportsincidentalfindingsforthereturnofresultstostudyparticipants.
Whatareincludedinthepackage:
1. Avariant/gene/genesetprioritizationandvariantclassificationsoftwarebundlenamedPERCH
2. AvariantannotationsoftwarebundlenamedVANNER
3. Asetofutilityprograms
4. AsetofscriptsandSLURMtemplatesforsubmittingjobstoacomputercluster
5. ThereferencesequenceGRCh37,GRCh38,andhg19
1.1.VICTOR
[Backtotop]
Topics
‑Implemented quality Control
‑Compare to ExAC
‑BayesDel score
Programs
‑vBED/vBED2
‑vConvertVCF
‑vConvertPed
‑vConvertTest
‑vSPLIT
‑vQC
‑vQS
‑vINFO
‑vPROV
‑HGVS_to_genomic
Scripts
‑ slurm.all_steps
‑slurm.step2
‑victor.sbatch
6. dbSNP150
7. Ensembltranscriptdatabaserelease92forGRCh38and87forGRCh37
8. RefSeqtranscriptdatabase20180521
9. AlightweightedEnsembldatabasecontainingprincipletranscriptsfromAPPRIS
10.AlightweightedRefSeqdatabasecontainingprincipletranscriptsfromAPPRIS
11.DatabaseofdiscrepanciesbetweenRefSeqsequenceandreferencesequence
12.MicroRNAbindingsitespredictedbyTargetScan7.1
13.Splicealtering(intronicorexonic)SNVspredictedbydbscSNV
14.Ensemblregulatorybuildmotiffeatures
15.AconvertedGeneManiagenenetworkreadytobeusedbyPERCH
16.PROVEANscoresforInDelsobservedinpublicdatabases
17.ThemaximumallelefrequenciesobtainedfromUK10K,gnomAD,GoNL,and1kJapanese
18.BayesDelscoresforallpossibleSNVsintheentirehumangenome
19.MaximumBayesDelscoresamongallpossibleSNVsforeachgene
20.ClinVarvariants,excludingthosewithconflictingreports
21.Databasesofmappability(DukeExcludableandDacExcludable)
22.InterProproteindomaindatabase
23.Severalgenepanelsforreportingincidentalfinding
24.Severalpathwaydatabasesforgenesetanalysis
25.ExACnonTCGAv1
FeaturesinadditiontothoseinheritedfromPERCHorVANNER:
1. Supportsparallelcomputing.
2. Supportsreanalysisfromthemiddleofthepipeline.Itwriteslogssequentially.
3. Supportsmultiplegenomesincludinghg19,GRCh37,andGRCh38.
4. AutomaticallydeterminestheVQSLODcutoffsforSNVsandInDelsseparately.
5. Automaticallyadjustsforpopulationstructuredetectedfromaprinciplecomponentanalysis.
6. Automaticallyremovescorrelatedindividualsthathavehighermissingrates.
7. Providesmonthlydatabaseupdates.
Versionchecking
Theversionstringofthissoftwarepackageisaversionnumberandabuilddate.Ifitisabetaversion,the
versionnumberisfollowedbytheword“beta”.Allprogramshavethesameversion.Programversionanddata
versionareseparate;sometimesyouonlyneedtoupgradetheprogramsbutnotthedata.The(version)
optionofanyprogramwillcheckfornewversionsforbothprogramsanddatathroughtheInternet.Thisoption
willnotsendoutanyinformation.
Ingeneral,theusageofprogramoptionsfollowstheserules:
1) Optionswith2+argumentsmustbeusedlikethis:“optionarg1arg2”;
2) Optionswithonly1augmentcanbeusedlikethis:“option=arg”or“optionarg”;
3) Youcanomitabooleanargumentthatispositive,ie:“option=yes”isequalto“option”;
4) Itisbettertoalwaysuse“option=arg”wheneverpossible,whichismorereadable;
5) Iftheargumentisanarray,youcanreplacethedefaultby“arrayoption=v1,v2”;
6) Youcanclearthedefaultbyassignmenttoanemptyarray“arrayoption=”;
7) Youcankeepthedefaultandinsertadditionalitemsby“arrayoption+=v1,v2”;
8) Theorderofdifferentoptionsdoesnotmatter.Forexample,“ab”isequivalentto“ba”;
9) Theorderofthesameoptionmatters.Thelastoneoverridestheothers.E.g.,“a=yesa=no”is“a=no”;
10)Youcannotmergemultiplesingleletteroptionsintoone:“ab”cannotbewrittenas“ab”.
Becarefulaboutarraytypearguments:thecorrectwaytoassigntwovaluesis“arrayoption=v1,v2”butnot
“arrayoption=v1arrayoption=v2”.Thelatterwillfirstassignv1tothearray,thenassignv2andgetridof
v1,leavingonlyoneelement(v2)inthearray.
Allescapesequencesinprogramoptionsandargumentswillbereplaced.Thefollowingsequencesare
recognized:\\(backslash),\a(alert),\b(backspace),\f(formfeed),\n(newline),\r(carriagereturn),\t
(horizontaltab),\v(verticaltab),and\xHH(bytewith2digitshexadecimalvalue).
The(h)or(help)optionwillprintashorthelptextfortheprogram.Inthehelptext,thedescriptionofeach
programoptionfollowstheconventionof“ProgramOptionArgumentDataTypeFunction{CurrentValue}”.
1.2.Options
[Backtotop]
Anythingwithinapairofsquarebracketsisoptional.Thingswithincurlybracketsarethecurrentvalueafter
settingbyfactorydefaults,configurefiles,andprogramoptions.Tomakethehelptextmorereadable,please
settheterminalwindowsizetoatleast160columns.
ArgumentDataTypeisrepresentedbyaletterorastringasshownbelow:
DorDBLafloatingpointnumber
IorINTanintegerwithoutcommasforthousands(1234butnot1,234)
BorBOOLacaseinsensitivebooleanvalue(0/1/no/yes/n/y/false/true/f/t)
CorCHARacharacter
SorSTRastring,escapesequenceallowed(e.g.,"hi\n"=hi\\n=$'hi\n')
ForFILEafilename,mayincluderelativeorabsolutepath,environmentvariablesallowed
PorPATHapath,relativeorabsolute,environmentvariablesallowed,betterendswith/
FLDorFIELDafieldnumber.Thefirstfieldis1,andsoon.Thelastfieldis1,secondlastis2,andsoon.
FLDsorFIELDsanarrayoffieldnumbersdividedbyacomma
SsorSTRsanarrayofstringsseparatedbyacomma
FsorFILEsanarrayoffilenamesseparatedbyacolon
Ifspecified,theremaybeastrictrequirementfortheformatandthenumberofarguments.Forexample,
“D,D,D”referstoexactly3floatingpointnumbersseparatedbyacomma,suchasthepenetrancearguments.
ArgumentDataTypecouldbefollowedbyanintegertohelpthewritingofthedescriptions,suchasS1,S2,S3for
thexctpfxoption,whereS1isExACfileprefix,S2iscases'qc.logfileprefix,S3iscases'coveragefileprefix.
AConfigureFileisanotherwaytopassparameterstoprogramsbesidesprogramoptions.Althoughbeing
optional,usingaConfigureFileismoreconvenientthanprogramoptions,especiallyfortheparametersthat
needtobethesameforallVICTORprograms.TheprogramsreadtwoConfigureFilesiftheyexist.Thefirstis
/path/to/VICTOR/etc/par.txt,whichcanbeusedtosetupparametersthatareuniversaltoallusers.The
secondis./par.txt,whichshouldbeusedtospecifyparametersinthecurrentanalysis.Theformeroverrides
defaultvalues;thelatteroverridestheformer;andcommandlineoptionsoverrideallofthem.
Inthisfile,onerowisoneoption.Theformatis"option=argument".Spaceisnotallowedinthearguments.
Anythingaftertheargumentwillbeomitted.Multipleconsecutivespacesortabsaretreatedasonedelimiter.
Linesstartingwitha#arecomments.Commentsandblanklineswillbeomitted.Thefirstrowofthefilehasa
specialcontent(seetheexamplesbelow),sothattheprogramswillnotaccidentallyreadawrongfile.
NotallprogramoptionscanbesetinaConfigureFile.Belowisalistofsupportedoptions:
--col-one Ss Header of the first column {"#CHROM","#Chrom","CHROM","Chrom","#CHR","#Chr","#chr","Chr","CHR","chr"}
--col-func S Header of the column for functional consequence {Func_Type}
--col-gene S Header of the column for gene symbol {Func_Gene}
--col-fdet S Header of the column for functional details {Func_Detail}
--info-func S The name of the INFO sub-field for functional consequence annotation {vAnnGene}
--gnuplot S The command for gnuplot 4.6.0 or later {gnuplot}
--gdb S The gene database (refGeneLite/refGeneFull/ensGeneLite/ensGeneFull) {refGeneLite}
--filt-vqs-nan B Exclude variants if VQSLOD is missing {no}
--filt-vqs-snv D Exclude variants if VQSLOD<D (-inf = no filtering) for SNVs {-5.368}
--filt-vqs-indel D Exclude variants if VQSLOD<D (-inf = no filtering) for InDels {-4.208}
--filt-miss-rate D Exclude variants if missing rate in cases or controls is > D (1 = no filtering) {0.01}
--filt-filter Ss Exclude variants if FILTER is not one of the Ss {.,PASS}
--filt-QD D Exclude variants if QD (Qual. By Depth.) < D (0 = no filtering, GATK default = 2) {0}
--filt-MQ D Exclude variants if MQ (Mapping Quality) < D (0 = no filtering, GATK default = 40) {0}
--filt-FS D Exclude variants if FS (Fisher Strand) > D (0 = no filtering, GATK default = 60) {0}
--filt-HS D Exclude variants if HaplotypeScore > D (0 = no filtering, GATK default = 13) {0}
--filt-MR D Exclude variants if MQRankSum < D (0 = no filtering, GATK default = -12.5) {0}
--filt-RP D Exclude variants if ReadPosRankSum < D (0 = no filtering, GATK default = -8) {0}
--hard-filter B Exclude variants by hard filters. Will set thresholds with GATK defaults if not set. {No}
--HardFiltIfNoVQ B Exclude variants by hard filters if there's no VQSLOD. Will set thresholds with GATK defaults. {yes}
--filt-DP I Exclude genotypes if DP<I (0 = no filter) {10}
--filt-GQ I Exclude genotypes if GQ<I (0 = no filter) {40}
--filt-MaxAF D Exclude variants if MaxAF > D (0 = no filter) {0.01}
--filt-SplAF D Exclude variants if SplAF > D (0 = no filter) {0}
--filt-FdrAF D Exclude variants if FdrAF > D (0 = no filter) {0.05}
--filt-del D Exclude variants if BayesDel<D (-inf = no filtering) {-0.0592577}
--prevalence D Prevalence {0.025}
--penetrance D,D,D Penetrance {0.02,0.1,0.5}
--include Fs Restrict analysis to regions in FILEs. Use 2+ files to define intersection. {}
--exclude Fs Restrict analysis to regions not in FILEs. Use 2+ files to define union. {}
--lof-only B Restrict analysis to loss-of-function variants only, BayesDel still applies {no}
--lof-no-del B Restrict analysis to loss-of-function variants only, BayesDel not considered {no}
1.3.Configurefile
[Backtotop]
--rm-ind Fs Remove individuals listed in file(s) Fs {}
--vc B Run mode is Variant Classification {no}
--no-web B Do not check version from web {false}
Belowisanexampleofthe/path/to/VICTOR/etc/par.txt:
VICTOR_parameters_1.0 << Do not delete or modify. This line prevents reading a wrong file.
--gnuplot=/opt/gnuplot/4.6.0/bin/gnuplot << This is necessary if gnuplot 4.6.0 or above is not included in $PATH.
Belowisanexampleofthe./par.txt:
VICTOR_parameters_1.0 << Do not delete or modify. This line prevents reading a wrong file.
--prevalence=0.025 << prevalence
--penetrance=0.02,0.1,0.5 << penetrance
MostVICTORprogramsreadadatafilefromthestandardinput,doanalyses,thenwriteamodifieddatafileto
thestandardoutput.Therefore,youcanusetheUnixpipe(“|”)torunmultipleprogramssequentially.This
featurecanhelpyoureduceCPUusagetimeandsavediskspaces.
Analysislogs,messages,andruntimeerrorsarenormallywrittentothestandarderror(StdErr)output.
However,theusageofUnixpipeandmultiprocessingparallelism(e.g.,usingGNUparalleltorunmultiple
instancessimultaneously,usuallyseparatedbychromosomes)willmaketheStdErroutputsbetweenprograms
interminglewitheachother.Tosolvethisproblem,pleasesetanenvironmentvariableSTDERR_MUTEXtoa
nonemptystring.TheprovidedSLURMscripttemplatesalreadyimplementedthisfeature.Ifyouusethescript
templatetorunananalysis,youdon’thavetosetthevariableseparately.
MessageswrittentothestandarderroroutputbyVICTORprogramsareoneliners,i.e.,onemessageperline.
Eachlinestartswithanumber,whichistheprocessIDoftheprogram.Bythisnumber,youcanseetowhich
programamessagebelongs.YoucaneasilycheckforerrorsandwarningsusingtheLinuxcommand“grep”.To
checkforerrors:grepi'error\|exception\|slurmstepd\|NODE_FAIL\|segmentation'<yourStdErrfile>
Tocheckforwarnings:grepi'warning'<yourStdErrfile>
Unlessotherwisestated,inputfilesforthissoftwarehavecolumnsdividedbyaspaceoratab.Multiple
successivedelimitersarenottreatedasone.Linesstartingwitha‘#’arecommentsandwillbeignored.The
readingofinputfilesisrobusttoLinux/Mac/Windows/mixedlinebreaks.Itisalsorobusttothemissingofaline
breakattheendofthelastline.Programswillstopreadingafileatthefirstblankline;therefore,youcanwrite
somethingusefultoyouinthefileafterablankline.Theinputfiledoesnotneedtobeuncompressedifitisa
.gzor.bz2becausetheprogramscanreadthemdirectly.Bothbgzipandgzipcompressionsaresupported.You
don’tevenneedtotype“.gz”or“.bz2”inthecommand,astheprogramswillfirstlookfortheuncompressed
file,thenfile.gz,followedbyfile.bz2.
1.6.1SampleFile
ASampleFilecontainsID,sex,outcomeandcovariatesforallsamples,includingunrelatedcasesandcontrols
forassociationtests,duplicatedsamplesforqualitycontrol,individualsfrompedigreesforlinkageanalysisorde
novomutationdetection,andanyothersamplesthathavebeensequencedandcouldbeusedforquality
control.Thisfilehasaheaderrow.Thefirstcolumnshouldbenamed“SeqID”,whichcontainstheSequenceID
foreachsequencedsampleinaGenotypeFile,i.e.,theIDinaVCFheaderrow.Thesecondcolumnis“Sex”
withcontentsoffemaleormale;otherstringsrepresentunknownsex.Ifyoudon’tknowthesexofsome
samples,setthemtounknown;don’tsetunknownsexasmalesorfemalesbecausethiswillleadtowrong
allelefrequencycalculationsforchromosomeXorY.Thethirdcolumnistheoutcomevariable,whichcanbean
affectionstatus(unaffectedoraffected;otherstringsareunknownaffectionstatus)oraquantitativetrait(if
thereisanynumbersnotequalto0/1/2).SexandAffectionStatusarecaseinsensitive.Samplesnottobe
usedinacasecontrolassociationtestshouldhaveanunknownaffectionstatus.Theyareinthisfilejustto
providethesexinformation.Columnsstartingfromthefourthareoptional.Theyarecovariatesforassociation
testsandqualitycontrol.Missingvaluescanbe.orunknown.Covariatescanbestringtypeornumerictype
butnotmixed,i.e.,acovariatecannotcontainbothnumbersandnonmissingstrings.Thenameofacovariate
cannotstartwith“_PC_”or“_iPop”,whichisreservedbyVICTORtostoretheprinciplecomponentanalysis
1.4.Pipe
[Backtotop]
1.5.Standarderror
[Backtotop]
1.6.Inputfiles
[Backtotop]
results.IfyouwanttoprovidethepopulationorigininformationinsteadofperformingaPCA,youcanadda
stringtypecovariatewiththeheader“pop”.ASampleFilemaycontainmoreorfewerpeoplethanthe
GenotypeFile.Onlytheoverlappingsampleswillbeanalyzed.Inanassociationtest,sampleswithamissing
valuefortheoutcomeoranycovariatewillbeomitted.Linesinthisfileneednotbesortednorinthesame
orderastheGenotypeFile.Belowisanexample,wherethecolumnsareseparatedbymultiplespacesforweb
displaypurposeonly.Inarealfiletheyshouldbedividedbyonetab.
SeqID Sex Disease Pop
ped1_i1 male unknAff Eur
ped1_i2 female unknAff Eur
ped1_i3 male affected Eur
ped1_i4 female unknAff Eur
random unknSex unknAff SAS
HG00096 male unaffected Eur
HG00097 female unaffected Eur
HG00099 female unaffected Eur
HG00100 female unaffected Eur
1.6.2PedigreeFile
APedigreeFilecontainsallpedigreesforlinkageanalysisbyvSEGordenovomutationanalysisbyvAAA.This
fileissimilartoPLINK.fam,butismorerobust,flexible,andinformative.IfyoualreadyhaveaPLINK.famfile
youcandirectlyuseitasaPedigreeFileforPERCH.Ithas6columnscorrespondingtoPedigreeID,
IndividualID,FatherID,MotherID,Sex,andAffectionStatus.Ithasaheaderlineasshownintheexample
below.Optionally,itcanalsohavethe7thcolumnforLiabilityClass.Pedigreesshouldnothaveany
consanguinityloopormarriageloop.Ifyourpedigreehasloops,youcanmanuallybreaktheloopsby
founderizingaminimumnumberofindividualswhilemaintainingtherelationshipsamongaffectedindividualsas
muchaspossible.Myprogram“pedpro”hasthefunctiontodosoautomatically.Youcanusethatprogramto
breaktheloopsbeforeanalysisbyvSEG.Thereisnootherrestrictiononpedigreestructure.Forvariant
classification,PedigreeFileshouldindicatewhoistheprobandineachpedigree,whichcanbedonebyadding
thephrase“(proband)”attheendoftheIndividualID.Bewarethattheword“proband”maymeansdifferent
conceptstodifferentpeopletoageneticcounseloritmaybethefirstpersonrequestedforagenetictest,
whomayormaynotbeacarrierofthemutation;whiletoastatisticianitisthefirstpersoninthepedigree
whotestedpositiveforthemutation.Ifyoudon’tknowhowtosetaproband,thenjustleaveittotheprogram;
vSEGwillautomaticallychooseaprobandinaconservativeway.YoucanusethetraditionalcodingofSex(1
formale,2female,0forunknown)andAffectionStatus(1forunaffected,2foraffected,0forunknown)oruse
thesamecodingasinaSampleFile.LiabilityClassisanintegerstartingfrom1.Ifanindividualwas
sequenced,theIndividualIDshouldmatchwiththeheaderoftheVCFfile.ThismakesthePLINK.famfileabit
moredifficulttoprepare.However,aprogramnamedvConvertPedintheVICTORpackagecanhelpyoucreate
thisfilefromaPERCHformatPedigreeFile,whichismoreconvenientinsomesituations.Belowisanexample.
Herecolumnsareseparatedbymultiplespacesforwebdisplaypurpose.Inarealfiletheyshouldbedividedby
exactlyonetab.
PedID IndID Father Mother Sex Aff
ped1 ped1_i1 0 0 1 2
ped1 ped1_i2 0 0 2 1
ped1 ped1_i3 ped1_i1 ped1_i2 1 2
ped1 ped1_i4 ped1_i1 ped1_i2 2 2
1.6.3SeedFile
ASeedFileliststheseedgenesforguiltbyassociationanalysisbygnGBA.Thefirstcolumnofthisfileisa
diseasenameandthesecondcolumngenesymbols.Spaceisnotallowedinthediseasenameorthegene
symbol.MakesurethatgenesymbolsareNCBIOfficialSymbols,notfullnamesorsynonymsorGeneIDs.This
filedoesnothaveaheaderrow.Belowisanexample:
Disease Gene1
Disease Gene2
1.6.4AnnotationFile
AnAnnotationFileisadatabasefileforvariantannotationbyvAnnDel.Theinformationforannotationcouldbe
deleteriousnessscores,dbSNPIDs,allelefrequency,etc.Normallyyoudon’tneedtocreatethisfilebecause
theseinformationarealreadyincludedintheVICTORpackage.Ifyouhaveallelefrequencydataspecifictothe
studypopulation,youcanmakeanAnnotationFilewithallelefrequencyforannotation.Theformatofthisfile:
1) Linesstartingwith##arecomments;
2) Thefirstrowistheheaderrow.Contentsofthisrowwillbeusedasheadersintheoutput;
3) Thefirst4columnsare#CHROM,POS,REF,andALT;
4) Thereisnolimittothenumberofcolumns;
5) Variantswithmultiplealternativeallelesaresplitintoseparatelines;
6) Linesaresortedby#CHROM,POS,REFandALT;
7) Thefileiscompressedbybgzipandindexedbytabix;
8) ColumnheadercannotbeMaxAF,whichisalreadyreservedforinternalusagebyVICTOR.
1.6.5CohortFile
Ifyoursamplesweretargetenrichedbydifferentreagentsorsequencedbydifferentplatformsorwithdifferent
depths,thenyouneedtogenerateaCohortFile(cohort.txt)andsetcohort=cohort.txtforQC1in
slurm.all_steps.Hereacohortisasamplesetthatistargetenrichedbythesamereagentandsequencedby
thesameplatformwiththesamedepth.ACohortFilehas2columns,SeqIDandCohortID,delimitedbyatab.
TheCohortIDshouldnotcontainwhitespaces.Itcancontainsanynumberofalphanumericorothercharacters.
SampleswithanunknownCohortID(emptyor“.”or“unknown”)willbeexcludedfrommissingratecalculation
byvQC.Belowisanexample.Herecolumnsareseparatedbymultiplespacesforwebdisplaypurpose.Inareal
filetheyshouldbedividedbyonetab.
SeqID Cohort
ped1_i1 MyStudy
ped1_i2 MyStudy
ped1_i3 MyStudy
ped1_i4 MyStudy
HG00096 1000Genomes
HG00097 1000Genomes
HG00099 1000Genomes
HG00100 1000Genomes
VICTORprovidesdatafilesforhg19,GRCh37,andGRCh38.Becausetheremaybemultiplegenomedatabases
storedinVICTOR’sdatafolder,usersneedtotelltheprogramswhichonetousebyeitheroneofthetwo
methods.1)Usethegenomeoptioneitherinthecommandlineorapar.txt.2)Placethecurrentdirectory
withinafullpaththathasthenameofthegenome,forexample,/path/to/GRCh37/PI/project_A/analysis_1.
Thegenomenamedoesnothavetobethewholefoldername;itcanbesomethinglike“assembly_GRCh37”.
Butifthefullpathcontainsmultiplegenomenames,suchas“GRCh38_liftover_from_GRCh37”,thenthe
programscannotinferthegenome.Inthiscase,youhavetousethegenomeoption.
2.Scripts
ThisisaSLURMscripttemplateforyoutosubmitjobstoacomputercluster.SLURMscriptsarebasicallyBash
scriptswithadditionalSLURMparameters,whicharethelinesstartingwith"#SBATCH".Youcaneasilyconvert
aSLURMscripttoaPBS/MOABscriptoraplainBashscript.
Thisscriptwillsubmitonejobandperformparallelanalysesononecomputernode.Itprovidesamechanismto
savelogfilessequentiallywhenyouexecutethisscriptmultipletimes.Forthispurpose,youneedtousearray
insubmittingthejob(e.g.:sbatcharray=1slurm.all_steps).Belowtakesarray=2asanexample.Whena
jobstartsrunning,afilenamed"slurm.all_steps.run_2.start"willappearinthefolder.Whenthejobfinishes
successfully,anotherfilenamed"slurm.all_steps.run_2.stop"willappear.Standardoutputsanderrorswillbe
writtento"slurm.all_steps.run_2.stdout"and"slurm.all_steps.run_2.stderr",respectively.Thescriptitselfwill
besavedtoslurm.all_steps.run_2.script.Thelocalparameterfilepar.txtwillbesavedto
slurm.all_steps.run_2.par.TheversionofVICTORwillbewrittentoslurm.all_steps.run_2.version.
Thisscript
1. Conductsgenotypewise,variantwise,andsamplewisequalitycontrolofdata(see4.1below).
2. DeterminesVQSLODcutoffsforSNVsandInDels,separately.
3. Performsprinciplecomponentanalysisandadjustforpopulationstructureinanalyses.
4. Calculaterelatedness.Selectsonesamplefromeachrelatedgroupbyminimizingmissingrate.
5. Detectpedigreestructureerrors.
6. Annotatesfunctionalconsequence,allelefrequencies,andthedeleteriousnessscoreBayesDel.
7. Testswhethercasesandcontrolsarecomparablebycountingthenumberofrarevariantspersample.
1.7.Genome
[Backtotop]
2.1slurm.all_steps
[Backtotop]
8. PerformsPERCHanalysis,whichmaybeaburdentest,variancebasedassociationtest,linkageanalysis,
geneprioritization,variantprioritization,genesetanalysis,orreportingincidentalfindings.
Thisscriptconductsthesecondstepofslurm.all_steps.Itisdifferentfromslurm.all_stepsinthatitwillsubmit
multiplejobstoacomputercluster.Eachjobperformsananalysisononechromosome.Thisisusefulwhenyou
needtodophasing,whichistimeconsuming.
Thisisascripttorunthe“sbatch”commandtosubmitajobtoacomputercluster.Insteadofusingsbatch,this
scripthastheadvantageofworkinginharmonywithslurm.all_steps.Sinceslurm.all_stepshijacksthearray
numbertomakesequentiallogsifyourunslurm.all_stepsformultipletimes,victor.sbatchmakessurethatyou
dousethearrayoptionandthearraynumberhasnotbeenusedbefore.Italsohastheadvantageofsaving
thejobIDtoafilenamedslurm.all_steps.account,sothatyoucaneasilykeeptrackofthejoborcancelthe
job.victor.sbatchcanalsoworkforotherslurmscripts.
3. Utilitytools
ItreadscoveragedataoutputfromsamtoolsorGATK,calculatessomestatistics,andthenwriteaBEDfilefor
wellcoveredregions.Herethestatisticsistheproportionofaregionaboveacertainreaddepth(dp)ina
minimumpercentageofthesamples(pc).Theregionscouldbetargetedregions(target)orgenes(gene).
Aftercalculatingthesestatistics,itcombinesthedataandwriteaBEDfileofwellcoveredregions,whichhave
readdepthsaboveathreshold(cutoff).Ifthe(log)optionisset,itwillalsowritetoafilethenumberof
sampleswellcoveredforeachlocus.
TheinputfilescouldalsobeBEDfiles(format=bed).Inthiscase,thecoveragestatisticscannotbecalculated,
butitstillintegratesthedataandwriteamergedBEDfilefordownstreamanalysis.Themergedregionsare
thosecoveredbyaminimumpercentageoftheinputBEDfiles(pc).IfyouhaveonlyoneinputBEDfile,the
outputshouldhavethesameregionsasinput,butyoucanuseittoseehowmanybasepairsareincludedinthe
BEDfilebythiscommand:"vBEDformat=bedinput.bed>/dev/null".
vBED2isafasterversionofvBED,butsomeoptionsarenotimplemented.
vConvertVCFmodifiesVCFfilestobereadbyKINGorPLINK1.9.IfyouprovidemultipleVCFfiles,theywillbe
concatenated.YouneedtoprovideaPedigreeFile(ped)and/oraSampleFile(spl).Sequencedindividuals
notinthesefileswillbedropped;individualsinthesefilesbutnotintheVCFcanbeadded(add);VCF
headerswillbechangedto<FID><delimiter><IID>forsamplesinthePedigreeFile,or<SeqID><delimiter>
<SeqID>forthoseintheSampleFile.Thedelimitercanbecustomized(iddelim).Ifasampleisinboth
PedigreeFileandSampleFile,theformerhasthepriority.ItwillalsowriteaPLINK.famfiletoreplacetheone
createdbyPLINK(fam).Pleasenotethattheorderofindividualsinthis.famfileisnotthesameasthatin
theoutputVCFfile.Sotousethis.famfile,youneedtoenableindivsortforPLINK:
plink_1.9 --vcf vConvertVCF.output.vcf --id-delim : --indiv-sort file vConvertVCF.output.fam --make-bed
mv vConvertVCF.output.fam plink.fam
VICTORandsomeotherprogramssupportaPedigreeFileinPLINKorLINKAGEformat,wheretheIndividual
IDsshouldmatchwiththeheadersofaVCFfile.However,sometimesthisformatisnotideal.Youmayalready
haveanoldpedigreefilewithdifferentIndividualIDs.YoumayalsowanttoseparateIndividualIDsfrom
SequenceIDs,sincetheyaredifferentconcepts.Sometimesyoumayhavesequencedsomeindividualsmore
thanonce.ThesewouldmakethecreationofaPedigreeFileinPLINKformatabitmoredifficult.
Previously,PERCHsupportsanotherPedigreeFileformat(hereafterPERCHformat),whichisbasicallyaPLINK
formatPedigreeFilewithanextracolumnofSequenceIDthatmatchestotheheaderrowofaVCFfile.Forthe
2.2slurm.step2
[Backtotop]
2.3victor.sbatch
[Backtotop]
3.1.vBEDandvBED2
[Backtotop]
3.2.vConvertVCF
[Backtotop]
3.3.vConvertPed
[Backtotop]
individualsthatarenotsequencedtheSequenceIDis0.Thisfilehasaheaderrow,wheretheSequenceID
columnshouldhavetheheadernameSeqID.Becauseofthisextracolumn,theIndividualIDdoesnotneedto
matchwiththeVCFheader,nordoesitneedtobeuniqueacrosspedigrees.Preparingapedigreefileinthis
formatismoreconvenientthanthePLINKformat.Nowthisfileformatisnolongersupportedbymostofmy
programs,butyoucanstillusethisformatandthenusethevConvertPedtooltoconvertthefiletoPLINK
format.Thistoolreadstwofiles,aPedigreeFileinPERCHformat(ped)andaGenotypeFileinVCF(geno).
ThereasontoreadaVCFistofindthesequencedindividuals,sothatitcantranslatetheIndividualIDsforthe
sequencedindividualsonly.
Thisprogramreadavariant(chrposrefalt)andapedigreefile(ped),thenwriteaVCFfileanda
pedigreefile(prefix)forvariantclassificationbyPERCH.
Theinputpedigreefileshouldhave8columns:Pedigree_ID,Individual_ID,Father,Mother,Sex,Affection
Status,Age,andGenotypeofthevariant.Ageshouldbeapositivenumberlessthan150,otherwiseitis
deemedmissing.Genotypeishet,hom,neg,+/+,+/,/+,/,2/2,2/1,1/2,1/1.Otherstringsaredeemed
missingvalue.
Theprogramwillcalculatesexandagespecificliabilityclasses(age).Thereareoptionstospecifywhattodo
withamissingage(unkage),sex(unksex),oraffectionstatus(unkaff).
vSPLITisaprogramtosplitmultiallelicvariantsintomultiplelines.Thisishelpfulforthedownstreamanalysis,
aseachalternativeallelemayhaveadifferentdeleteriousnessscore,populationfrequency,andfunctional
consequence.
Insplittingalternativealleles,theprogramsplitsthegenotypestooifthefilecontainsanygenotypecolumns.
Forexample,ifasample’sgenotypeis1/2(i.e.,alternativeallele1andalternativeallele2),theperson’s
genotypewillbecome1/aanda/1,respectively.Here‘a’denotes“anotheralternativeallele”,whichcanbe
changedbyoneoftheoptions(a).Theprogramassumesthatstartingfromthe10thcolumnaregenotypes,
whichconformstotheVCFformat.Ifthelastfewcolumnsarenotgenotypes,usetheoption(l)tochangethe
columnnumberofthelastsequencedindividual.
ThisprogramalsoreadstheINFOfieldandsplitscertainvariables.Thereareseveralmodesofsplitting:a)
DirectSplitting(ds),whenthevaluescorrespondtoeachalternativealleleinthesameorderasthe“ALT”
field;b)PlusReference(pr),whenthereisanextravalueforthereferencealleleattheend;andc)Genotype
Counts(gc),whicharethenumberofobservationsforallpossiblegenotypes.Pleaseusethe(h)optionto
seethenamesofthedefaultvariablesthatwillbesplitineachmode.Thesedefaultvariablesaredesignedto
workforVCFfilesfromthe1000GenomeProject,theNHLBIGOExomeSequencingProject,theExome
AggregationConsortium,andVCFfilescreatedbytheGenomeAnalysisToolkit.Youcanspecifymorevariables
thanthoseactuallyexistinaVCFfile.Soyoudon’thavetochangethevariablenamesunlessyouhavea
variablethatisnotsetbydefault.
ThisprogramdoesnotsplittheADfieldineachgenotypecolumn.ThisisbecausetheADfieldisnotpartofthe
standardVCF,andtheusageofADisnotrecommendedbyGATK:“BecausetheADincludesreadsandbases
thatwerefilteredbythecaller(andincaseofindels,isbasedonastatisticalcomputation),itshouldnotbe
usedtomakeassumptionsaboutthegenotypethatitisassociatedwith.Ultimately,thephredscaledgenotype
likelihoods(PLs)arewhatdeterminesthegenotypecalls.”
InputandOutput
vQCperformsgenotype,variantandsamplewisequalitycontrol(QC).TheinputisaVCFfilewithorwithout
genotypefields.Ifithasnogenotypes(suchasthoseinExAC),itisrequiredtospecifythetotalnumberof
samplesinthesequencing(totspl)andtheINFOsubfieldnameforAC(infoac)andAN(infoan).The
outputisamodifiedVCFfileafterremovinglowqualitygenotypesandvariants.Thereasonfortheremovalof
eachvariantcanbeloggedtoafile(log).Somesampleswillbeexcludedfromtheoutputifspecified(rm
3.4.vConvertTest
[Backtotop]
3.5.vSPLIT
[Backtotop]
3.6.vQC
[Backtotop]
ind).Uncommonintergenicorintronicvariantsmaybeexcluded(rmiiMaxAFlt)toreducethecomputation
timeofdownstreamanalysis,suchasphasing.Toidentifyintergenicandintronicvariants,theVCFinputneeds
tohavefunctionalannotationsinanINFOsubfield(infofunc,supportsannotationbyVANNER).vQCcanalso
filtervariantsbyvariationtype(snvonly,indelonly),chromosomalregion(include,exclude),
chromosometype(autochr).Denovomutationsmaybelostduringasubsequentanalysis,suchasphasing
byBEAGLE.vQCcansolvethisissuebyexportingdenovomutationstoafilebeforetheanalysis(outdn),
theninsertthembacktotheVCFafterward(insertdn).
TheformatsofPedigreeFile(ped)andSampleFile(spl)aredescribedintheManualpageforPERCH.Ifthe
SampleFilecontainspopulationinformation(column4,header"pop",caseinsensitivestrings,missingvalueis
anemptystringor.orunknown),HardyWeinbergequilibrium(HWE)tests(filthwepv)willbeperformedin
unaffectedsampleswithineachpopulation.NonfounderindividualsfromthePedigreeFilewillbeexcluded
automatically(hwefounder).Ifpopulationisnotdefined,youhavetheoptiontochoosewhethertodoHWE
testinallindependentcontrols(hwecontrols).
Filters
MostoptionsofthisprogramareparametersforclassicalQCprocedures,suchashardfiltering(hardfilter,
HardFiltIfNoVQ)orVQSLODfiltering(filtvqssnv,filtvqsindel),missingVQSLODscore(filtvqsnan),
HardyWeinbergequilibriumtest(filthwepv,filthweinfo),FILTERfield(filtfilter),minorallelecount(
filtmac),totalallelecount(filtan),missingrateincasesorcontrols(filtmissrate),missingrate
discrepancybetweencasesandcontrols(filtmisspv),genotypediscordanceamongduplicatedsamples(
filtdiscord),Mendelianerrorsexcludingdenovomutations(filtMendelian),thesamedenovomutationin
multipleindividuals(filtdenovo),unoriginaldenovomutations(filtuodn),theproportionofheterozygous
haploidyamongnonmissinghaploidyindividuals(filthh),theproportionofsampleswithsufficientcoverage
(filtcovDP,filtcovpc),andthemeandepthifsmallerorgreaterthanacertainthreshold(filtminDP,
filtmaxDP).Bydefault,homozygousdenovomutationsaretreatedasMendelianerrorsbutnotadenovo
mutation.Pleaseseetheprogramhelptextfortheoptionsforeachoftheabovefilters.
Inaddition,vQCimplementsauniqueQCmethodusingallelefrequency.Ifyouaresearchingforlowfrequency
variants,andyouexpectalowfoundereffectandhighlocusand/orallelicheterogeneity,youmaywantto
excludecommonvariantsfromsubsequentanalysis.ThisfilteringmaybeanefficientQCtoobecausesome
sequencingartifactsmaybe“observed”inalmosteverysequencedsamples.Similartothisidea,vQCremoves
variantsthataresupposedtoberarebutaretoocommoninthedata.Ittakesaminorallelefrequencycutoff(
filtobsmaf).ForvariantswhoseMaxAFissmallerthanthecutoffitcalculatesaonesidedpvaluefor
observingthedataamongthesamples(casesandcontrolsaltogether)givenanallelefrequencyequaltothe
cutoff,andthenremovesthevariantsifthepvalueissmallerthanauserdefinedthreshold(filtobspv).As
youmaysee,thismethodisratherconservativebecausevariantswerefilteredbasedonpvalueinsteadofthe
observedfrequency.Sothismethodisrobusttofoundermutations.ThisQCmethodmaybevalidevenifthe
samplesareenrichedforacertainphenotypethatisdistinctfromthediseaseofinterest.Avariantthatistoo
commoninthosesamplesmaybe:1)anartifact;2)acommonvariantspecifictothestudiedpopulation;3)a
causalvariantfortheenrichedphenotype;or4)aprotectivevariantforthediseaseofinterest.Formost
studiesthelastsituationisveryunlikelyandhenceitissafetoremovethesevariants.Nevertheless,useitwith
caution.
Themissingratefilteringcanbedoneincasesandcontrolsseparately(filtmissea=yes)orjointly(filt
missea=no).However,sometimesitmakesmoresensetodoitbycohort.Hereacohortmeansasampleset
thatisexomecapturedbythesamereagentandsequencedbythesameplatformwiththesamedepth.Also,if
youhavesequencedpedigrees,somepedigreemembersmaybeindicatedwithanunknownaffectionstatusin
thesamplefile(becausetheyarenotindexcases),whowillnotbeincludedinthemissingratecalculationif(
filtmissea=yes).Inthiscase,youcancreateaCohortFilewith2columns,SampleIDandCohortID,
delimitedbyatab.Usethe(cohort)optiontospecifythisfile,thentheprogramwilldothemissingrate
filteringbycohortonly,disregardingcase/controlstatusorpedigrees.ThecohortIDshouldnotcontainwhite
spaces.Itcancontainsanynumberofalphanumericorspecialcharacters.SampleswithanunknowncohortID
(emptyor“.”or“unknown”)willbeomitted.
DuplicationsandReplications
Toperformqualitycontrolbygenotypeconcordanceamongduplicatedsamples,vQCreadsaDuplicationFile
(dup).Thisfiledoesnothaveaheaderrow.EachrowarethedifferentSeqIDsforonebiologicalsample.The
numberofcolumnsineachrowmayvary,butthereshouldbeatleasttwocolumnsastheseareduplications.
ThevQCwillcomparecolumn2withcolumn1,column3withcolumn1,andsoon.Therefore,itisbettertoput
the“goldstandard”incolumn1,andeachexperimentinaspecificcolumnconsistentlyforalllines.Missing
SeqIDshouldbe“0”or“.”orempty.Thefirstcolumnshouldnotbemissing.AtleastoneSeqIDperlineinthis
fileshouldalsobeintheSampleFiletoo,becausevQCneedstoobtainsexinformationfromtheSampleFile.
Belowisanexampleofthisfile.Supposeyouhaveperformedfourexperiments:Sangersequencing,exome
array,andnextgenerationsequencingbytwoplatforms,thentheDuplicationFilemaylookslikethis:
#SangerArrayNGS1NGS2
S1SS1AS1N1S1N2
S2S0S2N1S2N2
S3SS3A0S3N2
...
Besidesduplications(asamplewentthroughthesameexperimenttwice),vQCcanalsoperformqualitycontrol
bygenotypeconcordanceamongreplications(asamplewentthroughanotherexperimenttovalidatethe
sequencingresults,suchashighthroughputSNParraygenotypingortargetedsequencing).Variantsinan
arraymostlikelyhaveacommonorlowallelefrequency,soitisnotveryhelpfulfortheQCofrarevariants
discoveredfromsequencing.Butitcanstillbeusedtoestimatetheconcordancerateandfindoutagood
thresholdforGQandDP.Todothis,vQCreadstwofiles(rep).ThefirstoneisaReplicationIDFile,which
translatesIDsbetweenthetwoexperiments.ThesecondoneisaReplicationGenotypeFile,atabixindexed
cleanedVCFfilethatcontainsgenotypesfromthereplicationexperiment.Thenumberandorderofsamplesin
thesecondfileneednotbethesameasintheprimaryVCFinputfile.Theoption(outdup)willwritetheGQ
anDPvaluesofallgenotypestoafile.Inthatfile,thefirstthreecolumnsarethecomparisonresult(Cfor
concordance,Dfordiscordance,followedbytheprimarygenotypeinsequencing),GQ,andDP,respectively.If
theoption(outdup)isnotset,vQCwillwriteagenotypecomparisonsummarytothestandarderror.
SamplewiseQC
vQCcanalsooutputseveralstatisticsforsamplewisequalitycontrol(sampleqc),whichincludemissingrate,
heterozygoustononreferencehomozygousratio,Ti/TvratioofcodingSNVs,geneticsex,meanGQ,meanDP,
andthegenotypeconcordancerateamongreplicatedvariants(ifyouhavearraydata).Sinceitismostlikely
thattheproportionofbadsamplesissmall,whiletheproportionofbadvariantscouldbelarge,thesestatistics
arecalculatedaftergenotypewiseandvariantwisequalitycontrol.Ti/Tvwascalculatedforcodingregionsonly
sothatitiscomparableevenbetweenawholeexomeandawholegenomesequencing.Toidentifycoding
variants,theGenotypeFileshouldhavebeenannotatedforfunctionalconsequences,wheretheannotationisin
theINFOfield(infofunc).vQClooksforthephraseMissense,Synonymous,StopLoss,StopGainand
Translationwithinthissubfieldforcodingvariants.
vQCinfersexfromXandYseparately.Itdoesn'tjointhetwochromosomestomakeoneinferencebecause
theymaycontradicttoeachother,whichmaybeduetosamplecontaminationorrareconditions.Forsex
inferencefromX,vQCrequiresanestimatedgenotypingerrorrate(xerrrate).Unlessyouuseverygentle
QCfilters,thedefaultvalueshouldbefineformoststudies.
Ifyouhavearraydataanduseitinthequalitycontrol(rep),vQCreportsthegenotypecallrateandthe
concordancerateamongtheoverlappingvariants.
joinsampleqc
IfyouperformQCforonechromosomeatatime,whichisrecommendedforalargestudy,youcanaggregate
theresultsandcalculatethegenomewideoverallstatistics(joinsampleqc).Ifyouhavecalculatedgenetic
relatednessbetweenindependentsamplesbyKING,vQCcanreadtheKINGoutput(king)andselectthe
individualstoberemovedinsubsequentanalyses.Itfirstfindsgroupsofpeoplethatarecorrelatedbykinship.
Thenumberofpeopleineachgroupcouldbemorethantwo.Fromeachgroup,vQCselectsthepersonwhohas
thelowestmissingrate.Theremainingpeoplewillbewrittentoafilenamed<prefix>.rel_problem.Forthis
optiontowork,youalsoneedtosetthe(spl)and(prefix)option.Inasubsequentanalysisthatrequires
independentindividuals,youcanremovethesampleslistedinthisfile(rmind+=<prefix>.rel_problem).The
defaultofthe(king)optionremovesseconddegreerelativesorcloser(kincutoff),becausethekinship
calculationisnotveryreliablebeyondthisrelatednesslevel.Ifyou’dliketoincluderelatedindividualsinan
associationtest,usingindividualweightstocontrolforgeneticrelatedness,thenyoucanchoosetoremove
duplicatesandmonozygotictwinsonly(kincutoff=0.35355).
Table:contentsofsampleqcorjoinsampleqcoutput.
------------------------------------------------------------------------------------------------
Column Header Content
------------------------------------------------------------------------------------------------
1 SeqID Sequence ID
2 Sex Input sex from Sample File
3 ms Missing genotypes on autosomal chromosomes
4 non-ms Non-missing genotypes on autosomal chromosomes
5 Het Heterozygous genotypes on autosomal chromosomes
6 AltHom Homozygous ALT genotypes on autosomal chromosomes
7 HetRare Heterozygous rare variants within the minimum exome on autosomal chromosomes
8 HomRare Homozygous rare variants within the minimum exome on autosomal chromosomes
9 HetPers Heterozygous personal variant within the minimum exome on autosomal chromosomes
10 HomPers Homozygous personal variant within the minimum exome on autosomal chromosomes
11 HetSing Heterozygous Singleton within the minimum exome on autosomal chromosomes
12 HomSing Homozygous Singleton within the minimum exome on autosomal chromosomes
13 Ti Ti coding SNVs on autosomal chromosomes
14 Tv Tv coding SNVs on autosomal chromosomes
15 MsRate Missing rate on autosomal chromosomes
16 Het/Hom Het/Hom ratio on autosomal chromosomes
17 HetRate Heterozygous rate on autosomal chromosomes
18 Ti/Tv Ti/Tv ratio on autosomal chromosome coding regions
19 Y_ms Missing on Y
20 Y_call Genotype calls on Y
21 Y_call% Genotype call rate on Y
22 Y_sex Genetic sex inferred from Y
23 X_call Genotype calls on X
24 X_het Heterozygous genotypes on X
25 X_LOD LOD score for sex inference from X
26 X_sex Genetic sex inferred from X
27 RepGeno Number of variants that replication experiment has a genotype call
28 RepBthG Number of variants that both replication and this VCF have a genotype call
29 RepConc Number of concordant genotype calls
30 Call% Call rate in this VCF among the variants that replication has a genotype call
31 Conc% Concordance rate among the variants that both have a genotype call
32 numGQ Number of non-missing GQ score
33 numDP Number of non-missing DP score
34 meanGQ mean GQ score
35 meanDP mean DP score
------------------------------------------------------------------------------------------------
Overallquality
vQCassessestheoverallqualitybythenumberofrarevariantspersample.RarevariantmeansMaxAF<0.05(
rv).Thismeasureismoreinterpretablethansingletonsasthelatterreliesonthenumberofsamples
sequenced.Tomakethismeasurecomparablebetweenstudieswherethetargetedregionsmaybedifferent,
rarevariantarecountedonlywithintheminimumoverlapregionsacrossmultipleexomecapturingreagents
includingAgilentV4V5V6,NimbleGenV2V3,VCRomeV2,thewellcovered(15xin95%ofthesamples)
regionsinExAC,andthegeneregionsinrefSeqandEnsembl.vQCreportsthesenumbersbreakdownbycase
controlstatus,sothatyoucanevaluatewhetherthere'sabatcheffectbetweencasesandcontrolsandwhether
thecasesandcontrolsarecomparable.Thesenumbersarealsoreportedforeachindividualsothatitcanbe
usedforsamplewiseQCtoo(sampleqc).
ThisprogramreadstheVQSLODscoresoutputfrom“vQCoutvqs”anddeterminestheVQSLODcutofffor
SNVandInDelseparately.ThecutoffvalueisthelowerfenceforoutliersfromknownvariantsbyKimber’s
method.Ifthiscutoffisbeyondthelowerbound(lb)orupperbound(ub)ofthepresumedrange,the
programwillwriteanerrormessagetothestandardoutput,otherwiseitwillwriteaprogramoptionwiththe
cutoffvalueforsubsequentanalyses.
ThisprogrammodifiesaVCFfile.ItcanremovespecificINFOfields(remove)orremoveallINFOfieldsexcept
thespecificones(keep).AuniqueutilityofthisprogramistorestoretheoriginalPOS,REF,ALTdataafter
vAnnGenerun(restore).InfunctionalconsequenceannotationbyvAnnGene,thesefieldsaremodifiedto
reflectthebiologicalconsequenceofthevariant.Therefore,somevariantsmaybeleftaligned,whileothers
rightaligned.Thisishelpfulforacorrectdeleteriousnessannotations.Fortunately,vAnnGenealsostoresthe
originalPOS,REF,ALTdataintheINFOfield,soyoucanrestorethembythisprogramifyouwant.
3.7.vQS
[Backtotop]
3.8.vINFO
[Backtotop]
3.9.vPROV
[Backtotop]
ThisprogramannotatePROVEANforinframeinsertionsordeletionsbycallingthePROVEANprogram.It
appliesmultithreading(nt)forfastercomputation.
ThisprogramreadsHGVSnomenclaturesandtranslatethemintogenomiccoordinatesandsequencechanges.
InputisafilewithHGVSinthefirstcolumn;othercolumnswillbeignored.Outputtostdoutisanotherfilewith
7columns:HGVS,CHR,POS,HGVS,REF,ALT,INFO.Notethatthefirstandfourthcolumnsarethesamefor
historicalreasons.YoucaneasilyconvertthisoutputintoaVCFfilebythefollowingcommands:
echo '##fileformat=VCFv4.1' > HGVS_to_genomic.output.vcf
echo $'#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO' >> HGVS_to_genomic.output.vcf
grep -v ^# HGVS_to_genomic.output | cut -f 2- | sed 's/\t/\t.\t.\t/5' >> HGVS_to_genomic.output.vcf
Featuresofthisprograminclude:
1) Automaticallycheckforreferencesequenceerrors
2) Ifonlyagenesymbolisgiven,translateittotheprincipaltranscriptdefinedinAPPRIS;
3) Ifonlyagenesynonymisgiven,translateittoagenesymbolandthentheprincipaltranscript;
4) IfmultipleHGVSnomenclaturesaregiven,selecttheunambiguousone(choosec.xxxoverp.xxx);
5) SomenonstandardordeprecatedHGVSaresupported.Forexample,c.IVS15T>C,p.(Arg3381Ser).
6) IfaproteinlevelHGVShavemultiplegenomicinterpretations,selectthemostprobableonethathasa
higherallelefrequency,hasamatchingindbSNPorClinVar,oralowerdeleteriousnessscore.
Pleaseusetheoption(singular)toturnonthefeatureofchoosingthemostprobableSNVforaproteinlevel
HGVS.Forthistowork,theprogramrequiresaMaxAFdatabase(MaxAF),adbSNPdatabase,andthe
BayesDelscoredatabase(del).VICTORalreadyprovidesallthesedatabases,soyoudon'thavetosetthese
options.
4.Topics
GenotypewiseQC:
1) Removebadgenotypes.GenotypesaredeemedmissingiftheDP(readdepth)orGQ(genotypequality
conditionalonthevariantbeingtrue)scoreislessthanathreshold,whichcanbechangedbyaprogram
option,(filtdp)and(filtgq),respectively.
2) CorrectgenotypesbasedontheexpectedploidyofchrX,chrY,chrMforeachindividual(sexdependent).It
takesthepseodoautosomalregion(PAR)intoaccount.
VariantwiseQC:
1) Removebadvariants.vQCfiltervariantsbyVQSLOD,theFILTERfield,missingrate,missingrate
discrepancybetweencasesandcontrols,Mendelianerrorsinpedigreesexcludingdenovomutations,the
samedenovomutationinmultipleindividuals,theprobabilityofobservingdatagivealowallelefrequency,
HardyWeinbergdisequilibriuminindependentcontrolsand/orexternalsamples,genotypediscordances
amongduplicatedsamples,thepercentageofsamplescoveredbysequencingwithaminimumdepth,the
meandepthsmallerorgreaterthanacertainthreshold,andtheproportionofheterozygousgenotypes
amongnonmissinghaploidyindividuals.IfVQSLODisnotavailable,vQCautomaticallyapplieshardfiltering
byQD,MQ,FS,HaplotypeScore,MQRankSum,andReadPosRankSum.
2) Quantitativelyintegratequalityofvariantcalls.Qualitycontrolbyvariantfilteringmaynotbesufficient
becausevariantsattheborderlineofafilteringthresholdmaynotbecleanenough.Toalleviatethis
problem,PERCHintegratesVQSLODintoavariantweighttogetherwithadeleteriousnessscore.Therefore,
thismethodtakesabalancebetweenvariantcallqualityandbiologicalrelevance.
3) Selectregionsorvarianttypes.vQCcanfiltervariantsbychromosomalregionandvarianttype.Acommon
usageofthisfeatureistoselectvariantswithintheintersectionofcapturedregionsbetweencohorts,and/or
excludetheregionsthataretoodifficultfornextgenerationsequencing.Youcanalsorestricttheanalysisto
SNVonly.
4) WrongREF.TheannotationprogramvAnnGenecheckswhethertheREFsequencematcheswiththe
referencegenomicsequence.Variantsthatdonotmatchwillbereported.
SamplewiseQC:
1) Sex
3.10.HGVS_to_genomic
[Backtotop]
4.1.ImplementedqualityControl
[Backtotop]
2) Het/nonrefhomratio
3) Ti/Tvratio
4) Missingrate
5) MeanGQandDPacrossallvariants
6) pedigreestructureerrors(bytheprogramKING)
7) crypticrelatednessamongunrelatedindividuals(bytheprogramKING)
8) populationstratification(bytheprogramPLINK)
Thescriptslurm.all_stepsdoesn’tconductQCbasedon.bamfiles,suchascontaminationratesandsequencing
yields.ButIhaveprovidedascripttodosointheTutorial.OtherQCmethodsarenotyetimplementedbut
maybeconsideredinourfuturework,suchasmultiplerecombinationeventsononechromosomeand
homozygousorcompoundheterozygousmutationsincriticalgenes.
Youneedtoprovidethesequencingcoverageoneachbasepairofthegenomeinyourcases.Thecoveragefiles
(onechromosomeeachfile)canbegeneratedbythefollowingscriptfrom.bamfiles:
export DP_CUT=20
export PC_CUT=80
export BAM_PFX=prefix.of.bam.files
export BAM_SFX=suffix.of.bam.files
export SPL=samples.txt
export OUT=VICTOR
export GENOME=GRCh37
# first, make the list of cases that have a bam file.
cut -f 1 $SPL | sed "s/\(.*\)/\1$BAM_SFX ]; then echo \1; fi/" | sed "s;^;if [ -s $BAM_PFX;" | bash > samples.bam_available
# calculate read depth on each position for all samples
mkdir -p ${DP_CUT}x/perInd
my_func() { samtools depth -q 10 -Q 20 ${BAM_PFX}$1${BAM_SFX} | awk '($3>='${DP_CUT}')' | gzip -c > ${DP_CUT}x/perInd/$1.depth.gz; }
export -f my_func
module load parallel
parallel -a samples.bam_available my_func
# join the results.
cat samples.bam_available | sed "s;^;${DP_CUT}x/perInd\/;" | sed "s/$/.depth.gz/" > samples.to_join
vBED2 --genome=$GENOME --filelist samples.to_join --format=samtools --dp=${DP_CUT} --pc=0 --log=${DP_CUT}x/files.to_join --nt=auto >/dev/null
rm samples.to_join*
# to save time, do it by a batch of 300 to 500 samples, and set the options --filelist and --log accordingly. Below is an example:
# cat samples.bam_available | sed "s;^;${DP_CUT}x/perInd\/;" | sed "s/$/.depth.gz/" | head -300 > samples.to_join.1
# vBED2 --genome=$GENOME --filelist samples.to_join.1 --format=samtools --dp=${DP_CUT} --pc=0 --log=${DP_CUT}x/files.to_join.1 --nt=auto \
# >/dev/null
# now create the coverage files
mkdir -p ${DP_CUT}x/${PC_CUT}pc
TotSpl=`cat samples.bam_available | wc -l`
RequiredSpl=`echo "$TotSpl * 0.$PC_CUT" | bc`
RequiredSpl=`echo "($RequiredSpl+0.5)/1" | bc`
vBED2 ${DP_CUT}x/files.to_join* --genome=$GENOME --format=vBED --min-spl=$RequiredSpl --tot-spl=$TotSpl \
--log=${DP_CUT}x/${PC_CUT}pc/p_samples.txt --nt=auto > ${DP_CUT}x/${PC_CUT}pc/Covered.bed
bgzip ${DP_CUT}x/${PC_CUT}pc/p_samples.txt
tabix -p vcf ${DP_CUT}x/${PC_CUT}pc/p_samples.txt.gz
CHRS=( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X )
for CHR in ${CHRS[@]}; do
echo "#DP_CUT=$DP_CUT" > ${DP_CUT}x/${PC_CUT}pc/$OUT.chr$CHR.cov
echo "#PC_CUT=0.$PC_CUT" >> ${DP_CUT}x/${PC_CUT}pc/$OUT.chr$CHR.cov
tabix ${DP_CUT}x/${PC_CUT}pc/p_samples.txt.gz $CHR:1-300,000,000 >> ${DP_CUT}x/${PC_CUT}pc/$OUT.chr$CHR.cov
gzip -f ${DP_CUT}x/${PC_CUT}pc/$OUT.chr$CHR.cov
done
Then,inslurm.all_stpes,addthefollowingparameters.
# What to do.
ExACnT=yes
# Main Parameters.
export COV=Prefix.of.the.coverage.files.for.your.cases # if you use the above script, COV=${DP_CUT}x/${PC_CUT}pc/$OUT
export EFP=Prefix.of.the.ExAC.files # the default is ExAC1/noTCGA, it uses the provided nonTCGA data
export AWX=additional.options.for.vAAA # use this parameter to specify which ExAC population to use
IfyouwanttocomparetoaspecificpopulationratherthantheentireExAC,usethe(xctpop)optionforthe
AWXparameterintheslurm.all_stepsscript.AnexampleisAWX=”xctpop=NFE”.Nowyoucandothe
analysistocompareyourcasestoExACnonTCGAsamples.
4.2.ComparetoExAC
[Backtotop]
NotethattheannotationparametersfortheExACdatamustmatchwiththoseforthecases.TheExAC1folder
providedbyVICTORisprecomputedfromExACnonTCGAsamples,annotatedwithrefGeneLiteanddefault
parameters.Ifyouliketoannotatewithanothergenedatabase,thenyouhavetogeneratetheExACfiles.Also,
ifyouliketouseanotherExACsubset(nonPsych)orthewholeExAC,youneedtogeneratetheExACfilestoo.
Belowisthescripttomakethesefiles.
# set parameters according to ExAC subset (nonTCGA, nonPsych, all)
export ExAC_FILE=ExAC_nonTCGA.r1.sites.vep.vcf.gz # downloaded file
export DP_CUT=20 # depth cutoff
export PC_CUT=60 # percent of samples with read depth >= $DP_CUT
export TotSpl=53105 # 53105 for nonTCGA, 60702 for all
export OUT=ExAC1 # ExAC1 for ExAC r1
export PFX=noTCGA # noTCGA or noPsych or all
# 2) Prepare ExAC's coverage file $PFX.chr$CHR.cov.gz.
mkdir coverage && cd coverage
wget -r -np -nd -nH ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.1/coverage/
if [ "$DP_CUT" == 30 ]; then COLUMN=11
elif [ "$DP_CUT" == 25 ]; then COLUMN=10
elif [ "$DP_CUT" == 20 ]; then COLUMN=9
elif [ "$DP_CUT" == 15 ]; then COLUMN=8
elif [ "$DP_CUT" == 10 ]; then COLUMN=7
else echo >&2 ERROR: DP_CUT is wrong.
fi
CHRS=( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X )
for CHR in ${CHRS[@]}; do
echo "#DP_CUT=$DP_CUT" > $PFX.chr$CHR.cov
echo "#PC_CUT=0.$PC_CUT" >> $PFX.chr$CHR.cov
zcat Panel.chr$CHR.coverage.txt.gz | sed 1d | awk '($'$COLUMN'>=0.'$PC_CUT')' | awk '{print $1,$2,$'$COLUMN'}' >> $PFX.chr$CHR.cov
gzip -f $PFX.chr$CHR.cov
done
mkdir -p ${PC_CUT}pc${DP_CUT}x && mv *.cov.gz ${PC_CUT}pc${DP_CUT}x
cd ..
# 3) prepare ExAC's annotation file $PFX.chr$CHR.ann.del.gz and quality control log file $PFX.chr$CHR.qc.log
export MaxAFDB=MaxAF_20170801.gz
export VAG=
export DEL=
HWE_FIELDS=AN_NFE,Het_NFE,Hom_NFE,AN_AFR,Het_AFR,Hom_AFR,AN_SAS,Het_SAS,Hom_SAS,AN_EAS,Het_EAS,Hom_EAS,AN_AMR,Het_AMR,Hom_AMR
rm -rf ${OUT} ${OUT}_log
mkdir -p ${OUT}
mkdir -p ${OUT}_log
my_func() {
tabix $ExAC_FILE $1:0-300,000,000 -h | vSPLIT |\
vAnnGene --norm-only | vAnnDel --ann=dbSNP --ms=index |\
vAnnGene --no-filter --no-split --add-info --info-func=vAnnGeneAll |\
vAnnDel --add-info --ann=$MaxAFDB --add-ms |\
vAnnBase --add-info --ann=BayesDel_nsfp33a_noAF -x=3 --min=-1.5 --step=0.01 --indel=max --padding=1 |\
vAnnDel --add-info --ann=InDels.provean.tsv.gz -f 5 --wr=PROVEAN --preload --indel-only |\
vDEL --no-sort --check-ms=${OUT}_log/${PFX}.chr$1 --info-func=vAnnGeneAll |\
vQC --info-func=vAnnGeneAll --tot-spl=$TotSpl --info-ac=AC_Adj --info-an=AN_Adj --filt-miss-rate=1 \
--filt-hwe-info=$HWE_FIELDS --filt-vqs-snv=-inf --filt-vqs-indel=-inf --log=${OUT}/${PFX}.chr$1.qc.log.gz --nt=auto-12 |\
vAnnGene --all-ann=vAnnGeneAll $VAG | gzip -c > ${OUT}_log/${PFX}.chr$1.ann.gz; }
export -f my_func
module load parallel
parallel -a ~/local/VICTOR/script_template/chr_noMnoY my_func
cat ${OUT}_log/${PFX}.chr*.for_PROV > ${OUT}_log/${PFX}.provean
### submit ${OUT}_log/${PFX}.provean to provean.jcvi.org, calculate, and save results to ${OUT}_log/${PFX}.provean.result.one.tsv
sed 1d ${OUT}_log/${PFX}.provean.result.one.tsv | cut -f 2,11 | grep -v NA | grep -v $'\t$' | tr , '\t' |\
sort -k 1,1 -k 2,2n -k 3,3 -k 4,4 | sed '1s/^/#CHROM\tPOS\tREF\tALT\tPROVEAN\n/' > ${OUT}_log/${PFX}.provean.tsv
if [ -s ${OUT}_log/${PFX}.provean.tsv ]; then
bgzip -f ${OUT}_log/${PFX}.provean.tsv
tabix -f -p vcf ${OUT}_log/${PFX}.provean.tsv.gz
my_func() {
vAnnDel ${OUT}_log/${PFX}.chr$1.ann.gz --indel-only --ann=${OUT}_log/${PFX}.provean.tsv.gz --preload -f 5 --wr=PROVEAN --add-info |\
vDEL --add-af $DEL |\
vINFO --remove CSQ,DP_HIST,GQ_HIST,AGE_HISTOGRAM_HET,AGE_HISTOGRAM_HOM |\
gzip -c > ${OUT}/${PFX}.chr$1.ann.del.gz;
cp coverage/${PC_CUT}pc${DP_CUT}x/$PFX.chr$1.cov.gz ${OUT}/${PFX}.chr$1.cov.gz;
}
export -f my_func
module load parallel
parallel -a ~/local/VICTOR/script_template/chr_noMnoY my_func
else
my_func() {
vDEL ${OUT}_log/${PFX}.chr$1.ann.gz --add-af $DEL |\
vINFO --remove CSQ,DP_HIST,GQ_HIST,AGE_HISTOGRAM_HET,AGE_HISTOGRAM_HOM |\
gzip -c > ${OUT}/${PFX}.chr$1.ann.del.gz;
cp coverage/${PC_CUT}pc${DP_CUT}x/$PFX.chr$1.cov.gz ${OUT}/${PFX}.chr$1.cov.gz;
}
export -f my_func
module load parallel
parallel -a ~/local/VICTOR/script_template/chr_noMnoY my_func
fi
BayesDelisadeleteriousnessmetascore.YoucancalculateBayesDelforaVCFfileusingtheprovidedscript
template.PleaseseetheTutorialfordetailedprocedures.Ifyouwanttoannotatejustafewnumberof
4.2.BayesDelscore
[Backtotop]
variants,youcanalsousetheonlinecalculator(seetheUSENOWpagefortheURL).Therangeofthescoreis
from1.29334to0.75731.Thehigherthescore,themorelikelythevariantispathogenic.Thereisauniversal
cutoffvalue(0.0692655withMaxAF)thatwasobtainedbymaximizingsensitivityandspecificityatthesame
time.Therearealsogenespecificcutoffvalues.PleaseseetheBayesDel_GSTfilewithinthedatafolderforthe
cutoff.Inthatfilethefirstcolumnisgenesymbol,thesecondiscutoffforBayesDelwithoutMaxAF,thethirdis
cutoffforBayesDelwithMaxAF.It’simportanttonotethatthesecutoffvaluesweredesignedforgenediscovery
research,notforclinicaloperations,whereyoumaywanttohaveadoublecutoffsystem,i.e.,
BayesDel>cutoff1islikelypathogenic,BayesDel<cutoff2islikelybenign,whileothersarevariantsofuncertain
significance.