Arima Mapping User Guide A160146 V00
User Manual:
Open the PDF directly: View PDF .
Page Count: 6
Arima-HiC Mapping Pipeline
Doc A160154 v00
Arima-HiC Mapping Pipeline
Document Part Number:
A160146 v00
Release Date:
November 2018
Workflow Co-Developed With:
Bing Ren Lab - Ludwig Institute for Cancer Research
HiC Overview
The workflow is meant to assist in mapping Hi-C paired-end reads (e.g. obtained via Arima-HiC kits or
services) to reference sequences.
The Arima-HiC kits and services utilize an experimental protocol that captures inherent three-
dimensional (3D) conformation of the genome. The Hi-C libraries are subjected to Illumina short-read
sequencing in “paired-end” mode and the resulting data is referred to as Hi-C paired-end reads. When
Hi-C paired-end reads are mapped to reference sequences, conformational information manifested in
the Hi-C procedure can be used to generate chromosome-span contiguity.
In the Arima-HiC methodology, a sample (cells or tissues) is first crosslinked to preserve the genome
conformation. The crosslinked DNA is then digested using restriction enzymes. The single-stranded 5’-
overhangs are then filled in causing digested ends to be labeled with a biotinylated nucleotide. Next,
spatially proximal digested ends of DNA are ligated, preserving both short- and long-range DNA
contiguity. The DNA is then purified and sheared to a size appropriate for Illumina short-read
sequencing. After shearing, the biotinylated fragments are enriched to assure that only fragments
originating from ligation events are sequenced in paired-end mode via Illumina sequencers to inform DNA
contiguity. See figure above.
Below, this documents describes the necessary steps to map Hi-C paired-end reads (FASTQ format) to
reference sequences.
Mapping Pipeline
In this section, we describe specific steps to map Hi-C paired-end reads to reference sequences. This
mapping procedure also includes steps to filter Hi-C reads to correct for erroneous mapping that can
confound downstream analyses. See figure above.
The mapping pipeline will output a single binary alignment map file (BAM file) that contains paired and
filtered Hi-C paired-end reads mapped to reference sequences. Below, we walk through an example of
our mapping pipeline.
Input – Illumina paired-end FASTQ
Output – BAM file
The first section of the pipeline defines the paths to the files, scripts, and output directories utilized by
our pipeline, and then creates any of those output directories that do not already exist. For the mapping
pipeline, you will need the software BWA, SAMtools, and Picard Tools installed on your system. You also
need the scripts “filter_five_end.pl” and “two_read_bam_combiner.pl” that are provided. Please
change the file paths, file names, and label names as appropriate. Note that the $REF and $FAIDX
variables correspond to your reference sequence FASTA file and indexed reference sequence FASTA file,
respectively.
#! /bin/bash
SRA=’basename_of_fastq_files’
LABEL=’overall_exp_name’
BWA=’/software/bwa/bwa-0.7.12/bwa’
SAMTOOLS=’/software/samtools/samtools-1.3.1/samtools’
IN_DIR=’/path/to/gzipped/fastq/files’
REF=’/path/to/reference_sequences/reference_sequeneces.fa’
FAIDX=’$REF.fai’
RAW_DIR=’/path/to/write/out/bams’
FILT_DIR=’/path/to/write/out/filtered/bams’
FILTER=’/path/to/filter_five_end.pl’
COMBINER=’/path/to/two_read_bam_combiner.pl’
STATS=’/path/to/get_stats.pl’
PICARD=’/software/picard/picard-2.6.0/build/libs/picard.jar’
TMP_DIR=’/path/to/write/out/temporary/files’
PAIR_DIR=’/path/to/write/out/paired/bams’
REP_DIR=’/path/to/where/you/want/deduplicated/files’
REP_LABEL=$LABEL\_rep1
MERGE_DIR=’/path/to/final/merged/alignments/from/any/biological/replicates’
MAPQ_FILTER=10
echo "### Step 0: Check output directories exist & create them as needed"
[ -d $RAW_DIR ] || mkdir -p $RAW_DIR
[ -d $FILT_DIR ] || mkdir -p $FILT_DIR
[ -d $TMP_DIR ] || mkdir -p $TMP_DIR
[ -d $PAIR_DIR ] || mkdir -p $PAIR_DIR
[ -d $REP_DIR ] || mkdir -p $REP_DIR
[ -d $MERGE_DIR ] || mkdir -p $MERGE_DIR
Next, we use BWA-MEM to align the Hi-C paired-end reads to reference sequences. Because Hi-C
captures conformation via proximity-ligated fragments, paired-end reads are first mapped independently
(as single-ends) using BWA-MEM and are subsequently paired in a later step.
echo "### Step 1.A: FASTQ to BAM (1st)"
$BWA mem -t 12 -B 8 $REF $IN_DIR/$SRA\_1.fastq.gz | $SAMTOOLS view -Sb - >
$RAW_DIR/$SRA\_1.bam
echo "### Step 1.B: FASTQ to BAM (2nd)"
$BWA mem -t 12 -B 8 $REF $IN_DIR/$SRA\_2.fastq.gz | $SAMTOOLS view -Sb - >
$RAW_DIR/$SRA\_2.bam
Subsequent to mapping as single-ends, some of these single-end mapped reads can manifest a ligation
junction and are therefore considered “chimeric” (i.e. they do not originate from a contiguous piece of
DNA). When BWA-MEM maps these chimeric reads, there can be high quality mapping on both the 5’-
side and 3’-side of the ligation junction within a given read. In such cases, only the 5’-side should be
retained because the 3’- side can originate from the same contiguous DNA as the 5’-side of the reads
mate-pair. Therefore, we retain only the portion of the chimeric read that maps in the 5’-orientation in
relation to its read orientation. This is accomplished using the script “filter five end.pl.”
After filtering, we pair the filtered single-end Hi-C reads using “two_read_bam_combiner.pl,” which
outputs a sorted, mapping quality filtered, paired-end BAM file. We then add read groups to this BAM
file using Picard Tools.
We also use Picard Tools to discard any PCR duplicates present in the paired-end BAM file generated
above. If applicable, we require that you merge paired-end BAM files that were sequenced via multiple
Illumina lanes from the same library (i.e. technical replicates) before removing PCR duplicates. Below is
example code for how to accomplish this merging step.
###############################################################################
# How to Accommodate Technical Replicates
# This pipeline is currently built for processing a single sample with
a read1 and read2 fastq
file.
# Technical replicates (eg. one library split across multiple lanes) should
# be merged before running the MarkDuplicates command.
# If this step is run, the names and locations of input files to subsequent
# steps will need to be modified in order for subsequent steps to runcorrectly.
# The code below is an example of how to merge technical replicates.
REP_NUM=X
#number of the technical replicate set e.g. 1
REP_LABEL=$LABEL\_rep$REP_NUM
INPUTS_TECH_REPS=(’bash’ ’array’ ’of’ ’bams’ ’from’ ’replicates’)
#BAM files
you want
combined
as technical replicates
#example bash array -
#INPUTS_TECH_REPS=(’INPUT=A.L1.bam’ ’INPUT=A.L2.bam’ ’INPUT=A.L3.bam’)
java -Xms4G -Xmx4G -jar $PICARD MergeSamFiles $INPUTS_TECH_REPS
OUTPUT=$TMP_DIR/$REP_LABEL.bam USE_THREADING=TRUE ASSUME_SORTED=TRUE
VALIDATION_STRINGENCY=LENIENT
echo "### Step 2.A: Filter 5’ end (1st)"
$SAMTOOLS view -h $RAW_DIR/$SRA\_1.bam | perl $FILTER | $SAMTOOLS view -Sb -
> $FILT_DIR/$SRA\_1.bam
echo "### Step 2.B: Filter 5’ end (2nd)"
$SAMTOOLS view -h $RAW_DIR/$SRA\_2.bam | perl $FILTER | $SAMTOOLS view -Sb -
> $FILT_DIR/$SRA\_2.bam
echo "### Step 3A: Pair reads & mapping quality filter"
perl $COMBINER $FILT_DIR/$SRA\_1.bam $FILT_DIR/$SRA\_2.bam $SAMTOOLS
$MAPQ_FILTER | $SAMTOOLS view -bS -t $FAIDX - | $SAMTOOLS sort -o
$TMP_DIR/$SRA.bam -
echo "### Step 3.B: Add read group"
java -Xmx2g -jar $PICARD AddOrReplaceReadGroups INPUT=$TMP_DIR/$SRA.bam
OUTPUT=$PAIR_DIR/$SRA.bam ID=$SRA LB=$SRA SM=$LABEL PL=ILLUMINA PU=none
Note, that if you perform merging of technical replicates above, then the file names and locations will
change from the written flow of this pipeline. You will need to adjust the file names and locations that are
used as input in the following step - PCR duplicate removal.
At this point in the pipeline, if you have two or more libraries prepared from the same sample (i.e.
biological replicates), the biological replicate paired-end BAM files should be merged prior to
subsequent analyses. Below is example code for how to accomplish this merging step. If you do not
have biological replicates, then the pipeline is complete. The final output is a single BAM file that contains
the paired, 5’-filtered, and duplicate-removed Hi-C reads mapped to the reference sequences of choice.
The resulting statistics file has a breakdown of the total number of intra-contig read-pairs, long-range
intra-contig read-pairs, and inter-contig read-pairs in the final processed BAM file.
###############################################################################
### How to Accommodate Biological Replicates
### This pipeline is currently built for processing a single sample
### with one read1 and read2 fastq
file.
### Biological replicates (eg. multiple libraries made from the same
### sample) should be merged before proceeding with subsequent steps.
### The code below is an example of how to merge biological replicates.
###############################################################################
INPUTS_BIOLOGICAL_REPS=(’bash’ ’array’ ’of’ ’bams’ ’from’ ’replicates’)
#BAM
files
you
want combined
as biological replicates
###example bash array -
#INPUTS_BIOLOGICAL_REPS=(’INPUT=A_rep1.bam’ ’INPUT=A_rep2.bam’
’INPUT=A_rep3.bam’)
java -Xms4G -Xmx4G -jar $PICARD MergeSamFiles $INPUTS_BIOLOGICAL_REPS
OUTPUT=$MERGE_DIR/$LABEL.bam USE_THREADING=TRUE ASSUME_SORTED=TRUE
VALIDATION_STRINGENCY=LENIENT
$SAMTOOLS index $MERGE_DIR/$LABEL.bam
perl $STATS $MERGE_DIR/$LABEL.bam > $MERGE_DIR/$LABEL.bam.stats
echo "Finished Mapping Pipeline through merging Biological Replicates"
The final output of this pipeline is a single BAM file that contains the paired, 5’-filtered, and duplicate-
removed Hi-C reads mapped to the reference sequences of choice. The resulting statistics file has a
breakdown of the total number of intra-contig read-pairs, long-range intra-contig read-pairs, and
inter-contig read-pairs in the final processed BAM file.
echo "### Step 4: Mark duplicates"
java -Xms24G -XX:-UseGCOverheadLimit -Xmx24G -jar $PICARD MarkDuplicates
INPUT=$PAIR_DIR/$SRA.bam OUTPUT=$REP_DIR/$REP_LABEL.bam
METRICS_FILE=$REP_DIR/metrics.$REP_LABEL.txt TMP_DIR=$TMP_DIR
ASSUME_SORTED=TRUE
VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=TRUE
$SAMTOOLS index $REP_DIR/$REP_LABEL.bam
perl $STATS $REP_DIR/$REP_LABEL.bam > $REP_DIR/$REP_LABEL.bam.stats
echo "Finished Mapping Pipeline through Duplicate Removal"