Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 654

Download
Open PDF In Browser	View PDF

The Staden Package Manual
Last update on 25 April 2016

James Bonfield, Kathryn Beal, Mark Jordan,
Yaping Cheng and Rodger Staden

Copyright c 1999-2002, Medical Research Council, Laboratory of Molecular Biology. Made
available under the standard BSD licence.
Copyright c 2002-2006, Genome Research Limited (GRL). Made available under the standard BSD licence.
Portions of this code are derived from a modified Primer3 library. This bears the following
copyright notice:
Copyright c 1996,1997,1998 Whitehead Institute for Biomedical Research. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions must reproduce the above copyright notice, this list of conditions and
the following disclaimer in the documentation and/or other materials provided with the
distribution. Redistributions of source code must also reproduce this information in the
source code itself.
2. If the program is modified, redistributions must include a notice (in the same places as
above) indicating that the redistributed program is not identical to the version distributed
by Whitehead Institute.
3. All advertising materials mentioning features or use of this software must display the
following acknowledgment: This product includes software developed by the Whitehead
Institute for Biomedical Research.
4. The name of the Whitehead Institute may not be used to endorse or promote products
derived from this software without specific prior written permission.
We also request that use of this software be cited in publications as
Steve Rozen, Helen J. Skaletsky (1996,1997,1998) Primer3. Code available at http://wwwgenome.wi.mit.edu/genome software/other/primer3.html
THIS SOFTWARE IS PROVIDED BY THE WHITEHEAD INSTITUTE “AS IS” AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE WHITEHEAD INSTITUTE BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Permission is given to duplicate this manual in both paper and electronic forms.

Short Contents
1 Next generation assembly editing with Gap5 . . . . . . . . . . . . . . . 3
2 Sequence assembly and finishing using Gap4 . . . . . . . . . . . . . . 95
3 Searching for point mutations using pregap4 and gap4 . . . . . 309
4 Preparing readings for assembly using pregap4 . . . . . . . . . . . . 325
5 Marking poor quality and vector segments of readings . . . . . . 399
6 Screening Against Vector Sequences . . . . . . . . . . . . . . . . . . . . 401
7 Screening Readings for Contaminant Sequences . . . . . . . . . . . 413
8 Viewing and editing trace data using trev . . . . . . . . . . . . . . . . 417
9 Analysing and comparing sequences using spin . . . . . . . . . . . . 429
10 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
11 File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
12 Man Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
General Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
File Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
Variable Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Function Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629

iii

Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Next generation assembly editing with Gap5
................................................. 3
1.1

Gap5 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Creating databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Opening/closing databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Changing directories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Check Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Contig Selector / Comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Contig Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1.1 Selecting Contigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1.2 Changing the Contig Order . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1.3 The Contig Selector Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Contig Comparator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2.1 Examining Results and Using Them to Select
Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2.2 Automatic Match Navigation . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Template Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Filtering data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2 Template plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2.1 Controlling The Y Layout.. . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.3 Depth / Coverage Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Editing in Gap5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.1 Moving the visible segment of the contig . . . . . . . . . . . . . . . . . . 22
1.4.2 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.3 Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.3.1 Moving the editing cursor . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.3.2 Adjusting the Quality Values . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.3.3 Adjusting the alignment coordinates . . . . . . . . . . . . . . . . . 27
1.4.3.4 Adjusting the Cutoff Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.3.5 Summary of Editing Commands . . . . . . . . . . . . . . . . . . . . . 27
1.4.4 Cut and Paste Control of Sequence . . . . . . . . . . . . . . . . . . . . . . . 28
1.4.5 Selecting Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.6 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.6.1 Annotation Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.4.7 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.4.7.1 Search by Annotation Comments . . . . . . . . . . . . . . . . . . . . 32
1.4.7.2 Search by Tag Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4.7.3 Search by Padded Position . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4.7.4 Search by Unpadded Position . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4.7.5 Search by Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4.7.6 Search by Reading Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.4.7.7 Search by Reference InDel . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

The Staden Package Manual
1.4.7.8 Search by Consensus Quality . . . . . . . . . . . . . . . . . . . . . . . .
1.4.7.9 Search by Consensus Discrepancy . . . . . . . . . . . . . . . . . . . .
1.4.7.10 Search by Consensus Heterozygosity . . . . . . . . . . . . . . . .
1.4.7.11 Search by Low Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.7.12 Search by High Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.8 The Settings Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.8.1 Group Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.8.2 Highlight Disagreements . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.8.3 Pack Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.8.4 Hide Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.9 Primer Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.10 Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.11 The Editor Information Line . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.11.1 Reading Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.11.2 Contig Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.11.3 Tag Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.12 The Join Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.13 Using Several Editors at Once . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.14 Quitting the Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.15.1 Keyboard summary for editing window . . . . . . . . . . . . .
1.4.15.2 Mouse summary for editing window . . . . . . . . . . . . . . . .
1.4.15.3 Mouse summary for names window . . . . . . . . . . . . . . . . .
1.4.16 Plotting Restriction Enzymes. . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.16.1 Selecting Enzymes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.16.2 Examining the Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.16.3 Reconfiguring the Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.16.4 Textual Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Importing and Exporting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1.1 Importing with tg index . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1.2 Importing fasta/fastq files . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1.3 Mapped assembly by bwa aln . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1.4 Mapped assembly by bwa dbwtsw . . . . . . . . . . . . . . . . . . .
1.5.2 Importing GFF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.3 Export Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.4 Export Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Finding Sequence Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1 Find Internal Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1.1 Find Internal Joins Dialogue . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.2 Find Repeats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.3 Find Read Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.3.1 Find Read Pairs Graphical Output . . . . . . . . . . . . . . . . . .
1.6.4 Sequence Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7 Checking Assemblies and Removing Readings. . . . . . . . . . . . . . . . . .
1.7.0.1 Checking Assemblies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.1 Removing Readings and Breaking Contigs . . . . . . . . . . . . . . . .
1.7.1.1 Breaking Contigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34
34
34
34
34
34
35
36
36
36
36
38
40
41
42
42
43
44
44
44
44
45
46
47
47
48
48
48
50
50
50
52
52
52
54
54
55
56
56
59
62
64
64
67
69
70
72
73

v
1.7.1.2 Disassembling Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.7.1.3 Delete Contigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
1.8 Tidying up alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
1.8.1 Shuffle Pads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
1.8.2 Remove Pad Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
1.8.3 Remove Contig Holes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.9 Calculating Consensus Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
1.9.1 Normal Consensus Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
1.9.2 The Consensus Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
1.9.2.1 Consensus Calculation Using Base Frequencies . . . . . . . 83
1.9.2.2 Consensus Calculation Using Weighted Base Frequencies
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
1.9.2.3 Consensus Calculation Using Confidence values . . . . . . 84
1.9.2.4 The Quality Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
1.9.3 List Consensus Confidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
1.9.4 List Base Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
1.10 Other Miscellany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
1.10.1 List Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
1.10.2 Results Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
1.10.3 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.10.3.1 Special List Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.10.3.2 Basic List Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
1.10.3.3 Contigs To Readings Command . . . . . . . . . . . . . . . . . . . . 93
1.10.3.4 Search Sequence Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Sequence assembly and finishing using Gap4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.1
2.2

Organisation of the gap4 Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.2.1 Summary of the Files used and the Preprocessing Steps . . . 97
2.2.2 Summary of Gap4’s Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.2.3 Introduction to the gap4 User Interface . . . . . . . . . . . . . . . . . . 101
2.2.3.1 Introduction to the Contig Selector . . . . . . . . . . . . . . . . . 102
2.2.3.2 Introduction to the Contig Comparator . . . . . . . . . . . . . 103
2.2.3.3 Introduction to the Template Display . . . . . . . . . . . . . . . 105
2.2.3.4 Introduction to the Consistency Display . . . . . . . . . . . . 107
2.2.3.5 Introduction to the Restriction Enzyme Map . . . . . . . 109
2.2.3.6 Introduction to the Stop Codon Map . . . . . . . . . . . . . . . 110
2.2.3.7 Introduction to the Contig Editor . . . . . . . . . . . . . . . . . . 111
2.2.3.8 Introduction to the Contig Joining Editor. . . . . . . . . . . 114
2.2.4 Gap4 Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.2.4.1 Gap4 File menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.2.4.2 Gap4 Edit menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.2.4.3 Gap4 View menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.2.4.4 Gap4 Options menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.2.4.5 Gap4 Experiments menu . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.2.4.6 Gap4 Lists menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.2.4.7 Gap4 Assembly menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

The Staden Package Manual
2.2.5 The use of numerical estimates of base calling accuracy . .
2.2.6 Use of the "hidden" poor quality data . . . . . . . . . . . . . . . . . . .
2.2.7 Annotating and masking readings and contigs . . . . . . . . . . .
2.2.7.1 Standard tag types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.7.2 Active tags and masking . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Contig Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Selecting Contigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Changing the Contig Order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 The Contig Selector Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Contig Comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Examining Results and Using Them to Select Commands
...........................................................
2.4.2 Automatic Match Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Contig Overviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Template Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1.1 Reading and Template Plot. . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1.2 Reading and Template Plot Display . . . . . . . . . . . . . . . .
2.5.1.3 Reading and Template Plot Options . . . . . . . . . . . . . . . .
2.5.1.4 Reading and Template Plot Operations . . . . . . . . . . . . .
2.5.1.5 Quality Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1.6 Restriction Enzyme Plot . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Consistency Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2.1 Confidence Values Graph . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2.2 Reading Coverage Histogram . . . . . . . . . . . . . . . . . . . . . . .
2.5.2.3 Read-Pair Coverage Histogram . . . . . . . . . . . . . . . . . . . . .
2.5.2.4 Strand Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2.5 2nd-Highest Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2.6 Diploid Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.3 SNP Candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.4 Plotting Consensus Quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.4.1 Examining the Quality Plot . . . . . . . . . . . . . . . . . . . . . . . .
2.5.5 Plotting Stop Codons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.5.1 Examining the Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.5.2 Updating the Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.6 Plotting Restriction Enzymes. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.6.1 Selecting Enzymes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.6.2 Examining the Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.6.3 Reconfiguring the Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.6.4 Creating Tags for Cut Sites . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.6.5 Textual Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Editing in Gap4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Moving the visible segment of the contig . . . . . . . . . . . . . . . . .
2.6.2 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.3 Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.3.1 Moving the editing cursor . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.3.2 Editing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.3.3 Adjusting the Quality Values . . . . . . . . . . . . . . . . . . . . . . .
2.6.3.4 Adjusting the Cutoff Data . . . . . . . . . . . . . . . . . . . . . . . . . .

118
120
121
121
121
123
123
125
125
126
127
128
130
130
132
132
135
136
137
139
140
142
142
143
144
145
147
147
153
153
156
156
156
157
157
158
158
158
158
160
162
163
165
165
166
169
169

vii
2.6.3.5 Summary of Editing Commands . . . . . . . . . . . . . . . . . . . .
2.6.4 Selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.5 Annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.1 Search by Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.2 Search by Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.3 Search by Annotation Comments . . . . . . . . . . . . . . . . . . .
2.6.6.4 Search by Tag Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.5 Search by Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.6 Search by Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.7 Search by Consensus Quality . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.8 Search by file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.9 Search by Reading Name . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.10 Search by Edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.11 Search by Evidence for Edit (1) . . . . . . . . . . . . . . . . . . .
2.6.6.12 Search by Evidence for Edit (2) . . . . . . . . . . . . . . . . . . .
2.6.6.13 Search by Discrepancies . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.6.14 Search by Consensus Discrepancies . . . . . . . . . . . . . . . .
2.6.7 The Commands Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.2 Create Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.3 Edit Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.4 Delete Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.5 Save Contig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.6 Dump Contig to File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.7 Save Consensus Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.8 List Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.9 Report Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.10 Select Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.11 Align . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.12 Remove Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.7.13 Break Contig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8 The Settings Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.1 Status Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.2 Trace Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.3 Auto-display Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.4 Show Read-pair Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.5 Auto-diff Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.6 Y scale differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.7 Consensus Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.8 Group Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.9 Highlight Disagreements . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.10 Compare Strands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.11 Toggle auto-save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.12 3 Character Amino Acids . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.13 Show Reading and Consensus Quality . . . . . . . . . . . . .
2.6.8.14 Show edits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.15 Show Unpadded Positions . . . . . . . . . . . . . . . . . . . . . . . . .

169
170
171
174
174
175
175
175
175
175
175
176
176
176
176
176
177
177
177
177
177
177
177
177
178
178
178
178
179
179
179
179
179
180
181
181
182
182
183
183
183
184
184
184
184
184
185
185

viii

The Staden Package Manual
2.6.8.16 Show Template Names . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.17 Set Active Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.18 Set Output List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.19 Set Default Confidences . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.8.20 Set or unset saving of undo . . . . . . . . . . . . . . . . . . . . . . . .
2.6.9 Removing Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.10 Primer Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.10.1 Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.10.2 Template selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.11 Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.12 Reference Sequence and Traces . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.12.1 Reference sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.12.2 Reference traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.13 Template Status Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.14 The Editor Information Line . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.14.1 Reading Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.14.2 Contig Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.14.3 Tag Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.14.4 Base Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.15 The Join Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.16 Using Several Editors at Once . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.17 Quitting the Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.18 Editing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.18.1 Consensus and Quality Cutoffs . . . . . . . . . . . . . . . . . . . .
2.6.18.2 Editing by Base Change or Confidence . . . . . . . . . . . .
2.6.18.3 Base Overcalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.18.4 Base Undercalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.18.5 Multiple Base Disagreements . . . . . . . . . . . . . . . . . . . . . .
2.6.18.6 Poor Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.18.7 Checking for Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.19 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.19.1 Keyboard summary for editing window . . . . . . . . . . . .
2.6.19.2 Mouse summary for editing window . . . . . . . . . . . . . . .
2.6.19.3 Mouse summary for names window . . . . . . . . . . . . . . . .
2.6.19.4 Mouse summary for scrollbar . . . . . . . . . . . . . . . . . . . . . .
2.7 Assembling and Adding Readings to a Database . . . . . . . . . . . . . .
2.7.1 Normal Shotgun Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.1.1 Assemble Independently . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.1.2 Assemble Into Single Stranded Regions . . . . . . . . . . . . .
2.7.1.3 Stack Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.1.4 Put All Readings In Separate Contigs . . . . . . . . . . . . . .
2.7.2 Directed Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.3 Screen Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7.4 General Comments and Tips on Assembly . . . . . . . . . . . . . . .
2.7.5 Assembly Failure Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 Ordering and Joining Contigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.1 Order contigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.2 Find Read Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

185
186
186
186
186
186
187
188
188
188
191
191
191
192
193
194
195
195
195
196
197
197
197
198
199
199
200
200
201
201
202
202
203
203
204
205
205
209
209
210
211
211
213
215
216
217
219
222

ix
2.8.2.1 Find Read Pairs Graphical Output . . . . . . . . . . . . . . . . .
2.8.2.2 Find Read Pairs Text Output . . . . . . . . . . . . . . . . . . . . . .
2.8.2.3 The Template Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.2.4 The Reading Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.3 Find Internal Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.3.1 Find Internal Joins Dialogue. . . . . . . . . . . . . . . . . . . . . . . .
2.8.4 Find Repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Checking Assemblies and Removing Readings . . . . . . . . . . . . . . . .
2.9.0.1 Checking Assemblies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9.1 Removing Readings and Breaking Contigs . . . . . . . . . . . . . . .
2.9.1.1 Breaking Contigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9.1.2 Disassembling Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 Finishing Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10.1 Double Stranding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10.2 Suggest Primers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10.3 Suggest Long Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10.4 Compressions and Stops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10.5 Suggest Probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11 Calculating Consensus Sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.1 Normal Consensus Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.2 Extended Consensus Output . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.3 Unfinished Consensus Output . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.4 Quality Consensus Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.5 The Consensus Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.5.1 Consensus Calculation Using Base Frequencies . . . . .
2.11.5.2 Consensus Calculation Using Weighted Base
Frequencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.5.3 Consensus Calculation Using Confidence values . . . .
2.11.5.4 The Quality Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.6 List Consensus Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.7 List Base Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12 Miscellaneous functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.1 Complement a Contig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.2 Enter Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.3 Shuffle Pads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.4 Show Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.5 Contig Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.6 Sequence Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.7 Extract Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.8 Automatic Clipping by Quality and Sequence Similarity
...........................................................
2.12.8.1 Difference Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.8.2 Quality Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.8.3 Quality Clip Ends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.8.4 N-Base Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.13 Results Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.14 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.14.1 Special List Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

222
224
224
225
227
230
233
235
236
238
239
240
241
241
243
245
247
249
251
252
253
255
255
257
258
259
259
261
261
263
265
265
265
265
267
269
271
273
274
274
275
275
276
277
278
278

The Staden Package Manual
2.14.2 Basic List Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.14.3 Contigs To Readings Command . . . . . . . . . . . . . . . . . . . . . . . .
2.14.4 Minimal Coverage Command . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.14.5 Unattached Readings Command. . . . . . . . . . . . . . . . . . . . . . . .
2.14.6 Highlight Readings List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.14.7 Search Sequence Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.14.8 Search Template Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.14.9 Search Annotation Contents. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.15 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.15.1 Selecting Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.15.2 Editing Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.15.3 Special Note Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.16 Gap4 Database Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.16.1 Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.16.2 Opening a New Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.16.3 Opening an Existing Database . . . . . . . . . . . . . . . . . . . . . . . . .
2.16.4 Making Backups of Databases . . . . . . . . . . . . . . . . . . . . . . . . . .
2.16.5 Reading and Contig Names and Numbers . . . . . . . . . . . . . .
2.17 Copy Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.17.1.1 Copy Reads Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.18 Check Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.18.1 Database Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.18.2 Contig Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.18.3 Reading Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.18.4 Annotation Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.18.5 Note Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.18.6 Template Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.18.7 Vector Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.18.8 Clone Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19 Doctor Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.1 Structures Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.1.1 Database Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.1.2 Reading Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.1.3 Contig Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.1.4 Annotation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.1.5 Template Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.1.6 Original Clone Structure . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.1.7 Note Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.2 Ignoring Check Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.3 Extending Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.4 Listing and Removing Annotations . . . . . . . . . . . . . . . . . . . . .
2.19.5 Shift Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.6 Delete Contig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.19.7 Reset Contig Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20 Configuring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.2 Consensus Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

278
279
279
279
279
279
280
280
281
281
282
282
284
284
285
285
285
286
287
287
287
290
290
290
291
292
292
292
292
292
293
294
295
295
295
296
296
296
296
296
296
296
297
297
297
298
298
299

xi
2.20.3 Set Maxseq/Maxdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.4 Set Fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.5 Configuring Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.6 Set Genetic Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.7 Alignment Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.8 Trace File Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.9 The Tag Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.10 The GTAGDB File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.20.11 Template Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.21 Command Line Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Searching for point mutations using pregap4
and gap4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
3.1

299
299
300
300
301
302
304
304
305
306

Introduction to mutation detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Mutation Detection Programs . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Mutation Detection Reference Data . . . . . . . . . . . . . . . . . . . . .
3.1.3 Reference Sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.4 Reference Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.5 Using The Template Display With Mutation Data . . . . . . .
3.1.6 Configuring The Gap4 Editor For Mutation Data . . . . . . . .
3.1.7 Using The Gap4 Editor With Mutation Data . . . . . . . . . . . .
3.1.8 Processing Batches Of Mutation Data Trace Files. . . . . . . .
3.1.9 Processing Batches Of Mutation Data Trace Files Using
Pregap4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.10 Configuration Of Pregap4 For Mutation Data . . . . . . . . . .
3.1.11 Discussion Of Mutation Data Processing Methods . . . . . .

309
314
314
314
315
317
318
319
320
322
323
324

Preparing readings for assembly using pregap4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
4.1
4.2

Organisation of the Pregap4 Manual . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Summary of the Files used and the Processing Steps . . . . .
4.2.2 Introduction to the Pregap4 User Interface . . . . . . . . . . . . . .
4.2.2.1 Introduction to the Files to Process Window . . . . . . . .
4.2.2.2 Introduction to the Configure Modules Window . . . . .
4.2.2.3 Introduction to the Textual Output Window . . . . . . . .
4.2.2.4 Introduction to Running Pregap4 . . . . . . . . . . . . . . . . . . .
4.2.3 Pregap4 Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3.1 Pregap4 File menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3.2 Pregap4 Modules menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3.3 Pregap4 Information source menu . . . . . . . . . . . . . . . . . .
4.2.3.4 Pregap4 Options menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Specifying Files to Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Running Pregap4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Configuring the Pregap4 User Interface . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Fonts and Colours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Window Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

325
326
326
330
332
334
335
335
336
336
336
336
337
337
338
341
341
341

xii

The Staden Package Manual
4.6

Configuring Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 General Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.2 Estimate Base Accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.3 Phred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.4 ATQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.5 Trace Format Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.6 Initialise Experiment Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.7 Augment Experiment Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.8 Quality Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.9 Sequencing Vector Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.10 Cross match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.11 Cloning Vector Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.12 Screen for Unclipped Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.13 Screen Sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.14 Blast Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.15 Interactive Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.16 Extract Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.17 RepeatMasker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.18 Tag Repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.19 Mutation Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.20 Reference Traces and Reference Sequences . . . . . . . . . . . . . .
4.6.21 Trace Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.22 Mutation Scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.23 Gap4 Shotgun Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.24 Cap2 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.25 Cap3 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.26 FakII Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.27 Phrap Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.28 Enter Assembly into Gap4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.29 Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.30 Old Cloning Vector Clip - Obsolete . . . . . . . . . . . . . . . . . . . . .
4.6.31 ALF/ABI to SCF Conversion - Obsolete . . . . . . . . . . . . . . .
4.7 Using Config Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Pregap4 Naming Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8.1 Mutation Detection Naming Scheme . . . . . . . . . . . . . . . . . . . . .
4.8.2 Old Sanger Centre Naming Scheme . . . . . . . . . . . . . . . . . . . . . .
4.8.3 New Sanger Centre Naming Scheme . . . . . . . . . . . . . . . . . . . . .
4.8.4 Writing Your Own Naming Schemes . . . . . . . . . . . . . . . . . . . . .
4.9 Pregap4 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10 Information Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10.1 Simple Text Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10.2 Experiment File Line Types . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 Adding and Removing Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12 Low Level Pregap4 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.1 Low Level Global Configuration . . . . . . . . . . . . . . . . . . . . . . . .
4.12.2 Low Level Component Configuration . . . . . . . . . . . . . . . . . . .
4.12.3 Low Level Module Configuration . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.1 General Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

342
344
344
344
345
345
346
346
347
348
350
350
351
351
352
352
353
353
354
354
356
357
359
361
362
362
362
363
364
364
365
365
366
366
366
367
368
369
371
371
371
373
375
377
377
378
378
379

xiii
4.12.3.2 ALF/ABI to SCF Conversion . . . . . . . . . . . . . . . . . . . . .
4.12.3.3 Estimate Base Accuracies . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.4 Phred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.5 ATQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.6 Trace Format Conversion . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.7 Initialise Experiment Files . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.8 Augment Experiment Files . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.9 Uncalled Base Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.10 Quality Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.11 Sequencing Vector Clip. . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.12 Cross match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.13 Cloning Vector Clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.14 Old Cloning Vector Clip . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.15 Screen for Unclipped Vector . . . . . . . . . . . . . . . . . . . . . .
4.12.3.16 Screen Sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.17 Blast Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.18 Interactive Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.19 Extract Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.20 Tag Repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.21 RepeatMasker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.22 Mutation Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.23 Gap4 Shotgun Assembly . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.24 Cap2 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.25 Cap3 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.26 FakII Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.27 Phrap Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.28 Enter Assembly into Gap4 . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.29 Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12.3.30 Shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.13 Writing New Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.13.1 An Overview of a Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.13.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.13.3 Module Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.13.4 Global Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.13.5 Builtin Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.13.6 An Example Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

379
380
380
380
381
381
382
382
382
383
384
385
385
386
387
387
388
388
388
389
389
390
391
391
392
393
393
394
394
395
395
395
397
397
398
398

Marking poor quality and vector segments of
readings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Introduction to read clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

xiv

The Staden Package Manual

Screening Against Vector Sequences. . . . . . . . 401
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9

402
404
404
405
406
407
408
408
410

Screening Readings for Contaminant Sequences
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
7.1
7.2
7.3
7.4

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters (defaults in brackets) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Error codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Primer file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vector Primer File Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Defining Cloning and Primer Sites for Vector Clip . . . . . . . . . . . .
Finding the Cloning and Primer Sites . . . . . . . . . . . . . . . . . . . . . . . .

Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Error codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

413
414
414
415

Viewing and editing trace data using trev
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
8.1
8.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Opening trace files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Opening a trace file from the command line . . . . . . . . . . . . .
8.2.2 Opening a trace file from within Trev. . . . . . . . . . . . . . . . . . . .
8.3 Viewing the trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.2 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4 Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.1 Setting the left and right cutoffs . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.2 Editing the sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.3 Undoing clip edits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5 Saving a trace file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6 Processing multiple files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7 Printing a trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1 Page options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1.1 Paper options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1.2 Panels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1.3 Fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2 Trace options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2.1 Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2.2 Line width and colour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2.3 Dash pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2.4 Print bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2.5 Print magnification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8 Quitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

417
420
420
421
421
422
422
422
422
423
423
423
423
424
424
424
425
425
425
425
425
426
426
426
427
427

Analysing and comparing sequences using spin
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
9.1
9.2

Organisation of the Spin Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
9.2.1 Summary of the Spin Single Sequence Functions . . . . . . . . . 429
9.2.2 Summary of the Spin Comparison Functions . . . . . . . . . . . . . 430
9.2.3 Introduction to the Spin User Interface . . . . . . . . . . . . . . . . . . 431
9.2.3.1 Introduction to the Spin Plot . . . . . . . . . . . . . . . . . . . . . . . 432
9.2.3.2 Introduction to the Spin Sequence Display . . . . . . . . . . 437
9.2.3.3 Introduction to the Spin Sequence Comparison Plot
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
9.2.3.4 Introduction to the Spin Sequence Comparison Display
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
9.2.4 Spin Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
9.2.4.1 Spin File Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
9.2.4.2 Spin View Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
9.2.4.3 Spin Options Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
9.2.4.4 Spin Sequences Menu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
9.2.4.5 Spin Statistics Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
9.2.4.6 Spin Translation Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
9.2.4.7 Spin Search Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
9.2.4.8 Spin Comparison Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
9.3 Spin’s Analytical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
9.3.1 Count Sequence Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
9.3.2 Count Dinucleotide Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . 447
9.3.3 Plot base composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
9.3.4 Calculate codon usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
9.3.5 Set genetic code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
9.3.6 Translation - general . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
9.3.7 Find open reading frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
9.3.8 Restriction enzyme search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
9.3.8.1 Selecting Enzymes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
9.3.8.2 Examining the Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
9.3.8.3 Reconfiguring the Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
9.3.8.4 Printing the sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
9.3.9 Subsequence search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
9.3.10 Motif search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
9.3.11 Gene finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
9.3.11.1 Start codon search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
9.3.11.2 Stop codon search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
9.3.11.3 Codon usage method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
9.3.11.4 Positional base preferences . . . . . . . . . . . . . . . . . . . . . . . . 472
9.3.11.5 Author test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
9.3.11.6 Uneven positional base preferences . . . . . . . . . . . . . . . . 477
9.3.11.7 Splice site search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
9.3.11.8 tRNA search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
9.4 Spin Comparison Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
9.4.1 Finding Similar Spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483

xvi

The Staden Package Manual
9.4.2 Finding Matching Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.3 Finding the Best Diagonals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.4 Aligning Sequences Globally . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.5 Aligning Sequences Locally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5 Controlling and Managing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.1 Probabilities and expected numbers of matches . . . . . . . . . .
9.5.2 Changing the maximum number of matches . . . . . . . . . . . . .
9.5.3 Changing the default number of matches . . . . . . . . . . . . . . . .
9.5.4 Hide duplicate matches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.5 Changing the score matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.6 Set protein alignment symbols . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6 The Spin User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.1 SPIN Sequence Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.1.1 Cursors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.1.2 Crosshairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.1.3 Zoom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.1.4 Drag and drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.2 Sequence display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.2.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.2.2 Save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.3 SPIN Sequence Comparison Plot . . . . . . . . . . . . . . . . . . . . . . . .
9.6.3.1 Cursors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.3.2 Crosshairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.3.3 Zoom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.3.4 Drag and drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6.4 Sequence Comparison Display . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7 Controlling and Managing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7.1 Result manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7.1.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7.1.2 List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7.1.3 Configure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7.1.4 Hide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7.1.5 Reveal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7.1.6 Remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8 Reading and Managing Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.1 Use of feature tables in spin . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.2 Reading in sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.2.1 Simple search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.2.2 Extracting a sequence from a personal archive file . . .
9.8.3 Sequence manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.1 Change the active sequence . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.2 Set the range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.3 Copy Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.4 Sequence type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.5 Complement sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.6 Interconvert t and u . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.7 Translate sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.8 Scramble sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

485
487
489
493
497
498
498
498
499
499
501
501
502
505
506
506
506
509
510
511
512
513
513
513
514
514
515
515
517
517
517
517
517
517
517
517
518
518
519
519
520
520
520
520
520
520
520
521

xvii
9.8.3.9 Rotate sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.10 Save sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.3.11 Delete sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.8.4 Selecting a sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Basic Interface Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.1 Buttons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.2 Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.3 Text Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2.4 Text Entry Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Standard Mouse Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.4 The Output and Error Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5 Graphics Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5.1 Zooming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.6 Colour Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7 File Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.1 Directories and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.7.2 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.8 Font Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

521
521
522
522

523
523
523
524
524
525
526
526
528
528
529
529
530
530
531

File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533

11.1 SCF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
11.1.1 Header Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
11.1.2 Sample Points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
11.1.3 Sequence Information.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
11.1.4 Comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
11.1.5 Private data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
11.1.6 File structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
11.1.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
11.1.7.1 Byte ordering and integer representation. . . . . . . . . . . 538
11.1.7.2 Compression of SCF Files . . . . . . . . . . . . . . . . . . . . . . . . . 539
11.2 ZTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
11.2.1 Header. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
11.2.2 Chunk Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
11.2.2.1 Data format 0 - Raw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
11.2.2.2 Data format 1 - Run Length Encoding. . . . . . . . . . . . . 541
11.2.2.3 Data format 2 - ZLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
11.2.2.4 Data format 64/0x40 - 8-bit delta . . . . . . . . . . . . . . . . . 542
11.2.2.5 Data format 65/0x41 - 16-bit delta . . . . . . . . . . . . . . . . 542
11.2.2.6 Data format 66/0x42 - 32-bit delta . . . . . . . . . . . . . . . . 543
11.2.2.7 Data format 67-69/0x43-0x45 - reserved . . . . . . . . . . . 543
11.2.2.8 Data format 70/0x46 - 16 to 8 bit conversion . . . . . . 543
11.2.2.9 Data format 71/0x47 - 32 to 8 bit conversion . . . . . . 543
11.2.2.10 Data format 72/0x48 - "follow" predictor . . . . . . . . . 544
11.2.2.11 Data format 73/0x49 - floating point 16-bit chebyshev
polynomial predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544

xviii

The Staden Package Manual

11.2.2.12 Data format 74/0x4A - integer based 16-bit chebyshev
polynomial predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
11.2.3 Chunk Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
11.2.3.1 SAMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
11.2.3.2 SMP4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
11.2.3.3 BASE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
11.2.3.4 BPOS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
11.2.3.5 CNF4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
11.2.3.6 TEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
11.2.3.7 CLIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
11.2.3.8 CR32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
11.2.3.9 COMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
11.2.4 Text Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
11.2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
11.3 Experiment File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
11.3.1 Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
11.3.2 Explanation of Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
11.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
11.3.4 Unsupported Additions (From LaDeana Hillier) . . . . . . . . 564
11.4 Restriction Enzyme File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
11.5 Vector primer File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
11.6 Vector Sequence Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568

Man Pages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569

12.1 Convert trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2 Copy db . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3 Copy reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4 Eba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

570
570
570
570
570
571
571
572
573
573
573
573
573
573
573
574
574
574
574
574
576
577
577
577

xix
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.5 Extract seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.6 Extract fastq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.7 Find renz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.8 GetABIfield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.9 Get comment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.10 Get scf field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.11 Hash exp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.12 Hash extract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

577
577
577
578
578
578
578
578
578
579
579
579
579
579
579
580
580
580
580
580
580
581
581
581
581
581
582
582
583
583
583
583
583
583
584
584
584
584
584
584
585
585
585
585
585
586
586
586

The Staden Package Manual
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.13 Hash list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.14 Hash tar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.15 Init exp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.16 MakeSCF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.17 Make weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.18 PolyA clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.19 Qclip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

586
586
586
587
587
587
587
587
587
587
587
587
587
588
588
589
590
590
590
590
590
590
590
591
591
591
591
591
591
592
592
593
593
593
593
595
596
596
597
597
597
597
597
597
597
597
597
598

xxi
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.20 Screen seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.21 TraceDiff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.22 Trace dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.23 Vector clip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

598
599
599
600
600
600
600
600
601
601
602
603
603
603
603
603
606
606
606
606
606
607
607
607
607
607
608
610
610

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611

General Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
File Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
Variable Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Function Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629

Preface
This manual describes the sequence handling and analysis software developed at the Medical
Research Council Laboratory of Molecular Biology, Cambridge, UK, which has come to be
known as the Staden Package.
The vast bulk of work on the package was done at LMB within Rodger Staden’s group,
which over time has consisted of Tim Gleeson, Simon Dear, James Bonfield, Kathryn Beal,
Mark Jordan and Yaping Cheng. Besides the group members a number of people have
made important contributions; most notably including David Judge and John Taylor for
feedback / tutorials and developing the Windows release respectively.
Since mid-2003 the group in LMB no longer exists. The package became “open source”
and moved onto SourceForge in early 2004. The only active maintainer (James Bonfield)
now works at the Wellcome Trust Sanger Institute. The new package homepage may be
found at
http://staden.sourceforge.net/ and the
https://sourceforge.net/projects/staden/ .

SourceForge

project

page

The focus of the development since 1990 has been to produce improved methods for
processing the data for large scale sequencing projects, and this is reflected in the scope of
the package: the most advanced components (trev, prefinish, pregap4 and gap4) are those
used in that area. Nevertheless the package also contains a program (spin) for the analysis
and comparison of finished sequences. The latter also provides a graphical user interface to
EMBOSS.
Since the LMB group disbanded it has become necessary to reduce the scope of further
development, so active work is primarily being directed to the Gap4 program.
Gap4 performs sequence assembly, contig ordering based on read pair data, contig joining
based on sequence comparisons, assembly checking, repeat searching, experiment suggestion, read pair analysis and contig editing. It has graphical views of contigs, templates,
readings and traces which all scroll in register. Contig editor searches and experiment suggestion routines use confidence values to calculate the confidence of the consensus sequence
and hence identify only places requiring visual trace inspection or extra data. The result is
extremely rapid finishing and a consensus of known accuracy.
Pregap4 provides a graphical user interface to set up the processing required to prepare
trace data for assembly or analysis. It also automates these processes. The possible processes which can be set up and automated include trace format conversion, quality analysis,
vector clipping, contaminant screening, repeat searching and mutation detection.
Trev is a rapid and flexible viewer and editor for ABI, ALF, SCF and ZTR trace files.
Prefinish analyses partially completed sequence assemblies and suggests the most efficient
set of experiments to help finish the project.
Tracediff and hetscan automatically locate mutations by comparing trace data against
reference traces. They annotate the mutations found ready for viewing in gap4.
Spin analyses nucleotide sequences to find genes, restriction sites, motifs, etc. It can
perform translations, find open reading frames, count codons, etc. Many results are pre-

The Staden Package Manual

sented graphically and a sliding sequence window is linked to the graphics cursor. Spin also
compares pairs of sequences in many ways. It has very rapid dot matrix analysis, global and
local alignment algorithms, plus a sliding sequence window linked to the graphical plots. It
can compare nucleic acid against nucleic acid, protein against protein, and protein against
nucleic acid.
The manual describes, in turn, each of the main programs in the package: gap4, and
then pregap4 and its associated programs such as trev, and then spin. This is followed by a
description of the graphical user interface, the ZTR, SCF and Experiment file formats used
by our software, UNIX manpages for several of the smaller programs, and finally a list of
papers published about the software. The description for each of the programs includes an
introductory section which is intended to be sufficient to enable people to start using them,
although in order to get the most from the programs, and to find the most efficient ways of
using them we recommend that the whole manual is read once. The mini-manual is made
up from the introductory sections for each of the main programs.

Chapter 1: Next generation assembly editing with Gap5

1 Next generation assembly editing with Gap5

The Staden Package Manual

1.1 Gap5 Databases
1.1.1 Creating databases
Gap5 cannot directly work on assembly formats in their native format. This is a substantial
difference from things like BAM file viewers, but the reason is simply that the other formats
do not have data structured in a manner that is suitable for in-place editing. Gap5 is first
and foremost an assembly editor.
Gap5 databases are currently created external to Gap5 using a command-line program
named tg_index.
tg_index [options] input file ...
The most general usage is simply to specify one or more data files (it accepts SAM/BAM,
CAF, ACE, BAF, MAQ and in a more limited fashion fasta/fastq), optionally specifying
the output database with -o database name. This will then create a database suitable for
editing by Gap5.
Valid options are:
-m

Input is MAQ format

-M

Input is MAQ-long format

-A

Input is ACE format

-B

Input is BAF format

-C

Input is CAF format

-f

Input is FASTA format

-F

Input is FASTQ format

-b

Input is BAM format

-s

Input is SAM format (with @SQ headers)

-u

Also store unmapped reads (SAM/BAM only)

-x

Also store auxillary records (SAM/BAM only)

-r

Store reference-position data (on) (SAM/BAM only)

-R

Don’t store reference-position data (SAM/BAM only)

-D

Do not remove duplicates (SAM/BAM only)

-p

Link read-pairs together (default on)

-P

Do not link read-pairs together

-q value

Number of reads to queue in memory while waiting for pairing. Use to reduce
memory requirements for assemblies with lots of single reads at the expense of
running time. 0 for all in memory, suggest 1000000 if used (default 0).

-a

Append to existing db

-n

New contigs always (relevant if appending)

Chapter 1: Next generation assembly editing with Gap5

-g

When appending to an existing db, assume the alignment was performed against
an ungapped copy of the existing consensus. Add gaps back in to reads and/or
consensus as needed.

-t

Index sequence names (default)

-T

Do not index sequence names

-z value

Specify minimum bin size (default is ’4k’)

-f

Fast mode: read-pair links are unidirectional large databases, eg n.seq > 100
million.

-d data_types
Only copy over certain data types. This is a comma separated list containing
one or more words from: seq, qual, anno, name, all or none
-c method
Specifies the compression method. This shold be one of ’none’, ’zlib’ or ’lzma’.
Zlib is the default.
-[1-9]

Use a fixed compression level from 1 to 9

-v version_num
Request a specific database formation version
To merge existing gap5 databases you will need to export either one or both into an
intermediate format (we suggest SAM) and then use tg index to import data again.

1.1.2 Opening/closing databases
The Open menu item is in the main gap5 File menu. It brings up a file browser allowing
selection of the gap5 database name. Databases consist of two files - a main data block
(.g5d) and a data index (.g5x). It does not matter which you choose as gap5 will open both.
Alternatively you can specify the database name on the command line when launching
gap5. Additionally this supports read-only access if you specify the -ro flag. For example
to open a database named Egu.0 (the old Gap4 convention implying version 0) in read-only
mode we would type:
gap5 -ro Egu.0 &

1.1.3 Changing directories
By default gap5 changes to the directory containing the database you have open. All local
output files specified (for example Save Consensus or Export Sequences) will be relative
to that location unless you use a full pathname. The current working directory may be
changed by using the Change Direction dialogue, found in the main File menu.

The Staden Package Manual

1.1.4 Check Database
This function (which is available from the Gap5 File menu) is used to perform a check on
the logical consistency of the database. No user intervention is required. If the checks are
passed the program will report zero errors. Otherwise a report of each error is displayed.

On a large database these checks can take a considerable amount of time. The default
is a thorough, but slow, check. However a faster mode is available which only performs
gross contig and contig-binning level checks, omitting the per sequence and per annotation
validation.
The dialogue also offers the choice of attempting to fix any problems that are found.
It is strongly recommended that you back the gap5 database up prior to performing fixes
as depending on the nature of the corruption the choices made may not necessarily be an
improvement. Note that this also may not fix every problem that is found, or the fixes
themselves may cause other errors to be found so it is best to recheck again.

Chapter 1: Next generation assembly editing with Gap5

1.2 Contig Selector / Comparator
1.2.1 Contig Selector
The prog Contig Selector is used to display, select and reorder contigs. It can be invoked
from the prog View menu, but will automatically appear when a database is opened.
In the Contig Selector all contigs are shown as colinear horizontal lines separated by short
vertical lines. The length of the horizontal lines is proportional to the length of the contigs
and their left to right order represents the current ordering of the contigs. This Contig Order
is stored in the gap database and users can change it by dragging the lines representing the
contigs in the display. The Contig Selector can also be used to select contigs for processing.
Unlike gap4, gap5 does not display annotations within the Contig Selector window.

The figure shows a typical display from the Contig Selector. At the top are the File, View
and Results menus. Below that are buttons for zooming and for displaying the crosshair.
The four boxes to the right are used to display the X and Y coordinates of the crosshair.
The rightmost two display the Y coordinates when the contig selector is transformed into
the contig comparator (see Section 2.4 [Contig Comparator], page 126). The two leftmost
boxes display the X coordinates: the leftmost is the position in the contig and the other is
the position in the overall consensus. The crosshair is the vertical line spanning the panel
below.
This panel shows the lines that represent the contigs and the currently active tags. Those
tags shown above the contig lines are on readings and those below are on the consensus.
Right clicking on a tag gives a menu containing “information” (to see the tag contents) and
“Edit contig at tag” which invokes the contig editor centred on the selected tag.
The information line is showing data for the contig that is currently under the crosshair.

1.2.1.1 Selecting Contigs
Contigs can be selected by either clicking with the left mouse button on the line representing
the required contig in the contig selector window or alternatively by choosing the "List
contigs" option from the "View" menu. This option invokes a "Contig List" list box where

The Staden Package Manual

the contig names and numbers are listed in the same order as they appear in the contig
selector window.

Within this list box the contig names can be sorted alphabetically on contig name or
numerically on contig number. This is done by selecting the corresponding item from the
sort menu at the top of the list box. Clicking on a name within the list box is equivalent
to clicking on the corresponding contig in the contig selector. More than one contig can
be selected by dragging out a region with the left mouse button. Dragging the mouse off
the bottom of the list will scroll it to allow selection of a range larger than the displayed
section of the list. When the left button is pressed any existing selection is cleared. To
select several disjoint entries in the list press control and the left mouse button. The “Copy”
button copies the current selection to the paste buffer.
Most commands require a contig identifier, which can be the contig name itself or the
Prog always knows reading record
name/number of any reading within that contig.
numbers, but depending on the options used in tg index when creating the assembly database the reading names may not be indexed. To specify a reading by record number, precede
it by a # character, e.g. “#10000” means reading record number 10000, but “10000” means
the contig or reading with name 10000.
Also any currently active dialogue boxes that require a contig to be selected can be
updated simply by clicking on a contig in the contig selector or clicking on an entry in
the "Contig Names" list box. For example, if the Edit contig command is selected from
the Edit menu it will bring up a dialogue requesting the identity of the contig to edit. If
the user clicks the left mouse button on a contig in the contig selector window, the contig
editor dialogue will automatically change to contain the name of the selected contig. Some
commands, such as the Contig Editor, can be selected from a popup menu that is activated
by clicking the right mouse button on the contig line in the Contig Selector or clicking the
right mouse button on the corresponding name within the "Contig List" list box. This
simultaneously defines the contig to operate on and so the command starts up without
dialogue.

Chapter 1: Next generation assembly editing with Gap5

Several contigs can be selected at once by either clicking on each contig with the left
mouse button or dragging out a selection rectangle by holding the left mouse button down.
Contigs which are entirely enclosed within the rectangle will be selected. Alternatively,
selecting several contigs from the "Contig Names" list box will also result in each contig
being selected. Selected contigs are highlighted in bold. Selecting the same contig again
will unselect it.
The currently selected contigs are also kept in a ’list’ named contigs.

1.2.1.2 Changing the Contig Order
The order of contigs is shown by the order of the lines representing them within the Contig
Selector. The order of contigs can be changed by moving these lines using the middle
mouse button, or Alt left mouse button. Several contigs may be moved at once by selecting
several contigs using the above method. After selection, move the contigs with the middle
mouse button, or Alt left mouse button, and position the mouse cursor where you want the
selection to be moved to. Upon release of the mouse button the contigs will be shuffled to
reflect their new order. The separator line at the point the contig was moved from increases
in height.
The contig order is saved automatically whenever a contig is created or removed (eg auto
assemble), including operations like disassemble which temporarily create contigs. The order
can be saved manually using the Save Contig Order option on the File menu.

1.2.1.3 The Contig Selector Menus
The File menu contains only one command; "Exit". This simply quits the contig selector
display.
The View menu gives access to the Results Manager (see Section 2.13 [Results Manager],
page 277), allows contigs to be selected using a list box containing the contig names (See
Section 2.3.1 [Selecting Contigs], page 123), and the list of selected contigs to be cleared.
The Results menu is updated on the fly to contain cascading menus for each of the
plots shown when the contig selector is in its 2D Contig Comparator mode (see Section 2.4
[Contig Comparator], page 126). The contents of these cascading menus are identical to
the pulldown menus available from within the Results Manager.

The Staden Package Manual

1.2.2 Contig Comparator
Prog commands such as Find Internal Joins (see Section 2.8.3 [Find Internal Joins],
page 227) and Find Repeats (see Section 2.8.4 [Find Repeats], page 233) automatically
transform the Contig Selector (see Section 2.3 [Contig Selector], page 123) to produce the
Contig Comparator. To produce this transformation a copy of the Contig Selector is added
at right angles to the original window to create a two dimensional rectangular surface
on which to display the results of comparing or checking contigs. Each of the functions
plots its results as diagonal lines of different colours. If the plotted points are close to the
main diagonal they represent results from pairs of contigs that are in the correct relative
order. Lines parallel to the main diagonal represent contigs that are in the correct relative
orientation to one another. Those perpendicular to the main diagonal show results for which
one contig would need to be reversed before the pair could be joined. The manual contig
dragging procedure can be used to change the relative positions of contigs. See Section 2.3.2
[Changing the Contig Order], page 125. As the contigs are dragged the plotted results will
be automatically moved to their corresponding new positions. This means that if users
drag the contigs to move their plotted results close to the main diagonal they will be
simultaneously putting their contigs into the correct relative positions.

By use of popup menus the plotted results can be used to invoke a subset of commands.
For example if the user clicks the right mouse button over a result from Find Internal Joins
a menu containing Invoke Join Editor (see Section 2.6.15 [The Join Editor], page 196) and
Invoke Contig Editors (see Section 2.6 [Editing in prog ], page 160) will pop up. If
the user selects Invoke Join Editor the Join Editor will be started with the two contigs
aligned at the match position contained in the result. If required one of the contigs will be
complemented to allow their alignment.

A typical display from the Contig Comparator is shown below. It includes results for
Find Internal Joins in black, Find Repeats in red and Sequence Search in green. The
currently highlighted item is shown in pink with a summary at the bottom of the screen.
The orientation of this is from top-left to bottom-right indicating that the match is in the
same orientation within both contigs (we can see some in the opposite orientation indicating
that we need to reverse complement either of the two contigs before attempting any joins,
although this will happen automatically). The crosshairs show the positions for a pair of

Chapter 1: Next generation assembly editing with Gap5

contigs. The vertical line continues into the Contig Selector part of the display, and the
position represented by the horizontal line is also duplicated there.

1.2.2.1 Examining Results and Using Them to Select Commands
Moving the cursor over plotted results highlights them, and the information line gives a
brief description of the currently highlighted match. This is in the form:
match name: contig1 number@position in contig1, with contig2 number@position in contig2,
length of the match
For Find Internal Joins the percentage mismatch is also displayed.

The Staden Package Manual

Several operations can be performed on each match. Pressing the right mouse button
over a match invokes a popup menu. This menu will contain a set of options which depends
on the type of result to which the match corresponds. The following is a complete list, but
not all will appear for each type of result.
Information
Sends a textual description of the match to the Output Window.
Hide

Removes the match from the Contig Comparator. The match can be revealed
again by using "Reveal all" within the Results Manager.

Invoke contig editors
Invoke join editors
When invoked these options bring up their respective displays to show the
match in greater detail.
Remove

Removes the match from the Contig Comparator. The match cannot be revealed again by using "Reveal all" within the Results Manager.

One of the items in the popup menu may have an asterisk next to it. This is the default
operation which can also be performed by double clicking the left mouse button on the
match. For Repeat or Find Internal Joins matches this will normally be the Join Editor,
or two Contig Editors when the match is between two points in the same contig.
The crosshairs can be toggled on and off and a diagonal line going from top left to bottom
right of the plot can also be displayed if required. This is useful as a guide for moving the
contigs such that their matches lie upon the diagonal line.
The "Results" menu on the contig selector window provides a similar mechanism of
accessing results, but at the level of all matches in a particular search. This is simply a
menu driven interface to the Results Manager window (see Section 2.13 [Results Manager],
page 277), but containing only the results relevant to the contig comparator window.

1.2.2.2 Automatic Match Navigation
The "Next" button of the contig comparator window automatically invokes the default
operation on the next match from the current active result. This provides a mechanism to
step through each match in turn ensuring that no matches have been missed.
With a single result (set of matches) plotted, the "Next" button simply steps through
each match in turn until all have been seen. Moving the mouse above the "Next" button,
without pressing it, highlights the next match and displays brief information about it in
the status line at the bottom of the window. To step through the matches in "best first"
order, select the "Sort Matches" option from the relevant name in the Results menu. The
exact order is dependent on the result in question, but is generally arranged to be the most
interesting ones first.
Bringing up another result now directs "Next" to step through each of the new matches.
To change the result that "Next" operates on, use the Result menu to select the "Use for
’Next’" option in the desired result. Alternatively, double clicking on a match also causes
"Next" to process the list starting from the selected result.

Chapter 1: Next generation assembly editing with Gap5

The "Next" scheme remembers any matches that have been previously examined either by itself or by manually double clicking, and will skip these. To clear this ’visited’
information select "Reset ’Next’" in the Results Manager.

The Staden Package Manual

1.3 Template Display
The template display is a graphical overview of a single contig. It allows us to see how much
data we have, how long the fragments are and how they relate to each other (whether they
are forming valid pairs).

The window consists of one or more tracks, by default showing the reading template
layout at the top and a sequence / read-pair coverage plot at the bottom. The Tracks menu
allows us to turn these on and off.
Below the main menu bar is a series of buttons that bring up new dialogues for controlling
how the data is to be display and what is to be displayed.
Then come a graphic plot per track. A cross-hair automatically tracks the cursor, indicating the X and Y coordinates (in appropriate units) in the status line at the bottom of
the window. The track displays can be moved by either using the horizontal and vertical
scrollbars at the bottom and right hand edges of the window, or by clicking and dragging
the contents of the window. While dragging the display will not update to show newly
visible regions of a contig until the left mouse button is released.
Finally the bottom contains a scrollbar and ruler for positioning and a series of controls.
The X scale simply controls how many base-pairs of the contig are covered by he window.
The X scale number is arbitrary, but is interpreted in an exponential manner so it is easy

Chapter 1: Next generation assembly editing with Gap5

to rapidly zoom in or zoom out. All other controls in the bottom panel do not affect the
reading coverage track, so they are covered in the template track section below.

1.3.1 Filtering data
By default all templates are used for drawing the tracks, but there are times when we may
wish to focus on specific problem data or to exclude it from our graphics.

The Filter button at the top of the Template Display brings up the dialogue shown
above. Making changes to this dialogue either have an instant impact on the display (when
“Auto update” is enabled) or instead only when we hit Apply or OK to dismiss the dialogue.
The Pairs: section allows us to select either reads on all templates, reads that are the sole
read for that template, or reads that are paired on a template. Note that the definition of a
pair here is strictly dependant on how many reads for a template are in the gap5 database
rather than the library preparation strategy. So a paired-end template for which only one
read is in the gap5 database (perhaps due to failure to map) is classified as “single”.
The Consistency section can be used to select all, consistent only or inconsistent only
data. This requires read-paired data (single reads cannot be inconsistent as so are considered
as consistent). The interpretation of inconsistent currently is that the two reads of a pair
do not point towards one another, but in future releases this is planned to check the correct
orientation for that library type as for some constructions it is normal to have reads pointing
in the same orientation.
The Spanning section governs whether to display read pairs with one read in this contig
and the other read in another contig. Handling templates with more than two reads is still
on-going work, but when finished a spanning read-pair will be one with any read not in this
contig.
Underneath these are two sliders applied in addition to the above filters. They allow
removal of any read or read-pair (depending on the type of data being plotted) with a
mapping quality outside the selected range.

1.3.2 Template plot
This is the main body of the template display window. The default plot will be showing readpairs, mainly coloured by mapping quality with the insert size governing the Y coordinate.
Larger inserts are at the bottom of the track while shorter ones are at the top.

The Staden Package Manual

The colours used are as follows:
blue

This is a template with only one reading present. It could be either a pair with
one end not in this assembly, or a true single-ended sequencing experiment. The
horizontal size of the line is now the length of the individual sequence rather
than the computed length of the insert.

orange

This is a template with one reading present in another contig. The size of the
line is derived from the size of the data in this contig (typically a single reading).

red

This template is considered as inconsistent in some manner, typically due to
the relative position and orientation of the forward and reverse sequences being
incorrect.

grey (variety of )
Any consistent read-pair is coloured by the mapping quality, by default using the
average of the individual sequence mapping qualities. Lighter shades represent
higher mapping qualities.
The row of scale bars at the bottom of the window control how data is to be plotted.
They are:
X Scale

Controls how many base-pairs in the contig to plot. Higher values indicate more
base pairs, but with an exponentially growing scale.

Y Magnification
Governs the amount of vertical space consumed by the template track. This
has no impact on the depth track.
Y Offset

Adds a small shift to the Y position of data prior to plotting. This is of little
use unless Separate Strands has also been selected, where upon this allows the
two halves of the plot to be brought closer together. (Effectively meaning the
a plot can go from -1000 to -100 and +100 to +1000 instead of -1000 to +1000
with a blank area in the middle if our sequences are a minimum of 100 bases
long.)

Stacking Y Size
Only of use in Stacking Y-Position mode. This vertically groups together data
of similar length, allowing a basic approach of separating short-read and longread technologies. The Y layout is performed in steps of “Stacking Y Size”. To
pack reads tightly together regardless of length, set this to the maximum value
possible.
Y Spread

This adds a small perturbation to the computed Y coordinates of lines in the
template track. When the Y coordinate is derived based on the insert size of
the read-pair it is not always clear whether a line represents a single item or

Chapter 1: Next generation assembly editing with Gap5

many items stacked perfectly on top of one another. The Y spread control
compensates for this.

Template track with Y spread of 0.

Template track with Y spread of 50.

1.3.2.1 Controlling The Y Layout.
The layout and type of data in the template track can be controlled using the Template
button at the top of the main template display window.

The Y Position section controls how the Y coordinates are computed when plotting data
(with X being tied to the position in the assembly or reference). It can be one of three
settings.

The Staden Package Manual

Template size

The default mode. The size of an object is defined to be the number of bases it
spans. This is normally the size of a read-pair, or if the pair spans contigs or if
only readings are shown it is the size of a single reading instead. Larger objects
are at the bottom of the window. This Y method very clearly reveals indels in
a mapped assembly. It sometimes also sometimes reveals misassemblies.
Given that items of identical size will stack on top of one another, of particular
use to this display mode is the Y Spread control in the main window.
Stacking

A more traditional view - each and every item is allocated its own
non-overlapping Y coordinate (although low Y magnifications may imply these
are drawn at the same Y pixel).

Chapter 1: Next generation assembly editing with Gap5

It is still possible to partially group items by their insert size using the “Stacking
Y Size” control in the main window.
Mapping Quality

Finally we can display data collated by the mapping score. This is typically only
available for mapped assemblies. This plot sometimes helps to reveal regions
where all the data present is of poor mapping quality, indicating a likely repeat.
Adjacent to the Y Position frame is the Colour frame. This controls the colour of the
lines drawn in the template display rather than their location.
Combined mapping quality
Minimum mapping quality
Maximum mapping quality
For templates with multiple reads visible, we have a variety of mapping qualities.
Often these individual sequence mapping qualities will differ, but we wish to
draw a single line for the template with a single colour. These three methods
control whether we take the average, minimum or maximum values from the
individual sequences on this template.
Reads

The line typically represents the entire span of the insert, but we may not have
sequence data for all of the template. This colour mode will also draw the
portions of the template that we have known sequence for, in green for forward
strand sequences and magenta for reverse strand sequences. Any remaining
portion of template between the reads is drawn using the combined mapping
quality.

At the bottom of this dialogue is a row of check buttons.
“>>Acc” enables accurate mode, but be warned this can be very slow. When the template
display is drawn it fetches all data within the visible portion plus a little bit ether side.
From this reads from the same template are paired up. However when a template spans

The Staden Package Manual

a substantially larger range than is shown we may only have fetched one read for this
template. We do know that such a template forms a pair, but we do not know the exact
location of the other end or even whether it is in this contig. The assumption is that it is
not, and the template is drawn in orange. Enabling accurate mode will work out the precise
location of the other end and if it is present elsewhere within this contig then the insert size
will be correctly determined and the plot adjusted accordingly.
The “Reads” checkbutton (not to be confused with the Reads colour selector) disables
all drawing of read-pairing and template lines, instead drawing lines to represent the known
DNA sequence instead.
“Y-log scale” controls whether we plot our Y values using log or linear scales.
“Separate strands” attempts to classify all templates as coming from the top or bottom
strand of DNA (based on the orientation of the sequences on that template, although
sometimes these are conflicting). It then splits the plot in two, forming an approximate
mirror image. This may be of use in some transcriptome sequencing experiments.

1.3.3 Depth / Coverage Plot
The depth track shows coverage of both individual readings and read-pairs, where a readpair counts as +1 coverage over the entire length it spans rather than just the portion
directly sequenced.
The filter options for (in)consistent read pairs also apply here, giving the option to only
show depth of consistent pairs.

Chapter 1: Next generation assembly editing with Gap5

1.4 Editing in Gap5
The Gap5 Contig Editor is designed to allow rapid checking and editing of characters
in assembled readings. Very large savings in time can be achieved by its sophisticated
problem finding procedures which automatically direct the user only to the bases that
require attention. The following is a selection of screenshots to give an overview of its use.

The figure above shows a screendump from the Contig Editor showing the consensus for
a small region of a contig and the aligned reads. The main components are, top-most menu
bar; common buttons and controls beneath this; the main name and sequence panels to the
left and right; scrollbars and jog-control; a status text line at the bottom.
The names panel on the left can show either reading names or a small ASCII diagram
representing their position, orientation and mapping quality as a grey-scale. The sequences
to the right in the screenshot has base quality shown in grey (dark being poor, light being
good) with disagreements to the consensus at the top shown in blue. The consensus line
also shows base qualities. You may notice we have a mixture of long and short sequences,
with the longer ones being at the top. This screenshot is from a mixed assembly of Illumina
short-read data and ABI Sanger-method capillary sequences.
One base is drawn in inverse video (a “G”). This is the current location of the editing
cursor. We can move this we arrow keys or clicking with the left mouse button. It behaves
much like the editing cursor in a word processor and need not be visible in the portion of
the contig we are viewing.

The Staden Package Manual

Also visible is a set of bases coloured yellow. These are an OLIGO annotation. Gap5
supports a wide variety of annotation types (often also referred to as “tags”). These are
covered later in more detail.

This figure is an example of the Trace Display showing three capillary traces and an
Illumina trace from readings in the previous Contig Editor screendumps. Note that this
demonstrates the possibility of showing the raw trace data for new short-read sequencing
technologies, but typically this is not available due to the high storage size.

1.4.1 Moving the visible segment of the contig
The contig editor displays only one segment of the entire contig, although several contig
editors can be in use at once. Below the sequence is a scrollbar and below that a “jog”
control. The scrollbar behaves as expected, allowing rapid positioning anywhere within the
contig using the middle mouse button or left-clicking and dragging the slider. However with
extremely long contigs (for example 100Mb) it can become tricky to move by the desired
amount. Each pixel on the scrollbar may represent 100Kb worth of data, so dragging the
scrollbar is only approximate positioning. Equally so clicking in the trough to move a
screen-full at a time can be too small. This is where the jog-control can be of use.
By default this is always centred. Clicking and dragging this left or right starts to scroll
the editor, at a speed proportional to how far away from the centre the jog is dragged.
Releasing the mouse button stops automatically scrolling and recentres the jog control.
The final, more precise, manner of positioning the editor view is with the text entry box
in the bottom left corner. Type in any coordinate here and press return to jump straight
to that location. Note however that Gap5’s coordinates are currently always in padded
form; that is to say that a gap in the consensus caused by an insertion in one of the aligned
sequences is still counted as a base position.
For particularly deep displays the vertical scrollbar on the right edge of the window will
also be useful. While scrolling in X, the editor attempts to keep the same sequences visible
on screen. To do this it may automatically adjust the Y scrollbar for you due to changing
layout of sequences. (By default the top-most sequence is always the sequence that starts
furthest left and the bottom most is the sequence starting furthest right.)

Chapter 1: Next generation assembly editing with Gap5

If you have a mouse wheel, this may also be used for small scrolling. By itself it scrolls
in Y one sequence at a time. With the Control key held down it scrolls in larger increments. Using the Shift key in conjunction with the mouse wheel scrolls in X instead, with
Shift+Control to scroll in larger increments.
The displayed portion of the contig is separate from the current location of the editing
cursor. This is displayed as a black rectangle with typically a light coloured letter inside it.
Any editing keys operate on the base underneath this or to the base immediately preceding
it for Delete. We cover the topic of editing later (see Section 2.6.3 [Editing], page 165),
however moving the editing cursor is also another way of scrolling the editor.
Finally the Page Up and Page Down keys scroll the editor left or right by 90% of current
screen width. Used with Shift the moves in increments of 1Kb, with Control in increments
of 10Kb and with both Shift and Control in increments of 100Kb. The Home and End
keys jump to the start or end of the current item underneath the editing curosr - either a
sequence or the consensus.

1.4.2 Names
At the left side of the editor window is the “names panel”. This either displays an ASCII
pictorial summary of the sequence layout or the actual sequence names themselves depending on the settings in use. Between the names panel and the sequences panel is a vertical
line, visible at the right edge of the above image. This can be dragged left and right to
adjust the proportion of display dedicated to the names and sequence panels.
The default name display looks like this:

This plot is a mini diagram of the way the sequences overlap. Here the > and < symbols
represent the start of sequences, assembled on either the forward or reverse strand, with
the ... sections reflecting their relative lengths. The background shading indicates the
mapping quality of the sequence (which may not be available in many cases, depending on
how the assembly was derived). This should indicate the likelihood that the sequence has
been assembled to the correct point. Sequence that appears to map elsewhere, e.g. due to
a repeat, will be dark grey while unique sequence will be light grey or white. Moving the
mouse cursor over a sequence will tell you the precise mapping quality along with additional

The Staden Package Manual

information such as the sequence name, the technology used (Sanger, Illumina, 454, etc),
and whether it is part of a pair of sequences.
In the editor Settings menu is a checkbox labelled “Pack Sequences”. When checked we
permit multiple sequences to be drawn in the same row. Unchecking this reverts to the
Gap4 style of display where each sequence has its own dedicated row. This also has an
affect on the names panel, which switches to showing the sequence names, as below.

This still uses the > and < symbols to reflect strand and grey scales for representing the
mapping quality. The > and < are now also coloured independently.
•
•
•
•
•

light blue The read is not paired
white Forms a consistent pair
grey Paired, but the insert size is too large or too small
red Paired, but in an invalid orientation
orange Paired, but the other end is in another contig

At the bottom of the names panel is an editable text field containing the current display
position. Adjacent to this is a small “P” indicating these coordinates are “padded”. Clicking
this will alternate with “R” to indicate reference coordinates, although these may not be
available in all situations. Note that currently, for speed reasons, it cannot directly display
unpadded coordinates.
Typing into this position entry-box allows us to direct the editor to a specific location. If
we end the number with “u” it performs an unpadded to padded conversion before jumping
to this location.
Left clicking on a name will toggle the background between the current grey to a shade
of blue (with luminosity once again reflecting mapping quality). This indicates that the
sequence name has been added to the “readings” list. Multiple names may be selected and
deselecting by pressing and holding the left mouse button while moving the mouse cursor.
In both display modes, pressing the right mouse button brings up a context sensitive
menu containing operations relevant to that specific sequence. This may contain the following commands.

Chapter 1: Next generation assembly editing with Gap5

Copy name to clipboard
Copy #number to clipboard
These copy the sequence name or the record number to the clipboard for use
in a subsequent paste operation. Note that there is no visual cue that this
has happened. The same function may also be achieved by left-clicking and
dragging the mouse horiztonally, as if attempting to highlight a region of text.
These two items are also available when right clicking on the Consensus label,
but in this case it copies the contig name or number to the clipboard instead.
Goto...

This lists other sequences sharing the same template, such as the other end of
a read-pair. Selecting this command will jump the editor to the left-most base
in that sequence. If the sequence is in another contig then a new editor will be
created, unless one already exists for that contig in which case that other editor
will be moved accordingly.

Join to...

In the case of read-pairs that span contigs, the join to function will bring up
the join editor for both contigs involved, automatically complementing the other
contig if appropriate based on the library pair orientation statistics.

Right clicking on the contig name also pops up a menu. In here are otions to change
the contig name or the starting coordinate. These options are also available in the editor
Commands menu.

1.4.3 Editing
Editing can take up a significant portion of the time taken to finish a sequencing project.
Gap5 has a selection of searches (see Section 2.6.6 [Searching], page 174) designed to speed
up this process. The problems that require most attention are conflicts between good bases.
Where base confidence values are present it should be unnecessary to edit all conflicting
bases as, generally, this will amount to adjusting poor quality data to agree with good
quality data in which case the consensus sequence should be correct anyway.
Pads in the consensus should not be considered a problem requiring edits because it
is possible to output the consensus sequence (from the main Gap5 File menu) with pads
stripped out. Obviously poorly defined pads (a mixture of several alignment padding characters and real bases) require checking in the same manner as other poorly defined consensus
bases.
To change a base simply overtype with a new base call, one of a,c,g or t in lowercase.
Alternatively a base can be changed to an alignment padding character by pressing “*”.
These new bases and pads automatically get given a quality value of 100, but see below for
how to adjust this. The consensus cannot be edited in this manner.
To insert a gap into sequence press “i” or the Insert key. At present only alignment pads
can be inserted, not bases, although the pads can subsequently be edited to turn them into
bases. The “i” and Insert keys also permits insertions of gaps into the consensus, which it
achieves by inserting into every sequence aligned at that position.
Bases may be deleted by pressing the Delete or Backspace key. This deletes the base
immediately to the left of the current editing cursor. Note that if Delete or Backspace is
pressed with the editing cursor on the consensus this removes an entire column of data.

The Staden Package Manual

Deleting anything other than alignment padding characters (either in sequences or the
consensus) is a dangerous operation needing careful thought. To prevent accidental removal
of data therefore, to delete anything other than “*” you must press Control in conjunction
with Delete or Backspace.

1.4.3.1 Moving the editing cursor
Nearly all editing operations happen at the location of the editing cursor. This cursor
appears as a black block containing the base in a light colour, instead of the usual black
base on a light background.
The simplest mechanism of moving the cursor is using the left mouse button. Alternatively the following keys can be used.
Left arrow or Control b
Right arrow or Control f
Up arrow or Control p
Down arrow or Control n
Control a
Control e
Home
End
Meta or Alt <
Meta or Alt >

Move
Move
Move
Move
Move
Move
Move
Move
Move
Move

left one base
right one base
up one base
down one base
editing cursor
editing cursor
editing cursor
editing cursor
editing cursor
editing cursor

to
to
to
to
to
to

start of sequence
end of sequence
start of sequence
end of sequence
start of contig
end of contig

If any of these move the editing cursor outside of the visible region, the editor will scroll
to accommodate. Control-a and Control-e with the editor on the consensus line will also
jump to the start and end of the contig.
If “Cutoffs” are shown (see Section 2.6.3.4 [Adjust the Cutoff Data], page 169) the cursor
may be placed in the cutoff data too. Note that turning off displaying cutoff data would
then leave the editor on an invisible base, so it is moved to the consensus line instead.

1.4.3.2 Adjusting the Quality Values
Each base has its own quality value. Assembly will allow only values between 1 and 99
inclusive. A quality value of 0 means that this base should be ignored. A quality value of
100 means that this base is definitely correct and the consensus will be forced to be the
same base type and will be given a consensus confidence of 100. If two conflicting bases
both have a quality of 100 the consensus will be a dash with a confidence of 0.
Newly added bases or replaced bases are assigned a quality of 100.
Several keyboard commands are available to edit the quality value of an individual base.
[
]
Shift
Control
Shift
Control

Up-Arrow
Up-Arrow
Down-Arrow
Down-Arrow

Set quality to 0 and move cursor right
Set quality to 100 and move cursor right
Increment quality by 1
Increment quality by 10
Decrement quality by 1
Decrement quality by 10

Chapter 1: Next generation assembly editing with Gap5

Finally note that quality values can also be made visible by clicking on the “Quality”
checkbutton at the top of the editor. This shows the quality by use of a grey scale.

1.4.3.3 Adjusting the alignment coordinates
On rare occasions we may need to move an entire sequence a small amount to achieve an
optimal alignment, rather than simply inserting or deleting pads.
This is achieved by using Control plus the left and right arrow keys while the editing
cursor is anywhere on the sequence.
Control Left-Arrow
Control Right-Arrow

Shift sequence left
Shift sequence right

1.4.3.4 Adjusting the Cutoff Data
Sequences typically consist of a good quality “used” portion and poor quality “clipped”
or “cutoff” portions at the 5’ and 3’ ends of the sequence. Although for short sequencing
technologies it’s quite likely we have no cutoff data at all. The reason for this is that the
low quality ends of sequences may have a sufficient number of errors that the sequence
alignment algorithms are no longer confident they have the correct bases aligned, or event
that the sequence simply disagrees too much.
By default these are not shown, although you may see blank lines in the display as room
is left for this sequence even when it is not visible. The cutoff data may be displayed by
pressing the “Cutoffs” check-button at the top of the editor. The cutoff sequence will then
be displayed in grey. We call the boundary between the cutoff data and the used data the
cutoff position. These positions can be adjusted by pressing the “<” (left cutoff) or “>”
(right cutoff) keys. In both cases the cutoff point is between the base with the editing
cursor and the base to the left of the editing cursor.
Using the “<” and “>” keys with the editing cursor in the consensus performs bulk
versions of these edits by clipping every single sequence to that poinit. One small difference
here though is that the bulk versions only ever shrink cutoff data and do not grow it.
<
>

In sequence: set left cutoff position
In sequence: set right cutoff position

<
>

In consensus: bulk clip left cutoff
In consensus: bulk clip right cutoff

1.4.3.5 Summary of Editing Commands
A brief summary of these editing operations can be seen below:
Key
Location
----------------------------a,c,g,t,*
Reading
i, Insert
Reading
Delete
Reading
Ctrl Delete
Reading

Action
-------------------Change base
Insert pad
Delete * to left
Delete any base to left

Control Left

Move reading left

Reading

The Staden Package Manual

Control Right

Reading

Move reading right

[
]
Shift Up
Shift Down
Ctrl Up
Ctrl Down
<
>

Reading
Reading
Reading
Reading
Reading
Reading
Reading
Reading

Set quality to 0
Set quality to 100
Incr. quality by 1
Decr. quality by 1
Incr. quality by 10
Decr. quality by 10
Set left cutoff
Set right cutoff

i, Insert
Delete
Ctrl Delete
<
>

Consensus
Consensus
Consensus
Consensus
Consensus

Insert column of pads
Delete * to left
Delete any base to left
Bulk clip left cutoff
Bulk clip right cutoff

1.4.4 Cut and Paste Control of Sequence
It is possible to highlight an area of a reading or the consensus sequence in preparation for
performing some further action upon it. Such examples of actions are: creating annotations
and pasting into a new window. We call these highlighted areas “selections”. They are
displayed as an underlined region.
The simplest way to make a selection is using the left mouse button. Pressing the mouse
button marks the base beneath the cursor as the start of the selection. Then, without
releasing the button, moving the mouse cursor adjusts the end of the selection. Finally
releasing the button will allow normal use of the mouse again. If while marking a selection
we reach the edge of the window then the editor will automatically start scrolling for us.
Sometimes we may wish to make a particularly long selection, or just extend an existing
selection after we’ve already released the mouse button. This can be done by using shift
left mouse button to adjust the end of the selection. Hence we can mark the start of the
selection using the left button, scroll along the contig to the desired position, and set the
end using the shift left button.
The selection is stored in the “clipboard”. This allows for the usual “cut and paste”
operations between applications, although the contig editor only supports this in one direction (as it is not possible to “paste” into the window). The mechanism employed for this
follows the usual X Windows standard of using the middle mouse button.
A quick summary of the mouse selection commands follows.
Left button
Left button (drag)
Shift left button
Middle button (in another window)

Position editing cursor to mouse cursor
Mark start and end of selection
Adjust end of selection
Copy selected sequence

Chapter 1: Next generation assembly editing with Gap5

1.4.5 Selecting Sequences
The list named “readings” is used for all sequences selected in all editors. This is automatically updated whenever a sequence is selected or deselected.
Inividual sequence names can be (de)selected by clicking on them with the left mouse
button, or clicking and dragging out a region. This works well for a few sequences.
If you need to select all readings overlapping a specific consensus base or a region of
consensus bases mark the range of the consensus you wish to select over by pressing and
dragging the left mouse button (as if you were going to create an annotation) and then
either right click in the consensus or use the Commands menu to choose Select Reads.
When using the Commands menu you get a dialogue asking for confirmation of the start
and end positions and the option of whether to select sequences that overlap this range
or only those which are entirely containing within that range. When using the right-click
popup on the consensus it simply takes the defaults (overlapping sequences).
Deselection follows the same procedure.

1.4.6 Annotations
Annotations (or tags) can be placed at any position on readings or on the consensus. They
are usually used to record positions of primers for walking, or to mark sites, such as repeats
or compressions, that have caused problems during sequencing. Each annotation has a type
such as “primer”, a position, a length, a strand (forward, reverse or both) and an optional
comment. Each type and strand has an associated colour that will be shown on the display.
For information on searching for annotations see Section 2.6.6.4 [Searching by Tag Type],
page 175, and Section 2.6.6.3 [Searching by Annotation Comments], page 175.

FIXME: not all of the tag editor features are supported yet; specifically the Move/Copy
functionality is currently missing.
To create an annotation, make a selection and then select “Create Tag” from the contig
editor commands menu at the top of the editor or by pressing the right mouse button.
See Section 2.6.7 [The Commands Menu], page 177. This will bring up a further window;

The Staden Package Manual

the “tag editor” (shown above). The “Type:” button at the top of the editor invokes a
selectable list from which tag types can be chosen. See below.

Use this to select the desired type of annotation.
Next the strand of the annotation can be selected. This will be displayed as one of
“<—->”, “<—-”, “—->” and “?—-?” indicating both strands, top strand only, bottom
strand only, and stranded but unknown strand respectively. These mirror the GFF strand
definitions. The comment (the box beneath the buttons) can be edited using the usual
combination of keyboard input and arrow keys. The “Save” button will exit the tag editor
and create the annotation. To abandon editing without creating the annotation use the
“Cancel” button.
To edit an existing annotation, position the editing cursor within a annotation and select
“Edit Tag” from the commands menu. This will be a cascading menu, typically showing
one tag. If multiple tags coincide at the same sequence position you will be able to chose
which tag to edit. Once again the tag editor will be invoked and operates as before. The
F11 key is also a shortcut for editing the top-most tag underneath the editor cursor. When
editing, the “Save” will save the edited changes and “Cancel” will abandon changes.
Removing a annotation involves positioning the editing cursor within an annotation and
selecting “Delete Tag” from the commands menu. As with “Edit Tag” this is a cascading
menu to allow you to chose which tag at a specific point to delete. The F12 key is a shortcut
to remove the top-most tag underneath the editor cursor.
As usual, “undo” can be used to undo any of these annotation creations, edits and
removals.
Some tags may contain graphical controls instead of the usual text panel. These are
encoded with the master gap4/5 tag database (GTAGDB ) by specifying the default tag
text to be a piece of “ACD” code. A full description of the (modified for gap4/5) ACD

Chapter 1: Next generation assembly editing with Gap5

syntax is not available currently, but it is strongly modelled on the the EMBOSS ACD
syntax which has documentation at
http://www.emboss.org/Acd/index.html .
It is possible to add your own tag types by modifying either the system GTAGDB file
or creating your own GTAGDB file in your home directory (for all your databases) or the
current directory (for just those in that directory).
For rapid editing and deleting the F11 and F12 keys may be used. These edit and
delete the top-most tag underneath the editing cursor. If you wish to edit or delete the
tag underneath the mouse cursor instead (and hence save a mouse click) use Shift F11 and
Shift F12 for edit and delete.
The Control-Q key sequence may be used to toggle the displaying of tags. Pressing it
once will prevent all tags from being displayed in the editor. This is sometimes useful to see
any colouring information underneath the tag. Pressing Control-Q once more will redisplay
them.

1.4.6.1 Annotation Macros

For rapid annotating a series of 10 macros may be programmed. Press Shift and a
function key between F1 and F10 to bring up the macro editor. This look much like the
normal tag editor except that Save is replaced with Save Macro and saving does not actually
create a tag on the sequence. To use the macro, highlight the bases you wish and press the
function key corresponding to that macro - F1 to F10. For a single base pair tag you do
not need to underline a region as the tag will automatically cover the base underneath the
editing cursor. To remember these permanently use the “Save Tag Macros” option in the
“Settings” menu.
If you have an existing tag you wish to rapidly duplicate to many places, use Control
plus a function key to copy the tag underneath the editing cursor to that numbered tag
macro. This is simply a short cut for Shift and the function key, but without needing to
manually replicate the tag type and textual comment.
You may find that some function keys are already programmed to do other things (such
as raise or lower windows), depending on the windowing environment in use. If this is the
case either modify the configuration of your windowing system or simply use another macro
key.

The Staden Package Manual

Shift
F1-F10
Control F1-F10
F1-F10

Create a tag macro via a dialogue window
Create a tag macro from tag at editor cursor
Apply a tag macro (create a real tag)

1.4.7 Searching
The contig editor’s searching ability and its links to the consensus calculation algorithm
are crucial in determining the efficiency with which contigs can be checked and corrected.
The consensus is calculated “on the fly” and changes in response to edits. For editing, the
most important search functions are those which reveal problems in the consensus whilst
ignoring all bases that are adequately well determined. The standard search type is therefore
by consensus quality. By default this is done in the forward direction and for a quality value
of 30, although this is configurable by changing the collowing lines in the gap5rc file.
set_def CONTIG_EDITOR.SEARCH.DEFAULT_TYPE
set_def CONTIG_EDITOR.SEARCH.DEFAULT_DIRECTION
set_def CONTIG_EDITOR.SEARCH.CONSQUALITY_DEF

consquality
forward
30

Pressing the “Search” button brings up a separate search window. This allows the user
to select the direction of search, the type of search, and a value to search on. The value
is entered into a value text box, then pressing the “search” button performs the search. If
successful, the cursor is positioned accordingly.

The Control-s and Control-r key bindings in the editor are equivalent to searching for
the next or previous match. Both key bindings will bring up the search window if it is not
currently displayed (and not search), otherwise they perform the search currently selected
in that window. Additionally with the mouse focus in the search dialogue window the Page
Up and Page Down keys will perform previous and next search too.
As is described below, there are several search modes.

1.4.7.1 Search by Annotation Comments
This positions the cursor at the start of the next tag which has a comment containing the
string specified in the value box. The search performed is a regular expression search, and

Chapter 1: Next generation assembly editing with Gap5

certain characters have special meaning. Be careful when your string contains “.”, “*”, “[“,
“]”, “\”, “^” or “$”. The search can be performed either forwards or backwards from the
current cursor position. Searching with an empty value will find all tags.

1.4.7.2 Search by Tag Type
This positions the cursor at the start of the next tag of the specified type. To change the
type, click on the currently listed tag type, which displays a tag type selection dialogue.
The search can be performed either forwards or backwards of the current cursor position.
To find all tags, use “Search by Annotation Comments”, with an empty text box.

1.4.7.3 Search by Padded Position
This jumps to a padded location in the editor and is directly equivalent to typing a number
into the position entry box in the bottom left corner of the editor followed by “p”.
It is also possible to do relative searches by prefixing the location with + or -. So +100
will skip ahead 100 bases.

1.4.7.4 Search by Unpadded Position
As per the padded search, but this jumps to an unpadded coordinate - essentially the number
of non-* bases since the start of the contig, regardless of whether the first consensus base
is labelled as base 1.

1.4.7.5 Search by Sequence
This positions the cursor at the start of the next segment of sequence that matches the
value specified in the text box. The search is case insensitive, ignores pads, and can allow
a specified number of mismatches. Unlike Gap4, Gap5’s sequence search only looks in the
consensus sequence. It also operates either forwards or backwards from the current editing
cursor position.

1.4.7.6 Search by Reading Name
This positions the cursor at the left end of the reading specified in the value text box.
Note that not all reading names may be indexed by Gap5 and that the search will not
find unindexed names. See tg_index -t for information on creating Gap5 databases with
reading name indices.
The reading name has to be an exact match and so currently does not find prefix strings.
If multiple sequences exist with the same name (which should be strongly discouraged) then
it is undefined which will be found first.

1.4.7.7 Search by Reference InDel
Note: this information may not be available in all scenarios. If you imported the gap5
database from a SAM or BAM file there is an implicit set of reference coordinates used within
SAM/BAM. Gap5 can keep track of the relationship between gap5’s padded coordinate
system and the reference coordinates. This function uses this data to search for the next
or previous reference insertion or deletion.

The Staden Package Manual

1.4.7.8 Search by Consensus Quality
This positions the cursor on the consensus at the next position where the quality of the
consensus is below a given threshold. The quality threshold should be entered into the value
box and should be within the range of 0 to 100 inclusive.

1.4.7.9 Search by Consensus Discrepancy
The consensus algorithm can keep track of the expected number of differences to the consensus given sequence depth and sequence quality values. This search looks for locations where
the actual number of differences exceeds the expected amount by more than a specified
factor.

1.4.7.10 Search by Consensus Heterozygosity
The consensus algorithm has a simple heterozygous calling method. Rather than simply
weighing up the evidence for the base being A, C, G, T or a pad it also considers that it
may be a combination of any two of these values. The consensus scores for the individual
bases as well as the highest scoring consensus base can be seen in the editor information
line when the mouse cursor is moved over a consensus base.
This search is looking for consensus bases where the best heterozygous score is greater
than or equal to the specified value.

1.4.7.11 Search by Low Coverage
This jumps to the next or previous location where the sequence coverage drops below a
specified value.

1.4.7.12 Search by High Coverage
This jumps to the next or previous location where the sequence coverage is higher than a
specified value. Regions of extreme depth are often indication of misassemblies.

1.4.8 The Settings Menu
The purpose of this menu is to configure the operation of the contig editor. Settings can
be saved using the “Save settings” button, which also saves preferences for the editor width
and height and the location of the divider between the names and sequence panels. It does
not save tag macros though; these may be saved separately using the “Save Macros” option.
Settings for the following options can be changed.
• Group Readings
• Highlight Disagreements
• By dots
• By foreground colour
• By background colour
• Case sensitive
• Set quality threshold
• Pack sequences
• Hide annoations

Chapter 1: Next generation assembly editing with Gap5

• Background stripes
• Show Mapping Quality
• Show Template Status
• Padded coordinates
• Reference coordinates
• Save tag macros
• Save settings

1.4.8.1 Group Readings
Sequences have an “X” location in the editor defined by the location within the contig that
they align to. The “Y” location though is determined by the sequence layout algorithm,
governed by the Pack Sequences setting and Group Readings options.
By default sequences are grouped into distinct technologies, typically with longer sequences up the top (capillary) and shorter ones at the bottom (Illumina, SOLiD). Within
these technology groups the sequences are then sorted by their start location, so the topmost sequences start earlier and the bottom most sequences start later.
The Group Readings menu allows user control over these primary and secondary collating
orders. The sorting methods are defined below.
By technology
Sorted in order of unknown, sanger (capillary), Illumina, SOLiD, 454.
By clipped start
Sorted by the visible (non-cutoff) start position.
By start

Sorted by the start position, regardless of whether the base is in cutoff data or
not.

By template
Sorted by template name. In Gap5 this is always defined to be a prefix of
the sequence name, or optionally the same as the sequence name. The sorting
method is using a simple ASCII collation order.
By strand Sorts data into the top strand first followed by the bottom strand data.
By base

This sort order is different from all others in that it depends on the location of
the editor cursor.
Sorts sequences by the base type overlapping the last editor cursor location in
the consensus. The collation order is A, C, G, T, N and *. Sequences that
do not overlap that consensus location or those that only overlap in the cutoff
portion are not sorted by this method. If this is used as the primary sort then
these other sequences will be sorted using the secondary sort. If the secondary
sort is By Base then an implicit tertiary sort order of By Start is used.
Note that moving the editing cursor around sequences will not update the Y
order. Only placement of the editing cursor on the consensus will update this.

The Staden Package Manual

1.4.8.2 Highlight Disagreements
This toggles between the normal sequence display (showing the current base assignments)
and one in which those assignments that differ from the consensus are highlighted. It makes
scanning for problems by eye much easier.
Several modes of highlighting are available: “By dots” will only display the bases that
differ from the consensus, displaying all other bases as full stops if they match or colons if
they mismatch but are poor quality. The definition of poor quality here can be adjusted
using the “Set quality threshold” option of the Settings menu. The base colours are as
normal (ie reflecting tags and quality).
Highlight disagreements “By foreground colour” and “By background colour” displays
all base characters, but colours those that differ from the consensus. Bases which differ
by are below the difference quality threshold are shaded in light blue while high quality
differences are dark blue. This allows easier visual scanning of the context that a difference
occurs in, but it may be wise to disable the displaying of tags (hint: control-Q toggles tags
on and off).
Finally the “Case sensitive” toggle controls whether upper and lower case bases of the
same base type should be considered as differences.

1.4.8.3 Pack Sequences
This controls whether the editor allocates one row per sequence or whether it is permitted
to pack multiple sequences onto a single row, assuming they do not overlap.
The latter allows for a more compact plot which is desirable when dealing with short
sequences, however it has the side effect that the reading names can no longer be listed in
the names panel to the left.

1.4.8.4 Hide Annotations
Sometimes we need to see the background shading underneath an annotation, for example to
see the base quality or if we have Highlight Disagreements turned on using the by background
colour mode. This option simply hides all annotations from display until it is selected again
to reveal them once more.
The Control-Q keyboard shortcut has the same effect.

1.4.9 Primer Selection
The “Find Primer Walk” function from the Commands menu is an interface to the Primer3
program (builtin to Gap5 so it does not need an external installation). Currently it only
allows for selection of a single internal oligo suitable for “walking” along a template. It is
designed for manual finishing work and is not appropriate for automatic finishing. Future
plans are to add PCR support.

Chapter 1: Next generation assembly editing with Gap5

The command brings up its own dialogue window.

The top portion of this window controls where to look for primers. By default it will be
either side of the editing cursor location. We also specify here what strand we wish to run
our experiment on.
Below this are a series of Primer3 parameters. Please see the Primer3 documentation
for a full description of these.
Upon hitting OK, and assuming that some primers can be found, a new window showing
the available choices is presented.

The Staden Package Manual

The primers show are sorted by Primer3 score, with lower being better. Clicking on any
of the other headings in the table allows the data to be re-sorted by that column. Clicking
the left mouse button on any line will show the location of this primer in the main editor
window as an underlined region. It also updates the bottom half of the Oligos window with
further details.
At the bottom of the window are two editable selections. The left most labelled “Seq.
name to tag” allows us to pick a sequence we wish to place an oligo (OLIG) annotation on,
which defaults to the consensus sequence. The right selection box labelled “Template name”
is an list of identified templates at this region, however this is not necessarily exhaustive as
it only includes the sequences at this position and may miss some read-pairs that span this
region. If you have a specific template in mind you can also type in the name of it to here.
Pressing the “Add annotation” button then creates an oligo annotation. The text associated with the annotation will depend on the primer chosen, but an example follows.
Sequence
Template
GC
Temperature
Score
Date_picked
Oligoname

AACACATGGTAAAGCAGATG
zDH64-714h06
40.0
53.45
1.54377204143
Thu Aug 12 17:31:18 BST 2010
??

1.4.10 Traces
The original trace data from which the readings where derived can be displayed by double
clicking (two quick clicks) with the left or middle mouse button on the area of interest.
Control-t has the same effect. The trace will be displayed centred around the base clicked
upon and the name of the reading in the contig editor will be highlighted. Double clicking
on the consensus displays traces for all the readings covering that position.
Moving the mouse pointer over a trace base causes the display of an information line at
the bottom of the window. This gives the base type, its position in the sequence, and its
confidence value.
There are two forms of trace display which are selected using the “Compact” button at
the top of the Trace display. The compact form differs by not showing the Info, Diff, Comp.
and Cancel buttons at the left of each trace.
Note that Gap5 does not store the trace files in the project database: it stores only their
names and reads them when required. By default it will attempt to look for them in the
current working directory (likely the same directory as the gap database). However this

Chapter 1: Next generation assembly editing with Gap5

can be adjusted to look in other directories or via URLs using “Trace file location” in the
main Gap5 configure menu (see Section 2.20.8 [Trace File Location], page 302).

This figure is an example of the Trace Display showing three capillary traces and an
Illumina trace. On the top line, the Lock checkbutton keeps the trace data in sync with the
editor cursor position. The layout is controlled by the Columns and Rows selectors at the
top of the window; 2 column by up to 3 rows in the above screenshot. Show confidence draws
coloured bars and a numerical value representing the quality of each individual base-call.
The main trace panels each have the sequence name displayed in the top left corner.
Below this are X and Y zoom controls on the left and the actual trace data on the right.
The style of this will depend on the type of trace. Sanger chromatograms take multiple
samples per base and are subsequently analysed (base-called) to identify the peaks and
the number/type of bases represented by that peak. These are drawn using smooth lines,
examples of which can be seen in the top row of the image above. Illumina GA instruments
are “clocked” in that each and every measurement corresponds to one base. These are
drawn using a stick plot, as seen in the bottom row of the screen-shot. Note that it is quite
likely you will not have the processed trace data available for Illumina GA sequences due
to size constraints, so the above is simply an example of what could be viewed rather than
a typical example.
454 instruments use pyro-sequencing and so produce a variable number of bases per measurement, with each measurement being clocked to a specific cycle (flow) on the sequencing
instrument. Hence 454 data is also drawn using a stick plot, although with potentially
multiple bases per measurement. An example is visible below.

The Staden Package Manual

The horizontal rulers in this plot correspond to normalised peak intensities for 1.0, 2.0
and so on to indicate 1, 2, 3... bases per flow. Clearly visible are flows of approximate
height 1 (C T A G T on the left), 2 (the following AA) and 0 (the G between the left most
C and T). Above these the confidence bars are visible.
Right clicking on a trace will bring up a popup menu containing the following options.
Information
Displays some basic textual information about the trace. The information available will vary by trace type, but it may include details such as the length,
instrument and run-date.
Save

Saves the trace in ZTR format to a local file on disk. This can be useful for
when you are using a remote service for fetching traces or extracting them from
an archive such as .sff or .srf file.

Complement
Reverse complements the trace display. This does not modify data in any way,
but simply adjusts how it is drawn.
Quit

Removes this trace from the trace window. If it is the last displayed trace then
the window will be removed too.

1.4.11 The Editor Information Line
The very bottom line of the editor display is text line used by the editor to display pieces
of useful information. Currently this gives information on individual bases, readings, the
contig, and tags, as the mouse is moved over the appropriate object. Each type of object
we move the mouse pointer over (sequence base, consensus base, sequence name panel,
annotation) has its own list of information to display which can be configured using a
format string stored in your $HOME/.gap5rc file.
Typically you will not need to modify these, but if you choose to do so the default values
to start from are shown below.
# Mouse-over a sequence the reading name panel
set_def READ_BRIEF_FORMAT \
Reading:%n(#%Rn) Tech:%V Length:%l(%L)

MappingQ:%m%**/%*m

Pos:%S%p / %*S%*p

# Mouse-over the "Consensus" label in the name panel
set_def CONTIG_BRIEF_FORMAT \
Contig:%n(#%Rn)
Length:%l Start:%s End:%e
# Mouse-over a base in a sequence
set_def BASE_BRIEF_FORMAT1 \
Base %b confidence:%4.1c (Prob. %Rc, raw %4.1A %4.1C %4.1G %4.1T)
# Mouse-over a base in the consensus
set_def BASE_BRIEF_FORMAT2 \
Base confidence:%4.1c (Prob. %Rc)
sition %p
# Mouse-over an annotation
set_def TAG_BRIEF_FORMAT \
Tag type:%t Comment:"%.100c"

Position %Rp

A=%4.1A C=%4.1C G=%4.1G T=%4.1T *=%4.1*

Po-

Chapter 1: Next generation assembly editing with Gap5

The text output is as listed above, but replacing percent-code strings with a relevant
piece of text. In many cases a capital R indicates raw mode to display a numerical value
instead of a string. For example %n in READ BRIEF FORMAT will be replaced by the
sequence name while %Rn will be replaced by the sequence record number. The full syntax
of percent expansion is as follows:
• A percent sign.
• An optional minus sign to request left alignment of the information. When displaying
information in a specific field with where that data does not fill the entire space allowed
the information will, by default, be right justified. Adding a minus character here
requests left justification.
• An optional minimum field width. This is a decimal number indicating how much space
to leave for this information.
• An optional precision for numbers or maximum field width for strings. This is given
as a fullstop followed by a decimal number.
• An optional ’R’ to specify Raw mode. This changes the meaning of many (but not all)
of the expansion requests to give a numercial representation of the data. For example
%n is a reading name and %Rn is a reading number.
• Th expansion type itself. This is either one or two letters. See below for full details of
their meanings.
To programmers this syntax may seem very similar to printf. This is intentional, but
do not assume it is the same. Specifically the print syntax of %#, %+ and %0 will not work.

1.4.11.1 Reading Information
Used when we move the mouse over a sequence name in the names panel or a sequence
base-call. Example output is Reading:xc04a1.s1(#74) Tech:Sanger Length:295(474) MappingQ:50. Note that not all expansions make sense when used in the names panel as no
cursor X position is available.
%%

A single % sign

Reading name. Raw mode: record number

Reading record number

Position in sequence. Raw mode: position in contig.

Clipped sequence length

Unclipped sequence length

Start of clip

End of clip

Sense (whether complemented) - “<<” or “>>”. Raw mode: 0/1

Strand - “+” or “-”. Raw mode: 0/1

Base call

Confidence value of called base (phred style). Raw mode: probability

%A
%C
%G
%T

The Staden Package Manual

Individual confidence (phred style) of A,C,G,T component in log-odds form.
Raw mode: probability value.

Mapping Quality. Raw mode: probability of correctly mapped.

Instrument type - Sanger, Illumina, SOLiD, 454 or Unknown.

1.4.11.2 Contig Information
For the CONTIG BRIEF FORMAT and BASE BRIEF FORMAT2 the following expansions apply. These operate on contigs and the consensus sequence.
%%

Single % sign

Contig name. Raw mode: contig record number.

Contig record number

Position in contig

Length of contig

Contig start coordinate

Contig end coordinate

Called consensus base

Score for called consensus base. Raw mode: probability value

%A
%C
%G
%T
%*

Individual confidence for A,C,G,T,* base types in log-odds form. Raw mode:
as a probability value.

1.4.11.3 Tag Information
The TAG BRIEF FORMAT string is used to display annotation summaries. The possible
percent encodings are as follows.
%%

Single % sign

Tag position

Tag type (always 4 characters)

Tag length

Tag number (0 if unknown)

Tag comment

Chapter 1: Next generation assembly editing with Gap5

1.4.12 The Join Editor
Contigs are joined interactively using the Join Editor. This is simply a pair of contig editor
displays stacked above one another. The top editor is flipped in Y so that the consensus
appears at the bottom. This allows the two consensus sequences to be adjacent to one
another, separated only by a “differences” line. Note that it is essential to align the contigs
over the full length of their overlap. It is much more difficult to achieve this after a join has
been made, and until the alignment is correct, the consensus sequence will be nonsense.
The few differences between the Join Editor and the Contig Editor can be seen in the
figure below. Otherwise all the commands and operations are the same as those for the
Contig Editor

One difference is the Lock button. When set (as it is in the illustration) scrolling either
contig will also scroll the other contig.
The Align button aligns the overlapping consensus sequences and adds pads as necessary.
The alignment routine assumes that the two contigs are already in approximately the right
relative position (as they are immediately after the Join Editor has been invoked from Find
Internal Joins, or Find Repeats). If they are not you may get better results by manually
positioning then before hand.
The “<” and “>” buttons either side of the “Align” button perform the alignment from
the editing cursor to the start of the contig and and from the cursor to the end of the contig
only. Alignment end-gaps are penalised at the curosr position but not for the alignment
end at the contig start/end position. These buttons are useful for when multiple alignment
positions may be valid, such as is the case with an overlap consisting entirely of a short
tandem repeat.

The Staden Package Manual

It should be noted that each of the pair of editors comprising the Contig Editor maintains
its own undo history, and using Align is likely to add to both undo histories. There is only
one Undo button, but it applies to the editor last clicked within. A hint is given as to which
of the two editors this is by highlighting the editor in a red border when the mouse is moved
over the Undo button.
Pressing the Join button will display a small dialogue box informing you of the length
and percentage match of the overlap between the two contigs. At this point you can decide
to make the join, to not make the join (both of which remove the editors from the screen)
or to cancel which leaves the join editor visible still to permit further editing.

1.4.13 Using Several Editors at Once
Several editors can be used simultaneously, even on the same contig. In the latter case, it
is useful to understand the difference between the data and the view of the data.
Each operating Contig Editor is a view of the data for a particular contig. With two
editors viewing the same contig, making changes in either will modify the data that both
are viewing, hence the change will be visible in both editors. Similarly, using Undo in either
will undo the changes to both.
Interaction between Contig Editors and Join Editors is more complicated and generally
isn’t advised. However such interactions work consistently with the notion of views of
contigs. For example, suppose there are two Contig Editors open on two separate contigs,
and in addition to these a Join Editor displaying both contigs. Making the join in the Join
Editor will update the two stand-alone Contig Editors so that they are each viewing the
correct positions in the new contig, even though they’re both now viewing the same contig.

1.4.14 Quitting the Editor
The Exit operation in the File menu quits the editor. If changes have been made since the
last save you will be asked whether you wish to save these changes. Answering “Cancel”
abandons the exit process and provides control of the editor again, otherwise the appropriate
action will be taken and the editor quitted.

1.4.15 Summary
1.4.15.1 Keyboard summary for editing window
(“Left”, “Right”, “Up”, “Down” refer to the appropriate arrow keys.)
Page Up
Shift-Page Up
Control-Page Up
Shift-Control-Page Up

Scroll
Scroll
Scroll
Scroll

left
left
left
left

by
by
by
by

1Kb
10Kb
100Kb
1Mb

Page Down
Shift-Page Down
Control-Page Down
Shift-Control-Page Down

Scroll
Scroll
Scroll
Scroll

right
right
right
right

Left arrow or Control-b

Move editing cursor left one base

by
by
by
by

1Kb
10Kb
100Kb
1Mb

Chapter 1: Next generation assembly editing with Gap5

editing
editing
editing
editing
editing
editing
editing

Right arrow or Control-f
Up arrow or Control-p
Down arrow or Control-n
Control-a or Home
Control-e or End
Alt-comma
Alt-fullstop

Move
Move
Move
Move
Move
Move
Move

cursor
cursor
cursor
cursor
cursor
cursor
cursor

right one base
up one base
down one base
to start of sequence
to end of sequence
to start of contig
to end of contig

Control-t
Control-s
Control-r
Control-q

Display trace
Search forward
Search backwards
Toggle tag display

<
>

Set left cutoff clip point (in sequence)
Set right cutoff clip point (in sequence)

<
>

Bulk clip left cutoff (in consensus)
Bulk clip right cutoff (in consensus)

[
]
Shift Up
Shift Down
Control Up
Control Down

Set confidence to 0
Set confidence to 100
Increase confidence of
Decrease confidence of
Increase confidence of
Decrease confidence of

a, c, g, t or *
i or Insert
Backspace or Delete
Ctrl-Backspace or Ctrl-Delete

Overwrite base with a new call.
Insert pad (or column if in consensus)
Delete padding character
Delete base (any base type)

Control-right arrow
Control-left arrow

Move sequence right 1 base-pair
Move sequence left 1 base-pair

F11
F12

Edit tag under editing cursor
Delete tag under editing cursor

Shift F1 to Shift F10
Control F1 to Control F10
F1 to F10

Edit tag macro 1 to 10
Copy tag at editing cursor to macro 1 to 10
Create tag from macro 1 to 10

base
base
base
base

by
by
by
by

1
1
10
10

1.4.15.2 Mouse summary for editing window
Left button
Left button (drag)
Shift left button

Position editing cursor to mouse cursor
Mark start and end of selection
Adjust end of selection

The Staden Package Manual

Left button (double click)
Right button
Mouse-wheel
Control mouse-wheel
Shift mouse-wheel
Shift Control mouse-wheel

Display trace
Display commands menu
Vertically scroll the editor
Vertically scroll the editor, fast
Horizontally scroll the editor
Horizontally scroll the editor, fast

1.4.15.3 Mouse summary for names window
Left button + drag
Right button
Mouse-wheel
Control mouse-wheel

Copy sequence name to clip-board
Display popup menu
Vertically scroll the editor
Vertically scroll the editor, fast

Chapter 1: Next generation assembly editing with Gap5

1.4.16 Plotting Restriction Enzymes
The restriction enzyme map function finds and displays restriction sites within a specified
region of a contig. It is invoked from the gap4 View menu. Users can select the enzyme
types to search for and can save the sites found as tags within the database.

This figure shows a typical view of the Restriction Enzyme Map in which the results for
each enzyme type have been configured by the user to be drawn in different colours. On
the left of the display the enzyme names are shown adjacent to the lines of plotted results.
If no result is found for any particular enzyme eg here APAI, the line will still be drawn
so that zero cutters can be identified. Three of the enzymes types have been selected and
are shown highlighted. The results can be scrolled vertically (and horizontally if the plot is
zoomed in). A ruler is shown along the base and the current cursor position (the vertical
black line) is shown in the left hand box near the top right of the display. If the user clicks,
in turn, on two restriction sites their separation in base pairs will appear in the top right
hand box. Information about the last site touched is shown in the Information line at the
bottom of the display. At the top the edit menu is shown torn off and can be used to create
tags for highlighted enzyme types.

1.4.16.1 Selecting Enzymes
Files of restriction enzyme names and their cut sites are stored in disk files. For the format
of these files and notes about creating new ones see Section 11.4 [Restriction enzyme files],
page 566.
When the file is read, the list of enzymes is displayed in a scrolling window. To select
enzymes press and drag the left mouse button within the list. Dragging the mouse off the
bottom of the list will scroll it to allow selection of a range larger than the displayed section
of the list. When the left button is pressed any existing selection is cleared. To select several
disjoint entries in the list press control and the left mouse button. Once the enzymes have
been chosen, pressing OK will create the plot.

The Staden Package Manual

1.4.16.2 Examining the Plot
Positioning the cursor over a match will cause its name and cut position to appear in
the information line. If the right mouse button is pressed over a match, a popup menu
containing Information and Configure will appear. The Information function in this menu
will display the data for this cut site and enzyme in the Output Window.
It is possible to find the distance between any two cut sites. Pressing the left mouse
button on a match will display "Select another cut" at the bottom of the window. Then,
pressing the left button on another match will display the distance, in bases, between the
two sites. This is shown in a box located at the top right corner of the window.

1.4.16.3 Reconfiguring the Plot
The plot displays the results for each restriction enzyme on a separate line. Enzymes with
no sites are also shown. The order of these lines may be changed by pressing and dragging
the middle mouse button or alt + left mouse button on one of the displayed names at the
left side of the screen.
The results are plotted as black lines but users can select colours for each enzyme type
by pressing the right button on any of its matches. A menu containing Information and
Configure will pop up. Configure will display a colour selection dialogue. Adjusting the
colour here will adjust the colour for all matches for this restriction enzyme.

1.4.16.4 Textual Outputs
The Results menu of the plot contains options to list the restriction enzyme sites found.
One option sorts the results by enzyme name and the other by the positions of the matches.
The output below shows the textual output from "Output enzyme by enzyme". The
Fragment column gives the size of the fragments between each of the cut sites. The Lengths
column contains the fragment sizes sorted on size.
Contig zf98g12.r1 (#801)
Number of enzymes = 3
Number of matches = 7
Matches found=
1
Name
Sequence
1 AATII
GACGT’C
Matches found=
Name
1 ACCI
2 ACCI
3 ACCI
4 ACCI
5 ACCI

5
Sequence
GT’CGAC
GT’CTAC
GT’CTAC
GT’CTAC
GT’CGAC

Matches found=
Name
1 AHAII

1
Sequence
GA’CGTC

Position Fragment lengths
7130
7129
556
556
7129
Position Fragment lengths
414
413
189
1296
882
413
3871
2575
882
5816
1945
1681
7497
1681
1945
189
2575
Position Fragment lengths
7127
7126
559

Chapter 1: Next generation assembly editing with Gap5

559

7126

The output below shows the textual output from "Output ordered on position".
Contig zf98g12.r1 (#801)
Number of enzymes = 3
Number of matches = 7
Name
Sequence
1 ACCI
GT’CGAC
2 ACCI
GT’CTAC
3 ACCI
GT’CTAC
4 ACCI
GT’CTAC
5 AHAII
GA’CGTC
6 AATII
GACGT’C
7 ACCI
GT’CGAC

Position Fragment lengths
414
413
3
1296
882
189
3871
2575
367
5816
1945
413
7127
1311
882
7130
3
1311
7497
367
1945
189
2575

The Staden Package Manual

1.5 Importing and Exporting Data
1.5.1 Assembly
There are two main types of assembly - denovo and mapped - with the latter not really
being a true assembly at all.
Denovo assembly consists of an assembly of DNA fragments without typically knowing
any of the goal target sequence. Hence it compares sequence fragments against each other
in order to form contigs. Mapped assembly makes uses of a known reference sequence and
compares all sequence fragments against the reference, which is a far simpler and faster
process than denovo assembly.
Gap5 however has neither denovo or mapped assembly built-in. Instead it relies on
externally running standard command-line tools. At present this consists purely of using
bwa for a mapped assembly, but in future this will be expanded upon.
This means that the Assembly menu currently only contains a “Map Reads” sub-menu,
which is turn has multiple choices for bwa usage. You will not be directly able to join contigs
using these facilities or to fill holes in the contig, although this is possible by manually
following some of the steps outlined below and using an alternate step for generating the
SAM file.

1.5.1.1 Importing with tg index
To enable efficient editing of data, Gap5 needs its own database format for storing sequence
assemblies. Formats such as BAM are good at random access for read-only viewing, but
are not at all amenable to actions such as reverse complementing a contig and joining it to
another.
Hence we need a tool that can take existing assembly formats and convert them to a
form suitable for Gap5. The tg_index program performs this task. It is strictly a command
line tool, although in some specific cases Gap5 has basic GUI dialogues to wrap it up.
One or more input files may be specified. The general form is:
tg_index [options] -o gap5 db name input file name ...
An example usage is:
tg_index -z 16384 -o test_data.g5 test_data.bam
gap5 test_data.g5 &
File formats supported are SAM, BAM, ACE, MAQ (both short and long variants), CAF,
BAF, Fasta and Fastq. The latter two have no assembly and/or alignment information so
they are simply loaded as single-read contigs instead. Tg index typically automatically
detects the type of file, but in rare cases you may need to explicitly state the input file type.
Tg index options:
-o filename
Creates a gap5 database named filename and filename.aux If not specified the
default is “g db”.

Chapter 1: Next generation assembly editing with Gap5

-a

Append to an existing database, instead of creating a new one (which is the
default action).

-n

When appending, the default behaviour is to add reads to existing contigs if
contigs with the appropriate names already exist. This option always forces
creation of new contigs instead.

-g

When appending to an existing database, assume that the alignment has been
performed against an ungapped copy of the consensus exported from this database. (This is internally used when performing mapped assemblies as they
consist of exporting the consensus, running the external mapped alignment
tool, and then importing the newly generated alignments.)

-m
-M

Forces the input to be treated as MAQ, both short (-m) and long (-M) formats
are supported. By default the file format is automatically detected.

-A

Forces the input to be treads as ACE format.

-B

Forces the input to be treads as BAF format.

-C

Forces the input to be treads as CAF format.

-b
-s

Forces the input to be treads as BAM (-b) or SAM (-s) format. SAM must
have @SQ headers present. Both need to be sorted by position.

-z bin size Modifies the size of the smallest allowable contig bin. Large contigs will contain
child bins, each of which will contain smaller bins, recursing down to a minimum bin size. Sequences are then placed in the smallest bin they entirely fit
within. The default minimum bin size is 4096 bytes. For very shallow assemblies increasing this will improve performance and the decrease disk space used.
Ideally 5,000 to 10,000 sequences per bin is an approximate figure to aim for.
-u

Store unmapped reads only (from SAM/BAM only)

-x

Store SAM/BAM auxillary key:value records too.

-p
-P

-f

-t
-T

Enable (-p) or disable (-P) read-pairing. By default this is enabled. The purpose
of this is to link sequences from the same template to each other such that gap5
knows the insert size and read-pairings. Generally this is desirable, but it adds
extra time and memory to identify the pairs. Hence for single-ended runs the
option exists to disable attempts at read-pairing.
Attempt a faster form of read-pairing. In this mode we link the second occurrence of a template to the first occurrence, but not vice versa. This is sufficient
for the template display graphical views to work, but will cause other parts of
the program to behave inconsistently. For example the contig editor “goto...”
popup menu will sometimes be missing.
Controls whether to index (-t) or not (-T) the sequence names. By default this
is disabled. Adding a sequence name index permits us to search by sequence

The Staden Package Manual

name or to use a sequence name in any dialogue that requires a contig identifier.
However it consumes more disc space to store this index and it can be time
consuming to construct it.
-r nseq

Reserves space for at least nseq sequences. This generally isn’t necessary, but
if the total number of records extends above 2 million (equivalent to 2 billion
sequences, or less if we have lots of contigs, bins and annotation records to write)
then we run out of suitable sequence record numbers. This option preallocates
the lower record numbers and reserves them solely for sequence records.

-c compression method
Specifies an alternate compression method. This defaults to zlib, but can be set
to either none for fastest speed or lzma for best compression.

1.5.1.2 Importing fasta/fastq files
Sometimes we have a few individual sequences we wish to import as single-read contigs.
That is we won’t align them against each other or against existing data, but just load them
into our gap5 database so we can then run tools such as Find Repeats or Find Internal
Joins on them. (This can be ideal for importing consensus sequences.)
The “Import Fasta/Fastq as single-read contigs” function is designed for this purpose.
Behind the scenes it is nothing more than running tg_index -a to add a fasta or fastq file.

1.5.1.3 Mapped assembly by bwa aln
This function runs the bwa program using the “aln” method for aligning sequences. It is
appropriate for matching most types of short-read data.
The GUI is little more than a wrapper around command line tools, which can essentially
be repeatedly manually as follows.
1. Calculate and save the consensus for all contigs in the database in fastq format.
2. Index the consensus sequence using “bwa index”.
3. Map our input data against the bwa index using “bwa aln”. Repeat for reverse matches
too.
4. Generate SAM format from the alignments using “bwa samse” or “bwa sampe”.
5. Convert to BAM and sort by position.
6. Import the BAM file, appending to the existing gap5 database (equivalent to tg_index
-a).

1.5.1.4 Mapped assembly by bwa dbwtsw
This function runs the bwa program using the “dbwtsw” method for aligning sequences.
This should be used when attempting to align longer sequences or data with lots of indels.
The GUI is little more than a wrapper around command line tools, which can essentially
be repeatedly manually as follows.
1. Calculate and save the consensus for all contigs in the database in fastq format.
2. Index the consensus sequence using “bwa index”.
3. Map our input data against the bwa index using “bwa dbwtsw”.

Chapter 1: Next generation assembly editing with Gap5

4. Convert to BAM and sort by position.
5. Import the BAM file, appending to the existing gap5 database (equivalent to tg_index
-a).

The Staden Package Manual

1.5.2 Importing GFF
Annotations within GFF files can be imported to Gap5 as annotations (sometimes referred
to as tags). The “Import GFF Annotatons” function in the main File menu performs
this task. Note that in order for this to work the contigs should not have been edited or
complemented since the GFF file was created, otherwise the coordinates in the GFF file
will not match.
One caveat to this relates to sequence gaps. By default consensus gaps/padding characters are excluded from the contig consensus sequences when counting GFF sequence
coordinates. In some cases we may wish to support annotations in a gapped sequence, so
the “GFF coordinates are already padded” checkbox may be used to disable this coordinate
de-padding process.

1.5.3 Export Tags
This dialogue allows annotations (“tags”) to be written to disk as a GFF version 3 file.
Currently this just uses the GFF “remark” type, but future plans will be to support a
more wide variety of GFF types.

By default the coordinates generated are de-padded, such that “*”s in the consensus
sequence are not counted when identifying the coordinate of an annotation. This may be
disabled by deselecting the “Unpadded coordinates” checkbox.
The object a tag is attached to is typically the contig it is within, with the contig name
being used in the first column of the GFF file. This applies even for annotations place on
a sequence rather than the consensus. This feature may also be disabled by deselecting the
“Map sequence tags to consensus” checkbox.
Example GFF output follows, with “...” to denote lines truncated for illustrative purposes.
Contig6
Contig2

gap5
gap5

remark
remark

4745
3178

4745
3196

.
.

type=COMM;Note=Possible SNP?
type=OLIG;Note=Template%09xb63f10%0AOligona

Note we can see URL style percent encoding being used to avoid GFF format metacharacters, as per the GFFv3 specification.

Chapter 1: Next generation assembly editing with Gap5

1.5.4 Export Sequences
This function exports sequence and annotation data from a Gap5 database to a variety of
assembly formats.

The fasta and fastq formats are basic sequence-only or sequence plus quality, with no
support for contigs or alignments. The BAF, CAF, ACE and SAM formats all hold assembly data and so are reasonably complete representatives of data within Gap5. Note that
ACE does not directly support quality values and this export function does not create the
associated phdball file that houses this data.
There is also no direct support for BAM, however command line tools like samtools or
picard can convert the SAM file into BAM format. The SAM file should already be sorted
by position.
For SAM only there are additional options: whether to fix mate-pair information and
whether to use depadded coordinates. This former will ensure that the MRNM (Mate
Reference Name), MPOS and ISIZE fields are filled out. Note that this considerably slows
down the speed of exporting, so it is disabled by default.

The Staden Package Manual

1.6 Finding Sequence Matches

1.6.1 Find Internal Joins
The purpose of this function (which is invoked from the Gap5 View menu) is to use sequences
already in the database to find possible joins between contigs. Generally these will be joins
that were missed or judged to be unsafe during assembly and this function allows users to
examine the overlaps and decide if they should be made. During assembly joins may have
been missed because of poor data, or not been made because the sequence was repetitive.
Also it may be possible to find potential joins by extending the consensus sequences with
the data from the 3’ ends of readings which was considered to be too unreliable to align
during assembly i.e. we can search in the "hidden data".

If it has not already occurred, use of this function will automatically transform the
Contig Selector into the Contig Comparator. Each match found is plotted as a diagonal
line in the Contig Comparator, and is written as an alignment in the Output Window. The
length of the diagonal line is proportional to the length of the aligned region. If the match
is for two contigs in the same orientation the diagonal will be parallel to the main diagonal,
if they are not in the same orientation the line will be perpendicular to the main diagonal.
The matches displayed in the Contig Comparator can be used to invoke the Join Editor (see
Section 2.6.15 [The Join Editor], page 196) or Contig Editor. See Section 2.6 [Editing in
gap5], page 160. Alternatively, the "Next" button at the top left of the Contig Comparator
can be used to select each result in turn, starting with the best, and ending with the worst.
When this is in use, users can find the match in the Contig Comparator which corresponds

Chapter 1: Next generation assembly editing with Gap5

to the next result by placing the cursor over the Next button. The plotted match and the
contigs involved will turn white.

A typical display from the Contig Comparator is shown in the figure above.
To define the match all numbering is relative to base number one in the contig: matches
to the left (i.e. in the hidden data) have negative positions, matches off the right end of
the contig (i.e. in the hidden data) have positions greater than that of the contig length.
The convention for reporting the positions of overlaps is as follows: if neither contig needs
to be complemented the positions are as shown. If the program says "contig x in the sense" then the positions shown assume contig x has been complemented. For example, in

The Staden Package Manual

the results given below the positions for the first overlap are as reported, but those for the
second assume that the contig in the minus sense (i.e. 443) has been complemented.

Possible join between contig
445 in the + sense and contig
405
Percentage mismatch after alignment = 4.9
412
422
432
442
452
462
405 TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA
::::::::: : :::::::: ::::: ::: :::::::::: :::::::::: ::::::::::
445 *TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG*TT AGCTCACTCA
-127
-117
-107
-97
-87
-77
472
482
492
502
512
405 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
:::::::::: :::::::::: :::::::::: :::::::::: ::
445 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
-67
-57
-47
-37
-27
Possible join between contig
443 in the - sense and contig
423
Percentage mismatch after alignment = 10.4
64
74
84
94
104
114
423 ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG*CGAT GTCAGATGGG
:::: ::::: :::::::::: :::::::::: :::::: :: ::::: :::: :::::::::
443 ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,
3610
3620
3630
3640
3650
3660
124
134
144
154
164
423 TTG*ATGAAG TAGAAGTAGG AG*AGGTGGA AGAGAAGAGA GTGGGA
::: :::::: :::::::::: :: ::::::: ::: ::::: :: ::
443 TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG*
3670
3680
3690
3700
3710

Chapter 1: Next generation assembly editing with Gap5

1.6.1.1 Find Internal Joins Dialogue

The contigs to use in the search can be defined as "all contigs", a list of contigs in a file
"file", or a list of contigs in a list "list". If "file" or "list" is selected the browse button
is activated and gives access to file or list browsers. Two types of search can be selected:
one, "Probe all against all" compares all the contigs defined against one another; the other
"Probe with single contig", compares one contig against all the contigs in the list. If this
option is selected the Contig identifier panel in the dialogue box is ungreyed. Both sense of
the sequences are compared.
If users elect not to "Use standard consensus" they can either "Mark active tags" or
"Mask active tags", in which cases the "Select tags" button will be activated. Clicking on
this button will bring up a check box dialogue to enable the user to select the tags types
they wish to activate. Masking the active tags means that all segments covered by tags that
are "active" will not be used by the matching algorithms. A typical use of this mode is to
avoid finding matches in segments covered by tags of type ALUS (ie segments thought to
be Alu sequence) or REPT (ie segment that are known to be repeated elsewhere in the data

The Staden Package Manual

(see Section 2.2.7.1 [Tag types], page 121). "Marking" is of less use: matches will be found
in marked segments during searching, but in the alignment shown in the Output Window,
marked segments will be shown in lower case.
Some alignments may be very large. For speed and ease of scrolling Gap5 does not
display the textual form of the longest alignments, although they are still visible within the
contig comparator window. The maximum length of the alignment to print up is controlled
by the “Maximum alignment length to list (bp)” control.
The default setting for the consensus is to "Use hidden data" which means that where
possible the contigs are extended using the poor quality data from the readings near their
ends. To ensure that this additional data is not so poor that matches will be missed, the
program uses algorithms which can be configured from the "Edit hidden data parameters"
dialogue. Two algorithms are available. Both slide a window along the reading until a set
criteria is met. By default an algorithm which sums confidence values within the window is
used. It stops when a window with < "Minimum average confidence" is found. The other
algorithm counts the number of uncalled bases in the window and stops when the total
reaches "Max number of uncalled bases in window". The selected algorithm is applied to
all the readings near the ends of contigs and the data that extends the contig the furthest
is added to its consensus sequence.
If your total consensus sequence length (including a 20 character header for each contig
that is used internally by the program) plus any hidden data at the ends of contigs is greater
than the current value of a parameter called maxseq, Find Internal Joins may produce an
error message advising you to increase maxseq. Maxseq can be set on the command line
(see Section 2.21 [Command line arguments], page 306) or by using the options menu (see
Section 2.20.3 [Set Maxseq], page 299).
The search algorithms first finds matching words of length "Word length", and only
considers overlaps of length at least "Minimum overlap". Only alignments better than
"Maximum percent mismatches" will be reported.
There are three search algorithms: “Sensitive”, “Quick” and “Fastest”. The quick or
fastest algorithm should be applied first, and then the sensitive one employed to find any
less obvious overlaps.
The sensitive algorithm sums the lengths of the matching words of length "Word length"
on each diagonal. It then finds the centre of gravity of the most significant diagonals.
Significant diagonals are those whose probability of occurence is < "Diagonal threshold". It
then uses a dynamic programming algorithm to align around the centre of gravity, using a
band size of "Alignment band size (percent)". For example: if the overlap was 1000 bases
long and the percentage set at 5, the aligner would only consider alignments within 50 bases
either side of the centre of gravity. Obviously the larger the percentage and the overlap,
the slower the aligment.
The fastest and quick algorithms can find overlaps and align 100,000 base sequences in a
few seconds by considering, in its initial phase only matching segments of length "Minimum
initial match length". However it does a dynamic programming alignment of all the chunks
between the matching segments, and so produces an optimal alignment. Again a banded
dynamic algorithm can be selected, but as this only applies to the chunks between matching

Chapter 1: Next generation assembly editing with Gap5

segments, which for good alignments will be very short, it should make little difference to the
speed. the fastest and quick methods only differ in how aggressively they prune potential
alignments before entering the dynamic programming phase.
After the search the results will be sorted so that the best matches are at the top of a list
where best is defined as a combination of alignment length and alignment percent identity.
This list can be stepped through, one result at a time using the Contig Joining Editor, by
clicking on the "Next" button at the top left of the Contig Comparator.

The Staden Package Manual

1.6.2 Find Repeats
The purpose of this function (which is invoked from the Gap5 View menu) is to find exact
repeats in contig consensus sequences. An exact repeat is defined as a run of consecutive
identical ACGT characters; no mismatches or gaps are permitted.
If it has not already occurred, selection of this function will automatically transform
the Contig Selector into the Contig Comparator. See Section 2.4 [Contig Comparator],
page 126. Each match found is plotted as a diagonal line in the Contig Comparator. The
length of the diagonal line is proportional to the length of the match.
If the match is for two contigs in the same orientation the diagonal will be parallel to
the main diagonal, if they are not the line will be perpendicular to the main diagonal. The
matches displayed in the Contig Comparator can be used to invoke the Join Editor (see
Section 2.6.15 [The Join Editor], page 196) or Contig Editors (see Section 2.6 [Editing in
Gap5], page 160), and an Information button will display data about the match in the
Output window. e.g.
Repeat match
From contig xb54a3.s1(#26) at 78
With contig xb62h3.s1(#3) at 1
Length 37
This means that position 78 in the contig with xb54a3.s1 (reading number 26) at its left
end matches 37 bases at position 1 in the contig with xb62h3.s1 (number 3) at its left end.

Users can elect to search a "single" contig, or compare "all contigs", or a subset of
contigs defined in a list or a file. If "file" or "list" is selected the browse button is activated
and gives access to file or list browsers. If they choose to analyse a single contig the
dialogue concerned with selecting the contig and the region to search becomes activated.

Chapter 1: Next generation assembly editing with Gap5

The "Minimum Repeat" defines the smallest match that the algorithm will report. The
algorithm will search only for repeats in the forward direction "Find direct repeats", or
only those in the reverse direction "Find inverted repeats", or both "Find both".
If "Mask active tags" is selected the "Select tags" button is activated. Clicking on this
button will bring up a check box dialogue to enable the user to select the tags types they
wish to activate. Masking the active tags means that all segments covered by tags that are
"active" will not be used in the matching algorithm. A typical use of this mode is to avoid
finding matches in segments covered by tags of type ALUS (ie segments thought to be Alu
sequence) or that already covered by REPT tags. See Section 2.2.7.1 [Tag types], page 121.
After the search is complete clicking on "Yes" in the "Save tags to file" panel will activate
the "File name" box and all repeats on the list will be written to a file. This file can be
used with "Enter tags" (see Section 2.12.2 [Enter Tags], page 265) to create REPT tags
for all the repeats found. Note that "Enter tags" will remove all the results plotted in the
contig comparator.
Note that the current version of Find Repeats has a limit to the number of repeats it
can store. The limit depends on the current maximum consensus length, so if you want to
increase the limit, reset the maximum consensus length. This can be done using the "Set
maxseq" item in the "Options" menu.

The Staden Package Manual

1.6.3 Find Read Pairs
This function is used to check the positions and orientations of readings taken from the
same templates. It is invoked from the gap5 View menu.
For each template the relative position of its readings and the contigs they are in are
examined. This analysis can give information about the relative order, separation and
orientations of contigs and also show possible problems in the data. The search can be
over the whole database or a subset of contigs named in a list (see Section 2.14 [Lists],
page 278) or file of file names. The results are written to the Output Window and plotted
in the Contig Comparator (See Section 2.4 [Contig Comparator], page 126.). Read pair
information is also used to colour code the results displayed in the Template Display (see
Section 2.5.1 [Template Display], page 130).
Note that during assembly the template names and lengths are copied from the experiment files into the gap database. See Section 11.3 [Experiment Files], page 552. The
accuracy of the lengths will depend upon some size selection being performed during the
cloning procedures.

Users choose to process "all contigs" or a subset selected from a file of file names ("file")
or a list ("list"). If either of the subset options is selected the "browse" button will be
activated and can be clicked on to call up a file or list browser dialogue.

1.6.3.1 Find Read Pairs Graphical Output
The contig comparator is used to plot all templates with readings that span contigs. That
is, the lines drawn on the contig comparator are a visual representation of the relationship

Chapter 1: Next generation assembly editing with Gap5

(orientation and overlap) between contigs. When a template spans more than two contigs,
all the combinations of pairs of contigs are plotted. However such cases are uncommon.

The figure above shows a typical Contig Comparator plot which includes several types
of result in addition to those from Read Pair analysis.
The lines for the read-pairs are, by default, shown in blue. The length of the line is the
average length of the two readings within the pair. The slope of the line represents the
relative orientation of the two readings. If they are both the same orientation (including
both complemented) the line is drawn from top left to bottom right, otherwise the line is
drawn from top right to bottom left.

The Staden Package Manual

Clicking with the right mouse button on a read pair line brings up a menu containing,
amongst other things, "Invoke join editor" (see Section 2.6.15 [The Join Editor], page 196).
This will bring up the Join Editor with the two contigs shown end to end.

Chapter 1: Next generation assembly editing with Gap5

1.6.4 Sequence Search
The purpose of this function (which is available from the prog View menu) is to find
matches between the consensus sequence and short segments of sequence defined by the
user. The segments of sequence (or "strings") can be typed into the dialogue provided
or can be the sequences covered by consensus tag types (see Section 2.2.7.1 [Tag types],
page 121) selected by the user. The latter mode hence provides a way of checking to see
if a tagged segment of the sequence occurs elsewhere in the consensus. The function was
previously known as "Find Oligos".

Users can elect to search against a "single" contig, "all contigs", or a subset of contigs
defined in a list (see Section 2.14 [Lists], page 278) or a file. If "file" or "list" is selected
the browse button is activated and gives access to file or list browsers. If they choose to
analyse a single contig the dialogue concerned with selecting the contig and the region to
search becomes activated.
Both strands of the consensus are scanned using a very simple algorithm: insertions and
deletions are not allowed, but mismatches are. The "Minimum percent match" defines the
smallest percentage match which will be reported by the algorithm. A value of 75 means
that at least 75% of the bases must match the target sequence.
The user can elect to use tags or to specify their own sequences for the search. Selecting
"Use tags" will activate the "Select tags" browse button. Clicking on this button will bring
up a check box dialogue to enable the user to select the tags types they wish to activate.
Alternatively selecting "Enter sequence" will activate a text entry box and the user can
enter a string of characters. Only the characters ACGTU are allowed and there is no limit
to the length of the string.
If it has not already occurred, selection of this function will automatically transform
the Contig Selector into the Contig Comparator. See Section 2.4 [Contig Comparator],
page 126. Each match found is plotted as a diagonal line in the Contig Comparator. The
length of the diagonal line is proportional to the length of the search string. Self matches
from the tag search are not reported.

The Staden Package Manual

If the match between the search string and the contig are in the same orientation,
the diagonal match line will be parallel to the main diagonal, otherwise the line will be
perpendicular to the main diagonal. Matches found between a tag and a contig can be used
to invoke the Join Editor (see Section 2.6.15 [The Join Editor], page 196) or Contig Editors
(see Section 2.6 [Editing in prog ], page 160). Matches between a specified sequence and
a contig will only invoke the Contig Editor. All of the matches found are displayed in the
Output Window e.g.
Match found between tag on contig 315 in the + sense and contig 495
Percentage mismatch 16.7
957
967
977
987
997
315 CATAAGGATTTCCAATATTTTATTCCAGTTGGGCATCCTAGT
:: ::::::::::: :::::::::::::::::: ::::
495 GATTGGGATTTCCAATGTTTTATTCCAGTTGGGCACCCTAAG
2
12
22
32
42

Chapter 1: Next generation assembly editing with Gap5

1.7 Checking Assemblies and Removing Readings
After assembly, and prior to editing, it can be useful to examine the quality of the alignments
between individual readings and the sections of the consensus which they overlap. This may
reveal doubtful joins between sections of contigs, poorly aligned readings, or readings that
have been misplaced. By using this analysis in combination with other gap5 functions such
as Find internal joins (see Section 2.8.3 [Find Internal Joins], page 227) and Find repeats
(see Section 2.8.4 [Find Repeats], page 233), it is also possible to discover if readings have
been positioned in the wrong copies of repeat elements.
If readings are found to be misplaced or need removing for other reasons, gap5 has functions for breaking contigs (see Section 2.9.1.1 [Breaking Contigs], page 239), and removing
readings (see Section 2.9.1.2 [Disassembling Readings], page 240). These functions can be
accessed through the main gap5 Edit menu or from within the Contig Editor.
If readings are removed from contigs to start new contigs of one reading, these contigs can
then be processed by Find internal joins (see Section 2.8.3 [Find Internal Joins], page 227)
and the Join editor (see Section 2.6.15 [The Join Editor], page 196), which should reveal all
the other positions at which the reading matches.

The Staden Package Manual

1.7.0.1 Checking Assemblies
The Check Assembly routine (which is invoked from the gap5 View menu) is used to check
contigs for potentially misassembled readings by comparing them against the segment of
the consensus which they overlap. It simply slides a small window along the sequence
identifying regions of high disagreement between that portion of sequence and the consensus.
Results are displayed in the Output Window and plotted on the main diagonal in the Contig
Comparator. See Section 2.4 [Contig Comparator], page 126.
From the Contig Comparator the user can invoke the Contig Editor to examine the
alignment of any problem reading. See Section 2.6 [Editing in gap5], page 160. If the
reading appears to be correctly positioned the user can either edit it, or instead select the
name to add it to the “readings” list for subsequent disassembly or removal.

Users select either to search only one contig ("single"), all contigs ("all contigs"), or a
subset of contigs contained in a "file" or a "list". If "file" or "list" is selected the "browse"
button will be activated and clicking on it will invoke a file or list browser. If a single contig
is selected the "Contig identifier" dialogue will be activated and users should enter a contig
name.
The percentage disagreement and over what size of window are both configurable parameters. Additionally there is a parameter to control whether N bases in the sequence
should be considered as disagreements or not. The choice will depend on whether you are
looking for sequences that appear to be in the wrong place (ignore Ns) or simply sequences
that appear to have a large number of incorrect base calls (keep Ns).
The "Information" window produced by selecting "Information" from the Contig Comparator "Results" menu produces a summary of the results sorted in order os percentage
mismatch.
By clicking with the right mouse button on results plotted in the Contig Comparator
a pop-up menu is revealed which can be used to invoke the Contig Editor (see Section 2.6
[Editing in gap4], page 160). The editor will start up with the cursor positioned on the
problem reading. If the reading is found to be misplaced it can be marked for removal
from within the Editor (see Section 2.6.7.12 [Remove Reading], page 179). However, prior
to this it may be beneficial to use some of the other analyses such as Find internal joins
(see Section 2.8.3 [Find Internal Joins], page 227) and Find repeats (see Section 2.8.4 [Find
Repeats], page 233), which may help to find its correct location. Both of these functions

Chapter 1: Next generation assembly editing with Gap5

produce results plotted in the Contig Comparator (see Section 2.4 [Contig Comparator],
page 126) and any alternative locations will give matches on the same vertical or horizontal
projection as the problem reading.

The Staden Package Manual

1.7.1 Removing Readings and Breaking Contigs
Occasionally contigs require more drastic changes than simple basecall edits. Sometimes it
is necessary to remove readings that have been put in the wrong place, or to break contigs
that should not have been joined. Gap5 contains functions to help with these problems,
and two types of interface.
If a contig needs to be broken cleanly into two new contigs, with all the readings, other
than the two at the incorrect join, still linked together, then Break Contig (see Section 2.9.1.1
[Breaking Contigs], page 239), or (see Section 2.6.7.13 [Break Contig], page 179) should be
used. The former interface is available via the main gap5 Edit menu, and the latter as an
option in the Contig Editor.
If one or more readings need removing from from contig(s), even if their removal will
break the contiguity of a contig, then (see Section 2.9.1.2 [Disassemble Readings], page 240),
or (see Section 2.6.7.12 [Remove Reading], page 179) should be used. The former interface
is available via the main gap5 Edit menu, and the latter as an option in the Contig Editor.
Readings can be removed from the database completely, or moved to start individual new
contigs, one for each reading.

Chapter 1: Next generation assembly editing with Gap5

1.7.1.1 Breaking Contigs
The Break Contig function (which is available from the gap5 Edit menu) enables contigs to
be broken by removing the link between two adjacent readings. The user defines the contig
coordinate to break at. All sequences starting to the right of that position will be placed
into a new contig.

Breaking contig can somtimes cause more holes to be created. The “Remove contig
holes” will also cause subsequent breaks to happen at these cases, producing more than one
additional contig. If we have aligned against a reference and expect regions of zero coverage
then this option should be disabled.

The Staden Package Manual

1.7.1.2 Disassembling Readings
This function is used to remove readings from a database or move readings to new contigs.

If readings are removed from the database all reference to them is deleted. If a reading
is moved to a “single-read contig” a new contig will be created containing this one single
reading, which may then be re-processed by Find Internal Joins (see Section 2.8.3 [Find Internal Joins], page 227) and the Join editor (see Section 2.6.15 [The Join Editor], page 196),
which should reveal all the other positions at which the reading matches.
More useful is the general “Move readings to new contigs”. This will keep any assembly
relationships intact between the set of readings to be disassembled. For example if three
readings overlap then when disassembled all three will end up in a single new contig. This
function is particularly useful for pulling apart false joins or repeats.
The set of readings to be processed can be read from a “file” or a “list” and clicking on
the “browse” button will invoke an appropriate browser. If just a single reading is to be
assembled choose “single” and enter the reading name instead of the file or list of filenames.
Removal via a “list” is a particularly powerful option when controlled via the list generation functions within the contig editor. For example break contig could be viewed as
disassembling a list of readings selected using “Select this reading and all to right”.
Unlike gap4, gap5 can cope with having holes in contigs. (This is obviously a requirement
when dealing with mapped alignments.) Hence gap5 gives us a choice whether to break
contigs into two (or more) pieces when removing sequences produces holes in the contigs.
By default this is enabled.

Chapter 1: Next generation assembly editing with Gap5

1.7.1.3 Delete Contigs
While Disassemble Readings is capable of removing entire contigs, it is inefficient for this
task as it has a lot of additional house-keeping to perform.

Delete Contigs should be used when we wish to remove entire contigs. Be careful not
to accidentally choose this over disassemble readings as even when giving a single sequence
name, this function will interpret it as a request for removing all other sequences in that
contig too.
There is no Undo feature, so backups are advised before hand.

The Staden Package Manual

1.8 Tidying up alignments
The Shuffle Pads, Remove pad Columns and Remove Contig Holes all share a common goal
of tidying up sequence alignments, possibly also breaking the contig up.

1.8.1 Shuffle Pads
This function is an implementation of the Anson and Myers “ReAligner” algorithm. It analyses multiple sequence alignments to detect locations where the number of disagreements
to the consensus could be reduced by realignment of sequences, possibly also correcting the
consensus in the process. For example:
Sequence1:
Sequence2:
Sequence3:
Consensus:

GATTCAAAGAC
TTCAA*GACGG
TC*AAGAC
GATTCAAAGACGGATC

The consensus contains AAA, but the corrected alignment only has two As:
Sequence1:
Sequence2:
Sequence3:
Consensus:

GATTCAAAGAC
TTC*AAGACGG
TC*AAGAC
GATTC*AAGACGGATC

For speed we acknowledge that the new alignment will only deviate slightly from the old
one and so a narrow “band size” is used. This paramater may be adjusted if required, but
at the expense of speed.

Chapter 1: Next generation assembly editing with Gap5

1.8.2 Remove Pad Columns
There are cases where we may have multiple alignments where every single sequence has a
padding character such that the complete column is “*”. This can occur when disassembling
data from a falsely made join.
The Shuffle Pads algorithm will remove entire columns of pads when it finds them, but
it is time consuming and it may also edit alignments elsewhere. The Remove Pad Columns
function is a faster, more specific solution to this problem.

By default the function will only ever delete columns where 100% of the sequences have a
pad/gap. However with appropriate due care it is possible to reduce this and allow removal
of columns where a few sequences have a real base provided the overall percentage is still
high. This is achieved by reducing the “Percentage pad needed” parameter.
Reducing from 100% is not recommended though as it is removal of data purely for
tidyness sake, while the consensus algorithm will automatically find the correct solution.

The Staden Package Manual

1.8.3 Remove Contig Holes
Unlike Gap4, Gap5 permits contig regions with zero coverage. These can naturally occur
when using sequence mapping to known references. However in a denovo assembly context
they are not desireable.

Some algorithms have check boxes querying whether you wish holes to be removed by
breaking contigs up, but this dialogue offers a choice of fixing the holes at a later stage.
It identifies all regions of zero coverage and will break the contig into multiple fragments.

Chapter 1: Next generation assembly editing with Gap5

1.9 Calculating Consensus Sequences
In this section we describe the types of consensus which gap4 can produce, the formats they
can be written in, and the algorithms that can be used. The algorithms are not only used
to produce consensus sequence files, but in many other places throughout gap4 where an
analysis of the current quality of the data is required. One important place is inside the
Contig Editor (see Section 2.6 [Editing in gap4], page 160) where they are used to produce
an "on-the-fly" consensus, responding to every edit made by the user.
The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm], page 299).
There are four main types of consensus sequence file that can be produced by the program: Normal, Extended, Unfinished, and Quality. They are all invoked from the File
menu.
"Normal" is the type of consensus file that would be expected: a consensus from the
non-hidden parts of a contig. "Extended" is the same as "Normal" but the consensus is
extended by inclusion of the hidden, non-vector sequence, from the ends of the contig.
"Unfinished" is the same as "Normal" except that any position where the consensus
does not have good data for both strands is written using A,C,G,T characters, and the rest
(which has good data for both strands) is written using a different set of symbols. This
sequence can be used for screening against new readings: only the regions needing more
readings will produce matches. By screening readings in this way, prior to assembly, users
can avoid entering readings which will not help finish the project, and which may require
further editing work to be performed.
"Quality" produces a sequence of characters of the same length as the consensus, but
they instead encode the reliability of the consensus at each point.
Consensus sequence files can also encode the positions of the currently active tag types
by changing the case of the tagged characters (marking) or writing them in a different
character set (masking) (see Section 2.2.7.2 [Active tags and masking], page 121).
The consensus algorithms are usually configured to produce only the characters A,C,G,T
and "-", but it is possible to set them to produce the complete set of IUB codes. This mode
is useful for some types of work and allows the range of observed base types at any position
to be coded in the consensus. How the IUB codes are chosen is described in the introduction
to the consensus algorithms (see Section 2.11.5 [The Consensus Algorithms], page 257).
Depending on the type of consensus produced, the consensus sequence files can be written
in three different formats: Experiment files (see Section 11.3 [Experiment File], page 552),
FASTA (Pearson,W.R. Using the FASTA program to search protein and DNA sequence
databases. Methods in Molecular Biology. 25, 365-389 (1994)) or staden formats. If experiment file format is selected a further menu appears that allows users to select for the
inclusion of tag data in the output file. For FASTA format the sequence headers include the
contig identfier as the sequence name and the project database name, version number and
the number of the leftmost reading in the contig as comments. e.g. ">xyzzy.s1 B0334.0.274"
is database B0334, copy 0, and the left most reading for the contig is number 274, which has
a name of xyzzy.s1. For staden format the headers include the project database name and

The Staden Package Manual

the number of the leftmost reading in the contig. e.g. "" is database
B0334 and the left most reading for the contig is number 274. Staden format is maintained
only for historical reasons - i.e. there may still be a few unfortunate people using it. Obviously Experiment file format can contain much more information, and can serve as the
basis of a submission to the sequence library.

1.9.1 Normal Consensus Output
This is the usual consensus type that will be calculated (and is available from the gap4
File menu). The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm],
page 299).
Contigs can be selected from a file of file names or a list. In addition, tagged regions can
be masked or marked (see Section 2.2.7.2 [Active tags and masking], page 121), and output
can be in Experiment file, fasta or staden formats. If experiment file format is selected a
further menu appears that allows users to select for the inclusion of tag data in the output
file.

Chapter 1: Next generation assembly editing with Gap5

by the tag types chosen will not be written as ACGT but as defi symbols. If "mark" is
selected the tagged segments will be written in lowercase characters. Masking is useful for
producing a sequence to screen against other sequences: only the unmasked segments will
produce hits.
The "strip pads" option will remove pads ("*"s) from the consensus sequence. In the
case of experiment files this will also automatically adjust the position and length of the
annotations to ensure that they still mark the correct segment of sequence.
Normally the consensus sequences are named after the left-most reading in each contig.
For the purposes of single-template based sequencing projects (eg cDNA assemblies) the
option exists to “Name consensus by left-most template” instead of by left-most reading.
The routine can write its consensus sequence (plus extra data for experiment files) in
"experiment file", "fasta" and "staden" formats. The output file can be chosen with the
aid of a file browser. If experiment file format is selected the user can choose whether or not
to have "all annotations", "annotations except in hidden", or "no annotations" written out
with the sequence. If the user elects to include annotations the "select tags" button will
become active, and if it is clicked, a dialogue for selecting the types to include will appear.

1.9.2 The Consensus Algorithms
The consensus calculation is a very important component of gap4. It is used to produce
an "on-the-fly" consensus, responding to every individual change in the Contig Editor (see
Section 2.6 [Editing in gap4], page 160) and is used to produce the final sequence for
submission to the sequence libraries. Some years ago Bonfield, J.K. and Staden, R. The
application of numerical estimates of base calling accuracy to DNA sequencing projects.
Nucleic Acids Res. 23, 1406-1410 (1995) we put forward the idea of using base call accuracy
estimates in sequencing projects, and this has been partially realised with the values from
the Phred program (Ewing, B. and Green, P. Base-Calling of Automated Sequencer Traces
Using Phred. II. Error Probabilities. Genome Research. Vol 8 no 3. 186-194 (1998)).
These values are widely used and have defined a decibel type scale for base call confidence
values and gap4 is currently set to use confidence values defined on this scale. An overview
of our use of confidence values is contained in the introductory sections of the manual (see
Section 2.2.5 [The use of numerical estimates of base calling accuracy], page 118).
As is described elsewhere (see Section 2.11.6 [List Consensus Confidence], page 261)
being able to calculate the confidence for each base in the consensus sequence makes it
possible to estimate the number of errors it contains, and hence the number of errors that
will be removed if particular bases are checked and, if necessary, edited.
Gap4 caters for base calls with and without confidence values and hence provides a
choice of algorithms. There are currently three consensus algorithms that may be used.
The choice of the best algorithm will depend on the data that you have available and the
purpose for which you are using gap4.
The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm], page 299).
The only way to produce a consensus sequence for which the reliability of each base is
known, is to use reading data with base call confidence values. Their use, in combination

The Staden Package Manual

with the Confidence Value algorithm (see Section 2.11.5.3 [Consensus Calculation Using
Confidence Values], page 259). is strongly recommended.
For base calls without confidence values use the Base Frequencies algorithm (see
Section 2.11.5.1 [Consensus Calculation Using Base Frequencies], page 258). This is also
a fast algorithm so it may be appopriate for very high depth assemblies such those for
mutation studies.
For data with simple base call accuracy estimates rather than those on the decibel scale,
the Weighted Base Frequencies algorithm should be used (see Section 2.11.5.2 [Consensus
Calculation Using Weighted Base Frequencies], page 259).
All confidence values lie in the range 0 to 100. When readings are entered into a database, gap4 assigns a confidence of 99 to all bases without confidence values. For all three
algorithms, a base with confidence of 100 is used to force the consensus base to that base
type and to have a confidence of 100. However,if two or more base types at any position
have confidence 100, the consensus will be set to "unknown", i.e. "-", and will have a
confidence of 0. Note that dash ("-") is our preferred symbol for "unknown" as, within a
sequence, it is more easily distinguished from A,C,G,T than "N".
The consensus sequence is also assigned a confidence, even when base call confidence
values are not used to calculate it. The scale and meaning of the consensus confidence
changes between consensus algorithms. However the consensus cutoff parameter always has
the same meaning. A consensus base with a confidence ’X’ will be called as a dash when
’X’ is lower than the consensus cutoff, otherwise it is the determined base type.
Both the consensus cutoff and quality cutoff values can be set by using the "Configure cutoffs" command in the "Consensus algorithm" dialogue in the main gap4 Options
menu (see Section 2.20.2 [Consensus Algorithm], page 299). Within the Contig Editor (see
Section 2.6 [Editing in gap4], page 160) these values can be adjusted by clicking on the "<"
and ">" symbols adjacent to the "C:" (consensus cutoff) and "Q:" (quality cutoff) displays
in the top left corner of the editor. These buttons are repeating buttons - the values will
adjust for as long as the left mouse button is held down. Changing these values lasts only
as long as that invocation of the contig editor.
The consensus algorithms are usually configured to produce only the characters
A,C,G,T,* and "-", but it is possible to set them to produce the complete set of IUB
codes. This mode is useful for some types of work and allows the range of observed base
types at any position to be coded in the consensus. The IUB code at any position is
determined in the following way.
We assume that the user wants to know which base types have occurred at any point,
but may want some control over the quality and relative frequency of those that are used to
calculate the "consensus". For the simplest consensus algorithm there is no control over the
quality of the base calls that are included, but the Consensus Cutoff can be used to control
how the relative frequency affects the chosen IUB code. All base types whose computed
"confidence" exceeds the Consensus Cutoff will be included in the selection of the IUB code.
For example if only base type T reaches the Consenus Cutoff the IUB code will be T; if both
T and C reach the cutoff the code will be Y; if A, C and T each reach the cutoff the code
will be H; if A, C, G and T all reach the cutoff the code will be "N". For the Confidence

Chapter 1: Next generation assembly editing with Gap5

Value algorithm the Quality Cutoff can be used to exclude base calls of low quality, so that
all those that do not reach the Quality Cutoff are excluded from the IUB code calculation.
Otherwise the logic of the code selection is the same as for the two simpler algorithms.
Both the consensus cutoff and quality cutoff values can be set by using the "Configure
cutoffs" command in the "Consensus algorithm" dialogue in the main gap4 Options menu
(see Section 2.20.2 [Consensus Algorithm], page 299).
The algorithms are explained below.

1.9.2.1 Consensus Calculation Using Base Frequencies
This algorithm can be used for any data, with or without confidence values. Each standard
base type is given the same weight. The consensus will be the most frequent base type in a
given column provided that the consensus cutoff parameter is low enough. All unrecognised
base types, including IUB codes, are treated as dashes. Dashes are given a weight of
1/10th that of recognised base types. Pads are given a weight which is the average of their
neighbouring bases.
The confidence of a consensus base for this method is expressed as a percentage. So for
example a column of bases of A, A, A and T will give a consensus base of A and a confidence
of 75. Therefore a consensus cutoff of 76 or higher will give a consensus base of "-".
In the event that more than one base type is calculated to have the same confidence, and
this exceeds the consensus cutoff, the bases are assigned in descending order of precedence:
A, C, G and T.
The quality cutoff parameter (Q in the Contig Editor) has no effect on this algorithm.

1.9.2.2 Consensus Calculation Using Weighted Base Frequencies
This method can be used when simple, unquantified, base call quality values are available.
Instead of simply counting base type frequencies it sums the quality values. Hence a column
of 4 bases A, A, A and T with confidence values 10, 10, 10 and 50 would give combined
totals of 30/80 for A and 50/80 for T (compared to 3/4 for A and 1/4 for T when using
frequencies). As with the unweighted frequency method this sets the confidence value of
the consensus base to be the the fraction of the chosen base type weights over the total
weights (62.5 in the above example).
The quality cutoff parameter controls which bases are used in the calculation. Only bases
with quality values greater than or equal to the quality cutoff are used, otherwise they are
completely ignored and have no effect on either the base type chosen for the consensus or
the consensus confidence value. In the above example setting the quality cutoff to 20 would
give a T with confidence 100 (100 * 50/50).
In the event that more than one base type is calculated to have the same weight, and
this exceeds the consensus cutoff, the bases are assigned in descending order of precedence:
A, C, G and T.
This is Rule IV of Bonfield,J.K. and Staden,R. The application of numerical estimates
of base calling accuracy to DNA sequencing projects. Nucleic Acids Research 23, 1406-1410
(1995).

The Staden Package Manual

1.9.2.3 Consensus Calculation Using Confidence values
This is the prefered consensus algorithm for reading data with Phred decibel scale confidence
values. As will become clear from the follwing description, it is more complicated than the
other algorithms, but produces a much more useful result.
A difficulty in designing an algorithm to calculate the confidence for a consensus derived
from several readings, possibly using different chemistries, and hopefully from both strands
of the DNA, is knowing the level of independence of the results from different experiments
- namely the readings. Given that sequencing traces are sequence dependent, we do not
regard readings as wholly independent, but at the same time, repeated readings which
confirm base calls may give us more confidence in their accuracy. In addition, if we get a
particularly good sequencing run, with consequently high base call confidence values, we
are more likely to believe its base call and confidence value assignments. The final point in
this preamble is that the Phred confidence values refer only to the probability for the called
base, and they tell us nothing about the relative likelihood of each of the other 3 base types
appearing at the same position. These difficulties are taken into account by our algorithm,
which is described below.
In what follows, a particular position in an alignment of readings is referred to as a
"column". The base calls in a column are classified by their chemistry and strand. We
currently group them into "top strand dye primer", "top strand dye terminator", "bottom
strand dye primer" and "bottom strand dye terminator" classes.
Within each class there may be zero or many base calls. For each class we check for
multiple occurrences of the same base type. For each base type we find the highest confidence
value, and then increase it by an amount dependent on the number of confirming reads.
Then Bayes formula is used to derive the probabilities and hence the confidence values for
each base type.
To further describe the method it is easiest to work through an example. Suppose we
have 5 readings with the following characteristics covering a particular column.
Dye
Dye
Dye
Dye
Dye

primer, top strand,
primer, top strand,
primer, top strand,
terminator, top strand,
primer, bottom strand,

’A’,
’A’,
’T’,
’T’,
’A’,

confidence
confidence
confidence
confidence
confidence

20
10
20
10
5

Chapter 1: Next generation assembly editing with Gap5

T | .0033 .0033 .0033 .990
Bayesian calculations on this table then give us probabilities of approximately .766 for
A, .00154 for C, .00154 for G and .231 for T.
The other classes give probalities of .033 for A, C, G and .9 for T, and .316 for A, and
.228 for C, G and T.
To combine the values for each class we produce a table for a further Bayesian calculation.
Once again we fill in the probabilities and spread the remainder evenly amongst the other
base types.
|
A
C
G
T
-----------+-------------------------Primer Top | .766 .00154 .00154 .231
Term
Top | .0333 .0333 .0333 .9
Primer Bot | .316 .228
.228
.228
From this Bayes gives the final probabilities of .135 for A, .0002 for C, .0002 for G and
.854 for T. This is what would be expected intuitively: the T signal was present in both
dye primer and dye terminator experiments with 1/100 and 1/10 error rates whilst the A
signal was present on both strands with 1/100 and 1/3 error rates. Hence the consensus
base is T with confidence 8.4 (-10*log10(1-.854)).
If a padding character is present in a column we consider the pad as a separate base
type and then evenly divide the remaining probabilities by 4 instead of 3.

1.9.2.4 The Quality Calculation
The Quality Calculation described here (which is available from the gap4 File menu) applies
either of the two simple consensus calculations (see Section 2.11.5.1 [Consensus Calculation
Using Base Frequencies], page 258) and (see Section 2.11.5.2 [Consensus Calculation Using
Weighted Base Frequencies], page 259) to the data for each strand of the DNA separately.
It produces, not a consensus sequence, but an encoding of the "quality" of the data which
defines whether it has been determined on both strands, and whether the strands agree.
This quality is used as the basis for problem searches, such as find next problem, and the
Quality Display within the Template Display (see Section 2.5.1.5 [Quality Plot], page 137).
The categories of data and the codes produced are shown in the table. For example ’c’
means bad data on one strand is aligned with good data on the other.
+Strand -Strand
a

Good Good (in agreement)

Good Bad

Bad Good

Good None

None Good

Bad Bad

Bad None

The Staden Package Manual

None Bad

Good Good (disagree)

None None
the "Configure cutoffs" command in the

In the "Consensus algorithm" dialogue in the main gap4 Options menu (see
Section 2.20.2 [Consensus Algorithm], page 299), setting the configuration to treat readings
flagged using the "Special Chemistry" Experiment File line (CH field) (see Section 11.3
[Experiment File], page 552) affects this calculation. When set, the reading counts for
both strands in the Consensus and Quality Calculations, and hence is equivalent to having
data on both strands.

1.9.3 List Consensus Confidence
The Confidence Value consensus algorithm (see Section 2.11.5.3 [Consensus Calculation
Using Confidence Values], page 259) produces a consensus sequence for which the expected
error rate for each base is known. The option described here (which is available from the
gap4 View menu) uses this information to calculate the expected number of errors in a
particular consensus sequence and to tabulate them.
The decibel type scale introduced in the Phred program uses the formula
-10xlog10(error rate) to produce confidence values for the base calls. A confidence value of
10 corresponds to an error rate of 1/10; 20 to 1/100; 30 to 1/1000; etc.
So for example, if 50 bases in the consensus had confidence 10, we would expect those 50
bases (with an error rate of 1/10) to contain 5 errors; and if 200 bases had confidence 20, we
would expect them to contain 2 errors. If these 50 bases with confidence 10, and 200 bases
with confidence 20 were the least accurate parts of the consensus, they are the bases which
we should check and edit first. In so doing we would be dealing with the places most likely
to be wrong, and would raise the confidence of the whole consensus. The output produced
by List Confidence shows the effect of working through all the lowest quality bases first,
until the desired level of accuracy is reached. To do this it shows the cumulative number
of errors that would be fixed by checking every consensus base with a confidence value less
than a particular threshold.
The List Confidence option is available from within the Commands menu of the Contig
Editor and the main gap4 View menu. From the main menu the dialogue simply allows
selection of one or more contigs. Pressing OK then produces a table similar to the following:
Sequence length = 164068 bases.
Expected errors = 168.80 bases (1/971 error rate).
Value

Frequencies

Chapter 1: Next generation assembly editing with Gap5

4
5
6
7
8
9
10

30
2
263
151
164
96
80

11.94
0.63
66.06
30.13
25.99
12.09
8.00

34
36
299
450
614
710
790

14.24
14.87
80.94
111.06
137.06
149.14
157.14

1/1061
1/1065
1/1867
1/2841
1/5168
1/8344
1/14069

The output above states that there are 164068 bases in the consensus sequence with an
expected 169 errors (giving an average error rate of one in 971). Next it lists each confidence
value along with its frequency of occurrence and the expected number of errors (as explained
above, frequency x error rate). For any particular confidence value the cumulative columns
state: how many bases in the sequence have the same or lower confidence, how many errors
are expected in those bases, and the new error rate if all these bases were checked and all
the errors fixed.
Above it states that there are 790 bases with confidence values of 10 or less, and estimates
there to be 157 errors in those 790 bases. As we expect there to be about 169 errors in the
whole consenus this implies that manually checking those 790 bases would leave only 12
undetected errors. Given that the sequence length is 164068 bases this means an average
error rate of 1 in 14069. It is important to note that by using this editing strategy, this
error rate would be achieved by checking only 0.48% of the total number of consensus bases.
This strategy is realised by use of the consensus quality search in the gap4 Contig Editor
(see Section 2.6.6.7 [Search by Consensus Quality], page 175).

1.9.4 List Base Confidence
The various base-callers may produce a confidence value for each base call. Previous sections
describe how this may be used to produce a consensus sequence along with a consensus
confidence.
This function tabulates the frequency of each base confidence value along with a count
of how many times is matches or mismatches the consensus. Given that the standard scale
for confidence values follows the -10log10(probability of error) formula we can determine
what the expected frequency of mismatches should be for any particular confidence value.
By comparing this with our observed frequencies we then have a powerful summary of the
amount of misassembled data.
Total bases considered : 45270
Problem score
: 1.337130
Conf.
Match
Mismatch
Expected
Overvalue
freq
freq
freq representation
--------------------------------------------------------------------0
0
0
0.00
0.00
1
0
0
0.00
0.00
2
0
0
0.00
0.00
3
0
0
0.00
0.00
4
37
22
23.49
0.94
5
0
0
0.00
0.00

The Staden Package Manual

6
7
8
9
10
...

89
119
256
368
669

46
26
37
30
31

33.91
28.93
46.44
50.11
70.00

1.36
0.90
0.80
0.60
0.44

In the above example we see that there are 59 sequence bases with confidence 4, of which
37 match the consensus and 22 do not. If we work on the assumption that the consensus
is correct then we would expect approximately 40% of these to be incorrect, but we have
measured 37% to be incorrect (22/59) giving 0.94 fraction of the expected amount.
For a more problematic assembly, we may see a section of output like this:
Total bases considered : 1617511
Problem score
: 311.591358
Conf.
Match
Mismatch
Expected
Overvalue
freq
freq
freq representation
--------------------------------------------------------------------...
20
13432
384
138.16
2.78
21
23384
851
192.51
4.42
22
18763
487
121.46
4.01
23
13712
300
70.23
4.27
24
21182
363
85.77
4.23
25
20466
218
65.41
3.33
26
9752
123
24.80
4.96
27
23071
282
46.60
6.05
28
13816
158
22.15
7.13
29
27514
166
34.85
4.76
30
15664
140
15.80
8.86
...
We can see here that the observed mismatch frequency is greatly more than the expected
number. This indicates the number of misassemblies (or SNPs in the case of mixed samples)
within this project and is reflected by the combined “Problem score”. This score is simply
the sum of the final column (or 1 over that column for values less than 1.0).

Chapter 1: Next generation assembly editing with Gap5

1.10 Other Miscellany
1.10.1 List Libraries
The List Libraries window is perhaps misnamed as it handles arbitrary groups of reads,
possibly due to the use of multiple libraries, multiple instrument types or simply multiple
lanes on a single instrument. For SAM/BAM files this informations comes from the @RG
header lines. For other formats Gap5 typically makes use of the input filename to group
data together.

The basic plot shows a list of library names and how frequently read pairs have been
identified as matching to the same contig. This is computed at the time of import via
tg index and so will not be updated on contig joining or breakage. The Type field indicates
the instrument platform type (for example Illumina or 454), although this is often absent
from the input BAM files.

The Staden Package Manual

The Insert size and standard deviation (s.d.) are derived from the sequence alignments,
with assumptions of an approximately Gaussian distribution. While not entirely accurate
this is typically sufficient for most libraries when viewed in a summary table. Finally the
Orientation field indicates the relative orientation in which most of the read-pairs have been
assembled. This will be one of “-> <-”, “<- ->” or “-> -> / <- <-” to indicate the relative
orientations of the read-pair. Whether the observed orientation is correct will depend on
the particular sequencing strategy used.
Underneath the list is a histogram of observed insert sizes for the currently selected
library. The graph is currently very rudimentary with no controls, but it will auto-scale
to fit the data. The example shown above is an Illumina large insert library showing two
distinct distributions with the smaller being where the biotin enrichment failed and short
templates were included in the library. (Note in this example the sequence orientations
have been flipped so the bulk of the data is in the orientation expected by other tools.)

Chapter 1: Next generation assembly editing with Gap5

1.10.2 Results Manager
Some commands within prog produce "results" that are updated automatically as data
is edited. The Result Manager provides a way to list these results, and to interact with
them.
A result is an abstract term used to define any collection of data. Typically this data can
be displayed, manipulated and is usually updated automatically when changes are made that
affect it. Each set of matches from a particular search plotted on the Contig Comparator
(see Section 2.4 [Contig Comparator], page 126) is a result, as are entire displays such as
the Template Display.

The "results" window, shown above, can be invoked either from the View menu in the
main display or from the View menu of the Contig Comparator. Each result is listed in the
window on a separate line containing the time that the result was created (which may not
be the same as when it was last updated), the name of the function that created the result,
and the result number. The number is simply a unique identifier to help distinguish two
results produced by the same function.
Each item in the list is consuming memory on your computer. Running functions over
and over again without removing the previous results will slow down your machine and it
will, eventually, run out of memory. Removing items from the list solves this.
Pressing the right mouse button over an listed item will display a popup menu of operations that can be performed on this result. The operations available will always contain
"Remove" which will delete this result and shut down any associated window, but others
listed will depend on the result selected. In the illustration above the popup menu for the
"Repeat search" can be seen. Here the operations relate to a set of repeat matches currently
being displayed in the Contig Comparator (not shown).
The Contig Comparator functions ("Find internal joins", "Find read pairs", "Find repeats", "Check assembly" and "Find Sequences") are all listed in the Results Manager
once per usage of the function. It is worth remembering that the only places to completely
remove the plots from one of these functions is using the "Remove" command within the
Results Manager or to use the "Clear" button within the Contig Comparator to remove all
plots.

The Staden Package Manual

1.10.3 Lists
For many operations it is convenient to be able to process sets of data together - for example
to calculate a consensus sequence for a subset of the contigs. To facilitate this prog uses
lists.
Most prog commands dealing with batches of files or sets of readings or contigs can
use either files of filenames or lists. When selecting list names from within dialogues the
"browse" button will display a window containing all the currently existing lists. To select
a list simply double click on the list name. Alternatively the name may simply be typed in.
The List menu on the main menubar contains commands to Edit, Create, Delete, Copy,
Load, and Save lists. Some of these display a list editor. This is simply a scrollable text
window supporting simple editing facilities (see Section 10.2.3 [Text Windows], page 524).
The "Clear" button clears the list. The "Ok" button removes the list editor window.
It is not necessary to use "Ok" here before supplying the list name for input to another
option.

1.10.3.1 Special List Names
Some lists are automatically updated or are generated on-the-fly as needed. The lists named
"contigs" and "readings" correspond to the currently selected contigs in the contig selector
window and the currently selected readings in the template displays. Note that lists (with
any names) can also be created from selected items in the contig editor. See Section 2.6.8.18
[Set Output List], page 186. The "allcontigs" and "allreadings" lists are created as needed
and always contain an identifier for every contig and every reading identifier.
Because of the way the lists are implemented, as is outlined below, there are some useful
"tricks" that can be employed. A list name consisting of a contig identifier surrounded by
square brackets (’[’ and ’]’) will cause the creation of a list containing all of the readings
within that contig. For example, to use the Extract Readings option (see Section 2.12.7
[Extract Readings], page 273) to extract all the readings from contig ’xb54f8.s1’, the list
name given in the Extract Readings dialogue would be ’[xb54f8.s1]’.
A list name surrounded by curly brackets (’{’ and ’}’) will cause the creation of a list
containing all of the readings in the contigs named in the specified list name. So ’{contigs}’
is equivalent to all the readings in the contigs contained in the ’contigs’ list. Hence the
’allreadings’ list is identical to ’{allcontigs}’.
These tricks can be used anywhere where a list name is required except for editing and
deletion of lists. As a final example, to produce a file of filenames for the currently selected
contigs, save the list named ’{contigs}’ to a file.

1.10.3.2 Basic List Commands
The basic operations that can be performed on lists include copying, loading, saving, editing,
creation and deletion. Joining and splitting can only be performed using the list editors
and using cut and paste between windows.
The Load and Save commands require a list name and a file name. If only the name of
the file is given the list is assumed to have the same name. If it is desired to load or save

Chapter 1: Next generation assembly editing with Gap5

a list from/to a file of a different name then both should be specified. Creating a list that
already exists (or loading a file into an already existing list) is allowed, but will produce a
warning message.
The “Reading list” option controls whether the list to be loaded is a list of reading names
(which is normally the case). This will then turn on hyperlinking in any text views of this
list. Double-left clicking on an underlined reading name will bring up the contig editor
while right-clicking will bring up a command menu.

1.10.3.3 Contigs To Readings Command
This command produces a list or file of reading names for a single contig or for a set of
contigs. The user interface provides a dialogue to select the contigs and to select a list name
or filename.

1.10.3.4 Search Sequence Names
This command allows searching for sequences matching a prefix. The function produces both
a list in the text output window and a prog "list" of reading names. The highlighted
output is clickable, with the left mouse button invoking the contig editor and the right mouse
button displaying a popup-menu allowing additional operations (contig editor, template
display, reading notes and contig notes).
All searches are case sensitive and prefix only.

Chapter 2: Sequence assembly and finishing using Gap4

2 Sequence assembly and finishing using Gap4
2.1 Organisation of the gap4 Manual
The main body of the gap4 manual is divided, where possible, into sections covering related
topics. If appropriate, these sections commence with an overview of the functions they
contain. After the Introduction, the manual contains chapters on some important components of the user interface: the Contig Selector (see Section 2.3 [Contig Selector], page 123),
the Contig Comparator (see Section 2.4 [Contig Comparator], page 126), and then, in the
chapter on Contig Overviews (see Section 2.5 [Contig Overviews], page 130) we describe the
Template Display (see Section 2.5.1 [Template Display], page 130), and its subcomponents
the Stop Codon Plot (see Section 2.5.5 [Stop Codon Map], page 156), and the Restriction
Enzyme Plot (see Section 2.5.6 [Restriction Enzyme Search], page 157).
Then there is a long chapter on the powerful Contig Editor (see Section 2.6 [Editor
introduction], page 160), followed by a chapter describing the many assembly engines and
assembly modes which gap4 can offer (see Section 2.7 [Assembly Introduction], page 205).
Gap4 contains functions to use the data in an assembly database to find the left to
right order of contigs, and to compare their consensus sequences to look for joins that
may have been missed during assembly. A "read-pair" is obtained by sequencing a DNA
template (or "insert") from both ends: we then know the relative orientations of the two
readings, and if we know the approximate template length, we know how far apart they
should be after assembly. The next chapter is on the use of read-pair data for ordering
contigs and checking assemblies and on the use of consensus comparisons for finding joins
(see Section 2.8 [Ordering and Joining Contigs], page 217).
The next chapter is on checking assemblies and removing readings (see Section 2.9
[Checking Assemblies and Removing Readings], page 235). The following chapter describes
gap4’s methods for suggesting experiments for helping to finish a sequencing project (see
Section 2.10 [Finishing Experiments], page 241). Then we describe the various consensus calculation algorithms, and the options for creating consensus sequence files (see Section 2.11.5
[The Consensus Calculation], page 257). Next is the description of a set of miscellaneous
functions (see Section 2.12 [Miscellaneous functions], page 265), followed by chapters on
the Results Manager (see Section 2.13 [Results Manager], page 277), Lists (see Section 2.14
[Lists Introduction], page 278), Notes (see Section 2.15 [Notes], page 281), Configuring gap4
(see Section 2.20.1 [Options Menu], page 298), gap4 Database Files (see Section 2.16 [Gap
Database Files], page 284), Checking Databases for corruptions (see Section 2.18 [Check
Database], page 290) and Doctoring corrupted databases (see Section 2.19 [Doctor database], page 293).

2.2 Introduction
Gap4 is a Genome Assembly Program. The program contains all the tools that would
be expected from an assembly program plus many unique features and a very easily used
interface. The original version was described in Bonfield,J.K., Smith,K.F. and Staden,R. A
new DNA sequence assembly program. Nucleic Acids Res. 24, 4992-4999 (1995)

The Staden Package Manual

Gap4 is very big and powerful. Everybody employs a subset of options and has their
favourite way of accessing and using them. Although there is a lot of it, users are encouraged
to go through the whole of the documentation once, just to discover what is possible, and
the way that best suits their own work. At the very least, the whole of this introductory
chapter should be read, as in the long run, it will save time.
This chapter serves as a cross reference point, to give an overview of the program and to
introduce some of the important ideas which it uses. The main topics that are introduced
are listed in the current section. We introduced the use of base call accuracy values for
speeding up sequencing projects (see Section 2.2.5 [The use of numerical estimates of base
calling accuracy], page 118). The ability to annotate segments of readings and the consensus
can be very convenient (see Section 2.2.7 [Annotating and masking readings and contigs],
page 121). Generally the 3’ ends of readings from sequencing instruments are of too low
a quality to be used to create reliable consensus, but they can be useful, for example, for
finding joins between contigs (see Section 2.2.6 [Use of the "hidden" poor quality data],
page 120).
One of the most powerful features of gap4 is its graphical user interface which enables
the data to be viewed and manipulated at several levels of resolution. The displays which
provide these different views are introduced, with several screenshots (see Section 2.2.3
[Introduction to the gap4 User Interface], page 101).
It is important to understand the different files used by our sequence assembly software,
and how the data is processed before it reaches gap4 (see Section 2.2.1 [Summary of the
Files used and the Preprocessing Steps], page 97).
Note that gap4 is a very flexible program, and is designed so that it can easily be
configured to suit different purposes and ways of working. For example it is easy to create
a beginners version of gap4 which has only a subset of functions. What is described in this
manual is the full version, and so is likely to contain some perhaps more esoteric options
that few people will need to use. This introductory section also contains a complete list of
the options in the gap4 main menus (see Section 2.2.4 [Gap4 Menus], page 115).
In addition to sequence assembly, gap4 can be used for managing mutation study data
and for helping to discover and check for mutations (see Section 3.1 [Introduction to Searching for Mutations], page 309).
Two further useful facilities of gap4 are "Lists" and "Notes". For many operations it is
convenient to be able to process sets of data together - for example to calculate a consensus
sequence for a subset of the contigs. To facilitate this gap4 uses lists (see Section 2.14 [Lists
Introduction], page 278) A ‘Note’ (see Section 2.15 [Notes], page 281) is an arbitrary piece
of text which can be attached to any reading, any contig, or to the database in general.

Chapter 2: Sequence assembly and finishing using Gap4

2.2.1 Summary of the Files used and the Preprocessing Steps
Gap4 stores the data for an assembly project in a gap4 database. Before being entered into
the gap4 database the data must be passed through several preassembly steps, usually via
pregap4 (see Section 4.2 [Pregap4 introduction], page 326). These steps are outlined below.
The programs can handle data produced by a variety of sequencing instruments. They
can also handle data entered using digitisers or that has been typed in by hand. Usually
the trace files in proprietary format, such as those of ABI, are converted to SCF files (see
Section 11.1 [SCF introduction], page 533) or ZTR files. As originally put forward in Bonfield,J.K. and Staden,R. The application of numerical estimates of base calling accuracy to
DNA sequencing projects. Nucleic Acids Research 23, 1406-1410 (1995). gap4 makes important use of basecall confidence values, (see Section 2.2.5 [The use of numerical estimates
of base calling accuracy], page 118) which are normally stored in the reading’s SCF file.
One of the first steps in the preprocessing is to copy the base calls from the trace files
to text files known as Experiment files (see Section 11.3 [Experiment files], page 552). All
the subsequent processes operate on the Experiment files. Other preassembly steps include
quality and vector clipping. Each step is performed by a specific program controlled by the
program pregap4 (see Section 4.2 [Pregap4 introduction], page 326).
Experiment file format is similar to that of EMBL sequence entries in that each record
starts with a two letter identifier, but we have invented new records specific to sequencing
experiments. One of pregap4’s tasks is to augment the Experiment files to include data
about the vectors, primers and templates used in the production of each reading, and if
necessary it can extract this information from external databases. Some of the information
is needed by pregap4 and some by gap4. (Note that in order to get the most from gap4 it is
essential to make sure that it is supplied, via the Experiment files, with all the information
it needs.)
The trace files are not altered, but are kept as archival data so that it is always possible
to check the original base calls and traces. Any changes to the data prior to assembly (and
we recommend that none are made until readings can be viewed aligned with others) are
made to the copy of the sequence in the Experiment file.
The reading data, in Experiment file format, is entered into the project database (see
Section 2.16 [Gap Database Files], page 284), usually via one of the assembly engines.
Because Experiment file format was based on EMBL file format, EMBL files can also be
entered and their feature tables will be convered to tags. There is no limit to the length of
readings which can be entered.
All the changes to the data made by gap4 are made to the copies of the data in the
project database. Once the data has been copied into the gap4 database the Experiment
files are no longer required.
Gap4 uses the trace files to display the traces (see Section 2.6.11 [Traces], page 188),
and to compare the edited bases with the original base calls (see Section 2.6.6.11 [Search by
Evidence for Edit (1)], page 176), (see Section 2.6.6.12 [Search by Evidence for Edit (2)],
page 176). However gap4 databases do not store trace files: they record only the names
of the trace files (which are copied from the readings’ Experiment files). This means that
if the trace files for a project are not in the same directory/folder as the gap4 database,

The Staden Package Manual

gap4 needs to be told where they are, otherwise it cannot use them. Ideally, all the trace
files for a project should be stored in one directory. To tell gap4 where they are the "Trace
file location" command in the Options menu should be used (see Section 2.20.8 [Trace File
Location], page 302).
Gap4 databases have a number of size constraints, some of which can be altered by users
and others which are fixed.
While gap4 is running it often needs to calculate a consensus. The maximum size of
this sequence is controlled by a variable "maxseq". Most routines are able to automatically
increase the value of maxseq while they are running, but some of the older functions,
including some of the original assembly engines, are not. This means that it is important
for users to set maxseq to a sufficiently high value before running these elderly routines.
By default maxseq is currently set to 100000, but users can set it on the command line or
from within the Options menu.
Gap4 databases contain one record for each reading and one for each contig. The sum of
these two sets of records is the "database size", and the maximum value that database size
is permitted to reach is "maxdb". When databases are initialised maxdb is set, by default,
to 8000. Users can alter this value on the command line or from within the Options menu
of gap4.
Gap4 databases also limit the number and names of readings so that various output
routines know how many character positions are required: the maximum number imposed
in this way is 99,999,999, and the maximum reading name length is 40.
Currently we have sites with single gap4 databases containing over 200,000 readings with
consensus sequences in excess of 7,000,000 bases.
A gap4 database can be used by several users simultaneously, but only one is allowed
to change the contents of the database, and the others are given "readonly" access. As
part of its mechanism to prevent more than one person editing a database at once gap4
uses a "BUSY" file to signify that the database is opened for writing. Before opening a
database for writing, gap4 checks to see if the BUSY file for that database exists. If it does,
the database is opened only for reading, if not it creates the file, so that any additional
attempts to open the database for writing will be blocked. When the user with write access
closes the database, the BUSY file is deleted, hence re-enabling its ability to be opened
for changes. It is worth remembering that a side effect of this mechanism, is that in the
event of a program or system crash the BUSY file will be left on the disk, even though the
database is not being used. In this case users must remove the BUSY file before using the
database (see Section 2.16 [Gap4 Database Files], page 284).
The final result from a sequencing project is a consensus sequence (see Section 2.11.5
[The Consensus Calculation], page 257) and gap4 can write these in Experiment file format,
fasta format or staden format. Of course the whole database and all the trace files are also
useful for future reference as they allow any queries about the accuracy of the sequence to
be answered.

Chapter 2: Sequence assembly and finishing using Gap4

2.2.2 Summary of Gap4’s Functions
The tasks which gap4 can perform can be roughly divided into assembly (see Section 2.7
[Assembly Introduction], page 205), finishing (see Section 2.10 [Finishing Experiments],
page 241), and editing (see Section 2.6 [Editor introduction], page 160). But gap4 contains
many other functions which can help to complete a sequencing project with the minimum
amount of effort, and some of these are listed below.
Readings are entered into the gap4 database using the assembly algorithms (see
Section 2.7 [Assembly Introduction], page 205). In general these algorithms will build
the largest contigs they can by finding overlaps between the readings, however some,
perhaps more doubtful, joins between contigs may be missed, and these can be discovered,
checked and made using Find Internal Joins (see Section 2.8.3 [Find Internal Joins],
page 227), Find repeats (see Section 2.8.4 [Find repeats], page 233) and Join Contigs
(see Section 2.6.15 [The Join Editor], page 196). Find Internal Joins compares the ends
of contigs to see if there are possible overlaps and then presents the overlap in the Contig
Joining Editor, from where the user can view the traces, make edits and join the contigs.
Find Repeats can be used in a similar way, but unlike Find Internal Joins it does not
require the matches it finds to continue to the ends of contigs.
Read-pair data can be used to automatically put contigs into the correct order (see
Section 2.8.1 [Ordering Contigs], page 219), and information about contigs which share
templates can be plotted out (see Section 2.8.2 [Find Read Pairs], page 222). The relationships of readings and templates, within and between contigs can also be shown by
the Template Display (see Section 2.5.1 [Template Display], page 130) which has a wide
selection of display modes and uses.
Problems with the assembly can be revealed by use of Check Assembly (see Section 2.9
[Checking Assemblies], page 236), Find repeats (see Section 2.8.4 [Find repeats], page 233),
and Restriction Enzyme mapping (see Section 2.5.6 [Plotting Restriction Enzymes],
page 157). Check Assembly compares every reading with the segment of the consensus it
overlaps to see how well it aligns. Those that align poorly are plotted out in the Contig
Comparator. Find Repeats also presents its results in the Contig Comparator, so if used in
conjunction with Check Assembly, it can show cases where readings have been assembled
into the wrong copy of a repeated element. At the end of a project the Restriction Enzyme
map function can be used to compare the consensus sequence with a restriction digest of
the target sequence. Problems can also be found by use of the various Coverage Plots
available in the Consistency Display (see Section 2.5.2 [Consistency Display], page 140).
These plots will show regions of low or high reading coverage (see Section 2.5.2.2 [Reading
Coverage Histogram], page 142), places with data for only one strand (see Section 2.5.2.4
[Strand Coverage], page 144), or where there is no read-pair coverage (see Section 2.5.2.3
[Read-Pair Coverage Histogram], page 143). Errors can be corrected by Disassemble
Readings (see Section 2.9.1.2 [Disassembling Readings], page 240) and Break Contig (see
Section 2.9.1.1 [Breaking Contigs], page 239) which can remove readings from contigs or
databases or can break contigs.
The general level of completeness of the consensus sequence can be seen diagrammatically
using the Quality Plot (see Section 2.5.1.5 [Quality Plot], page 137), and the confidence

100

The Staden Package Manual

values for each base in the consensus sequence can be plotted (see Section 2.5.2.1 [Confidence
Values Graph], page 142).
The most powerful component of gap4 is its Contig Editor (see Section 2.6 [Editor
introduction], page 160). which has many display modes and search facilities to enable very
rapid discovery and fixing of base call errors.
If working on a protein coding sequence, the consensus can be analysed using the Stop
Codon Map (see Section 2.5.5 [Stop Codon Map], page 156), and its translation viewed
using the Contig Editor (see Section 2.6.8.1 [Status Line], page 180).
The final result from a sequencing project is a consensus sequence (see Section 2.11.5
[The Consensus Calculation], page 257).

Chapter 2: Sequence assembly and finishing using Gap4

101

2.2.3 Introduction to the gap4 User Interface
Gap4 has a main window from which all the main options are selected from menus. When a
database is open it also has a Contig Selector which will transform into a Contig Comparator
whenever needed. In addition many of the gap4 functions, such as the Contig Editor or the
Template Display will create their own windows when they are activated. All the graphical
displays and the Contig Editor can be scrolled in register. The base of the graphical display
windows usually contains an Information Line for showing short textual data about results
or items touched by the mouse cursor. Gap4 is best operated using a three button mouse,
but alternative keybindings are available. Full details of the user interface are described
elsewhere (see [User Interface], page 523), and here we give an introduction based around
a series of screenshots.
The main window (shown below) contains an Output window for textual results, an Error
window for error messages, and a series of menus arranged along the top. The contents of
the two text windows can be searched, edited and saved. Each set of results is preceded by
a header containing the time and date when it was generated.
Some of the text will be underlined and shaded differently. These are hyperlinks which
perform an operation when clicked (with the left mouse button) on, typically invoking a
graphical display such as the contig editor. Clicking on these with the right mouse button
will bring up a menu of additional operations. At present only a few commands (Show
Relationships and the Search functions) produce hypertext, but if there is sufficient interest
this may be expanded on.

102

The Staden Package Manual

2.2.3.1 Introduction to the Contig Selector
The gap4 Contig Selector is used to display, select and reorder contigs. In the Contig
Selector all contigs are shown as colinear horizontal lines separated by short vertical lines.
The length of the horizontal lines is proportional to the length of the contigs and their
left to right order represents the current ordering of the contigs. Users can change the
contig order by dragging the lines representing the contigs. This is done by clicking and
holding the middle mouse button, or Alt left mouse button, on a line and then moving
the mouse cursor. The Contig Selector can also be used to select contigs for processing.
For example, clicking with the right mouse button on the line representing a contig will
invoke a menu containing the commands which can be performed on that contig. There
are several alternative ways of specifying which contig an operation should be performed
on. Contigs are identified by the name or number of any reading they contain. When a
dialogue is requesting a contig name, using the left mouse button to click on the contig in
the Contig Selector will transfer its name to the dialogue box. Other methods are available
(see Section 2.3.1 [Selecting Contigs], page 123).
As the mouse is moved over a contig, it is highlighted and the contig name (left most
reading name) and length are displayed in the Information Line. The number in brackets
is the contig number (actually the number of its leftmost reading). Tags or annotations
(see Section 2.2.7 [Annotating and masking readings and contigs], page 121) can also be
displayed in the Contig Selector window.

Chapter 2: Sequence assembly and finishing using Gap4

103

2.2.3.2 Introduction to the Contig Comparator
Gap4 commands such as Find Internal Joins (see Section 2.8.3 [Find Internal Joins],
page 227), Find Repeats (see Section 2.8.4 [Find Repeats], page 233), Check Assembly
(see Section 2.9 [Check Assembly], page 236), and Find Read Pairs (see Section 2.8.2 [Find
Read Pairs], page 222) automatically transform the Contig Selector (see Section 2.3 [Contig
Selector], page 123) to produce the Contig Comparator. To produce this transformation a
copy of the Contig Selector is added at right angles to the original window to create a two
dimensional rectangular surface on which to display the results of comparing or checking
contigs.
Each of the functions plots its results as diagonal lines of different colours. In general, if
the plotted points are close to the main diagonal they represent results from pairs of contigs
that are in the correct relative order. Lines parallel to the main diagonal represent contigs
that are in the correct relative orientation to one another. Those perpendicular to the
main diagonal show results for which one contig would need to be reversed before the pair
could be joined. The manual contig dragging procedure can be used to change the relative
positions of contigs. See Section 2.3.2 [Changing the Contig Order], page 125. As the
contigs are dragged the plotted results will automatically be moved to their corresponding
new positions. This means that, in general, if users drag the contigs to move their plotted
results close to the main diagonal they will simultaneously be putting their contigs into the
correct relative positions.
This plot can simultaneously show the results of independent types of search, making it
easy for users to see if different analyses produce corroborating evidence for the ordering of
contigs. Indications that a reading may have been assembled in an incorrect position can
also be seen - if for example a result from Check Assembly lies on the same horizontal or
vertical projection as a result from Find Repeats, users can see the alternative position to
place the doubtful reading.
The plotted results can be used to invoke a subset of commands by the use of pop-up
menus. For example if the user clicks the right mouse button over a result from Find
Internal Joins a menu containing Invoke Join Editor (see Section 2.6.15 [The Join Editor],
page 196) and Invoke Contig Editors (see Section 2.6 [Editing in gap4], page 160) will pop
up. If the user selects Invoke Join Editor the Join Editor will be started with the two contigs
aligned at the match position contained in the result. If required one of the contigs will be
complemented to allow their alignment.

104

The Staden Package Manual

A typical display from the Contig Comparator is shown above. It includes results for
Find Internal Joins in black, Find Repeats in red, Check Assembly in green, and Find Read
Pairs in blue. Notice that there are several internal joins, read pairs and repeats close to the
main diagonal near the top left of the display. This indicates that the contigs represented in
that area are likely to be in the correct positions relative to one another. In the middle of
the bottom right quadrant there is a blue diagonal line perpendicular to the main diagonal.
This indicates a pair of contigs that are in the wrong relative orientation. The crosshairs
show the positions for a pair of contigs. The vertical line continues into the Contig Selector
part of the display, and the position represented by the horizontal line is also duplicated
there (see Section 2.4 [Contig Comparator], page 126).

Chapter 2: Sequence assembly and finishing using Gap4

105

2.2.3.3 Introduction to the Template Display
The Template Display can show schematic plots of readings, templates, tags, restriction
enzyme sites and the consensus quality. Colour coding distinguishes reading, primer and
template types. The Template Display can also be used to reorder contigs and to invoke
the Contig Editor.
An example showing all these information types can be seen in the Figure below.

106

The Staden Package Manual

away from one another. Colour coding is used to distinguish between different types of
inconsistency, and whether or not the inconsistency involves readings within or between
contigs. For example, most of the problems shown in the screendump above are coloured
dark yellow, indicating an inconsistency between a pair of contigs. The rest of the data,
(mostly dark blue indicating templates sequenced from only one end), is plotted below the
data for the inconsistent templates. Forward readings are light blue and reverse readings
are orange. Templates in bright yellow have been sequenced from both ends, are consistent
and span a pair of contigs (and so indicating the relative orientation and separation of the
contigs).
At the bottom is the restriction enzyme plot. The coloured blocks immediately above
and below the ruler are tags. Those above the ruler can also be seen on their corresponding
readings in the large top section. The display can be zoomed. The position of a crosshair
is shown in the two left most boxes in the top right hand corner. The leftmost shows the
distance in bases between the crosshair and the start of the contig underneath the crosshair.
The middle box shows the distance between the crosshair and the start of the first contig.
The right box shows the distance between two selected cut sites in the restriction enzyme
plots (see Section 2.5.1 [Template Display], page 130).

Chapter 2: Sequence assembly and finishing using Gap4

107

2.2.3.4 Introduction to the Consistency Display
The Consistency Display provides plots designed to highlight potential problems in contigs.
It is invoked from the main gap4 View menu by selecting any of its plots. Once a plot has
been displayed, any of the other types of consistency plot can be displayed within the same
frame from the View menu of the Consistency Display.
An example showing the Confidence Values Graph and the corresponding Reading Coverage Histogram, Read-Pair Coverage Histogram and Strand Coverage is shown below.

108

The Staden Package Manual

If more than one contig is displayed, the contigs are drawn immediately after one another
but are staggered in the y direction.
The ruler ticks can be turned on or off from the View menu of the consistency display. The plots can be enlarged or reduced using the standard zooming mechanism. See
Section 10.5.1 [Zooming], page 528.
The crosshair toggle button controls whether the crosshair is visible. This is shown as a
black vertical and horizontal line. The position of the crosshair is shown in the 3 boxes to
the right of the crosshair toggle. The first box indicates the cursor position in the current
contig. The second box indicates the overall position of the cursor in the consensus. The
last box shows the y position of the crosshair. (see Section 2.5.2 [Consistency Display],
page 140).

Chapter 2: Sequence assembly and finishing using Gap4

109

2.2.3.5 Introduction to the Restriction Enzyme Map
The restriction enzyme map function finds and displays restriction sites within a specified
region of a contig. Users can select the enzyme types to search for and can save the sites
found as tags within the database.

This figure shows a typical view of the Restriction Enzyme Map in which the results for
each enzyme type have been configured by the user to be drawn in different colours. On the
left of the display the enzyme names are shown adjacent to their rows of plotted results.
If no result is found for any particular enzyme eg here APAI, the row will still be shown
so that zero cutters can be identified. Three of the enzymes types have been selected and
are shown highlighted. The results can be scrolled vertically (and horizontally if the plot is
zoomed in). A ruler is shown along the base and the current cursor position (the vertical
black line) is shown in the left hand box near the top right of the display. If the user clicks,
in turn, on two restriction sites their separation in base pairs will appear in the top right
hand box. Information about the last site touched is shown in the Information line at the
bottom of the display. At the top the edit menu is shown and can be used to create tags
for highlighted enzyme types (see Section 2.5.6 [Restriction Enzyme Search], page 157).

110

The Staden Package Manual

2.2.3.6 Introduction to the Stop Codon Map
The Stop Codon Map plots the positions of all the stop codons on one or both strands of
a contig consensus sequence. If the Contig Editor is being used on the same contig, the
Refresh button will be enabled, and if used, will fetch the current consensus from the editor,
repeat the search and replot the stop codons.

The figure shows a typical zoomed in view of the Stop Codon Map display. The positions
for the stop codons in each reading frame (here all six frames are shown) are displayed in
horizontal strips. Along the top are buttons for zooming, the crosshair toggle, a refresh
button and two boxes for showing the crosshair position. The left box shows the current
position and the right-hand box the separation of the last two stop codons selected by the
user. Below the display of stop codons is a ruler and a horizontal scrollbar. The information
line is showing the data for the last stop codon the user has touched with the cursor. Also
shown on the left is the View menu which is used to select the reading frames to display
(see Section 2.5.5 [Stop Codon Map], page 156).

Chapter 2: Sequence assembly and finishing using Gap4

111

2.2.3.7 Introduction to the Contig Editor
The gap4 Contig Editor is designed to allow rapid checking and editing of characters in
assembled readings. Very large savings in time can be achieved by its sophisticated problem finding procedures which automatically direct the user only to the bases that require
attention. The following is a selection of screenshots to give an overview of its use.

The figure above shows a screendump from the Contig Editor which contains segments
of aligned readings, their consensus and a six phase translation. The Commands menu is
also shown. The main components are: the controls at the top; reading names on the left;
sequences to their right; and status lines at the bottom. Some of the reading names are
written in light grey which indicates that their traces/chromatograms are being displayed
(in another window, see below).
One reading name is written with inverse colours, which indicates that it has been
selected by the user. To the left of each reading name is the reading number, which is
negative for readings which have been reversed and complemented. The first of the status
lines, labelled "Strands", is showing a summary of strand coverage. The left half of the
segment of sequence being displayed is covered only by readings from one strand of the
DNA, but the right half contains data from both strands.
Along the top of the editor window is a row of command buttons and menus. The
rightmost pair of buttons provide help and exit. To their left are two menus, one of which
is currently in use. To the left of this is a button which initially displays a search dialogue,
and then pressing it again, will perform the selected search. Further left is the undo button:
each time the user clicks on this box the program reverses the previous edit command. The
next button, labelled "Cutoffs" is used to toggle between showing or hiding the reading
data that is of poor quality or is vector sequence. In this figure it has been activated,
showing the poor quality data in light grey. Within this, sequencing vector is displayed in

112

The Staden Package Manual

One of the readings contains a yellow tag, and elsewhere some bases are coloured red,
which indicates they are of poor quality. The Information Line at the bottom of the window
can show information about readings, annotations and base calls. In this case it is showing
information about the reliability of the base beneath the editing cursor.

A better way of displaying the accuracy of bases is to shade their surroundings so that
the lighter the background the better the data. In the figure above, this grey scale encoding
of the base accuracy or confidence has been activated for bases in the readings and the consensus. This screenshot also shows the Contig Editor displaying disagreements and edits.
Disagreements between the consensus and individual base calls are shown in dark green.
Notice that these disagreements are in poor quality base calls. Edits (here they are all pads)
are shown with a light green background. When they are present, replacements/insertions
are shown in pink, deletions in red and confidence value changes in purple. The consensus
confidence takes into account several factors, including individual base confidences, sequencing chemistry, and strand coverage. It can be seen that the consensus for the section covered
by data from only one strand has been calculated to be of lower confidence than the rest.
The Status Line includes two positions marked with exclamation marks (!) which means
that the sequence is covered by data from both strands, but that the consensus for each of
the two strands is different. The Information Line at the bottom of the window is showing

Chapter 2: Sequence assembly and finishing using Gap4

113

information about the reading under the cursor: its name, number, clipped length, full
length, sequencing vector and BAC clone name.

The Contig Editor can rapidly display the traces for any reading or set of readings. The
number of rows and columns of traces displayed can be set by the user. The traces scroll in
register with one another, and with the cursor in the Contig Editor. Conversely, the Contig
Editor cursor can be scrolled by the trace cursor. A typical view is shown above.
This figure is an example of the Trace Display showing three traces from readings in
the previous two Contig Editor screendumps. These are the best two traces from each
strand plus a trace from a reading which contains a disagreement with the consensus. The
program can be configured to automatically bring up this combination of traces for each
problem located by the "Next search" option. The histogram or vertical bars plotted top
down show the confidence value for each base call. The reading number, together with the
direction of the reading (+ or -) and the chemistry by which it was determined, is given
at the top left of each sub window. There are three buttons (’Info’, ’Diff’, and ’Quit’)
arranged vertically with X and Y scale bars to their right. The Info button produces a
window like the one shown in the bottom right hand corner. The Diff button is mostly used
for mutation detection, and causes a pair of traces to be subtracted from one another and
the result plotted, hence revealing their differences. (see Section 2.6.11 [Traces], page 188).

114

The Staden Package Manual

2.2.3.8 Introduction to the Contig Joining Editor
Contigs are joined interactively using the Join Editor. This is simply a pair of contig editor
displays stacked one above the other with a "differences" line in between. The Contig Join
Editor is usually invoked by clicking on a Find Internal Joins, or Find Repeats result in the
Contig Comparator. In which case the two contigs will appear with the match found by
these searches displayed.
The few differences between the Join Editor and the Contig Editor can be seen in the
figure below. Otherwise all the commands and operations are the same as those for the
Contig Editor.

In this figure the Cutoff or Hidden data is being displayed for the right hand contig. One
difference between the Contig Editor and the Join Editor is the Lock button. When set
(as it is in the illustration) the two contigs scroll in register, otherwise they can be scrolled
independently.
The Align button aligns the overlapping consensus sequences (see Section 2.6.15 [Editor
joining], page 196).

Chapter 2: Sequence assembly and finishing using Gap4

115

2.2.4 Gap4 Menus
The main window for gap4 contains File, Edit, View, Options, Experiments, Lists and
Assembly menus.

2.2.4.1 Gap4 File menu
The File menu includes database opening and copying functions and consensus calculation
options.
•
•
•
•
•
•
•
•

Change Directory (see Section 2.16.1 [Directories], page 284)
Check Database (see Section 2.18 [Check Database], page 290)
New (see Section 2.16.2 [Opening a New Database], page 285)
Open (see Section 2.16.3 [Opening an Existing Database], page 285)
Copy Database (see Section 2.16.4 [Making Backups of Databases], page 285)
Copy Readings (see Section 2.17.1 [Copying Readings], page 287)
Save Consensus (see Section 2.11.5 [The Consensus Calculation], page 257)
Extract Readings (see Section 2.12.7 [Extract Readings], page 273)

2.2.4.2 Gap4 Edit menu
The Edit menu contains options that alter the contents of the database.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Edit Contig (see Section 2.6 [Editor introduction], page 160)
Join Contigs (see Section 2.6.15 [Editor joining], page 196)
Save Contig Order (see Section 2.8.1 [Order Contigs], page 219)
Break Contig (see Section 2.9.1.1 [Break Contig], page 239)
Complement a Contig (see Section 2.12.1 [Complement a Contig], page 265)
Order Contigs (see Section 2.8.1 [Order Contigs], page 219)
Quality Clip (see Section 2.12.8.2 [Quality Clipping], page 275)
Quality Clip Ends (see Section 2.12.8.3 [Quality Clip Ends], page 275)
Difference Clip (see Section 2.12.8.1 [Difference Clipping], page 274)
N-Base Clip (see Section 2.12.8.4 [N-Base Clipping], page 276)
Double Strand (see Section 2.10.1 [Double Strand], page 241)
Disassemble Readings (see Section 2.9.1.1 [Break Contig], page 239)
Enter Tags (see Section 2.12.2 [Enter Tags], page 265)
Edit Notebooks (see Section 2.15 [Notes], page 281)
Doctor Database (see Section 2.19 [Doctor database], page 293)

2.2.4.3 Gap4 View menu
The View menu contains options to look at the data at several levels of detail, and analytic
functions which present their results graphically.
• Contig Selector (see Section 2.3 [Contig Selector], page 123)
• ResultsManager (see Section 2.13 [Results Manager], page 277)
• Find Internal Joins (see Section 2.8.3 [Find Internal Joins], page 227)

116

The Staden Package Manual

•
•
•
•
•
•
•
•
•
•
•

Find Read Pairs (see Section 2.8.2 [Find Read Pairs], page 222)
Find Repeats (see Section 2.8.4 [Find repeats], page 233)
Check Assembly (see Section 2.9 [Check Assembly], page 236)
Sequence Search (see Section 2.12.6 [Find Oligos], page 271)
Template Display (see Section 2.5.1 [Template Display], page 130)
Show Relationships (see Section 2.12.4 [Show Relationships], page 267)
Restriction Enzyme map (see Section 2.5.6 [Restriction Enzyme Search], page 157)
Stop Codon Map (see Section 2.5.5 [Stop Codon Map], page 156)
Quality Plot (see Section 2.5.1.5 [Quality Plot], page 137)
List Confidence (see Section 2.11.6 [List Confidence], page 261)
Reading Coverage Histogram (see Section 2.5.2.2 [Reading Coverage Histogram],
page 142)
• Read-Pair Coverage Histogram (see Section 2.5.2.3 [Read-Pair Coverage Histogram],
page 143)
• Strand Coverage (see Section 2.5.2.4 [Strand Coverage], page 144)
• Confidence Values Graph (see Section 2.5.2.1 [Confidence Values Graph], page 142)

2.2.4.4 Gap4 Options menu
The Options menu contains options for configuring gap4.
•
•
•
•
•
•
•

Consensus Algorithm (see Section 2.20.2 [Consensus Algorithm], page 299)
Set Maxseq (see Section 2.20.3 [Set Maxseq], page 299)
Set Fonts (see Section 2.20.4 [Set Fonts], page 299)
Configure Menus (see Section 2.20.5 [Configuring Menus], page 300)
Set Genetic Code (see Section 2.20.6 [Set Genetic Code], page 300)
Alignment Scores (see Section 2.20.7 [Alignment Scores], page 301)
Trace File Location (see Section 2.20.8 [Trace File Location], page 302)

2.2.4.5 Gap4 Experiments menu
The Experiments menu contains options to analyse the contigs and to suggest experimental
solutions to problems.
•
•
•
•

Suggest Long Readings (see Section 2.10.3 [Suggest Long Readings], page 245)
Suggest Primers (see Section 2.10.2 [Suggest Primers], page 243)
Compressions and Stops (see Section 2.10.4 [Compressions and Stops], page 247)
Suggest Probes (see Section 2.10.5 [Suggest Probes], page 249)

2.2.4.6 Gap4 Lists menu
The Lists menu contains a set of options for creating and editing lists for use in various
parts of the program.
• Creation and Editing (see Section 2.14 [Lists Introduction], page 278)
• Contigs To Readings (see Section 2.14.3 [Contigs To Readings Command], page 279)

Chapter 2: Sequence assembly and finishing using Gap4

•
•
•
•
•
•

117

Minimal Coverage (see Section 2.14.4 [Minimum Coverage], page 279)
Unattached Readings (see Section 2.14.5 [Unattached Readings], page 279)
Highlight Readings List (see Section 2.14.6 [Highlight Readings List], page 279)
Search Sequence Names (see Section 2.14.7 [Search Sequence Names], page 279)
Search Template Names (see Section 2.14.8 [Search Template Names], page 280)
Search Annotation Contents (see Section 2.14.9 [Search Annotation Contents],
page 280)

2.2.4.7 Gap4 Assembly menu
The Assembly menu contains various assembly and data entry methods.
•
•
•
•

Normal Shotgun Assembly (see Section 2.7.1 [Normal Shotgun Assembly], page 205)
Directed Assembly (see Section 2.7.2 [Directed Assembly], page 211)
Screen Only (see Section 2.7.3 [Assembly Screen Only], page 213)
Assembly Independently (see Section 2.7.1.1 [Assembly Independently], page 209)

118

The Staden Package Manual

2.2.5 The use of numerical estimates of base calling accuracy
In this section we give an overview of our use, when available, of base call accuracy estimates
or confidence values. We also explain the importance of the consensus calculations used by
gap4, and their role in minimising the work needed to complete sequencing projects.
We first put forward the idea of using numerical estimates of base calling accuracy in
our paper describing SCF format Dear, S. and Staden, R, 1992. A standard file format
for data from DNA sequencing instruments. DNA Sequence 3, 107-110 and then expanded
on their use for editing and assembly in Bonfield,J.K. and Staden,R. The application of
numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids
Res. 23, 1406-1410 (1995).
In Bonfield and Staden (1995), we stated "...the most useful outcome of having a sequence reading determined by a computer-controlled instrument would be that each base
was assigned a numerical estimate of its probability of having been called correctly... having
numerical estimates of base accuracy is the key to further automation of data handling for
sequencing projects. ... The simple procedure we propose in this paper is a method of using
the numerical estimates of base calling accuracy to obviate much of the tedious and time
consuming trace checking currently performed during a sequencing project. In summary
we propose that the numerical estimates of base accuracy should be used by software to
decide if conflicts between readings require human expertise to help adjudicate. We argue
that if the accuracy estimates are reasonably reliable then the majority of conflicts can be
ignored... and so the time taken to check and edit a contig will be greatly reduced."
This has been achieved by making the consensus calculations (see Section 2.11.5 [The
Consensus Calculation], page 257) central to gap4, and by providing calculations which
make use of base call accuracy estimates to give each consensus base a quality measure.
The consensus is not stored in the gap4 database but is calculated when required by each
function that needs it, and hence always takes into account the current data. In the Contig
Editor the consensus is updated instantly to reflect any change made by the user.
In 1998 the first useable probability values became available through the program Phred
(Ewing, B. and Green, P. Base-Calling of Automated Sequencer Traces Using Phred. II.
Error Probabilities. Genome Research. Vol 8 no 3. 186-194 (1998)). Phred produces a
confidence value that defines the probability that the base call is correct. This was an
important step forward and these values are widely used and have defined a decibel type
scale for base call confidence values. Gap4 is currently set to use confidence values defined
on this scale.
The confidence value is given by the formula
C_value = -10*log10(probability of error)
A confidence value of 10 corresponds to an error rate of 1/10; 20 to 1/100; 30 to 1/1000;
and so on. Using the main gap4 consensus algorithm they enable the production of a
consensus sequence for which the expected error rate for each base is known.
As is described elsewhere (see Section 2.11.6 [List Consensus Confidence], page 261)
being able to calculate the confidence for each base in the consensus sequence makes it
possible to estimate the number of errors it contains, and hence the number of errors that
will be removed if particular bases are checked and, if necessary, edited. For example, if

Chapter 2: Sequence assembly and finishing using Gap4

119

1000 bases in the consensus had confidence 20, we would expect those 1000 bases (with an
error rate of 1/100) to contain 10 errors.
Another program which produces decibel scale confidence values for ABI 377 data is
ATQA Daniel H. Wagner, Associates, at http://www.wagner.com/.
For gap4 the confidence values are expected to lie in the range 1 to 99, with 0 and 100
having special meanings to the program.
The confidence values are stored in SCF or Experiment files and copied into gap4 databases during assembly or data entry.
The searches provided by the Contig Editor (see Section 2.6.6 [Searching], page 174)
are one of gap4’s most important time saving features. The user selects a search type, for
example to find places where the confidence for the consensus falls below a given threshold,
and the search automatically moves the cursor to the next such position in the consensus.
The Contig Editor locates the next problem by applying the consensus calculation to the
contig. To edit a contig the user selects "Search" repeatedly, knowing that it will only
move to places where there is a conflict between good data or where the data is poor. Note
that the program is usually configured to automatically display the relevant traces for each
position located by the search option.
The main result is that far fewer disagreements between data are brought to the attention
of the user and fewer traces have to be inspected by eye, and so the whole process is faster.
Another consequence of the strategy is that, as fewer bases need changing to produce the
correct consensus, most of what appears on the screen will be the original base calls. Indeed
we have taken this a step further and suggest that if a base needs changing because it has
a high accuracy estimate, and is conflicting with other good data, then rather than change
the character shown on the screen, the user should lower its accuracy value. By so doing
more of the original base calls are left unchanged and hence are visible to the user. There
is a function within the contig editor to reset the accuracy value for the current base to 0.
Alternatively the accuracy value for the base that is thought to be correct can be set within
the contig editor to 100.

120

The Staden Package Manual

2.2.6 Use of the "hidden" poor quality data
In general sequences obtained from machines contain segments such as vector sequence and
poor quality data that need either to be removed or ignored during assembly and editing.
In our package we do not remove such segments but instead we mark them so that the
programs can deal with them appropriately. In gap4 such data is referred to as "hidden".
The positions to hide are determined initially by preprocessing programs such as vector clip
(see Chapter 6 [Screening Against Vector Sequences], page 401) and qclip (see Section 12.19
[qclip], page 597).
The hidden data can be revealed in the Contig Editor by toggling the Cutoffs button (see
Section 2.6.3.4 [Adjusting the Cutoff data], page 169); can be used to search for possible
joins between contigs (see Section 2.8.3 [Find Internal Joins], page 227), and can be included
in the consensus sequence (see Section 2.11.2 [Extended consensus], page 253) to be used
by external screening programs. For these cases the program can distinguish data that is
hidden because it is vector and data that is hidden because it is of poor quality: only poor
quality data is included.
The position of hidden data can be changed interactively in the Contig Editor. In
addition the Double Strand function (see Section 2.10.1 [Double stranding], page 241) will
reduce the amount of hidden data for readings that cover single stranded regions of contigs,
if the data aligns well with that on the other strand.

Chapter 2: Sequence assembly and finishing using Gap4

121

2.2.7 Annotating and masking readings and contigs
Gap4 can label segments of readings and contigs using "tags" (see Section 2.6.5 [Create
Tag], page 171). The program recognises a set of standard tags types and users can also
invent their own. Each tag type has a unique four character identifier, a name, a direction,
a colour and a text string for recording notes. Tags can be created, edited and removed
by users and by internal routines. Tags can also be input along with readings. This is
important when reference sequences are used during mutation detection (see Section 3.1.3
[Reference sequences], page 314).

2.2.7.1 Standard tag types
The standard tag types include those shown below plus the FT records from EMBL sequence
file entries. Users can also invent their own and add them to their personal GTAGDB. This is
a file that describes the available tag types and their colours (see Section 2.20.10 [Configure
the tag database], page 304).

Code
COMM
COMP
RCMP
STOP
OLIG
REPT
ALUS
SVEC
CVEC
MASK
FNSH
ENZ0
ENZ9
MUTN
DIFF
HETE
HET+
HETHOM+
HOMFCDS
F***

Function
Comment
Compression
Resolved compression
Stop
Oligo (primer)
Repeat
Alu sequence
Sequencing vector
Cloning vector
Mask me
Finished segment
Restriction enzyme 0
Restriction enzyme 9
Mutation
Sequence different to consensus
Heterozygous mutation
Heterozygous mutation False +ve
Heterozygous mutation False -ve
Homozygous mutation False +ve
Homozygous mutation False -ve
FEATURE: CDS
All other (60) EMBL FT record types

2.2.7.2 Active tags and masking
Tags are used for a variety of purposes and for each function in the program the user can
choose which tag types are currently "active". Where they are being used to provide visual
clues this will determine which tag types appear in the displays, but for other functions
they can be used to control which parts of the sequence are omitted from processing. This
mode of tag use is called "masking". For example the program contains a routine to search

122

The Staden Package Manual

for repeats, and if any are found, the user needs to know if such sequence duplications
are caused by incorrect assembly or are genuine repeats. Once the user has checked a
duplication reported by the program and found it to be a repeat, it can be labelled with a
REPT tag. If the repeat routine is run in masking mode and with REPT tags active, any
segment covered by a REPT tag will not be reported as a match. So once the "problem"
has been dealt with it can be labelled so it is not reported on subsequent searches. In
addition the tag is available to provide annotation for the completed sequence when it is
sent to the data libraries.
A more complicated application of masking is available for two of the other search
procedures in the program: (see Section 2.7.1 [Shotgun assembly], page 205) and (see
Section 2.8.3 [Find Internal Joins], page 227). The former is the general assembly function
and the latter is used to find potential joins between contigs in the database. Below we
describe how masking can be used during assembly and similar comments apply to Find
Internal Joins.
In the assembly function the user can choose to employ masking and then select the types
of tags to be used as masks. Readings are compared in two stages: first the program looks
for exact matches of some minimum length and then for each possible overlap it performs
an alignment. If the masking mode is selected the masked regions are not used during the
search for exact matches, but they are used during alignment. The effect of this is that
new readings that would lie entirely inside masked regions will not produce exact matches
and so will not be entered. However readings that have sufficient data outside of masked
segments can produce matches and will be correctly aligned even if they overlap the masked
data. A common use for masking during assembly or Find Internal Joins is to avoid finding
matches that are entirely contained in Alu segments.
A further mode related to masking is "marking". Marking is available for the consensus
calculation (see Section 2.11 [Consensus calculation], page 251) and for Find Internal Joins
(see Section 2.8.3 [Find Internal Joins], page 227). Instead of masking the regions covered by
active tags these routines simply write these sections of the consensus sequence in lowercase
letters. That is they make it easy for users to see where the tagged segments are. Marking
has no other effect.

Chapter 2: Sequence assembly and finishing using Gap4

123

2.3 Contig Selector
The prog Contig Selector is used to display, select and reorder contigs. It can be invoked
from the prog View menu, but will automatically appear when a database is opened.
In the Contig Selector all contigs are shown as colinear horizontal lines separated by short
vertical lines. The length of the horizontal lines is proportional to the length of the contigs
and their left to right order represents the current ordering of the contigs. This Contig Order
is stored in the gap database and users can change it by dragging the lines representing the
contigs in the display. The Contig Selector can also be used to select contigs for processing.
Tags (see Section 2.2.7 [Annotating and masking readings and contigs], page 121) can
also be displayed in the Contig Selector window. As the mouse is moved over a contig, it
is highlighted and the contig name (left most reading name) and length are displayed in
the status line. The number in brackets is the contig number. Unlike gap4, gap5 does not
display annotations within the Contig Selector window.

2.3.1 Selecting Contigs
Contigs can be selected by either clicking with the left mouse button on the line representing
the required contig in the contig selector window or alternatively by choosing the "List
contigs" option from the "View" menu. This option invokes a "Contig List" list box where

124

The Staden Package Manual

the contig names and numbers are listed in the same order as they appear in the contig
selector window.

Within this list box the contig names can be sorted alphabetically on contig name or
numerically on contig number. This is done by selecting the corresponding item from the
sort menu at the top of the list box. Clicking on a name within the list box is equivalent
to clicking on the corresponding contig in the contig selector. More than one contig can
be selected by dragging out a region with the left mouse button. Dragging the mouse off
the bottom of the list will scroll it to allow selection of a range larger than the displayed
section of the list. When the left button is pressed any existing selection is cleared. To
select several disjoint entries in the list press control and the left mouse button. The “Copy”
button copies the current selection to the paste buffer.
Most commands require a contig identifier (which can be the name or number of any
reading on the contig) and prog contains several mechanisms for obtaining this information from users. The names or numbers can be typed or cut and pasted into dialogue
boxes (note that a reading number must be preceded by a # character, e.g. "#102" means
reading number 102 but "102" means the reading with name 102).
Also any currently active dialogue boxes that require a contig to be selected can be
updated simply by clicking on a contig in the contig selector or clicking on an entry in
the "Contig Names" list box. For example, if the Edit contig command is selected from
the Edit menu it will bring up a dialogue requesting the identity of the contig to edit. If
the user clicks the left mouse button on a contig in the contig selector window, the contig
editor dialogue will automatically change to contain the name of the selected contig. Some
commands, such as the Contig Editor, can be selected from a popup menu that is activated
by clicking the right mouse button on the contig line in the Contig Selector or clicking the
right mouse button on the corresponding name within the "Contig List" list box. This
simultaneously defines the contig to operate on and so the command starts up without
dialogue.

Chapter 2: Sequence assembly and finishing using Gap4

125

2.3.2 Changing the Contig Order
The order of contigs is shown by the order of the lines representing them within the Contig
Selector. The order of contigs can be changed by moving these lines using the middle
mouse button, or Alt left mouse button. Several contigs may be moved at once by selecting
several contigs using the above method. After selection, move the contigs with the middle
mouse button, or Alt left mouse button, and position the mouse cursor where you want the
selection to be moved to. Upon release of the mouse button the contigs will be shuffled to
reflect their new order. The separator line at the point the contig was moved from increases
in height.
The contig order is saved automatically whenever a contig is created or removed (eg auto
assemble), including operations like disassemble which temporarily create contigs. The order
can be saved manually using the Save Contig Order option on the File menu.

2.3.3 The Contig Selector Menus
The File menu contains only one command; "Exit". This simply quits the contig selector
display.
The View menu gives access to the Results Manager (see Section 2.13 [Results Manager],
page 277), allows contigs to be selected using a list box containing the contig names (See
Section 2.3.1 [Selecting Contigs], page 123), allows active tags (see Section 2.20.9 [TagSelector], page 304) to be selected, and the list of selected contigs to be cleared.
The Results menu is updated on the fly to contain cascading menus for each of the
plots shown when the contig selector is in its 2D Contig Comparator mode (see Section 2.4
[Contig Comparator], page 126). The contents of these cascading menus are identical to
the pulldown menus available from within the Results Manager.

126

The Staden Package Manual

2.4 Contig Comparator
Prog commands such as Find Internal Joins (see Section 2.8.3 [Find Internal Joins],
page 227), Find Repeats (see Section 2.8.4 [Find Repeats], page 233), Check Assembly (see
Section 2.9 [Check Assembly], page 236), and Find Read Pairs (see Section 2.8.2 [Find
Read Pairs], page 222) automatically transform the Contig Selector (see Section 2.3 [Contig
Selector], page 123) to produce the Contig Comparator. To produce this transformation a
copy of the Contig Selector is added at right angles to the original window to create a two
dimensional rectangular surface on which to display the results of comparing or checking
contigs. Each of the functions plots its results as diagonal lines of different colours. If the
plotted points are close to the main diagonal they represent results from pairs of contigs
that are in the correct relative order. Lines parallel to the main diagonal represent contigs
that are in the correct relative orientation to one another. Those perpendicular to the
main diagonal show results for which one contig would need to be reversed before the pair
could be joined. The manual contig dragging procedure can be used to change the relative
positions of contigs. See Section 2.3.2 [Changing the Contig Order], page 125. As the
contigs are dragged the plotted results will be automatically moved to their corresponding
new positions. This means that if users drag the contigs to move their plotted results
close to the main diagonal they will be simultaneously putting their contigs into the correct
relative positions.

Because this plot can simultaneously show the results of independent types of search,
users can see if different analyses produce corroborating evidence for the ordering of contigs.
Also, if for example, a result from Check Assembly lies on the same horizontal or vertical
projection as a result from Find Repeats, users can see the alternative position to place the
doubtful reading. Ie this is an indication that a reading may have been assembled in an
incorrect position.

A typical display from the Contig Comparator is shown below. It includes results for
Find Internal Joins in black, Find Repeats in red, Check Assembly in green, and Find
Read Pairs in blue. Notice that there are several Find Internal Joins, Find Read Pairs and
Find Repeats results close to the main diagonal near the top left of the display, indicating
that the contigs represented in that area are likely to be in the correct relative positions
to one another. In the middle of the bottom right quadrant there is a blue diagonal line
perpendicular to the main diagonal which indicates a pair of contigs that are in the wrong
relative orientation. The crosshairs show the positions for a pair of contigs. The vertical

Chapter 2: Sequence assembly and finishing using Gap4

127

line continues into the Contig Selector part of the display, and the position represented by
the horizontal line is also duplicated there.

2.4.1 Examining Results and Using Them to Select Commands
Moving the cursor over plotted results highlights them, and the information line gives a
brief description of the currently highlighted match. This is in the form:
match name: contig1 number@position in contig1, with contig2 number@position in contig2,
length of the match
For Find Internal Joins the percentage mismatch is also displayed.

128

The Staden Package Manual

Removes the match from the Contig Comparator. The match can be revealed
again by using "Reveal all" within the Results Manager.

Invoke contig editors
Invoke join editors
Invoke template display
When invoked these options bring up their respective displays to show the
match in greater detail.
Remove

Removes the match from the Contig Comparator. The match cannot be revealed again by using "Reveal all" within the Results Manager.

One of the items in the popup menu may have an asterisk next to it. This is the default
operation which can also be performed by double clicking the left mouse button on the
match. For Repeat or Find Internal Joins matches this will normally be the Join Editor,
or two Contig Editors when the match is between two points in the same contig. For Read
Pairs two Template Displays are shown.
The crosshairs can be toggled on and off and a diagonal line going from top left to bottom
right of the plot can also be displayed if required. This is useful as a guide for moving the
contigs such that their matches lie upon the diagonal line.
The "Results" menu on the contig selector window provides a similar mechanism of
accessing results, but at the level of all matches in a particular search. This is simply a
menu driven interface to the Results Manager window (see Section 2.13 [Results Manager],
page 277), but containing only the results relevant to the contig comparator window.

2.4.2 Automatic Match Navigation
The "Next" button of the contig comparator window automatically invokes the default
operation on the next match from the current active result. This provides a mechanism to
step through each match in turn ensuring that no matches have been missed.
With a single result (set of matches) plotted, the "Next" button simply steps through
each match in turn until all have been seen. Moving the mouse above the "Next" button,
without pressing it, highlights the next match and displays brief information about it in
the status line at the bottom of the window. To step through the matches in "best first"
order, select the "Sort Matches" option from the relevant name in the Results menu. The
exact order is dependent on the result in question, but is generally arranged to be the most
interesting ones first. For example, Find Internal Joins shows the lowest mismatch first
whilst Check Assembly shows the highest mismatches first.
Bringing up another result now directs "Next" to step through each of the new matches.
To change the result that "Next" operates on, use the Result menu to select the "Use for

Chapter 2: Sequence assembly and finishing using Gap4

129

’Next’" option in the desired result. Alternatively, double clicking on a match also causes
"Next" to process the list starting from the selected result.
The "Next" scheme remembers any matches that have been previously examined either by itself or by manually double clicking, and will skip these. To clear this ’visited’
information select "Reset ’Next’" in the Results Manager.

130

The Staden Package Manual

2.5 Contig Overviews
Gap4 provides views of the data for an assembly project at 3 levels of resolution: the whole
project can be seen from the Contig Selector (see Section 2.3 [Contig Selector], page 123),
the most detail from the Contig Editor (see Section 2.6 [Editing in gap4], page 160), and
the Contig Overview Displays, described in this section, provide an intermediate level of
information and data manipulation. They are available from the main gap4 View menu.

These middle level resolution displays provide graphical overviews of individual contigs
or sets of contigs. The possible information shown includes readings, templates, tags, restriction enzyme sites, stop codons, plots of the consensus quality, read coverage, read-pair
coverage, strand coverage and consensus confidence. The displays of readings, templates,
tags, restriction enzyme sites and plots of the consensus quality can be shown in a single
window called the Template Display (see Section 2.5.1 [Template Display], page 130). The
plots of reading coverage, read-pair coverage, strand coverage and consensus confidence can
be shown in a single display called the Consistency Display (see Section 2.5.2 [Consistency
Display], page 140), or as separate plots. The Stop Codon Plot (see Section 2.5.5 [Plotting
Stop Codons], page 156) and a more informative version of the Restriction Enzyme Plot (see
Section 2.5.6 [Plotting Restriction Enzymes], page 157) can be shown in separate windows.

2.5.1 Template Display
The Template Display can show schematic plots of readings, templates, tags, restriction
enzyme sites and the consensus quality. It can be used to reorder contigs, create tags and
invoke the Contig Editor. It is invoked from the main gap4 View menu.

Chapter 2: Sequence assembly and finishing using Gap4

131

An example showing all these information types can be seen in the Figure below.

The large top section contains lines and arrows representing readings and templates.
Beneath this are rulers; one for each contig, and below those is the quality plot. The
template and reading section of the display is in two parts. The top part contains the
templates which have been sequenced from both ends but which are in some way inconsistent
- for example given the current relative positions of their readings, they may have a length
that is larger or greater than that expected, or the two readings may, as it were, face
away from one another. Colour coding is used to distinguish between different types of
inconsistency, and whether or not the inconsistency involves readings within or between
contigs. For example, most of the problems shown in the screendump above are coloured
dark yellow, indicating an inconsistency between a pair of contigs. The rest of the data,
(mostly dark blue indicating templates sequenced from only one end), is plotted below the
data for the inconsistent templates. Forward readings are blue and reverse readings are

132

The Staden Package Manual

orange. Templates in bright yellow have been sequenced from both ends, are consistent
and span a pair of contigs (and so indicate the relative orientation and separation of the
contigs).
The coloured blocks immediately above and below the ruler are tags. Those above the
ruler can also be seen on their corresponding readings in the large top section. Zooming is
available. The position of a crosshair is shown in the two left most boxes in the top right
hand corner. The leftmost shows the distance in bases between the crosshair and the start
of the contig underneath the crosshair. The middle box shows the distance between the
crosshair and the start of the first contig. The right box shows the distance between two
selected cut sites in the restriction enzyme plots.

As seen in the dialogue above, users can choose to display a single contig, all contigs, or
a subset of contigs from a file of filenames ("file") or a list ("list"). If either the file or list
options are chosen, the "browse" button will be activated and can be used to call up a file
or list browser dialogue.
The items to be shown in the initial template display can be selected from the list of
checkboxes. The default is to display all templates and readings. However, it is possible to
display only templates with more than one reading ("Ignore ’single’ templates) or templates
with both forward and reverse readings ("Show only read pairs"). These latter two options
may be beneficial if the database is very large.
In the section below we give details about the individual components of the overall
Template Display.

2.5.1.1 Reading and Template Plot
The Reading and Template Plot shows templates and readings. The following sections
describe the display, its options, and the operations which it can be used to perform. It is
invoked from the main gap4 View menu.

2.5.1.2 Reading and Template Plot Display
The Reading and Template Plot shows templates and readings. Colour is used to provide
additional information. The reading colour is used to convey the primer information. The
default colours are:

Chapter 2: Sequence assembly and finishing using Gap4

red

primer unknown

green

forwards primer

orange

reverse primer

133

dark cyan custom forward primer
orange-red
custom reverse primer
Colour is used to distinguish the number and the location of the readings derived from
each template. Templates with readings derived from only one end are drawn in blue. Those
with readings from both ends are pink when both ends are contained within the same contig.
Those with readings from both ends are green when the readings are in different contigs
and one of the contigs is not being plotted.
For each template gap4 stores an expected length, as a range between two values. From
an assembly it is often possible to work out the actual length of a template based upon the
positions within a contig of readings sequenced using the forward and reverse primers. The
forward and reverse readings on a single template (called a read pair) are considered to be
inconsistent if this observed distance is outside of the range of acceptable sizes and then
the template is drawn in black. Alternatively it may be possible that both forward and
reverse readings are assembled on the same strand (in which case both arrows will point in
the same direction). This too is a problem and hence the templates are drawn in black.
If more than one contig is displayed then the distance between adjacent contigs is determined from any read pair information. If there are spanning templates between two
adjacent contigs and the readings on that template are consistent, i.e. are in the correct
orientation, the template is coloured yellow. Templates which span non-adjacent contigs in
the display or contain inconsistent readings are coloured dark yellow.
A summary of the default template colours follows.
blue

the template contains only readings from one end

pink

the template contains both forward and reverse readings in the same contig

green

the template contains both forward and reverse readings, but they are in separate contigs, and one of the contigs is not being displayed.

black

the readings on the template are within the same contig but are in contradictory
orientations or are an unexpected distance apart

yellow

the readings on the template are within different contigs (both of which are
being displayed) and are consistent

dark yellow
the readings on the template are within different contigs (both of which are
being displayed) and are inconsistent
If more than one contig is displayed, the contigs are positioned in the same left to right
order as the input contig list, (which need not necessarily be in the same order as the
contig selector). Overlapping contigs are drawn as staggered lines. If the user selects the

134

The Staden Package Manual

"Calculate contig positions" option from the menu the horizontal distance between adjacent
contigs is determined from any available read pair information. Otherwise, or in the absence
of any read pair information, the second contig is positioned immediately following the first
contig, but will be drawn staggered in the vertical direction. If the readings on a template
spanning two contigs are consistent, the distance between the contigs is determined using the
template’s mean length. If there are several templates spanning a pair of contigs an average
distance is calculated and used as the final offset between the contigs. Templates which
span non-adjacent contigs or contain inconsistent readings are not used in the calculation
of the contig offsets. It is possible that data in the database is inconsistent to such an
extent that, although spanning templates have consistent readings, the averaging can lead
to a display which shows the templates to have inconsistent readings, eg the readings are
pointing in opposite directions.

A summary of the templates and readings used to calculate the distance between two
contigs is displayed in the output window. An example is given below:

============================================================
Wed 02 Apr 10:35:51 1997: template display
-----------------------------------------------------------Contig zf98g12.r1(651) and Contig zf23d2.s1(348)
Template
zf22h7( 376) length 1893
Reading
zf22h7.r1( +10R), pos
6257 +208, contig 651
Reading
zf22h7.s1( -376F), pos
145 +331, contig 348
Template
zf49f5( 536) length 1510
Reading
zf49f5.r1( +255R), pos
6562 +239, contig 651
Reading
zf49f5.s1( -536F), pos
227 +135, contig 348
Gap between contigs = -11
Offset of contig 348 from the beginning = 7674

The contig names and numbers are given in the top line. Below this, the spanning
template name, number and length is displayed. Below this the reading name, whether the
reading has been complemented (+: original -: complemented), number, primer information,
starting position, length and contig number. This is of similar format to that displayed by
the read pairs output. See Section 2.8.2.2 [Find Read Pairs], page 224. The average gap
between the contigs is given and finally the distance in bases between the start of the second
contig and the start of the left most contig in the display.

Chapter 2: Sequence assembly and finishing using Gap4

135

2.5.1.3 Reading and Template Plot Options

Within the figure shown above the contents of the View menu are visible. The "Templates", "Readings", "Quality Plot" and "Restriction Enzyme Plot" commands control
which attributes are displayed. The graphics are always scaled to fit the information within
the window size, subject to the current zoom level. This means that turning off templates,
but leaving readings displayed, will improve visibility of the reading information.
The "Ruler ticks" checkbox determines whether to draw numerical ticks on the contigs.
The number of ticks is defined in the .gaprc (see Section 2.20.1 [Options Menu], page 298)
file as NUM TICKS although the actual number of ticks per contig that will be displayed
also depends on the space available on the screen.
The "ignore ’single’ templates" toggle controls whether to display all templates or only
those containing more than one reading. The "show only read pairs" toggle controls whether

136

The Staden Package Manual

all templates or only those containing both forward and reverse readings are displayed.
Hence when set the templates displayed are those with a known (observed) length. The
"Show only spanning read pairs" toggle controls whether to display all templates or only
those containing forward and reverse readings which are in different contigs.
The plot can be enlarged or reduced using the standard zooming mechanism. See
Section 10.5.1 [Zooming], page 528.
The crosshair toggle button controls whether the cursor is visible. This is shown as a
black vertical line. The position of the crosshair is displayed in the two boxes to the right
of the crosshair toggle. The first box indicates the cursor position in the current contig.
The second box indicates the overall position of the cursor in the consensus. The third
box is used to show the distance between restriction enzyme cut sites. See Section 2.5.1.6
[Restriction Enzyme Plot], page 139.
Tags that are on the consensus can only be seen on the ruler. These are marked beneath
the ruler line. Tags on readings can be seen both on the ruler (above the line) and on
their appropriate readings within the template window. To configure the tag types that
are shown use the "select Tags" command in the View menu. This brings up the usual tag
selection dialog box. See Section 2.20.9 [Tag Selector], page 304.

2.5.1.4 Reading and Template Plot Operations
The contig editor can be invoked by double clicking the middle mouse button, or Alt the left
mouse button, in any of the displays, ie template, ruler, quality or restriction enzyme plots.
The editor will start up with the editing cursor on the base that corresponds to the position
clicked on in the Template Display. If more than one contig is currently being displayed
the editor decides which contig to show using the following rules. If the user clicks on the
Quality Plot, the contig lines or the Restriction Enzyme Display, the corresponding contigs
will be shown. If the user clicks on a gap between these displays the nearest contig will be
selected. If the user clicks on the template or reading lines, the editor will show the contig
whose left end is to the left of and closest to the cursor.
The long blue vertical line seen in the previous figure is the position of the editing cursor
within a Contig Editor. Each editor will produce its own cursor and each will be visible.
Moving the editing cursor within a contig editor automatically moves its cursor within the
Template Display. Similarly, clicking and dragging the editor cursor with the middle mouse
button, or Alt left mouse button, within the Template Display scrolls the associated Contig
Editor.
The order of the contigs can be changed within the Template Display by clicking with
the middle mouse button, or Alt left mouse button, on a contig line and dragging the line to
the new position. The Template Display will update automatically once the mouse button
is released. The change of a dark yellow template to bright yellow is indicative that the two
contigs are now in consistent positions and orientations. The order of the contigs in the
gap4 database, as displayed in the contig selector, can be updated by selecting the "Update
contig order" command in the Edit menu.
By clicking on any of the contig lines in the ruler a popup menu is invoked. From this,
information on the contig can be obtained, the contig editor can be started, the contig

Chapter 2: Sequence assembly and finishing using Gap4

137

can be complemented, and the templates within the contig can be highlighted (shown by
changing their line width).
A list named readings always exists. It contains the list of readings that are highlighted
in all the currently shown template displays. See Section 2.14 [Lists], page 278. The
highlighting mechanism used is to draw the readings as thicker, bolder, lines. The "clear
Active Readings" command from the View menu clears this list. The "highlight reading
list" command loads a new set of readings to use for the "readings" list and then highlights
these.
To interactively add and remove readings from the active list use the left mouse button.
Clicking on an individual reading will toggle its state from active to non active and back
again. Pressing and holding the left mouse button, and moving the mouse, will drag out a
bounding box. When the button is released all readings that are contained entirely within
the bounding box will be toggled.
Activating a reading (using any of the above methods) when an editor is running, will
also highlight the reading within the editor. Similarly, highlighting the reading in the editor
activates it within the template display and adds it to the active reading list.

2.5.1.5 Quality Plot
This option can be invoked from the main gap4 View menu, in which case it appears as a
single plot, or from the View menu of the Template Display, in which case it will appear as
part of the Template Display.
This display provides an overview of the quality of the consensus. The Contig Editor
can be used to examine the problems revealed. A typical plot is displayed below.

For each base in the consensus a quality is computed based on the accuracy of the data
on each strand. As can be seen in the Figure above, this information is then plotted using
colour and height to distinguish between the different quality assignments. The colour and
height codes are explained below.

138

The Staden Package Manual

Colour
grey
blue
green
red
black

Height
0
0
-1
-1
-2

to
to
to
to
to

0
1
0
1
2

Meaning
OK on both strands, both agree
OK on plus strand only
OK on minus strand only
Bad on both strands
OK on both strands but they disagree

For example, in the figure we see that the first four hundred or so bases are mostly only
well determined on the forward strand.
Note that when a large number of bases are being displayed the limited screen resolution
causes the quality codes for adjacent bases to be drawn as single pixels. However the use of
varying heights ensures that all problematic bases will be visible. Hence when the quality
plot consists of a single grey line all known quality problems have been resolved, at the
current consensus and quality cutoffs.
To check problems the contig editor can be invoked by double clicking on the middle
mouse button, or Alt left mouse button. It will appear centred on the base corresponding
to the position on which the mouse was clicked.
The quality plot appears as "Calculate quality" in the Results Manager window (see
Section 2.13 [Results Manager], page 277).
Within the Results Manager commands available, using the right mouse button, include
"Information", which lists a summary of the distribution of quality types to the output
window, and "List" which lists the actual quality values for each base to the output window.
These quality values are written in a textual form of single letters per base and are listed
below.
+Strand -Strand
a

Good Good (in agreement)

Good Bad

Bad Good

Good None

None Good

Bad Bad

Bad None

None Bad

Good Good (disagree)

None None
An example of the output using "Information" and "List" follows.
============================================================
Wed 02 Apr 12:14:06 1997: quality summary

Chapter 2: Sequence assembly and finishing using Gap4

139

-----------------------------------------------------------Contig xb56b6.s1 (#11)
81.00 OK on both strands and they agree(a)
3.94 OK on plus strand only(b,d)
11.98 OK on minus strand only(c,e)
1.85 Bad on both strands(f,g,h,j)
1.22 OK on both strands but they disagree(i)
============================================================
Wed 02 Apr 12:14:09 1997: quality listing
-----------------------------------------------------------Contig xb56b6.s1 (#11)
10
20
30
40
50
60
eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeehee eeeeeeeeee eeeeeeeeee
70
80
90
100
110
120
eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee
130
140
150
160
170
180
eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee
190
200
210
220
230
240
eeeeeeeeee eeeeeeeeee heeeeeeeee eeeeeeeici iiaiaciiia aaaaaaaaac
250
260
270
280
290
300
aaaacaaaaa aaaaaaaiia aaaaaaaaaa aaaaaaaaaa aaaabaaaaa aaaaaaaaaa
310
320
330
340
350
360
aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa faaaaaaaaa
[ output removed for brevity ]

2.5.1.6 Restriction Enzyme Plot
The restriction enzyme plot within the template display is a reduced version of the main
Restriction Enzyme Map function. The dialogue used for choosing the restriction enzymes
is identical and is described with the main function. See Section 2.5.6 [Plotting Restriction
Enzymes], page 157. It is invoked from the Template Display View menu. An example plot
from the template display can be seen below.

140

The Staden Package Manual

Here we see the searches for two restriction enzymes. Each vertical line is drawn at the
cut position of the matched restriction site. Unlike the main restriction enzyme plot here
all matches are plotted on a single horizontal plot. Initially all sites are drawn in black.
To distinguish one site from another either touch the site with the mouse cursor and read
the template display information line, or place the mouse cursor above a site and press the
right mouse button. This pops up a menu containing "Information" and "Configure". The
"Configure" option can be used to change the colour of all matches found for this enzyme.
In the figure above we have changed the initial colours for both of the restriction enzymes
searched for. The "Information" command displays information for all sites found in the
text output window.

As with the main Restriction Enzyme Map function, clicking the left mouse button on
two restriction sites in turn displays the distance between the chosen sites in the information
line. This figure is also displayed in the box at the top right hand corner of the template
display.

2.5.2 Consistency Display
The Consistency Display provides plots designed to highlight potential problems in contigs.
It is invoked from the main gap4 View menu by selecting any of its plots. Once a plot has
been displayed, any of the other types of consistency plot can be displayed within the same
frame from the View menu of the Consistency Display.

Chapter 2: Sequence assembly and finishing using Gap4

141

An example showing the Confidence Values Graph and the corresponding Reading Coverage Histogram, Read-Pair Coverage Histogram and Strand Coverage is shown below.

One or more contigs can be displayed and are drawn in the same order at the input
contig list (which need not necessarily be in the same order as the contig selector). If more
than one contig is displayed, the contigs are drawn immediately after one another but are
staggered in the y direction.

142

The Staden Package Manual

The ruler ticks can be turned on or off from the View menu of the consistency display.
The plots can be enlarged or reduced using the standard zooming mechanism. See
Section 10.5.1 [Zooming], page 528.
The crosshair toggle button controls whether the crosshair is visible. This is shown as a
black vertical and horizontal line. The position of the crosshair is shown in the 3 boxes to
the right of the crosshair toggle. The first box indicates the cursor position in the current
contig. The second box indicates the overall position of the cursor in the consensus. The
last box shows the y position of the crosshair.

2.5.2.1 Confidence Values Graph
This option can be invoked from the main gap4 View menu, in which case it appears as a
single plot, or from the View menu of the Consistency Display in which case it appear part
of the Consistency Display.
The confidence values are determined from the current consensus algorithm (see
Section 2.11.5 [The Consensus Algorithms], page 257).

Please note that this plot can be very slow for long contigs. This is caused by the large
number of points (not the calculation) and we hope to speed it up in a future release.

2.5.2.2 Reading Coverage Histogram
This option can be invoked from the main gap4 View menu, in which case it appears as a
single plot, or from the View menu of the Consistency Display in which case it will appear
as part of the Consistency Display.

Chapter 2: Sequence assembly and finishing using Gap4

143

The number of readings which cover each base position along the contig are plotted as
a histogram.

As can be seen in the dialogue below, the user can select the contigs(s) to display, and
whether to plot: Forward strand only, Reverse strand only, Both strands or the Summation
of both strands. In the example shown above both strands have been plotted: forward in
red and reverse in black.

2.5.2.3 Read-Pair Coverage Histogram
This option can be invoked from the main gap4 View menu, in which case it appears as a
single plot, or from the View menu of the Consistency Display in which case it will appear
as part of the Consistency Display.

144

The Staden Package Manual

The number of read-pairs which cover each base position along the contig are plotted as
a histogram.

2.5.2.4 Strand Coverage
This option can be invoked from the main gap4 View menu, in which case it appears as a
single plot, or from the View menu of the Consistency Display in which case it will appear
as part of the Consistency Display.
The display is used to show which regions of the data are covered by readings from each
of the two strands of the DNA. A separate line is drawn for each strand: forward in red and
reverse in black. The function works in two complementary modes: it can plot the positions
which are covered, or the positions which are not. The latter is probably the most useful
as it directs users to the places requiring further data.
The figure below shows the covered positions, and the figure below that shows the
uncovered positions for the same contig.

Chapter 2: Sequence assembly and finishing using Gap4

145

The plot can be regarded as a coarse version of the Quality Plot (see Section 2.5.1.5
[Quality Plot], page 137), in that it shows the strand coverage using the Quality Calculation
(see Section 2.11.5.4 [The Quality Calculation], page 261), but does not reveal problems with
individual base positions.

The dialogue allows user to select the contig(s) and strands to analyse and whether to
plot Coverage or Problems.

2.5.2.5 2nd-Highest Confidence
The traditional way to compute the consensus confidence values is to take into account
both the matching and mismatching bases within each individual column. If instead we
work on the hypothesis that a contig may have more than one sequence present then we

146

The Staden Package Manual

can instead compute five consensus confidence values at every point (four bases plus pad)
by only totally up the bases that agree and ignoring those that mismatch.

In the case of zero conflicts the highest confidence value will be the same as the standard
consensus confidence. When a conflict occurs, the second highest confidence value can be
used as a measure of how strong the conflict could be. It is this value is plotted.

Chapter 2: Sequence assembly and finishing using Gap4

147

2.5.2.6 Diploid Graph
At present this is a rather specialist function written for a particular in-house purpose. This
plot relates very closely to the 2nd-Highest Confidence plot (see Section 2.5.2.5 [2nd-Highest
Confidence], page 145), but it also takes into account depth information.

Specifically as assumption is made that a contig may consist of two alleles with approximately 50/50 ratio. Any discrepancies visible by looking at the second highest confidence
value should therefore also be backed up by a 50/50 split in sequence depth.

2.5.3 SNP Candidates
The 2nd-Highest Confidence (see Section 2.5.2.5 [2nd-Highest Confidence], page 145) and
the Diploid Graph (see Section 2.5.2.6 [Diploid Graph], page 147 both plot indicators of
how likely an alignment column is to be made up of 2 or more sequence populations.
By studying these in further detail we should be able to spot correlated differences and
to start assigning haplotypes. The SNP Candidate plot initially brings up a dialogue asking

148

The Staden Package Manual

for a single contig and range. After selecting this a window is displayed showing the likely
locations of SNPs as seen below.

The top row of this has controls to define how the 2nd-Highest Confidence or Diploid
Graph results are analysed in order to pick candidate locations for SNPs.
Going from right to left, the “2 alleles only” toggle switches between the two algorithms;
when enabled it uses the additional assumption coded into the Diploid Graph of their being
only two populations in approximately 50:50 ratio. Next the minimum base quality may
be adjusted. Any difference with a poorer quality than this is completely ignored. The
minimum discrepancy score is a threshold (with high indicating a strong SNP) applied to
the results of the consistency plot results. A spike in this plot needs to be at least as
high as this score to be accepted. This score is then adjusted for immediate proximity to
other SNPs (e.g. it forms a run of bases) and this adjusted score is compared against the
minimum SNP score parameter. Typically this can be left low. If any of these parameters
are modified press the “Recalculate candidate SNPs” button to recompute.
The large central panel contains a vertically scrolled representation of the candidate
SNPs found. By default the left-most plot contains a pictorial view of the sequence depth.
Next to this is a vertical ruler showing the relative positions of candidate SNPs. Both of
these two plots are to scale based on the sequence itself. To the right of these come a series
of text based items with one row per candidate SNP. Initially this consists only of a check

Chapter 2: Sequence assembly and finishing using Gap4

149

button (“Use”), Position, Score and the frequency of base types observed at that consensus
column. Double clicking on any row will bring up the contig editor at that position showing
the potential SNP. You may manually curate which ones you consider to be true or not
by enabling or disabling it use the “Use” checkbox on that row. The score may also be
manually adjusted allowing certain differences to be forced apart by using a very high score.
The second row from the top contains a row of options controlling how the correlation
between candidate SNPs is used to assign haplotypes. For every template in the contig
the algorithm produces a fake sequence consistencing only of the bases considered to be
a candidate SNP and enabled by having the “Use” checkbox set. These fake sequences
are then clustered to form groups. No re-alignment is performed as the existing multiple
alignment has already been made (although you may wish to run the Shuffle Pads algorithm
before hand if the existing sequence alignment is poor).
This is a fairly standard clustering algorithm that starts with each sequence being the
sole member of a set. All sets are compared with each other based on the correlation between
sets using an adjusted correlation score (achieved by subtracting “Correlation offset”) and
then the overlaps are ranked by score. The best scoring sets are then merged together. If
Fast Mode is not being used the merged set is then compared against everything else once
more to obtain new scores, otherwise a simple adjustment is guessed at. Skipping this step
speeds up the algorithm considerably and generally gives sufficient results; hence the Fast
Mode toggle. This process is repeated until no two sets have an overlap score of greater
than or equal to the “Minimum merge score”.
The Filter Templates button brings up a new dialogue box containing an editable list
(initially blank) of template names. Adding a template name here will force this template
to be ignored by the clustering algorithm. You may also enter reading names here too
and they will be automatically converted to template names, hence filtering out all other
readings from the same template. If you or suspect specific templates from being chimeric
then this is where they should be listed.
The Cluster by SNPs button starts the clustering process running. It cannot be interrupted and may take a few minutes. After completion the “Sets” component (rightmost) of
the central plot is updated as seen in the below screenshot. Each set is a group of templates
clustered together based on the candidate SNPs. They are sorted in left to right order
such that the left-most set contains the most number of templates and the right most set
contains the fewest. The consensus for members of that set is displayed in each square and
the quality of the consensus is shown in a similar fashion to the contig editor, with white
being good quality and dark grey being poor (usually due to being low coverage within that
set).
The background to the entire row is also shaded to indicate the observed quality of that
SNP in the context of this clustering. A white background indicates that two or more sets
exist with high quality consensus bases (>= quality 90) that differ. A light grey background
is used where the consensus bases differ but not with high quality bases. A dark grey
background is used to indicate that the consensus in all sets covering that SNP candidate
agree. This typically happens when either the clustering has failed or when a candidate
SNP is not a real indicator of which haplotype a sequence belongs to, such as a base calling

150

The Staden Package Manual

error or a random fluctuation in homopolymer length. If you wish to force this SNP to be
used for clustering then try increasing its score and re-clustering again.

Hence in the above example we see two distinct good quality sets made from the SNPs
between 1503 and 2334 and two more good quality sets from 12039 onwards. This indicates
that we have no templates where one end spans SNPs in the 1503-2334 region and the other
end spans SNPs in the 12039 onwards region. We also have a series of smaller sets which
probably arise due to incorrect base calls or more rarely due to chimeras.
Now if we double click to get the contig editor up it will display an additional window
labelled “Tabs”. NOTE: this does not happen if a contig editor for this contig is already
being displayed. If so shut that one down first. Notice that the sequence names are also
coloured. This indicates the set the sequence has been assigned to. The picture below also
has the “Highlight Disagreements” mode enabled with a difference quality cutoff sufficient

Chapter 2: Sequence assembly and finishing using Gap4

151

set to match the one used in the SNP Candidates plot. Two clear SNP positions can be
seen.

The tabs window lists the set numbers and their size (except for “All”). Selecting a set
will show just sequences from that set. This allows for the set consensus and quality values
to be viewed. The editor also allows for sequences to be moved from one set to another,
but for now this is purely serves a visual purpose and the movements are not passed back
to the main SNP candidates window (although this is an obvious change to make).
Moving back to the main SNP Candidates window note that we have a series of selection
buttons at the bottom of the window. These control automatic selection of rows (SNPs)
based on their quality assigned by observing the set consensus sequences. The clustering
algorithm only works on selected sets so this allows for poor quality SNPs to be removed from
further calculations. Additionally to simplify the view unselected SNPs may be removed
by pressing the “Remove unselected” button.
Above each set has a checkbutton above it (not visible in the screenshot). Initially these
are not enabled, but they indicate which sets certain operations should be performed on.
Pressing the right mouse button over a set (or a set checkbox) brings up a menu indicating
the following operations.
Delete set
Merge selected sets
This removes either the clicked upon set or all enabled sets (those that have
their checkbox set) from the display.

152

The Staden Package Manual

Save this consensus
Save consensus for selected sets
This brings up a dialogue box allowing the consensus for a single or selected
sets to be saved in FASTA format. The set numbers is a space separated list
of numbers representing the sets to save, starting with the leftmost set being
numbered as 1. Initially this is either the one you clicked on or all the selected
ones, but it may be edited in this dialogue too prior to saving. Strip pads
removes padding characters (’*’) from the consensus.
“Incorporate ungrouped templates” controls how template sequences that were
not assigned to at least one set are dealt with. It could be considered that sequences covering regions where no SNPs have been detected should be included
when computing the consensus, and this is the default action. However this can
be disabled such that only sequences that were specifically used for breaking
the assembly apart into sets form the consensus.
Produce fofn for this set
Produce fofn for selected sets
These options allow a file or list of reading names to be saved. A single fofn is
produced but multiple sets may be grouped together in one fofn. Here the set
number “0” is a placeholder for all of the sequences that were not assigned to
a set.
The final set of controls to discuss in the SNP Candidates window control the splitting
of sets into contigs. This is a one-way action which cannot be undone, so make sure you
backup the database using Copy Database before hand.
The “Split sets to contigs” button moves the readings in each selected set to its own
contig. In some cases a set may be non-contiguous. Remember that templates are assigned
to sets, but a template may often only have the end sequence known with the middle portion
being unsequenced. Gap4 does not currently handle scaffolds and super-contigs so in order
to keep such sets held together in a single contig the “Add fake consensus” option may
be used. This adds an additional sequence to the contig that contains the consensus for
the set (including from readings that were unassigned). This also handily means that new
contigs produced from multiple sets are already aligned and base coordinates are directly
comparable. Hence two such sets may be viewed in the Join Editor by typing their names
into the main Join Contigs dialogue. (Find Internal Joins will attempt to realign the contigs
and often fails if the set contains many regions of unknown consensus.)

Chapter 2: Sequence assembly and finishing using Gap4

153

2.5.4 Plotting Consensus Quality
This option can be invoked from the main gap4 View menu, in which case it appears as a
single plot, or from the View menu of the Template Display, in which case it will appear as
part of the Template Display.
For each base in the consensus a "quality" code is computed based on the accuracy of
the data on each strand and whether or not the two strands agree. In a future release it
will be renamed the "Strand Comparison Plot" This "quality" is then plotted using colour
and height to distinguish the quality codes shown below.
Colour
grey
blue
green
red
black

Height
0
0
-1
-1
-2

to
to
to
to
to

0
1
0
1
2

Meaning
OK on both strands, both agree
OK on plus strand only
OK on minus strand only
Bad on both strands
OK on both strands but they disagree

For example, in the figure we see that the first four hundred or so bases are mostly only
well determined on the forward strand.

2.5.4.1 Examining the Quality Plot
Note that when displaying many bases the screen resolution implies that the quality codes
for many bases will appear in the same screen pixel. However the use of varying heights
ensures that all problematic regions will be visible, even when the problem is only with a
single base position. Hence when the quality plot consists of a single grey line all known
quality problems have been resolved, at the current consensus and quality cutoffs.
The quality plot appears as "Calculate quality" in the Results Manager window (see
Section 2.13 [Results Manager], page 277).
Within the Results Manager commands available, using the right mouse button, include
"Information", which lists a summary of the distribution of quality types to the output
window, and "List" which lists the actual quality values for each base to the output window.
These quality values are written in a textual form of single letters per base and are listed
below.

154

The Staden Package Manual

+Strand -Strand
a

Good Good (in agreement)

Good Bad

Bad Good

Good None

None Good

Bad Bad

Bad None

None Bad

Good Good (disagree)

None None
An example of the output using "Information" and "List" follows.
============================================================
Wed 02 Apr 12:14:06 1997: quality summary
-----------------------------------------------------------Contig xb56b6.s1 (#11)
81.00 OK on both strands and they agree(a)
3.94 OK on plus strand only(b,d)
11.98 OK on minus strand only(c,e)
1.85 Bad on both strands(f,g,h,j)
1.22 OK on both strands but they disagree(i)
============================================================
Wed 02 Apr 12:14:09 1997: quality listing
-----------------------------------------------------------Contig xb56b6.s1 (#11)
10
20
30
40
50
60
eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeehee eeeeeeeeee eeeeeeeeee
70
80
90
100
110
120
eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee
130
140
150
160
170
180
eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee eeeeeeeeee
190
200
210
220
230
240
eeeeeeeeee eeeeeeeeee heeeeeeeee eeeeeeeici iiaiaciiia aaaaaaaaac
250
260
270
280
290
300
aaaacaaaaa aaaaaaaiia aaaaaaaaaa aaaaaaaaaa aaaabaaaaa aaaaaaaaaa

Chapter 2: Sequence assembly and finishing using Gap4

310
320
330
340
350
360
aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa faaaaaaaaa
[ output removed for brevity ]

155

156

The Staden Package Manual

2.5.5 Plotting Stop Codons
The Stop Codon Map plots the positions of all the stop codons on one or both strands of
a contig consensus sequence. It can be invoked from the gap4 View menu. If the Contig
Editor is being used on the same contig, the Refresh button will be enabled and if used will
fetch the current consensus from the editor, repeat the search and replot the stop codons.

The figure shows a typical zoomed in view of the Stop Codon Map display. The positions
for the stop codons in each reading frame (here all six frames are shown) are displayed in
horizontal strips. Along the top are buttons for zooming, the crosshair toggle, a refresh
button and two boxes for showing the crosshair position. The left box shows the current
position and the right-hand box the separation of the last two stops codons selected by the
user. Below the display of stop codons is a ruler and a horizontal scrollbar. The information
line is showing the data for the last stop codon the user has touched with the cursor. Also
shown on the left is a copy of the View menu which is user to select the reading frames to
display.

2.5.5.1 Examining the Plot
Positioning the cursor over a plotted point will cause its codon and position to appear in
the information line.
It is possible to find the distance between any two stop codons. Pressing the left mouse
button on a plotted point will display "Select another codon" at the bottom of the window.
Then, pressing the left button on another plotted point will display the distance, in bases,
between the two sites. This is shown in the box located at the top right corner of the
window.

2.5.5.2 Updating the Plot
If the Contig Editor (see Section 2.6 [Editing in gap4], page 160) is currently running on the
same contig as is being displayed as a Stop Codon Map, the Refresh button will be shown
in bold lettering and hence be active, otherwise it will be greyed out. Pressing the button
will fetch the current consensus from the Contig Editor and replot its stop codons. Hence
the plot can be kept current with the changes being made in the editor.

Chapter 2: Sequence assembly and finishing using Gap4

157

2.5.6 Plotting Restriction Enzymes
The restriction enzyme map function finds and displays restriction sites within a specified
region of a contig. It is invoked from the gap4 View menu. Users can select the enzyme
types to search for and can save the sites found as tags within the database.

2.5.6.1 Selecting Enzymes
Files of restriction enzyme names and their cut sites are stored in disk files. For the format
of these files and notes about creating new ones see Section 11.4 [Restriction enzyme files],
page 566.
When the file is read, the list of enzymes is displayed in a scrolling window. To select
enzymes press and drag the left mouse button within the list. Dragging the mouse off the
bottom of the list will scroll it to allow selection of a range larger than the displayed section
of the list. When the left button is pressed any existing selection is cleared. To select several
disjoint entries in the list press control and the left mouse button. Once the enzymes have
been chosen, pressing OK will create the plot.

158

The Staden Package Manual

2.5.6.2 Examining the Plot
Positioning the cursor over a match will cause its name and cut position to appear in
the information line. If the right mouse button is pressed over a match, a popup menu
containing Information and Configure will appear. The Information function in this menu
will display the data for this cut site and enzyme in the Output Window.
It is possible to find the distance between any two cut sites. Pressing the left mouse
button on a match will display "Select another cut" at the bottom of the window. Then,
pressing the left button on another match will display the distance, in bases, between the
two sites. This is shown in a box located at the top right corner of the window.

2.5.6.3 Reconfiguring the Plot
The plot displays the results for each restriction enzyme on a separate line. Enzymes with
no sites are also shown. The order of these lines may be changed by pressing and dragging
the middle mouse button or alt + left mouse button on one of the displayed names at the
left side of the screen.
The results are plotted as black lines but users can select colours for each enzyme type
by pressing the right button on any of its matches. A menu containing Information and
Configure will pop up. Configure will display a colour selection dialogue. Adjusting the
colour here will adjust the colour for all matches for this restriction enzyme.

2.5.6.4 Creating Tags for Cut Sites
Clicking the left mouse button on an enzyme name at the left of the display toggles a
highlight. The Create tags command from the Edit menu will add tags to the database for
all the matches whose enzyme names are highlighted. The command displays a dialogue
box listing the enzyme names on the left, and the tag type to create for that enzyme on the
right. Tag types must be chosen for all the listed restriction enzyme types before the tags
can be created. Suitable tag types to choose are the ENZ0, ENZ1 (etc) tags.

2.5.6.5 Textual Outputs
The Results menu of the plot contains options to list the restriction enzyme sites found.
One option sorts the results by enzyme name and the other by the positions of the matches.
The output below shows the textual output from "Output enzyme by enzyme". The
Fragment column gives the size of the fragments between each of the cut sites. The Lengths
column contains the fragment sizes sorted on size.
Contig zf98g12.r1 (#801)
Number of enzymes = 3
Number of matches = 7
Matches found=
1
Name
Sequence
1 AATII
GACGT’C
Matches found=
Name
1 ACCI

5
Sequence
GT’CGAC

Position Fragment lengths
7130
7129
556
556
7129
Position Fragment lengths
414
413
189

Chapter 2: Sequence assembly and finishing using Gap4

2
3
4
5

ACCI
ACCI
ACCI
ACCI

Matches found=
Name
1 AHAII

GT’CTAC
GT’CTAC
GT’CTAC
GT’CGAC
1
Sequence
GA’CGTC

159

1296
3871
5816
7497

882
2575
1945
1681
189

413
882
1681
1945
2575

Position Fragment lengths
7127
7126
559
559
7126

Position Fragment lengths
414
413
3
1296
882
189
3871
2575
367
5816
1945
413
7127
1311
882
7130
3
1311
7497
367
1945
189
2575

160

The Staden Package Manual

2.6 Editing in Gap4
The gap4 Contig Editor is designed to allow rapid checking and editing of characters in
assembled readings. Very large savings in time can be achieved by its sophisticated problem finding procedures which automatically direct the user only to the bases that require
attention. The following is a selection of screenshots to give an overview of its use.

The figure above shows a screendump from the Contig Editor which contains segments
of aligned readings, their consensus and a six phase translation. The Commands menu is
also shown. The main components are: the controls at the top; reading names on the left;
sequences to their right; and status lines at the bottom. Some of the reading names are
written in light grey which indicates that their traces/chromatograms are being displayed
(in another window, see below).
One reading name is written with inverse colours, which indicates that it has been
selected by the user. To the left of each reading name is the reading number, which is
negative for readings which have been reversed and complemented. The first of the status
lines, labelled “Strands”, is showing a summary of strand coverage. The left half of the
segment of sequence being displayed is covered only by readings from one strand of the
DNA, but the right half contains data from both strands.
Along the top of the editor window is a row of command buttons and menus. The
rightmost pair of buttons provide help and exit. To their left are two menus, one of which
is currently in use. To the left of this is a button which initially displays a search dialogue,
and then pressing it again, will perform the selected search. Further left is the undo button:
each time the user clicks on this box the program reverses the previous edit command. The
next button, labelled “Cutoffs” is used to toggle between showing or hiding the reading
data that is of poor quality or is vector sequence. In this figure it has been activated,
revealing the poor quality data in light grey. Within this, sequencing vector is displayed in

Chapter 2: Sequence assembly and finishing using Gap4

161

lilac. The next button to the left is the Edit Modes menu which allows users to select which
editing commands are enabled. The next command toggles between insert and replace and
so governs the effect of typing in the edit window. The 2 entryboxes on the left hand side
labelled C and Q set the consensus and quality cutoff values (see Section 2.6.18.1 [Consensus
and Quality Cutoffs], page 198).
One of the readings contains a yellow tag, and elsewhere some bases are coloured red,
which indicates they are of poor quality. The Information Line at the bottom of the window
can show information about readings, annotations and base calls. In this case it is showing
information about the reliability of the base beneath the editing cursor.

162

The Staden Package Manual

register with one another, and with the cursor in the Contig Editor. Conversely, the Contig
Editor cursor can be scrolled by the trace cursor. A typical view is shown below.

This figure is an example of the Trace Display showing three traces from readings in
the previous two Contig Editor screendumps. These are the best two traces from each
strand plus a trace from a reading which contains a disagreement with the consensus. The
program can be configured to automatically bring up this combination of traces for each
problem located by the “Next search” option. The histogram or vertical bars plotted top
down show the confidence value for each base call. The reading number, together with the
direction of the reading (+ or -) and the chemistry by which it was determined, is given
at the top left of each sub window. There are three buttons (’Info’, ’Diff’, and ’Quit’)
arranged vertically with X and Y scale bars to their right. The Info button produces a
window like the one shown in the bottom right hand corner. The Diff button is mostly used
for mutation detection, and causes a pair of traces to be subtracted from one another and
the result plotted, hence revealing their differences. (see Section 2.6.11 [Traces], page 188).

2.6.1 Moving the visible segment of the contig
The contig editor displays only one segment of the entire contig, although several contig
editors can be in use at once. Above the sequence display is a “scrollbar”. This line
represents the entire contig, with a greyed section representing the currently displayed
segment. To change the displayed segment put the mouse cursor in the scrollbar and use
the mouse buttons. The available controls are:
Middle Mouse Button
Alt Left Mouse Button
Left Mouse Button

Set displayed section
Set displayed section
Scroll left or right one screenful

On the far right side of the contig is a vertically oriented scrollbar. Typically the editor
will be showing all available data, in which case the vertical scrollbar cannot be scrolled.
In regions of exceptionally deep coverage, the editor makes sure that the controls, the

Chapter 2: Sequence assembly and finishing using Gap4

163

consensus, and any status lines are visible. The remaining space is taken up with however
many sequences fit. The vertical scrollbar can then be used, using the mouse buttons listed
above, to scroll through the sequences.
In addition to the scrollbars there are four buttons on the left hand side for scrolling by
fixed amounts.
<<
<
>
>>

Scroll
Scroll
Scroll
Scroll

left half a screenful
left one base
right one base
right half a screenful

Within the editor window itself two more key combinations can be used for scrolling
forwards and backwards an entire screenful. These, and several others, are modelled after
the Emacs key bindings.
Control v
Meta v

Scroll right one screenful
Scroll left one screenful

Finally, moving the editing cursor will always adjust the displayed section so that the
editing cursor is visible. Hence this can also be used to scroll around the editor in both
horizontal and vertical fashions.

2.6.2 Names
At the left side of the editor window is a display containing the reading names and numbers.
Each line consists of its orientation (“+” or “-”), reading number, a coloured template
consistency status and its name. The bottom line is always CONSENSUS. Also on the bottom
line is the current edit status. This is modelled on Emacs, and consists of one of ----, -%%and -**-, to symbolise “No unsaved edits made”, “No edits made - editor is in read only
mode”, and “Unsaved edits made”.
The maximum length of a reading name is 40 characters. Additionally there are 7
characters taken up with the direction and number of a reading. By default the names
display only shows 23 characters (enough to show 16 letters of a reading name). A horizontal
scrollbar just above the reading names can be used to scroll the reading names. Note that
the numbers and orientation are always visible. To change the width of the editor names
display set the CONTIG_EDITOR.NAMES_WIDTH setting in your ‘.gaprc’. For example:
set_def CONTIG_EDITOR.NAMES_WIDTH 23
The foreground colour for the text reveals whether the trace for this reading is shown
- a grey foreground indicates that the trace is visible. The background colour represents a
user highlight and the disassembly mode. The default background colour is light grey (the
same colour as the general editor background). Clicking the left mouse button on a reading
name toggles the background of the name component of number-name pair to black. This
is particularly useful for keeping track of an individual reading whilst scrolling the editor.
As the editor scrolls an individual reading will move up and down the editor display. By
highlighting this reading it becomes easy to track. The number component of the numbername pair is used to highlight readings that are to be disassembled. See Section 2.9.1.2
[Disassemble Readings], page 240. In this case the background is dark grey.

164

The Staden Package Manual

If the template display is in use, highlighting a reading name in the editor will select
this reading in the template display (by marking it as bold). Similarly selecting a reading
in the template display (left mouse button) will highlight the reading in the contig editor.
Additionally the contig editor cursor is visible within the template display allowing the
position of the editor to be controllable from the template display and connected plots
(such as the quality plot). See Section 2.5.1 [Template Display], page 130.
The readings contained within the “readings” list are automatically highlighted when
the editor starts. Toggling the highlighted names in the editor updates the “readings” list
accordingly. See Section 2.14.1 [Special List Names], page 278.
Once an output list for the editor has been set, pressing the middle mouse button,
or Alt left mouse button, on the names display has the same effect as the using the left
button, except that it adds (and never removes) the reading name to the specified list. See
Section 2.6.8.18 [Set Output List], page 186. This is similar to using the left mouse button
to add names to the “readings” list, except that it allows for multiple lists to be built up.
Pressing the right mouse button on a name will popup a menu containing a variety of
operations to perform for that specific reading.
Goto...

This is a cascading menu containing all other readings on the same template,
including ones on other contigs. Selecting the appropriate read name will move
the editor to the left-most base in that sequence. If the sequence is in another
editor then either the other editor will be moved (and created if needed).

Join to...

This is only shown when a template has more than one reading in it and the
readings are within separate contigs. when this is the case a cascading menu
presents the list of readings in other contigs. Selecting one of these will bring up
the join editor with both sequences visible (so that you will need to manually
scroll to approximately the correct position in order to find the join).

Select this reading
Select this reading and all to right
Deselect this reading
Deselect this reading and all to right
Select readings on this template
Deselect readings on this template
These commands (de)select one or several readings. “Select this reading” is the
most simple method and this acts in the same way as simply left clicking on a
sequence name.
The other modes allow the (de)selection of sequences on this template (regardless of which contig they are in) or ranges. The “and all to right” modes are
designed with disassemble readings in mind. Disassembling all readings from
a specific point onwards using the “Move readings to new contigs” mode is
analogous to using break contig. Selecting all readings within a range may
be achieved by a combination of “select this reading and all to right” and a
subsequence “deselect this reading and all to right” further along the contig.
List notes

Chapter 2: Sequence assembly and finishing using Gap4

165

This invokes the note selector with this reading already listed (see Section 2.15.1
[Selecting Notes], page 281).
Set as reference sequence
This marks the sequence as the Reference sequence (see Section 2.6.12.1 [Reference sequences], page 191).
Set as reference trace
This marks the trace as a Reference trace for use with trace differencing (see
Section 2.6.12.2 [Reference traces], page 191).
Remove reading (this only)
Remove reading and all to right
This marks one or more readings as ready for removal by disassemble readings
(see Section 2.6.9 [Removing readings from the contig], page 186). You will
then be prompted when you exit the editor whether you wish to disassemble
the chosen readings.
Clear selection
This clears the current reading selection.

2.6.3 Editing
Editing can take up a significant portion of the time taken to finish a sequencing project.
Gap4 has a selection of searches (see Section 2.6.6 [Searching], page 174) designed to speed
up this process. The problems that require most attention are conflicts between good bases.
Where base confidence values are present it should be unnecessary to edit all conflicting
bases as, in general, this will amount to adjusting poor quality data to agree with good
quality data, in which case the consensus sequence should be correct anyway.
Pads in the consensus should not be considered a problem requiring edits because it
is possible to output the consensus sequence (from the main Gap4 File menu) with pads
stripped out. Obviously poorly defined pads (a mixture of several pads and real bases)
require checking in the same manner as other poorly defined consensus bases.
If you wish to check all base conflicts set the consensus algorithm to Frequency (see
Section 2.11.5 [The Consensus Algorithms], page 257) and the consensus cutoff to 100. The
consensus will then be a dash in all places where there is not a 100% agreement in the
sequences. The “Next Problem” editor button will then step one at a time through each
conflict.

2.6.3.1 Moving the editing cursor
Nearly all editing operations happen at the location of the editing cursor. This cursor
appears as a solid block. The simplest mechanism of moving the cursor is simply use the
left mouse button. Alternatively the following keys can be used.

166

The Staden Package Manual

Left arrow or Control b
Right arrow or Control f
Up arrow or Control p
Down arrow or Control n
Control a
Control e
Meta a
Meta e
Meta <
Meta >

Move
Move
Move
Move
Move
Move
Move
Move
Move
Move

left one base
right one base
up one base
down one base
editing cursor
editing cursor
editing cursor
editing cursor
editing cursor
editing cursor

to
to
to
to
to
to

start of used
end of used
start of cutoff
end of cutoff
start of contig
end of contig

The difference between the last four Control and Meta key combinations depends on
whether “Cutoffs” is set. If it is, then “Control a” will move to the start of the used data
for this reading and “Meta a” will move to the start of the cutoff data for this reading.
Otherwise they both move to the same point (the used data start). Similarly for “Control
e” and “Meta e”. The action of these four key presses in the consensus line is simply to
move to the start or end of the entire consensus sequence.
The cursor can be placed on any sequence data shown in the editor.

2.6.3.2 Editing Modes
The editor operates in two main edit modes - Replace and Insert. Replace allows a character
to be replaced by another and Insert allows characters to be inserted. Replace is the default
mode. The mode can be changed by pressing the button marked “Insert”. The checkbox
next to the button will be set (filled by a dark colour) when the mode is “Insert”. By
default these modes are restricted until the Edit Modes menu is used to change them.
The Edit Modes menu consists of a series of checkboxes and radiobuttons which control
which editing options are enabled.
Allow insert in read
Allow del in read
Insertion or deletion within a reading will shift the sequence characters and
so will alter their alignment. This is acceptable provided action is taken to
correct it, by either shifting the reading or by inserting or deleting a base
elsewhere. This functionality is disabled by default and is enabled by checking
the appropriate checkbox. Note though that insertion and deletion of bases
within the cutoff data will shuffle the cutoff data rather than the reading itself
and hence will not break alignment. However this operation still requires the
edit mode to be enabled.
Allow insert any in cons
Allow del dash in cons
Allow del any in cons
These operations control the editing actions allowed for the consensus. By
default the only operations allowed are insertion and deletion of pads. This is
because consensus editing is typically used for removing columns of pads where
a single reading has been overcalled.

Chapter 2: Sequence assembly and finishing using Gap4

167

When editing at 100% disagreement, such cases will be dashes in the consensus,
so “Allow del dash in cons” enables deletion of both dash and pads.
“Allow insert any in cons” and “Allow del any in cons” allow any column to
be completely inserted or deleted. These are potentially dangerous actions,
however the “Evidence for edits” options can detect such edits.
Allow replace in cons
Replacing a base in the consensus changes all of the bases in readings at this
point that disagree with the typed base. The actual edit performed depends
upon the “Edit by base type” and “Edit by confidence” radiobuttons.
Allow reading shift
To shift a reading place the cursor at the far left end of the reading. If cutoffs
is set this should be the far left end of the cutoff data. Then typing space or
delete will move the reading right or left respectively by one position. This
operation is disabled by default.
Allow transpose any
Moving pads within a reading is often a useful procedure, and the ’movement’
of a pad alone will not break the alignment. For this reason it is possible to
move pads around without using insert/delete. Placing the cursor over a pad
in a reading and pressing “Control l” or “Control r” will move that pad left or
right one base. This operation will not work with the cursor on the consensus.
Pad movement is allowed at all times. The selection of “Allow transpose any”
allows any pair of adjacent characters to be swapped.
Allow uppercase
A rule often followed by users is to type all modifications in lower case which
makes edited characters easier to see. The “Allow uppercase” checkbox controls
whether this rule is enforced or not. By default “Allow uppercase” is checked
which means that the rule is not enforced.
Edit by base type
Edit by confidence
These two selections are radiobuttons, and are mutually exclusive. They control
the outcome when replacing bases in the consensus. When editing the consensus
“Edit by base type” changes bases that disagree with the consensus to the base
typed. “Edit by confidence” changes the confidence of disagreeing bases to 0.
If the consensus quality cutoff value is greater than or equal to zero, characters
with an accuracy value of 0 are ignored in the consensus calculation. That
is, although the characters still appear in the reading, they are not used to
calculate the consensus. In this way it is possible to maintain the original base
calls for visual inspection, but get the correct consensus.
Note that “Edit by confidence” will not work if the “frequency” consensus
algorithm is in use (see Section 2.11.5 [The Consensus Calculation], page 257).

168

The Staden Package Manual

If you wish to use “Edit by confidence”, make sure that the quality cutoff is zero
or higher, otherwise the frequency consensus algorithm will be used instead.
Allow F12 for fast tag deletion
F12 and Shift-F12 may be use to delete the tag underneath the contig editor
cursor (F12) or the mouse pointer (Shift-F12). Initially these are disabled to
prevent accidental deletions.
Mode set 1
Mode set 2
To make it easier to set the editing modes two user definable sets are available.
By default these are as follows.
Mode set 1:
− Disallow insert in read
− Disallow del in read
− Disallow insert any in cons
− Allow del dash in cons
− Disallow del any in cons
− Disallow replace in cons
− Disallow reading shift
− Disallow transpose any
− Allow uppercase
− Edit by confidence
Mode set 2:
− Allow insert in read
− Allow del in read
− Allow insert any in cons
− Allow del dash in cons
− Disallow del any in cons
− Allow replace in cons
− Allow reading shift
− Disallow transpose any
− Allow uppercase
− Edit by confidence
Currently the only way of redefining these sets is to add lines to your ‘.gaprc’
file. See Section 2.20.1 [Options Menu], page 298. The method is to define a
list of 1s and 0s to specify the states in the order listed above. The two default
sets are defined as follows.
set_def CONTIG_EDITOR.SE_SET.1
set_def CONTIG_EDITOR.SE_SET.2

{0 0 0 1 0 0 0 0 1 1}
{1 1 1 1 0 1 1 0 1 1}

Chapter 2: Sequence assembly and finishing using Gap4

169

2.6.3.3 Adjusting the Quality Values
Each base has its own quality value. Assembly will allow only values between 1 and 99
inclusive. A quality value of 0 means that this base should be ignored. A quality value of
100 means that this base is definitely correct and the consensus will be forced to be the
same base type and will be given a consensus confidence of 100. If two conflicting bases
both have a quality of 100 the consensus will be a dash with a confidence of 0.
Newly added bases or replaced bases are assigned their own quality values. By default
these are both 100. The “Set Default Confidence” option in the settings menu allows these
values to be changed.
Several keyboard commands are available to edit the quality value of an individual base.
The ’[’ and ’]’ keys set the quality to 0 and 100 repsectively. To increment or decrement
the confidence of a base by 1 use Shift plus the Up and Down arrow keys. To increment
or decrement by 10 use Control plus the Up and Down arrow keys. The editor will beep if
you reach quality 0 or 100. Finally note that quality values can also be made visible by the
use of grey scales for the sequence background colour. See Section 2.6.8.13 [Show Quality],
page 184.

2.6.3.4 Adjusting the Cutoff Data
The cutoff data is displayed by pressing the “Cutoffs” toggle at the top of the editor. The
cutoff sequence will be displayed in grey. We call the boundary between the cutoff data
and the used data the cutoff position. These positions can be shifted left or right for each
end of the reading using the Meta Left-arrow and Meta Right-arrow keys respectively. As
keyboards may not have a meta key, Control Left-arrow and Control Right-arrow also have
the same effect. These key combinations adjust the cutoff positions by a single base at
a time. They only work when the cursor is on the very first or very last “used” base,
depending on which cutoff you wish to adjust.
If large changes are required the cutoffs can be “zapped” to new positions using the “<”
and “>” keys. To use these, place the editing cursor to the position required (which may be
within the cutoff data or the used data) and press the “<” key to set the left cutoff to the
base between the cursor and the base leftwards of the cursor. Similarly “>” sets the right
cutoff to the base between the cursor and the base leftwards of the cursor. Note that many
keyboards have “<” and “>” above the “,” and “.” keys. In this case you will need to press
Shift in conjunction with “,” and “.” to perform the operations.

2.6.3.5 Summary of Editing Commands
A brief summary of these editing operations and which (if any) edit modes are required can
be seen below:
Key
Location
Ins/Rep Edit Mode
Action
-------------------------------------------------------------------------base/*
Reading
Replace any
Change base
base/*
Reading
Insert
Insert in read
Change base
delete
Reading
both
Delete in read
Del base left & move
Ctrl delete Reading
both
Delete in read
Delete base to left
Ctrl d
Reading
both
Delete in read
Delete under cursor

170

The Staden Package Manual

delete
space
Ctrl l
Ctrl r
Ctrl l
Ctrl r
[
]
Shift Up
Shift Down
Ctrl Up
Ctrl Down
<
>
Meta left
Meta right
*
base
base
delete *
delete delete any
Ctrl d
Shift F1-10
F1 to F10
F11
Shift F11
F12
Shift F12

Read start
Read start
Reading
Reading
Reading
Reading
Reading
Reading
Reading
Reading
Reading
Reading
Reading
Reading
Reading
Reading
Consensus
Consensus
Consensus
Consensus
Consensus
Consensus
Consensus
Read/Cons
Read/Cons
Read/Cons
Read/Cons
Read/Cons
Read/Cons

both
both
both
both
both
both
both
both
both
both
both
both
both
both
both
both
both
Insert
Replace
both
both
both
both
both
both
both
both
both
both

Readint shift
Reading shift
any
any
Transpose any
Transpose any
any
any
any
any
any
any
any
any
any
any
any
Insert any in cons
any
any
Del dash in con
Del any in cons
Del dash/any
any
any
any
any
Fast tag deletion
Fast tag deletion

Shift left
Shift right
Move pad left
Move pad right
Move base left
Move base left
Set quality to 0
Set quality to 100
Incr. quality by 1
Decr. quality by 1
Incr. quality by 10
Decr. quality by 10
Set left cutoff
Set right cutoff
Adjust left cutoff
Adjust right cutoff
Insert pad column
Insert column
Replace column
Delete column
Delete column
Delete column
Delete column
Create tag macro
Use tag macro
Edit tag under cursor
Edit tag under pointer
Delete tag under cursor
Del. tag under pointer

2.6.4 Selections
It is possible to highlight an area of a reading or the consensus sequence in preparation for
performing some further action upon it. Such examples of actions are: creating annotations
and aligning sequence. We call these highlighted areas “selections”. They will be displayed
as an underlined region.
The simplest way to make a selection is using the left mouse button. Pressing the mouse
button marks the base beneath the cursor as the start of the selection. Then, without
releasing the button, moving the mouse cursor adjusts the end of the selection. Finally
releasing the button will allow normal use of the mouse again.
Sometimes we may wish to make a selection longer than is visible on the screen, or to
extend our current selection. This can be done by using shift left mouse button to adjust
the end of the selection. Hence we can mark the start of the selection using the left button,
scroll along the contig to the desired position, and set the end using the shift left button.

Chapter 2: Sequence assembly and finishing using Gap4

171

The selection is stored in a “cut buffer”. This allows for the usual “cut and paste”
operations between applications, although the contig editor only supports this in one direction (as it is not possible to “paste” into the window). The mechanism employed for this
follows the usual X Windows standard of using the middle mouse button (or Alt left mouse
button). For example, to send a piece of sequence to a text editor (eg Emacs) mark the
desired region using the left mouse button in the editor window and then press the middle
button, or Alt left mouse button, whilst the mouse cursor is in the text editor window. The
sequence will then be inserted into the text editor.
A quick summary of the mouse commands follows.
Left button
Left button (drag)
Shift left button
Middle button (another window)
Alt left button (another window)

Position editing cursor to mouse cursor
Mark start and end of selection
Adjust end of selection
Copy selected sequence
Copy selected sequence

2.6.5 Annotations
Annotations (or tags) can be placed at any position on readings or on the consensus. They
are usually used to record positions of primers for walking, or to mark sites, such as repeats
or compressions, that have caused problems during sequencing. They can also be used to
contain feature table data as read from an EMBL format sequence file (see Section 3.1.3
[Reference sequences], page 314). Each annotation has a type such as “primer”, a position, a
length, a strand (forward, reverse or both) and an optional comment. Each type and strand
has an associated colour that will be shown on the display. For information on searching
for annotations see Section 2.6.6.4 [Searching by Tag Type], page 175, and Section 2.6.6.3
[Searching by Annotation Comments], page 175.

To create an annotation, make a selection and then select “Create Tag” from the contig
editor commands menu. See Section 2.6.7 [The Commands Menu], page 177. This will

172

The Staden Package Manual

bring up a further window; the “tag editor” (shown above). The “Type:” button at the
top of the editor invokes a selectable list from which tag types can be chosen. See below.

Use this to select the desired type of annotation.
Next the strand of the annotation can be selected. This will be displayed as one of “<—
–>”, “<—–” and “—–>”. The comment (the box beneath the buttons) can be edited using
the usual combination of keyboard input and arrow keys. The “Save” button will exit the
tag editor and create the annotation. To abandon editing without creating the annotation
use the “Cancel” button.
To edit an existing annotation, position the editing cursor within a annotation and select
“Edit Tag” from the commands menu. This will be a cascading menu, typically showing
one tag. If multiple tags coincide at the same sequence position you will be able to chose
which tag to edit. Once again the tag editor will be invoked and operates as before. The
F11 key is also a shortcut for editing the top-most tag underneath the editor cursor. When
editing, the “Save” will save the edited changes and “Cancel” will abandon changes.
Removing a annotation involves positioning the editing cursor within an annotation and
selecting “Delete Tag” from the commands menu. As with “Edit Tag” this is a cascading
menu to allow you to chose which tag at a specific point to delete.
Within a tag editor two buttons “Move” and “Copy” may be used to reposition existing
tags. When editing a tag, the current location of the tag is underlined within the editor.
If a new region is highlighted (on the consensus, a different reading, or even in a different
contig) and either of these buttons are pressed the tag will be saved to the new location
and removed from the previous location if “Move” was used. This can be used as an easy
way to adjust the extents of an existing tag or as a way to annotation multiple locations
with the same tag contents.

Chapter 2: Sequence assembly and finishing using Gap4

173

As usual, “undo” can be used to undo any of these annotation creations, edits and
removals.
Some tags may contain graphical controls instead of the usual text panel. These are
encoded with the master gap4 tag database (GTAGDB ) by specifying the default tag text
to be a piece of “ACD” code. A full description of the (modified for gap4) ACD syntax is
not available currently, but it is strongly modelled on the the EMBOSS ACD syntax which
has documentation at
http://www.emboss.org/Acd/index.html .
It is possible to add your own tag types by modifying either the system GTAGDB file
or creating your own GTAGDB file in your home directory (for all your databases) or the
current directory (for just those in that directory).

For rapid annotating a series of 10 macros may be programmed. Press Shift and a
function key between F1 and F10 to bring up the macro editor. This look much like the
normal tag editor except that Save is replaced with Save Macro and saving does not actually
create a tag on the sequence. To use the macro, highlight the bases you wish and press
the function key corresponding to that macro - F1 to F10. For a single base pair tag you
do not need to underline a region as the tag will automatically cover the base underneath
the editing cursor. To remember these permanently use the “Save Macros” option in the
“Settings” menu.
You may find that some function keys are already programmed to do other things (such
as raise or lower windows), depending on the windowing environment in use. If this is the
case either modify the configuration of your windowing system or simply use another macro
key.
For rapid editing and deleting the F11 and F12 keys may be used. These edit and
delete the top-most tag underneath the editing cursor. If you wish to edit or delete the
tag underneath the mouse cursor instead (and hence save a mouse click) use Shift F11 and
Shift F12 for edit and delete.
The Control-Q key sequence may be used to toggle the displaying of tags. Pressing it
once will prevent all tags from being displayed in the editor. This is sometimes useful to see
any colouring information underneath the tag. Pressing Control-Q once more will redisplay
them.

174

The Staden Package Manual

2.6.6 Searching
The contig editor’s searching ability and its links to the consensus calculation algorithm
are crucial in determining the efficiency with which contigs can be checked and corrected.
The consensus is calculated “on the fly” and changes in response to edits. For editing, the
most important search functions are those which reveal problems in the consensus whilst
ignoring all bases that are adequately well determined. The default search type is therefore
by consensus quality. By default this is done in the forward direction and for a quality value
of 30, although this is configurable by changing the collowing lines in the gaprc file.
set_def CONTIG_EDITOR.SEARCH.DEFAULT_TYPE
set_def CONTIG_EDITOR.SEARCH.DEFAULT_DIRECTION
set_def CONTIG_EDITOR.SEARCH.CONSQUALITY_DEF

consquality
forward
30

Selecting “Next Search” brings up a window which can remain present during normal
editor operation. The window allows the user to select the direction of search, the type of
search, and a value to search on. The value is entered into a value text box, then pressing
the “search” button performs the search. If successful, the cursor is positioned accordingly.
An audible tone indicates failure. Pressing the “Cancel” button removes the search window.
The search window is automatically removed when the contig editor is exited.

The “Cutoffs” button can be used to select whether or not searching should find matches
within the cutoff data.
The Control-s key binding in the editor is equivalent to searching forward for the next
match. The Escape Control-s key sequence performs a reverse search. Both key bindings
will bring up the search window if it is not currently displayed.
As is described below, there are thirteen different search modes.

2.6.6.1 Search by Position
The presence of padding characters in the consensus can greatly alter the length of the
sequence, and the positions of the bases along it. Positions can therefore be defined in two

Chapter 2: Sequence assembly and finishing using Gap4

175

ways: those which include pads and those which do not. This option (termed a search!)
moves the cursor to a specified position. The numeric position is specified in the value text
box. Eg a value of “1234” causes the cursor to be placed at base number 1234 in the contig.
Positioning within a reading is achieved by prefixing the number with the “@” character, eg
“@123” positions the cursor at base 123 of the sequence in which the cursor lies. Relative
positions can be specified by prefixing the number with a plus or minus character. Eg
“+1234” will advance the cursor 1234 bases. If possible, the cursor is positioned within the
same sequence. The direction buttons have no effect on this operation.

2.6.6.2 Search by Problem
This positions the cursor at the next place in the consensus sequence which is “*”, “-” or
“N”. The search can be performed either forwards or backwards from the current cursor
position. Obviously the characters appearing in the consensus depend on the selected
consensus calculation algorithm and the thresholds set.

2.6.6.3 Search by Annotation Comments
This positions the cursor at the start of the next tag which has a comment containing the
string specified in the value box. Only currently active tag types are searched. The search
performed is a regular expression search, and certain characters have special meaning. Be
careful when your string contains “.”, “*”, “[“, “]”, “\”, “^” or “$”. The search can be
performed either forwards or backwards from the current cursor position. Searching with
an empty value will find all tags.

2.6.6.4 Search by Tag Type
This positions the cursor at the start of the next tag of the specified type. If the tag type
is not active, the tag will be found and underlined but will remain invisible. To change the
type, select from the menu that pops up when the mouse is clicked on the button labeled
“Type:”. The search can be performed either forwards or backwards of the current cursor
position. To find all tags, use “Search by Annotation Comments”, with an empty text box.

2.6.6.5 Search by Sequence
This positions the cursor at the start of the next segment of sequence that matches the
value specified in the text box. The search is case insensitive, ignores pads, and can allow a
specified number of mismatches. It may be performed on sequence only, consensus only or
both. It also operates either forwards or backwards from the current editing cursor position.

2.6.6.6 Search by Quality
This positions the cursor at the next place in the consensus sequence where the consensus
for each of the two strands disagree. Where there is only data for one strand the search
will stop at every base. The search can be performed either forwards or backwards from
the current cursor position.

2.6.6.7 Search by Consensus Quality
This positions the cursor on the consensus at the next position where the quality of the
consensus is below a given threshold. The quality of the consensus is calculated by the

176

The Staden Package Manual

consensus algorithm. For this search the quality threshold should be entered into the value
box and should be within the range of 0 to 100 inclusive.

2.6.6.8 Search by file
This steps the cursor through a set of positions specified in a file. The format for the
positions in the file is one per line with each line consisting of a reading name, a position
within that reading, and an optional comment. If a position is relative to the start of the
contig rather than the start of any particular reading, then simply use the first reading in
the contig. Positions that are beyond the ends for the reading are still valid, although the
editing cursor is moved onto the consensus sequence.
The comment can consist of any string. Multiline comments are possible, but they
must be written using \n in the comment string rather than an actual newline character
(which would signify the start of the next record). The comment for the current position
is displayed at the bottom of the editor search window in a text panel which is visible only
when in the “search by file” mode.
Any record containing a reading name that is not in the current contig is silently ignored.
This allows for a search file to have positions for all contigs. However at present there is no
mechanism for stepping through an entire search file bringing up editors for each contig as
required. This will be implemented in the future.
An example file follows.
xb63c7.s2
xb63c7.s2
xb32a2.s1
xa17b1.r1

102
30 A multi-\nline comment.
56 Oligo, of length 12
5714 Repeat from 5714 to 5780

2.6.6.9 Search by Reading Name
This positions the cursor at the left end of the reading specified in the value text box. If
the value is prefixed with a hash sign it is assumed to be a reading number. Otherwise it is
assumed to be a reading name. Eg “#123” positions the cursor at the left end of reading
number 123. “a16a12.s1” positions at the start of reading a16a12.s1. If the value was “a16”
the cursor is positioned at the first reading which starts with “a16”.

2.6.6.10 Search by Edit
This positions the cursor at the next place in the contig where an edit has been made. Edits
include base insertions, deletions, replacements and confidence value changes.The search can
be performed either forwards or backwards from the current cursor position.

2.6.6.11 Search by Evidence for Edit (1)
The Evidence for Edit (1) option checks edited bases to find bases in the consensus for
which there is no evidence in the original readings. The definition of evidence is that at
least one reading had this original base call. Currently this search operates only in the
forward direction.

2.6.6.12 Search by Evidence for Edit (2)
p

Chapter 2: Sequence assembly and finishing using Gap4

177

The Evidence for Edit (2) option checks edited bases to find bases in the consensus for
which there is no evidence in the original readings. The definition of evidence is that at
least one reading from each strand had this original base call. Currently this searches only
in the forward direction.

2.6.6.13 Search by Discrepancies
This finds positions where two or more bases are above a particular quality level, but in
disagreement. The quality threshold is given in the value box and should be within the
range of 0 to 100 inclusive.

2.6.6.14 Search by Consensus Discrepancies
This finds positions where there is a significant disagreement in a particular consensus
base. Unlike “by Discrepancies” this does not look for individual base confidence values,
but rather it combines multiple bases together for each base type and searches for the second
highest confidence at any point. This is the same method use in the 2nd-highest confidence
graph (see Section 2.5.2.5 [2nd-Highest Confidence], page 145).

2.6.7 The Commands Menu
The Commands menu is available by either pressing the Commands button at the top of
the contig editor window, or by pressing the Control key and the left mouse button, or by
pressing right mouse button with the mouse cursor anywhere within the sequence display
section of the contig editor. A menu will be revealed containing the following options (which
are described in greater detail below).

2.6.7.1 Search
This Contig Editor Commands menu function Performs a search. See Section 2.6.6 [Searching], page 174.

2.6.7.2 Create Tag
This Contig Editor Commands menu function Creates an annotation. See Section 2.6.5
[Annotations], page 171.

2.6.7.3 Edit Tag
This Contig Editor Commands menu function Edits an annotation. See Section 2.6.5 [Annotations], page 171.

2.6.7.4 Delete Tag
This Contig Editor Commands menu function Removes an annotation. See Section 2.6.5
[Annotations], page 171.

2.6.7.5 Save Contig
This Contig Editor Commands menu function writes any edited data to disk. The undo
history is cleared and it is no longer possible to quit and abandon these saved changes. The
Control-x followed by Control-s will also save the contig editor in the same manner as the
Save command.

178

The Staden Package Manual

2.6.7.6 Dump Contig to File
This Contig Editor Commands menu function outputs the current contig, as currently shown
(e.g. with status lines) to a file. The user can select the region to dump, the length of each
line, and the file name to use. The sequence names can be up to 40 characters, but often
projects do not use the full length. To avoid wasted space in the output the number of
columns to use for sequence names can be adjusted.

2.6.7.7 Save Consensus Trace
This Contig Editor Commands menu function produces a trace file for the consensus sequence by averaging the traces of the readings. The command brings up a dialogue containing controls to specify the filename, the consensus start and end positions, the strand,
and whether to use matching reads.
As the trace of a reading is dependent on the direction it was read, the consensus trace
can be computed from all the reads in either the forward or reverse directions, but not both
at once. When the “Use only matching reads” toggle is set to “Yes” only the readings of
the correct strand that have the same base call as the consensus sequence are used. The
option is useful for producing wild-type trace files for a mutation analysis project.

2.6.7.8 List Confidence
This Contig Editor Commands menu function operates in a very similar manner to the main
Gap4 List Confidence command (see Section 2.11.6 [List Confidence], page 261), except that
it only operates on the current contig, and it uses the current editor consensus confidences
rather than the ones saved to disk. It displays a dialogue requesting a range within the
contig and a question asking if only summary of the results is required.
Pressing OK or Apply will add to the editor information line a count of the expected
number of errors and the error rate. If the “Only update information line” question was
answered “No” then the full frequency table will also be output. It will appear in the main
text output window in the same format as the “List Confidence” command in the main
Gap4 View menu. The Apply button can be used to calculate the number of errors without
removing the dialogue.
It is often the very ends of contigs (which are generally low coverage and bad quality)
that have most of the errors, and so it is sometimes useful to set a range which includes all
of the contig except for around 1000 bases from each end.

2.6.7.9 Report Mutations
This Contig Editor Commands menu function is used to produce a list of all the bases
annotated with mutation tags (or those bases which differ from the consensus/reference
sequence). If the tags or differences are within segments of sequence which are also annotated with EMBL feature table CDS records, the report will include data describing its
effect. The report, which can be sorted by sequence or position, includes the reading names,
mutation positions relative to the reference sequence, the actual change, its effect, and the
evidence. An example is shown below.

Chapter 2: Sequence assembly and finishing using Gap4

179

001321_11aF
001321_11aF
001321_11cF
001321_11cF
001321_11dF
001321_11eF
001321_11eF
000256_11eF

33885T>Y
34407G>K
35512T>Y
35813C>Y
36314A>R
36749A>R
37313T>K
36749A>G

(silent F) (strand - only)
(expressed E>[ED]) (strand - only)
(silent L) (double stranded)
(expressed P>[PL]) (double stranded)
(expressed E>[EG]) (double stranded)
(expressed K>[KR]) (double stranded)
(noncoding) (strand - only)
(expressed K>R) (double stranded)

Here the first record is for reading 001321 11aF, position 33885, T changed to T and C
(i.e. is heterozygous) to produce no amino acid change, with evidence coming only from
the complementary strand. The last record is for reading 000256 11eF, position 36749, A
changed to G, producing an amino acid change K to R, with evidence from both strands of
the sequence. The penultimate record denotes a heterozygote in a noncoding region.

2.6.7.10 Select Primer
This Contig Editor Commands menu function allows the user to employ the primer selection
algorithm OSP to find primers for sequencing experiments. See Section 2.6.10 [Searching
for Primers], page 187.

2.6.7.11 Align
This Contig Editor Commands menu function performs a sequence alignment between the
currently selected segment of a reading and the consensus sequence. It provides a simple
way of extending the visible part of a reading to use its hidden data, which is often useful to
double strand a short section of consensus without the need to perform further experiments.
On a sequence, highlight the cutoff data to align along with a small section of the good
quality non-cutoff data. Then select the align command and adjust the cutoff point as
desired. Pads are inserted in the consensus and readings as necessary, although pads will
not be inserted in the cutoff data of other sequences.

2.6.7.12 Remove Reading
This Contig Editor Commands menu function marks a reading for subsequent removal. See
Section 2.6.9 [Removing readings from the contig], page 186

2.6.7.13 Break Contig
This Contig Editor Commands menu function breaks the contig so that the reading underneath the editing cursor is the left end of a new contig. In order to perform this operation
all edits are saved automatically first. Once saved these edits cannot be undone. This
operation is identical to the Break Contig command in the main menu. See Section 2.9.1.1
[Break Contig], page 239.

2.6.8 The Settings Menu
The purpose of this menu is to configure the operation of the contig editor, including the
consensus calculation, the active tags and the status lines. Settings can be saved using the
“Save settings” button, but this does not save any tag macros. These may be saved using
the “Save Macros” option. Settings for the following options can be changed.

180

•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

The Staden Package Manual

Status Line
Trace Display
Consensus algorithm
Highlight Disagreements
Compare Strands
Toggle auto-save
3 Character Amino Acids
Show reading quality
Show consensus quality
Show edits
Show unpadded positions
Show template names
Set Active Tags
Set Output List
Set Default Confidences
Store Undo Set or unset saving of undo

2.6.8.1 Status Line
The contig editor can display several additional text lines underneath the consensus sequence. This “status” data is of textual form and can provide additional information about
the data displayed above. Currently, there are two forms of status line available. These are
“Strands” and “Translate Frame”. Both status line types update automatically as edits are
made that change the consensus.
The status line menu is accessed by cascading off the settings menu. It contains the
following.
•
•
•
•
•
•
•
•
•
•
•
•

Show Strands
Translate using feature tables
Translate frame 1+
Translate frame 2+
Translate frame 3+
Translate frame 1Translate frame 2Translate frame 3Translate + frames
Translate - frames
Translate all frames
Remove all

“Show Strands” creates a single line consisting of the +, -, = and ! characters. These
indicate: positive strand only, negative strand only, both strands (in agreement) and both
strands (in disagreement) respectively.

Chapter 2: Sequence assembly and finishing using Gap4

181

The frame translation status lines provide translations in each of the six available reading
frames. Alternatively, using the “Translate using feature tables”, only segments described
in CDS records will be translated. The CDS records are those contained in the reference
sequence. Translations can be displayed in either the single character or the three character
amino acid codes.
Pressing the right mouse button on the ’name’ segment of the status line (on the left
hand side) pops up a menu. The commands available may depend on the type of the status
line chosen, however currently it will always only contain the “Remove” command. This,
as expected, removes the status line from the display. To remove all status lines use the
“Remove all” command from the “Status Line” cascading menu.
Note that the data in the status line cannot be cut and pasted, modified or searched; it
is not possible to move the cursor into these lines.

2.6.8.2 Trace Display
This is a cascading menu containing various options for configuring the trace views within
the editor.

2.6.8.3 Auto-display Traces
When switched on, auto-display traces will direct certain searches to automatically display
relevant traces to aid in solving problems. This works in conjunction with most appropriate
searches. The traces chosen to solve the “problem” will, by default, be the best trace from
each strand which agrees with the consensus (which is calculated at a low consensus cutoff)
and the best trace from each strand which disagrees with the consensus. This selection
of traces may be adjusted by modifying the CONTIG_EDITOR.AUTO_DISPLAY_TRACES_CONF
configuration variable. The default setting of this is “+ - +d -d”. Each of the space separated elements in this string corresponds to a trace file to choose. If one cannot be found,
then it is ignored. The order listed here is the order in which they will be displayed in the
trace window. The complete list of available trace specifiers is:
+

Best +ve strand trace agreeing with consensus

Best +ve strand dye-primer trace agreeing with consensus

Best +ve strand dye-terminator trace agreeing with consensus

Best -ve strand trace agreeing with consensus

-p

Best -ve strand dye-primer trace agreeing with consensus

-t

Best -ve strand dye-terminator trace agreeing with consensus

Best trace disagreeing with consensus

Best +ve strand trace disagreeing with consensus

-d

Best -ve strand trace disagreeing with consensus

Second best +ve strand trace agreeing with consensus

+2p

Second best +ve strand dye-primer trace agreeing with consensus

+2t

Second best +ve strand dye-terminator trace agreeing with consensus

182

The Staden Package Manual

-2

Second best -ve strand trace agreeing with consensus

-2p

Second best -ve strand dye-primer trace agreeing with consensus

-2t

Second best -ve strand dye-terminator trace agreeing with consensus

Second best trace disagreeing with consensus

+2d

Second best +ve strand trace disagreeing with consensus

-2d

Second best -ve strand trace disagreeing with consensus

2.6.8.4 Show Read-pair Traces
When double-clicking on a sequence to view a trace this option will automatically identify
traces on both strands of this template. Both the forward strand and reverse strand traces
will then be shown, in that order.

2.6.8.5 Auto-diff Traces
Once this is activated, whenever the user double clicks on a base in the editor sequence
display, not only is the reading’s trace displayed, but also its designated reference trace
plus the difference between them. If its complementary reading is available, its trace and
reference trace and their differences are also displayed.
If no traces have been specified to be the reference traces then Gap4 will attempt to
automatically pick two. It choses the highest quality pair of traces that come from the
same template and disagree with either the forward or reverse strand of the trace initially
double-clicked upon.

Trace differences display

Chapter 2: Sequence assembly and finishing using Gap4

183

As is shown in the figure below, it is also possible to set the trace difference display to
use positive and negative references

For further information about mutation detection, see Section 3.1 [Search for Mutations],
page 309.

2.6.8.6 Y scale differences
When performing trace alignments and differencing (using Auto-diff traces or via the manual
“difference” option in the trace display) this option controls whether to perform a trace
peak-height normalisation on both traces prior to alignment and substraction.

2.6.8.7 Consensus Algorithm
This allows selection of the consensus algorithm to use within the Contig Editor. Like the
consensus and quality cutoff parameters, it is local to the specific editor being used. The
main Consensus algorithm option should be used to globally change the algorithm being
used. See Section 2.20.2 [Consensus Algorithm], page 299.

2.6.8.8 Group Readings
This is a cascading menu allowing the readings viewed in the editor to be sorted and grouped
by different criteria. By default the order (in Y) that readings are listed in the editor is
sorted by the position of the left-most used based. This option provides a choice of by
position, strand (plus first, then minus), name, number, template and clone.
Where appropriate an automatic sub-ordering is applied. For example sorting by strand
will group the readings primarily into “+” and “-” groups, but within the group the readings
are still sorted by position. When grouping by template the sub-grouping is by strand.

184

The Staden Package Manual

2.6.8.9 Highlight Disagreements
This toggles between the normal sequence display (showing the current base assignments)
and one in which those assignments that differ from the consensus are highlighted. It makes
scanning for problems by eye much easier.
Several modes of highlighting are available: “By dots” will only display the bases that
differ from the consensus, displaying all other bases as full stops if they match or colons if
they mismatch but are poor quality. The definition of poor quality here can be adjusted
using the “Set quality threshold” option of the Settings menu. The base colours are as
normal (ie reflecting tags and quality).
Highlight disagreements “By foreground colour” and “By background colour” displays all
base characters, but colours those that differ from the consensus. Bases which differ by are
below the difference quality threshold are not coloured. This allows easier visual scanning
of the context that a difference occurs in, but it may be wise to disable the displaying of
tags (hint: control-Q toggles tags on and off).
Finally the “Case sensitive” toggle controls whether upper and lower case bases of the
same base type should be considered as differences.

2.6.8.10 Compare Strands
This toggles the consensus calculation routine between treating both strands together or
independently. In the independent case any difference between the two strands is shown
in the consensus as a ’-’. Hence these clashes are found as problems by the “Search by
problem” option.

2.6.8.11 Toggle auto-save
Selecting auto-save toggles the auto save feature. Initially this is turned off each time the
contig editor is invoked. Once toggled the adjacent checkbox will be set to indicate the
feature is enabled and the contig will be saved. From that point onwards the contig editor
will write its data to disk every 50 edits. Each time an auto save is performed it is announced
in the output window. Saving more frequently can still be performed manually by using
“Save Contig”.
Unlike “saves” made using the manual “Save Contig” command, the “Undo” button will
allow the user to undo edits regardless of when the last auto save occurred.

2.6.8.12 3 Character Amino Acids
By default, the codon translation within the status line displays single character amino acid
codes. Selecting “3 Character Amino Acids” will toggle the status line to display three
character amino acid codes.

2.6.8.13 Show Reading and Consensus Quality
When the quality cutoff value is 0 or higher and either of the “show reading quality” or
“show consensus quality” toggles is set, the background for bases is shaded in a grey level
dependent on their quality. There are ten levels of shading with the darkest representing
poor data and the lightest representing good data. So with the quality cutoff set to 50,
all bases with a quality of less than fifty are shown with a red foreground and a dark grey

Chapter 2: Sequence assembly and finishing using Gap4

185

background, bases with quality just above 50 will have the darkest grey background, and
bases with a quality of 100 will have the lightest background. When tags are present the
background colour is that of the tag rather than the quality.
The colours used are adjustable by modifying your ‘.gaprc’ file. The defaults are shown
below.
set_def
set_def
set_def
set_def
set_def
set_def
set_def
set_def
set_def
set_def
set_def

CONTIG_EDITOR.QUAL0_COLOUR
CONTIG_EDITOR.QUAL1_COLOUR
CONTIG_EDITOR.QUAL2_COLOUR
CONTIG_EDITOR.QUAL3_COLOUR
CONTIG_EDITOR.QUAL4_COLOUR
CONTIG_EDITOR.QUAL5_COLOUR
CONTIG_EDITOR.QUAL6_COLOUR
CONTIG_EDITOR.QUAL7_COLOUR
CONTIG_EDITOR.QUAL8_COLOUR
CONTIG_EDITOR.QUAL9_COLOUR
CONTIG_EDITOR.QUAL_IGNORE

"#494949"
"#696969"
"#898989"
"#a9a9a9"
"#b9b9b9"
"#c9c9c9"
"#d9d9d9"
"#e0e0e0"
"#e8e8e8"
"#f0f0f0"
"#ff5050"

2.6.8.14 Show edits
When set, any change between the bases displayed and the original sequence held in the
trace files is shown by changing the background colour of the changed base. The detection
of these edits depends on the quality values and the “original position” data. Hence the
traces do not need to be present in order to detect edits. The colour of the bases reflects the
type of change found. The colours are adjustable by editing the ‘.gaprc’ file. The following
table lists the colour, gaprc variable name and the meaning.
red

CONTIG_EDITOR.EDIT_DEL_COLOUR — Deletion

pink

CONTIG_EDITOR.EDIT_BASE_COLOUR — Base change or insertion

green

CONTIG_EDITOR.EDIT_PAD_COLOUR — Padding character

purple

CONTIG_EDITOR.EDIT_CONF_COLOUR — Confidence value

2.6.8.15 Show Unpadded Positions
The ruler at the top of the contig editor displays every tenth base number in the consensus
sequence. Without “show unpadded positions” enabled any character in the consensus is
counted, including padding characters. If “show unpadded positions” is enabled the ruler
will only count non pad (“*”) characters. Please note that this may considerably slow down
the editor on large databases as the full consensus needs to be calculated in order to plot
the ruler. If you just need to obtain the occasional unpadded position it is better to press
the Enter key or to use the “unpadded position” search.

2.6.8.16 Show Template Names
The names panel on the left hand side of the editor normally shows the reading names.
This option may be used to toggle this display to show the template names instead. When
enabled the trace display also switches from showing reading names to template names.

186

The Staden Package Manual

2.6.8.17 Set Active Tags
“Set Active Tags” allows configuration of which tag types should be displayed within the
editor. Note that searches for tag annotations will only examine active tags, but searching
for a specific tag type will find tags even when tags of this type are not visible. In this
situation the tag will still be invisible, but as usual the tag location will be underlined.
This option is particularly useful for exploring cases where a section of sequence has many
overlapping tags. An alternative to using this dialogue is using the Control-Q key, which
toggles the display of active tags.

2.6.8.18 Set Output List
“Set output list” pops up a dialogue asking for a list name to be used when outputting
reading names (see Section 2.14 [Lists], page 278). Once an output list has been specified,
pressing the middle button, or Alt left mouse button, on a reading name will add the name
to the end of list. Note that selecting the same name more than once will add the name to
the list more than once. The list is never cleared by the editor. This allows multiple editors
to append to the same list. If required, use the list menu to clear the list.

2.6.8.19 Set Default Confidences
Replacing bases or inserting new bases in the editor can assign new confidence values to
those bases. The default setting is to set these confidence values to 100 which has the effect
of forcing the consensus to be that base. The “Set Default Confidences” dialogue allows
these default values to be changed. The allowable range of confidence values for a base
is from 0 to 100 inclusive. The dialogue also allows selection of confidence -1. This tells
the editor to not change the confidence value. When replacing a base this keeps the same
confidence value of the base that is being replaced. When inserting a base this uses the
average of the confidence value of the two surrounding bases.

2.6.8.20 Set or unset saving of undo
Storing the undo information takes up a great deal of computer memory and slows down the
alignment algorithm. Particularly when using the Join Editor for very large overlaps (e.g.
after copying batches of readings from one database to another), it can be useful to turn
off the saving of undo information. For this reason the settings menu contains an option to
turn off (or on) the saving of undo information.

2.6.9 Removing Readings
It is often desirable to completely remove a reading from a contig. When not using the editor
this is typically performed using the Disassemble Readings function. See Section 2.9.1.2
[Disassemble Readings], page 240. When using the editor, the “Remove Reading” option
on the editor commands menu performs a similar task.
The command marks the reading underneath the editing cursor to be removed once the
editor is quitted. Until then, the reading number in the names section of the display is
shown with a dark grey background. The reading will also not be used in the calculation
of the consensus. Thus, if all readings at a particular section of consensus are marked for
removal the consensus sequence will be shown as dashes. Selecting the “Removing Reading”
command again with the editing cursor on a reading already marked for removal will cancel

Chapter 2: Sequence assembly and finishing using Gap4

187

the removal request. The keyboard command of Control-H may also be used as a shortcut
to the “Removing Reading” command.
Once the editor has been quitted you will be asked whether you wish to disassemble the
marked readings. Answering “No” will simply quit the editor as normal without removing
any readings. Answering “Yes” will bring up the usual “Disassemble Readings” dialogue.
The options here allow removal of all readings from this contig, or non-crucial only. A crucial
reading is one that will cause this contig to be broken into two or more segments. A choice
is also given as to whether the readings should be completely removed from this database,
or for each reading to be placed in its own contig. Pressing “OK” now will remove the
readings from the contig, breaking the contig if necessary, and will quit the editor. Pressing
“Cancel” will close the “Disassemble Readings” dialogue without making any changes and
will not quit the editor.
At any time, quitting the editor and not disassembling the readings will leave a List
(see Section 2.14 [Lists], page 278) named “disassemble” containing the readings marked
for removal. These may then be disassembled at a later stage if necessary. However the list
will only be available until the next editor is quit (at which stage that editor will create its
own, possibly blank, disassemble list), so make a copy if necessary.

2.6.10 Primer Selection
The oligo selection engine is the one used in the program OSP. It is described in Hillier,
L., and Green, P. (1991). “OSP: an oligonucleotide selection program,” PCR Methods and
Applications, 1:124-128. Oligo selection is a complex operation. The normal mode of use
is outlined below:
1. Open the oligo selection window, by selecting “Select Primer” from the contig editor
commands menu.
2. Position the cursor to where you want the oligo to be chosen. While the oligo selection
window is visible, you will still have complete control over positioning and editing
within the contig editor.
3. Indicate the strand for which you require an oligo. This is done by toggling the direction
arrows (“—–>” or “<——”).
4. Press the “Find Oligos” button to find all suitable oligos (see the “Parameters” subsection below for further information on controlling this procedure). Information for
the closest suitable oligo to the cursor position is given in the output text window and
at the bottom of the editor in the information line. In the contig editor the position of
the oligo is marked by a temporary tag on the consensus. The window is recentered if
the oligo is off the screen.
5. If this oligo is not suitable (it may have been used before, and failed) the next closest
oligo can be viewed by pressing “Next”.
6. Suitable templates are automatically identified for the currently displayed oligo (see the
“Template selection” subsection below). By default, the template is that closest to the
oligo site. If the choice is not suitable (it may be known to be a poor quality template,
say) another can be chosen from the “Choose from” pull-down menu. Templates that
do not appear on the menu can be specified by selecting simply typing their name in

188

The Staden Package Manual

the “Template name” entry box. However, the template must be on the correct strand
and be upstream of the oligo.
7. A tag can be created for the current oligo by pressing the button “Accept”. The
annotation for this tag holds the name of the template and the oligo primer sequence.
There are fields to allow the user to specify their own primer name (“serial#”) and
comments (“flags”) for this tag. An example of oligo tag annotation:
serial#=
template=a16a9.s1
sequence=CGTTATGACCTATATTTTGTATG
flags=
8. The oligo selection window is closed when “Accept” or “Quit” is selected.

2.6.10.1 Parameters
The parameters controlling the selection of oligos can be changed by pressing the “Edit
parameters” button. This invokes a dialogue box which allows the specification of further
parameters.
By default, the oligos are selected from a window that extends 40 bases either side of the
cursor. The size and location of this window relative to the cursor position can be changed
in the “Edit parameters” window.
Primer constraints can be specified by melting temperature, length and G+C content.
In gap4 oligos are ranked according to their overall score, where the best oligos have
lower scores.

2.6.10.2 Template selection
For simplicity, each reading is considered to represent a template. In practice, many readings
can be made off the same template. Suitable templates that are identified are those that
satisfy all of the following conditions:
1. are in the appropriate sense,
2. have 5’ ends that start upstream of the oligo,
3. are sufficiently close to the oligo to be useful.
This last criterion relates to the insert size for the templates used for sequencing and the
average reading length. A template is considered useful if a full reading can be made from
it, taking into account both of these factors. The default insert size is 1000 bases (although
the size range should be included in the experiment file for each reading, and hence the
default would not be required), and the default average reading length is 400 bases. These
values can be changed in the “Edit parameters” window.

2.6.11 Traces
The original trace data from which the readings where derived can be displayed by double
clicking (two quick clicks) with the left or middle mouse button on the area of interest.
Control t has the same effect. The trace will be displayed centred around the base clicked
upon and the name of the reading in the contig editor will be highlighted. Double clicking on
the consensus displays all the readings covering that position. Double clicking on a reading

Chapter 2: Sequence assembly and finishing using Gap4

189

which already has its trace displayed will cause the corresponding trace to be surrounded
by a red border.
Moving the mouse pointer over a base causes the display of an information line at the
bottom of the window. This gives the base type, its position in the sequence, and its
confidence value.
There are two forms of trace display which are selected using the “Compact” button at
the top of the Trace display. The compact form differs by not showing the Info, Diff, Comp.
and Cancel buttons at the left of each trace.
Note that gap4 does not store the trace files in the project database: it stores only their
names and reads them when required. However it does not know which directory they are
stored in, unless this is specified using the “Trace File Location” option (see Section 2.20.8
[Trace File Location], page 302).

The picture shows an example of three displayed traces. The reading number, together
with the direction of the reading (+ or -) and the chemistry by which it was determined,
is given at the top left of each sub window. The chemistry information is found from
comments in the experiment file. ’uf’ and ’ur’ indicate universal forward and universal
reverse, ’cf’ and ’cf’ indicate custom forward and custom reverse, and ’p’ and ’t’ indicate
primer and terminator. There are four buttons (’Info’, ’Diff’, ’Comp.’ and ’Cancel’) below
this information, and X and Y scale bars to the right.
The “Info” button will display a window like the one shown at the bottom right of the
picture. This contains the comments from the relevant SCF file.
The “Diff” buttons are used to produce a new trace showing the differences between two
existing traces. To use this, press “Diff” in any window. The mouse cursor then changes
to a cross symbol. Pressing the left mouse button anywhere on another trace that has a
“Diff” button will create the difference trace. Any other button cancels the operation. The
algorithm used for computing the difference trace is adjustable by parameters in the settings

190

The Staden Package Manual

menu (see Section 2.6.8.2 [Trace Display Settings], page 181). The trace differencing was
originally designed for visual inspection of suspected mutations Bonfield, J.K., Rada, C.
and Staden, R. Automated detection of point mutations using fluorescent sequence trace
subtraction. Nucleic Acids Res. 26, 3404-3409 (1998).
The “Comp.” button complements the displayed trace. If the sequence in the editor
has been complemented then the trace will automatically be shown in the complementary
sense. This button may be used to toggle the complementarity.
The “Cancel” button will remove the trace.
The X and Y scale bars zoom the trace in the appropriate direction. The default Y scale
is to fit the highest peak on the screen without clipping. When the “Show confidence” checkbutton is selected, the confidence value for each base call will be displayed as a histogram,
overlayed on the trace displays. The base confidence values are not computed by gap4, but
rather are read from the SCF file which is assumed to have been generated by one of the
programs that compute confidence values (such as phred, ATQA or eba). When ABI files
are in use, confidence values may not be shown.
The trace is displayed on the right with a scrollbar directly below it and with the reading
name in the top left corner. The vertical line seen in these three traces shows the location
of the editing cursor in the contig editor window. The lock button on the trace displays
ties the editing cursor movement to the scrolling of the trace windows and vice versa.
The trace display supports the display of up to four columns of traces, and can display
any number of rows. The number of columns and rows can be configured and saved using
the buttons at the top of the window. A scrollbar is provided if there are more traces to
display than can be viewed with the current settings.
To modify the number of traces that are shown at any one time, and the heights of these,
add (and edit) the following lines to your ‘$HOME/.gaprc’ file.
set_def TRACE_DISPLAY.ROWS
5
set_def TRACE_DISPLAY.COLUMNS
2
set_def TRACE_DISPLAY.TRACE_HEIGHT 150
New traces are always added to the bottom right of the window.
Resizing the width of the trace window, moving the trace window and adjusting the X
magnification are all remembered and used when bringing up new trace displays.
The “Close” button at the top right of the Trace Display removes the Trace Display.

Chapter 2: Sequence assembly and finishing using Gap4

191

An example of the “Compact” form of the trace display is shown below.

2.6.12 Reference Sequence and Traces
Reference sequences can be used to provide standard base numbering for contigs. If they
have feature table tags which contain CDS records the Contig Editor can use them to
translate only the known coding segments, and in the correct reading frame. The primary
use for reference sequences is in mutation detection.
Reference Traces provide standards, both positive and negative for mutation detection
by trace comparison.

2.6.12.1 Reference sequences
In order to put readings and their mutations in context we use a reference sequence and
feature table. This enables mutations to be reported using positions defined by the reference
sequence, and also allows the effect of the mutations to be noted. To facilitate this gap4 is
able to store entries from the EMBL sequence library complete with their feature tables.
These feature tables are converted to gap4 database annotations (tags), which means that
they can be selectively displayed in the template display and editor, and used to translate
only the exons (in the correct reading frame). The reference sequence can be designated
(or reassigned) by right clicking on its name. Once set it should appear labelled “S” at the
left edge of the editor.

2.6.12.2 Reference traces
From the “settings” menu of the editor the trace display can be set to “Auto-Diff traces”.
Once this is activated, whenever the user double clicks on a base in the editor sequence
display, not only is the reading’s trace displayed, but also its designated reference trace
plus the difference between them. If its complementary reading is available, its trace and
reference trace and their differences are also displayed.

192

The Staden Package Manual

The preferred way of assigning reference traces to readings is by use of “naming conventions”; that is to have a simple set of rules which control the names given to the trace
files. It can be seen in the figures showing the editor that forward and reverse readings from
the same patient have names with a common root but which end either F or R. This both
ties the two together (so the software knows which is the corresponding complementary
trace when the user double clicks on a reading) and also enables the association of readings
and their reference traces. Once a convention has been adopted the rules can be defined
for pregap4 by loading them via the “Load Naming Scheme” option in its File menu (see
Section 4.8 [Pregap4 Naming Schemes], page 366). For any batch of readings the reference
traces are defined within pregap4’s “Reference Traces” module.
Within the Contig Editor reference traces can be set by right clicking on their names in
the editor. When this is done a menu will popup. This allows the user to select whether
the trace is to be used as a positive or negative control.

2.6.13 Template Status Codes
Adjacent to the reading name is a coloured block indicating the reliability of the template.
Red

Strand conflict (e.g. two forward readings are assembled on opposing strands)

Blue

Position conflict (e.g. the start of this template can be derived at multiple
positions due to more than one universal primer sequence, but at positions >
100 base pairs apart).

Pink

One end is not present in this contig, but is in another contig.

Light grey One template end sequence is not present in this database (ie not a read-pair)
Medium grey
The measured template size is too large or too small
Dark grey Multiple problems
These correspond to the (larger) set of single-letter codes that are listed in the editor
information line
The “go to” and “select all readings from this template” commands (obtained by right
clicking on the reading name) are particularly useful when dealing with inconsistent templates.
The colour codes map to the (larger) set of single-letter identifiers used in the information
line (see Section 2.6.14 [The Editor Information Line], page 193). The letter codes are:
D

Distance (negative in size)

Distance (too large/small)

Primer position

Strand

Guessed start or end position of template

Spans contigs and contig-end distance is large

Spans contigs, but contig-end distance is small

Chapter 2: Sequence assembly and finishing using Gap4

No problems

Unknown problem

193

For templates with read-pairs spanning two contigs the distance from the end of each
contig (in the direction that the template ’reads’ in) is summed together to compute whether
a contig join is viable. This in turn yields the “O” and “I” codes.

2.6.14 The Editor Information Line
The very bottom line of the editor display is text line used by the editor to display pieces
of useful information. Currently this gives information on individual bases, readings, the
contig, and tags, as the mouse is moved over the appropriate object. For bases (in both
readings and the consensus) this information is only displayed when a mouse button is
pressed. The left mouse button displays with format BASE_BRIEF_FORMAT1 and format
BASE_BRIEF_FORMAT2 is displayed when pressing ’Enter’. By default the only difference
between the two is that ’Enter’ will display the “unpadded position” of a base in the
consensus - ie its position in the consensus after pads have been removed. The contents
and format of the information displayed is completely configurable by adding the relevant
definitions to your ‘.gaprc’ file. The defaults are as follows.
set_def READ_BRIEF_FORMAT \
{%n(#%Rn) Clone:%Cn Vector:%Tv Type:%P;%a Tmpl:%Tc %c}
set_def CONTIG_BRIEF_FORMAT \
{Contig:%n(#%Rn)
Length:%l
set_def TAG_BRIEF_FORMAT \
{Tag type:%t
Direction:%d
set_def BASE_BRIEF_FORMAT1 \
{Base confidence:%c (Probability %p)
set_def BASE_BRIEF_FORMAT2 \
{Base confidence:%c (Probability %p)
Unpadded position %U}

%c}

Comment:"%.100c"}

Position %P}

Position %P

Tag information is shown when the mouse is moved over an annotation. Read information
is shown when the mouse is moved over the reading name in the names section of the
display. Contig information is displayed when the mouse is moved over the “Consensus”
line in the names display. If you wish to leave the contig editor window without changing
the information line contents as the mouse moves over other information press and hold the
Shift key whilst moving the mouse. This disables the automatic highlighting. The same
mechanism also works for other windows (such as the template display).
The general style of the formats is the string to display with particular strings substituting % characters. For instance in the reading format %n is substituted by the reading
name. The general format of a % expansion is:
• A percent sign.

194

The Staden Package Manual

• An optional minus sign to request left alignment of the information. When displaying
information in a specific field with where that data does not fill the entire space allowed
the information will, by default, be right justified. Adding a minus character here
requests left justification.
• An optional minimum field width. This is a decimal number indicating how much space
to leave for this information.
• An optional precision for numbers or maximum field width for strings. This is given
as a fullstop followed by a decimal number.
• An optional ’R’ to specify Raw mode. This changes the meaning of many (but not all)
of the expansion requests to give a numercial representation of the data. For example
%n is a reading name and %Rn is a reading number.
• Th expansion type itself. This is either one or two letters. See below for full details of
their meanings.
To programmers this syntax may seem very similar to printf. This is intentional, but
do not assume it is the same. Specifically the print syntax of %#, %+ and %0 will not work.

2.6.14.1 Reading Information
Example output is Reading:xc04a1.s1(#74) Length:295(474) Vector:m13mp18 Clone:test
Chemistry:primer Primer:forward universal.
%%

A single % sign

Reading name. Raw mode: number

Reading number

Trace name

Position

Clipped length

Total length

Start of clip

End of clip

Sense (whether complemented) - “+” or “-”. Raw mode: 0/1

Chemistry (eg “BigDyeV3”). Raw mode: integer version

Strand - “+” or “-”. Raw mode: 0/1

Primer - “unknown”, “forward universal”, “reverse universal”, “forward custom” or “reverse custom”. Raw mode: 0/1/2/3/4

%Tn

Template name. Raw mode: template number

%T#

Template number

%Tv

Template vector. Raw mode: template vector number

%Ti

Template insert size

Chapter 2: Sequence assembly and finishing using Gap4

195

%Tc

Template consistency (a mix of “DdPSEO?” or “ok”). Raw mode: as a number

%Cn

Clone name. Raw mode: clone number

%C#

Clone number

%Cv

Clone vector. Raw mode: clone vector number

Trace filename.

User defined text, taken from the the first note of type INFO.

2.6.14.2 Contig Information
Example output is Contig:xc04a1.s1(#74) Length:1316.
%%

Single % sign

Left most reading name. Raw mode: reading number

(As %n)

Right most reading name. Raw mode: reading number

Contig number

Contig length

Expected number of errors (can be slow on large contigs)

User defined text, taken from the the first note of type INFO.

2.6.14.3 Tag Information
Example output is Tag type:OLIG Direction:- Comment:”template=xc04a1 sequence=
CGATTGCAGAATAAGACG”.
%%

Single % sign

Tag position

Tag direction - “+”, “-” or “=”. Raw mode: 0/1/2

Tag direction - “—–>”, “<—–” or “<—->”. Raw mode: 0/1/2

Tag type (always 4 characters)

Tag length

Tag number (0 if unknown)

Tag comment

2.6.14.4 Base Information
Example output is Base confidence:13 (Probability 0.954020) Position 3805 Unpadded position 3678.
%%

Single % sign

Confidence value (phred style)

196

The Staden Package Manual

Confidence value (as probability)

Padded consensus base position

Unpadded consensus base position

2.6.15 The Join Editor
Contigs are joined interactively using the Join Editor. This is simply a pair of contig editor
displays stacked above one another with a “differences” line in between. Note that it is
essential to align the contigs over the full length of their overlap. It is much more difficult
to achieve this after a join has been made, and until the alignment is correct, the consensus
sequence will be nonsense.
The few differences between the Join Editor and the Contig Editor can be seen in the
figure below. Otherwise all the commands and operations are the same as those for the
Contig Editor

One difference is the Lock button. When set (as it is in the illustration) scrolling either
contig, by using the scrollbar or the four movement buttons, will also scroll the other contig.
The Align button aligns the overlapping consensus sequences and adds pads The alignment routine assumes that the two contigs are already in approximately the right relative
position (as they are immediately after the Join Editor has been invoked from Find Internal
Joins, or Find Repeats). If they are not they must be positioned manually before using the
Align button.
The “<” and “>” buttons either side of the “Align” button perform the alignment from
the editing cursor to the start of the contig and and from the cursor to the end of the contig
only. Alignment end-gaps are penalised at the curosr position but not for the alignment
end at the contig start/end position. These buttons are useful for when multiple alignment
positions may be valid, such as is the case with an overlap consisting entirely of a STR.
It should be noted that each of the pair of editors comprising the Contig Editor maintains
its own undo history, and using Align is likely to add to both undo histories. Hence, to
undo the results of the Align command the Undo button in both editors must be used.
Note also that storing the undo information takes up a great deal of computer memory
and slows down the alignment process. For this reason the settings menu contains an option

Chapter 2: Sequence assembly and finishing using Gap4

197

to turn off (or on) the saving of undo information. When aligning very long overlaps it is
advisable to turn off the undo saving.
When “Join/Quit” is pressed a dialogue box is displayed containing the percentage
mismatch of the overlap, and asking if the join should be made. For joins above a certain
level of mismatch (20 percent by default) a second confirmation is required.

2.6.16 Using Several Editors at Once
Several editors can be used simultaneously, even on the same contig. In the latter case, it
is useful to understand the difference between the data and the view of the data.
Each operating Contig Editor is a view of the data for a particular contig. With two
editors viewing the same contig, making changes in either will effect the data that both are
viewing, hence the change will be visible in both editors. Similarly, using Undo in either
will undo the changes to both.
When quitting and saving changes, other editors for the same contig will act as if a “Save
Contig” request has been made by using the “Commands” menu (ie changes are written to
disk and the undo information will be reset). Answering “no” to the “Save changes” query,
simply shuts down the editor without saving. If there is no other editor for this contig then
the changes will be lost, otherwise the changes will be retained until the last editor for the
contig is exited.
Interaction between Contig Editors and Join Editors is more complicated and generally
isn’t advised. However such interactions work consistently with the notion of views of
contigs. For example, suppose there are two Contig Editors open on two separate contigs,
and in addition to these a Join Editor displaying both contigs. Making the join in the Join
Editor will update the two stand-alone Contig Editors so that they are each viewing the
correct positions in the new contig, even though they’re both now viewing the same contig.

2.6.17 Quitting the Editor
The “Quit” button quits the editor. If changes have been made since the last save (either a
“Save Contig” or an auto- save) you will be asked whether you wish to save these changes.
Answering “Cancel” abandons the quit process and provides control of the editor again,
otherwise the appropriate action will be taken and the editor quitted.
Within a join editor, the “Quit” button is changed to “Join/Quit”. Pressing it will
prompt for making the join. You will be told the percentage mismatch of the overlapping
consensus sequences. The join can either be accepted, rejected, or cancelled (in which case
the editor is not quitted and the join is not made).

2.6.18 Editing Techniques
The editor documentation describes the available controls, but not how these should be used
most efficiently. Some editing is performed in a local style or is personal preference, but a
great deal of the common editing tasks are best dealt with in specific ways. This section
aims to give example methods of resolving the common problems. Typically, problems
will be found using one of the editor searches (such as “consensus quality” or “problem”).
Used in conjunction with “Auto-display traces” (see Section 2.6.8.2 [Trace Display Settings],
page 181) this will automatically bring up a set of traces that are likely to be of assistance

198

The Staden Package Manual

in resolving the problem. Prior to working on a contig it can be helpful to use “Shuffle
Pads” to try to align padding characters. See Section 2.12.3 [Shuffle Pads], page 265..

2.6.18.1 Consensus and Quality Cutoffs
The most rapid editing technique (see Section 2.2.5 [The use of numerical estimates of base
calling accuracy], page 118) is only available if base call confidence values have been assigned
to the reading data using a scale proportional to -log(error rate). Using the “confidence”
consensus method will make use of confidence values to give the most probable consensus
sequence and a probability of each base being correct. Using the editor “consensus quality”
search then provides an extremely quick way of identifying the lowest quality consensus
bases. The List Confidence command will give information on the expected number of
errors that can be fixed by examining all consensus bases with a quality less than a particular
amount. This gives a good indication to the choice of theshold to use in the consensus quality
search. Additionally you will also be told the expected error rates. With this system it is
possible to stop editing once a particular average quality has been achieved.
Care should be taken in considering your desired error rate. An average error rate of
1 in 10,000 may be easily achievable. However there could still be consensus bases with
very low confidence. Hence it is perhaps best to choose both an average error rate and a
minimum consensus confidence for your finishing criteria. The consensus confidence values
are scaled such that a confidence of 20 is a 1 in 100 error rate, 30 is 1 in 1000, 40 is 1 in
10000 and so on.
The rest of this section described methods to use when the aforementioned confidence
values are not available.
The Consensus and Quality cutoff values used whilst editing are personal preference.
Rather than state suggested values, we discuss the merits of using example values.
The meaning of the consensus and quality cutoff values changes slightly depending on
the consensus algorithm in use. For more information on the algorithms and these values
see Section 2.11.5 [The Consensus Calculation], page 257.
With the “Base type frequencies” and “Quality weighted base type” methods, a consensus cutoff value of 100 means that every sequence disagreement will yield a dash in the
consensus. Hence the “Next Search” button when in “problem” search mode can be used
to verify every potential problem. This is a lot of work, but if you wish to make sure that
all disagreements are checked this is the easiest way.
With a quality cutoff of -1, lowering the consensus cutoff value to (eg) 90 means that
a base in the consensus will only be a dash when over 10% of the bases disagree with the
majority at that point. So a base covered by 11 sequences, 10 of which state A and one of
which states C would not be considered a problem and would not be found by the problem
search. Note that this is regardless of the strand information. So if the As are on the positive
strand and the single C is on the negative strand then this is still not considered a problem.
However, see below.
Still working with the “Base type frequencies” and “Quality weighted base type” consensus methods, changing the quality cutoff to be 0 or more means that the consensus base
is derived from the relative quality of bases instead of simple frequency counts. A quality

Chapter 2: Sequence assembly and finishing using Gap4

199

cutoff of 0 and a consensus cutoff of 90 means that the base will be a dash only when the
sum of the quality values for the most common base type (defined by the highest quality
sum) is less than 90% of the total. In comparison with a quality cutoff of -1, this means
that the above example of 10 A bases and 1 C base would be considered a problem if the C
base had a sufficiently high quality.
If you have confidence values for each base available you may consider it unnecessary
to check disagreements caused by poor quality data disagreeing with good quality data,
although disagreements between good data and good data should always be checked. However it should be obvious from this that with a quality cutoff of 0 and a consensus cutoff
of 100% every sequence conflict is still considered a potential problem. A specific change
in the consensus cutoff (eg from 100% to 90%) will typically find less problems when the
quality cutoff is 0 than when it is -1. This is entirely due to differences between good quality
data and poor quality data being excluded.
Finally, the “Compare Strands” editor setting calculates two independent consensus
sequences; one for each strand. The consensus shown is then the base calculated in each
of the two consensus sequences if they agree, or dash if they do not. The “confidence”
consensus algorithm already takes into account strand and chemistry when calculating the
consensus base type and confidence, but will only lower the confidence value for strand
disagreements, rather than setting the consensus base to be a dash. For all consensus
methods enabling “Compare Strands” will force you to check all consensus bases where the
evidence from each strand is conflicting.

2.6.18.2 Editing by Base Change or Confidence
Once a location has been found where an edit needs to be made there are two possible
methods of resolving the problem. Assuming that the edit is a base replacement, the first
way is to simply replace the differing base with the corrected base. This adds the new
base at 100% quality. A second solution is to set the confidence of the differing base to 0.
Assuming that we have the quality cutoff set to zero or more, this will remove the differing
base from the consensus calculation, thus enabling the consensus to be 100% identical.
Both of these methods may be used when replacing bases in the consensus and are
selectable using the “Edit Modes” menu. Fixing a problem by adjusting its confidence
leaves the original, conflicting, base visible on the screen. However if the changed reading is
the only one on a strand then adjusting the confidence means that the point only has good
data on one strand.

2.6.18.3 Base Overcalls
A common problem is that of base overcalls that will result in perhaps one reading having
and extra base, and all the others being padded by the alignment routines:
Read1
Read2
Read3
Read4
Read5
Consensus

ACC*AG
ACC*AG
ACC*AG
ACCCAG
ACC*AG
ACC*AG

200

The Staden Package Manual

In this first case we see that Read4 has an extra C, probably due to an overcall. Check
that the trace for Read4 shows an overcall. It is a good idea to check good quality traces
for both strands as well as the trace with the apparent problem. Also note that enabling
“Show reading quality” (Settings menu) will show the reading quality as grey scales.
We now need to remove the column. It would appear that this could be done by removing
the * from each of Readings 1, 2, 3 and 5, and removing the C from Reading 4. However
this will only make edits to those five readings. As we’re trying to remove an entire column
from the contig, we need to shift to the left by a single base the position of any readings to
the right. Naturally this is not the ideal method.
By placing the editing cursor in the consensus (on the second A) we can press Delete to
remove the entire column. This automatically makes sure that everything is consistent. If
we are editing at 100% consensus cutoff then this consensus base will be a ’-’ instead of a
’*’. For this to work we need to make sure that we have “Allow del dash in cons” enabled
in the Edit Modes menu. See Section 2.6.3.2 [Editing Modes], page 166.

2.6.18.4 Base Undercalls
Read1
Read2
Read3
Read4
Read5
Consensus

ACCCAG
ACCCAG
ACCCAG
ACC*AG
ACCCAG
ACCCAG

In the above case we see that Read4 has a C missing. Once again we must check the traces
to be sure that we wish to edit the reading. If so, then we can either make an edit specifically
to Read4 itself (in “replace” mode) or type C in the consensus at this column. The latter
will change either the base type of the * in Read4 to c, or will change its confidence value
to 0. This depends upon the value of the “Edit by base type”/”Edit by confidence” setting
in the Edit Modes menu.
When replacing base types, it is preferable to use lowercase letters. This makes the
modified base stand out. However even when using uppercase letters it is always possible
to search for edits at a later stage, although they won’t be as obvious to the human eye.
Finally, note that the “Allow replace in cons” mode must be set to to enable this solution.

2.6.18.5 Multiple Base Disagreements
Read1
Read2
Read3
Read4
Read5
Consensus

ACCGAG
ACCGAG
AC*GAG
ACCGAG
AC*GAG
AC-GAG

Now we have a more complex case. Two disagreements out of five readings. Care should
be taken to check these traces. Also note the strand of each reading. If the database is
highly repetitive, and Read1, Read2, and Read4 are all from one strand, with Read3 and

Chapter 2: Sequence assembly and finishing using Gap4

201

Read5 from the opposite strand then there is a chance that a misassembly has occurred or
that the problem is a strand dependent sequencing artifact.
Typically this is clear when using the “Highlight disagreements” mode.
See
Section 2.6.8.9 [Highlight Disagreements], page 184. By selecting this mode and by also
highlighting the reading names (see Section 2.6.2 [The sequence names display], page 163)
scanning along the contig will quickly show whether there are other disagreements in
common with these two readings versus the other three (which would support the evidence
of misassembly).
If any misassembled readings have been spotted then mark them for disassembly (see
Section 2.6.9 [Remove Readings], page 186) and they’ll no longer cause conflicts in the
consensus. If the problem is a simple case of needing to edit, then making the edit in the
consensus will require only one key stroke instead of the two needed to edit the individual
readings.

2.6.18.6 Poor Quality
Read1
Read2
Read3
Read4
Read5
Consensus

ACC*AGT*CGTA
ACC*AGT*CGTA
ACC*AGT*CGTA
ACCCAGTCCG
ACC*AGT*CGTA
ACC*AGT*CGTA

This is identical to the first case, except we have two edits within Read4 in close proximity.
This is usually due to a poor quality reading, which can be checked by examining the trace
and confidence values. Whilst we could continue to make edits in the normal fashion it may
be wiser to take another approach.
One technique is to adjust the cutoff data for Read4. By marking the data as hidden,
this portion of the reading will no longer be used for producing the consensus. However we
can only extend the cutoff data at one end or the other; it is not possible to have “hidden”
data part way through a reading except by modifying its confidence. Note though that
adjusting the cutoff data may mean that we have no data for one strand, which should be
solved by extra experiments.
If the reading is poor quality along its entire length, then disassembly is also a viable option. Using Highlight Disagreements (see Section 2.6.8.9 [Highlight Disagreements],
page 184. ) or Check Assembly (see Section 2.9 [Check Assembly], page 236) is a good way
of finding such readings. Note that disassembling readings may have other implications. It
could cause a hole in the contig (in which case it will be broken in two) or it could cause
a single stranded segment. If this is the case, the user needs to weigh up the work involved with making many edits along the length of this reading against performing another
experiment to obtain better quality data.

2.6.18.7 Checking for Errors
It is important to check the final sequence for any errors introduced by incorrect editing.
We strongly advise this when making use of the more dangerous options in “Edit Modes”

202

The Staden Package Manual

as it is possible to accidentally make changes. The editor provides several methods for
checking the edits performed on the data.
The search menu contains two search types (see Section 2.6.6.11 [Search for evidence for
edits(1)], page 176) and (see Section 2.6.6.12 [Search for evidence for edits(2)], page 176).
“Evidence for edits (1)” searches for the next edited place where none of the original readings
agree with the consensus. This helps to spot cases where entire columns have been inserted
or deleted. “Evidence for edits (2)” performs the same checks as “Evidence for edits (1)”,
but on each strand independently. So it will find all edited places where there is not evidence
from both strands.
Further checks may be performed outside the editor. Using the Find Read Pairs command all templates containing both forward and reverse readings will be checked to make
sure that the relative orientation and distance of the sequences is correct. See Section 2.8.2
[Find Read Pairs], page 222.
Finally, the Check Assembly command can either check the hidden data for each reading
to check that it does not diverge from the consensus sequence, or the visible data can be
examined to locate segments with a high proportion of disagreements with the consensus.
See Section 2.9 [Check Assembly], page 236. Such cases arise from unnoticed section of
vector sequence, chimeric reads, or due to a reading being in the wrong copy of a repeated
element.

2.6.19 Summary
2.6.19.1 Keyboard summary for editing window
(“Left”, “Right”, “Up”, “Down” refer to the appropriate arrow keys.)
Escape/Control v
Meta/Alt v

Scroll right one screenful
Scroll left one screenful

Left or Control b
Right or Control f
Up or Control p
Down or Control n
Control a
Control e
Meta/Alt/Escape a
Meta/Alt/Escape e
Meta/Alt/Escape comma
Meta/Alt/Escape fullstop

Move
Move
Move
Move
Move
Move
Move
Move
Move
Move

Meta/Control/Alt Left
Meta/Control/Alt Right
Control l
Control r
<
>

Extend left cutoff data
Extend right cutoff data
Move pad left
Move pad right
Zap left cutoff data
Zap right cutoff data

editing
editing
editing
editing
editing
editing
editing
editing
editing
editing

cursor
cursor
cursor
cursor
cursor
cursor
cursor
cursor
cursor
cursor

left one base
right one base
up one base
down one base
to start of used
to end of used
to start of cutoff
to end of cutoff
to start of contig
to end of contig

Chapter 2: Sequence assembly and finishing using Gap4

203

[
]
Shift Up
Shift Down
Control Up
Control Down
Delete
Backspace
Control Delete
Control d

Set confidence to 0
Set confidence to 100
Increase confidence of base by 1
Decrease confidence of base by 1
Increase confidence of base by 10
Decrease confidence of base by 10
Delete base or Shift reading left
Delete base or Shift reading left
Delete base from left and move left
Delete base from right; do not move

Space
Undo key or Control underscore
Control i
Control h
Control t
Control s
Escape Control s
Control x Control s
Control q
Control c or Control Insert
Insert
any ACGT1234DVBHKLMNRY5678*-

Shift reading right
Perform an undo
Toggle insert mode
Toggle a sequence for removal
Display trace
Search forward
Search backwards
Save editor
Toggle tag display
Copy underlined region to paste buffer
Insert a padding character (*)
Insert or change base (both cases allowed)

2.6.19.2 Mouse summary for editing window
Left button
Left button (drag)
Shift left button
Enter key
Return key
Left button (double click)
Middle button (double click)
Control left button
Right button
Mouse-wheel
Shift mouse-wheel
Control mouse-wheel

Position editing cursor to mouse cursor
Update editor information line
Mark start and end of selection
Adjust end of selection
Update editor information line (unpadded pos.)
As Enter key, but also moves editing cursor
Display trace
Display trace
Display commands menu
Display commands menu
Vertically scroll the editor
Vertically scroll the editor, slow
Vertically scroll the editor, fast

2.6.19.3 Mouse summary for names window
Left button
Middle button/Alt left button
Right button

Toggle user highlight (not in status line)
Add name to output list (if set)
Display popup menu

204

The Staden Package Manual

2.6.19.4 Mouse summary for scrollbar
Middle button
Alt left button
Left button

Set scrollbar position
Set scrollbar position
Scroll left or right one screenful

In addition to the scrollbar manipulation, the “<<”, “<”, “>”, “>>” buttons also scroll
the editor left or right by half a screenful or one base.

Chapter 2: Sequence assembly and finishing using Gap4

205

2.7 Assembling and Adding Readings to a Database
Assembly is performed by selecting one of the functions from the Assembly menu. The
options available are:
•
•
•
•
•
•
•
•

Normal shotgun assembly
Assemble independently
Assembly into single stranded regions
Stack readings
Put all readings in separate contigs
Directed assembly
Enter pre-assembled data
Screen only

The data for a project is stored in an assembly database (See Section 2.16 [Gap Database
Files], page 284.) All modes of assembly except CAP2, CAP3 and FAKII can either assemble
all the readings for a project in a single operation or can add batches of new data as they
are produced. CAP2, CAP3 and FAKII can only be used to assemble all the data for a
project as a single operation.
For all modes the names of the readings to assemble are read from a list or file of file
names, and the names of readings that fail to be entered are written to a list or a file of file
names. If only a single read is to be assembled the "single" button may be pressed and the
filename entered instead of the file of filenames.
Now that a sufficient number of readings to get close to contiguity can be obtained quite
quickly, and that more repetitive genomes are being sequenced it is sensible to use a "global"
algorithm for assembly, such as Cap2, Cap3, FakII or Phrap. These algorithms compare
each reading against all of the others to work out their most likely left to right order and
so have a better chance of correctly assembling repetitive elements than an algorithm that
only compares readings to the ones already assembled.
There is no limit to the length of the individual readings which can be assembled. Hence
reference sequences for use in mutation studies or for use as guide sequences can be assembled.
Note that Normal shotgun assembly (see Section 2.7.1 [Normal Shotgun Assembly],
page 205), Assemble independently (see Section 2.7.1 [Assembly Independently], page 205),
Assembly into single stranded regions (see Section 2.7.1.2 [Assembly Single], page 209),
Screen only (see Section 2.7.3 [Screen Only], page 213), Put all readings in separate contigs
(see Section 2.7.1.4 [Assembly new], page 211), may require the parameters maxseq and
maxdb to be set beforehand (see Section 2.20.3 [Set Maxseq], page 299). The maxseq
parameter defines the maximum length of consensus that can be created, and the maxdb
parameter the maximum number of readings and contigs that the database can hold (i.e.
number of readings + number of contigs).

2.7.1 Normal Shotgun Assembly
In the absence of any of the external assembly engines, which are in general superior,
particularly for repetitive data, this is the mode that most users will employ for all assembly.

206

The Staden Package Manual

It takes one reading at a time and compares it with all the data already assembled in the
database. If a reading matches it is aligned. If the alignment is good enough the reading is
entered into the database. If a reading aligns well with two contigs it is entered into one of
them, then the two contigs are compared. If they align well they are joined. If the reading
does not match it starts a new contig. If a reading matches but does not align well it can
either be entered as a new contig or rejected.
A submode allows tagged regions of contigs to be masked and hence restricts the areas
into which data is entered. Users select the types of tags to be used as masks. As outlined
above readings are compared in two stages: first the program looks for exact matches of
some minimum length, and then for each possible overlap it performs an alignment. If
the masking mode is selected the masked regions are not used during the search for exact
matches, but they are used during alignment. The effect of this is that new readings that
would lie entirely inside masked regions will not produce exact matches and so will not be
entered. However readings that have sufficient data outside of masked areas can produce
hits and will be correctly aligned even if they overlap the masked data. For this mode the
names of readings that do not produce matches are written to the error file with code 5.
Note that new readings that carry tags of the types being used for masking will be masked
only after they have been entered.

As explained above the user can select to "Apply masking", and if so, the "Select tags"
button will be activated and if it is clicked will bring up a dialogue to allow tag types to be
selected. See Section 2.20.9 [Tag Selector], page 304.

Chapter 2: Sequence assembly and finishing using Gap4

207

The "display mode" dialogue allows the type of output produced to be set. "Hide all
alignments" means that only the briefest amount of output will be produced. "Show passed
alignments" means that only alignments that fall inside the entry criteria will be displayed.
"Show all alignments" means that all alignments, including those that fail the entry criteria,
are displayed. "Show only failed alignments" displays alignments only for the readings that
fail the entry criteria. Adding text to the text output window will increase the processing
time.
When comparing each reading the program looks first for a "Minimum initial match",
and for each such matching region found it will produce an alignment. If the "Maximum
pads per read" and the "Maximum percent mismatch" are not exceeded the reading will
be entered. The maximum pads can be inserted in both the reading and the consensus. If
users agree we would prefer to swap the maximum pads criteria for a minimum overlap. i.e.
only overlaps of some minimum length would be accepted.
Assembly usually works on sets of reading names and they can be read from either a
"file" or a "list" and an appropriate browser is available to enable users to choose the name
of the file or list. If just a single reading is to be assembled choose "single" and enter the
filename instead of the file or list of filenames.
The routine writes the names of all the readings that are not entered to a "file" or a
"list" and an appropriate browser is available to enable users to choose the name of the file
or list. Occasionally it might be convenient to forbid joins between contigs to be made if a
new reading overlaps them both, but the default is to "Permit joins".
If a reading is found to match but does not align within the alignment criteria it can be
entered as a new contig or rejected. These two choices are described as "Enter all readings"
or "Reject failures". Pressing the "OK" button will start the assembly process.
Note that this option may require the parameter maxseq to be set beforehand (see
Section 2.20.3 [Set Maxseq], page 299). This parameter defines the maximum length of
consensus that can be created.
Typical output would be:
(Output removed to save space)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Processing
51 in batch
Reading name xb61h12.s1
Reading length
104
Total matches found
2
Contig
9 position
590 matches strand -1 at position
1
Contig
36 position
92 matches strand -1 at position
1
Trying to align with contig
9
Percent mismatch 2.1, pads in contig 0, pads in gel 1
Percentage mismatch
2.1
590
600
610
620
630
640
Consensus TTGAAAAATTAAAAACTTTTTTTGAAAATAAAAAAGAGTGAAAGTAAAGTAAAAGACAAG
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

208

The Staden Package Manual

Reading

TTGAAAAATTAAAAACTTTTTTTGAAAATAAAAAAGAGTGAAAGTAAAGTAAAAGACAAG
1
11
21
31
41
51

650
660
670
680
Consensus TAGCATGTAAATCAACTAAAAATAACTAATATTTT
::::::::::::::::::::::::: ::::::::
Reading TAGCATGTAAATCAACTAAAAATAA,TAATATTT61
71
81
91
Trying to align with contig
36
Percent mismatch 0.0, pads in contig
Percentage mismatch
0.0
92
102
Consensus TTGAAAAATTAAAAACTTTT
::::::::::::::::::::
Reading TTGAAAAATTAAAAACTTTT
1
11

0, pads in gel

Overlap between contigs
36 and
9
Length of overlap between the contigs
111
Entering the new reading into contig
9
This gel reading has been given the number
47
Complementing contig
36
Complementing contig
9
Trying to align the two contigs
Percent mismatch 4.4, pads in contig 0, pads in gel 3
Percentage mismatch
5.3
86
96
106
116
126
136
Consensus AAAAGTTTTTAATTTTTCAATTGTTTGGGTGTTCCTTTGACTATTAGAAAAACACCCCCC
::::::::::::::::::::::::::::::::::::::::::::::::::: :: :::::
Consensus AAAAGTTTTTAATTTTTCAATTGTTTGGGTGTTCCTTTGACTATTAGAAAA,CA,CCCCC
1
11
21
31
41
51
146
156
166
176
186
196
Consensus TTGCTCCTGTTGTGCAATTTTTGTTTTAAGTTTTCAATC*TTT*TATTTTAATA
::::::::::::::::::::::::::::::::::: ::: ::: :::::: :::
Consensus TTGCTCCTGTTGTGCAATTTTTGTTTTAAGTTTTC-ATC,TTTTTATTTT-ATA
61
71
81
91
101
111
Editing contig
36
Completing the join between contigs
47 and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
(Output removed to save space)
Batch finished
100 sequences processed

Chapter 2: Sequence assembly and finishing using Gap4

209

96 sequences entered into database
11 joins made
9 joins failed

2.7.1.1 Assemble Independently
This mode works in exactly the same way as normal shotgun assembly (see Section 2.7.1
[Normal Shotgun Assembly], page 205) with all its options and settings, except that the
new batch of data is assembled independently of all the data already in the database. This
means that the only overlaps found will be between the readings in the current batch. One
role for this mode would be to assemble a batch of data that was known from the way it
was produced (say a set of nested clones covering some problem region such as a repeat)
to overlap. Use of Assemble Independently will ensure that the batch of readings will only
be overlapped with one another, and will not be aligned with other similar regions of the
consensus. Once assembled in this way they can be joined to other contigs using Find
Internal Joins. See Section 2.8.3 [Find Internal Joins], page 227.

2.7.1.2 Assemble Into Single Stranded Regions
This mode works like normal assembly (see Section 2.7.1 [Normal Shotgun Assembly],
page 205) with masking, except that the masking is done for regions that already have
sufficient data on both strands of the sequence. This means that new readings will only be
assembled into regions that are single stranded or which border, and overlap, such segments.
Note that this means that readings that do not match are not entered, therefore those that
would actually lie between contigs are rejected.

210

The Staden Package Manual

"Show all alignments" means that all alignments, including those that fail the entry criteria,
are displayed. "Show only failed alignments" displays alignments only for the readings that
fail the entry criteria.
When comparing each reading the program looks first for a "Minimum initial match",
and for each such matching region found it will produce an alignment. If the "Maximum
pads per read" and the "Maximum percent mismatch" are not exceeded the reading will
be entered. The maximum pads can be inserted in both the reading and the consensus. If
users agree we would prefer to swap the maximum pads criteria for a minimum overlap. i.e.
only overlaps of some minimum length would be accepted.
Assembly usually works on sets of reading names and they can be read from either a
"file" or a "list" and an appropriate browser is available to enable users to choose the name
of the file or list. If just a single reading is to be assembled choose "single" and enter the
filename instead of the file or list of filenames.
The routine writes the names of all the readings that are not entered to a "file" or a
"list" and an appropriate browser is available to enable users to choose the name of the file
or list. Occasionally it might be convenient to forbid joins between contigs to be made if a
new reading overlaps them both, but the default is to "Permit joins".
Pressing the "OK" button will start the assembly process.
Note that this option may require the parameter maxseq to be set beforehand (see
Section 2.20.3 [Set Maxseq], page 299). This parameter defines the maximum length of
consensus that can be created.

2.7.1.3 Stack Readings
This assembly mode assumes that all the readings are already aligned and simply stacks
them on top of one another in a new contig.

Assembly usually works on sets of reading names and they can be read from either a
"file" or a "list" and an appropriate browser is available to enable users to choose the name
of the file or list. If just a single reading is to be assembled choose "single" and enter the
filename instead of the file or list of filenames.
The routine writes the names of all the readings that are not entered to a "file" or a
"list" and an appropriate browser is available to enable users to choose the name of the file
or list.

Chapter 2: Sequence assembly and finishing using Gap4

211

2.7.1.4 Put All Readings In Separate Contigs
This algorithm simply loads the readings into the database without comparing them, each
starting a new contig. This can be of use to those employing the database for storage rather
than assembly.

2.7.2 Directed Assembly
This assembly method assumes that a preprocessing program, such as an external assembly
engine, has been used to map the relative positions of the readings to within a reasonable
level of accuracy or tolerance. The assembly is "directed" by use of special "Assembly
Position" or AP records included in each reading’s experiment file. It is expected that
these AP records will be added to the experiment files by the preprocessing program, or by
a program which parses the output from such a program, and so the details given below are
not of interest to the average user.
The experiment file for each reading must contain a special "Assembly Position" or AP
line that defines the position at which to assemble the reading. The position is not defined
absolutely, but relative to any other reading (the "anchor reading") that has already been
assembled. The definition includes the name of the anchor reading, the sense of the new
reading, its offset relative to the anchor reading and the tolerance. i.e.:
AP

anchor_reading sense offset tolerance

The sense is defined using + or - symbols.
The offset can be of any size and can be positive or negative. Offset positions are defined
from 0. i.e. the first base in a contig or a reading is base number 0.
For normal use tolerance is a non-negative value, and the first base of the new reading
must be aligned at plus or minus "tolerance" bases of "offset". If tolerance is zero, after
alignment the position must be exactly "offset" relative to the anchor reading. If tolerance

212

The Staden Package Manual

is negative then alignment is not performed and the reading is simply entered at position
"offset" relative to the anchor reading.
To start a new contig the reading must include an AP line containing the anchor reading
*new* and the sense.
Example AP line:
AP

fred.021 + 1002 40

Example AP line to start a new contig:
AP

*new* +

The algorithm is as follows. Get the next reading name, read the AP line, find the
anchor reading in the database, get the consensus for the region defined by anchor reading
+ offset +/- tolerance. Perform an alignment with the new reading, check the position and
the percentage mismatch. If OK enter the reading.
Obviously the way the positions of readings are specified is very flexible but one example
of use would be to employ a file of file names containing a left-to-right ordered list of reading
names, with each reading using the one to its left as its anchor reading. In this way whole
contigs can be entered.
Although not specifically designed for the purpose this mode of assembly can be used
for "assembly onto template".

If required, the alignments can be shown in the Output window by selecting "Display
alignments". Only readings for which the "Maximum percent mismatch" after alignment
is not exceeded will be entered into the database, unless the "enter all readings" box is
checked. In that case a reading that does not match well enough will be placed in a
new contig. Specifying a "Maximum percent mismatch" of zero has a special meaning; it
implies that there should be no mismatches and so no alignments need to be performed,
and hence the consensus does not need to be computed either. For data that has already
been padded and aligned using an external tool (such as an external assembly program)

Chapter 2: Sequence assembly and finishing using Gap4

213

setting Maximum percent mismatch to zero can have a significant improvement in the speed
of Directed Assembly.

The “Ignore svec (SL/SR) clips” option controls whether sequencing vector clip points
should be considered when setting the hidden data sections for the sequence. With this
option enabled only the quality clip (QL/QR) experiment file records will be used.

The routine writes the names of all the readings that are not entered to a "file" or a
"list" and an appropriate browser is available to enable users to choose the name of the file
or list.

It is important to note that the algorithm assumes that readings are entered in the
correct order, i.e. a reading can only be entered into the defined AP position after the
reading relative to which its position is defined. The order of the readings is defined by the
order in the list or file of file names, and hence should be ordered by the external assembly
engine. But if the browser is used to select a batch of sequences, they are unlikely to be
in the correct order by chance, so care must be taken in its use. If reading X specifies an
anchor reading that has not been entered the algorithm will start a new contig starting with
X.

2.7.3 Screen Only
This function is used to compare a batch of readings against the data in an assembly
database without entering them. It performs "normal shotgun assembly" and records the
percentage mismatch for each matching reading in a file. If required, this file could then
be sorted on percentage mismatch and used as a file of file names for "normal shotgun
assembly"; in which case the best matches would be entered first. The readings in the

214

The Staden Package Manual

batch are only compared to the current contents of the assembly database, and are not
compared against the other readings in the batch.

As explained in normal assembly (see Section 2.7.1 [Normal Shotgun Assembly],
page 205) the user can select to "Apply masking", and if so, the "Select tags" button will
be activated and if it is clicked will bring up a dialogue to allow tag types to be selected.
See Section 2.20.9 [Tag Selector], page 304.
The "display mode" dialogue allows the type of output produced to be set. "Hide all
alignments" means that only the briefest amount of output will be produced. "Show passed
alignments" means that only alignments that fall inside the entry criteria will be displayed.
"Show all alignments" means that all alignments, including those that fail the entry criteria,
are displayed. "Show only failed alignments" displays alignments only for the readings that
fail the entry criteria.
When comparing each reading the program looks first for a "Minimum initial match",
and for each such matching region found it will produce an alignment. If the "Maximum
pads per read" and the "Maximum percent mismatch" are not exceeded the reading will
be entered. The maximum pads can be inserted in both the reading and the consensus. If
users agree we would prefer to swap the maximum pads criteria for a minimum overlap. i.e.
only overlaps of some minimum length would be accepted.
Screening usually works on sets of reading names and they can be read from either a
"file" or a "list" and an appropriate browser is available to enable users to choose the name

Chapter 2: Sequence assembly and finishing using Gap4

215

of the file or list. If just a single reading is to be assembled choose "single" and enter the
filename instead of the file or list of filenames.
The routine writes the names of all the readings and their alignment scores expressed
as percentage mismatches to a "file" or a "list" and an appropriate browser is available to
enable users to choose the name of the file or list.
Previous versions of the package also had the ability to search for matches in the "hidden"
poor quality data at the ends of contigs. This feature is no longer available.
Note that this option may require the parameter maxseq to be set beforehand (see
Section 2.20.3 [Set Maxseq], page 299). This parameter defines the maximum length of
consensus that can be created.

2.7.4 General Comments and Tips on Assembly
The program has several methods for assembly and it may not be obvious which is most appropriate for a given problem. The following notes may help. They also contain information
on methods for checking the correctness of an assembly.
If you have access to an external program that can generate the order and approximate
positions of readings then Directed Assembly can be used. The same is true if the experimental method used generates an ordered set of readings (see Section 2.7.2 [Directed
Assembly], page 211).
If you have access to a external global assembly program that can produce an assembly
and write out correct experiment files then Directed Assembly can still be used by specifying
a "tolerance" of -1 (in the experiment file AP lines).
For routine shotgun assembly of whole data-sets or incremental data-sets Normal Shotgun Assembly can be used. Through the idea of "Masked assembly" this option also can
also restrict the assembly to particular regions of the consensus (see Section 2.7.1 [Normal
shotgun assembly], page 205).
Note that Normal shotgun assembly (see Section 2.7.1 [Normal Shotgun Assembly],
page 205), Assemble independently (see Section 2.7.1 [Assembly Independently], page 205),
Assembly into single stranded regions (see Section 2.7.1.2 [Assembly Single], page 209),
Screen only (see Section 2.7.3 [Screen Only], page 213), Put all readings in separate contigs
(see Section 2.7.1.4 [Assembly new], page 211), may require the parameter maxseq to be
set beforehand (see Section 2.20.3 [Set Maxseq], page 299). This parameter defines the
maximum length of consensus that can be created. If you find that the assembly process is
only entering the first few hundred of a batch of readings, try increasing maxseq.
If you have a batch of readings that are known to overlap one another, but which, due
to repeats, may also match other places in the consensus, then it can be helpful to use
Assemble Independently. This will ensure that the batch of readings are compared only
to one another, and hence will not be assembled into the wrong places (see Section 2.7.1.1
[Assemble independently], page 209).
Almost all readings are assembled automatically in their first pass through the assembly
routine. Those that are not can be dealt with in two ways. Either they can be put through
assembly again with less stringent parameters, or entered using the "Put all readings in

216

The Staden Package Manual

new contigs" routine and then joined to the contig they overlap using Find Internal Joins
See Section 2.8.3 [Find Internal Joins], page 227.. If it is found that readings are not being
assembled in their first pass through the assembler, then it is likely that the contigs require
some editing to improve the consensus. Also it may be that poor quality data is being used,
possibly by users over-interpreting films or traces. In the long term it can be more efficient
to stop reading early and save time on editing. For those using fluorescent sequencing
machines the unused data can be incorporated after assembly using the Contig Editor and
Double Strand.
An independent and important check on assembly is obtained by sequencing both ends
of templates. Providing the correct information is given in the experiment files gap can
check the positions and orientations of readings from the same template (see Section 2.8.2
[Find read pairs], page 222). Any inconsistencies are shown both textually and graphically.
In addition this information can be used to find possible joins between contigs.

2.7.5 Assembly Failure Codes
0

The reading file was not found or is of invalid format

The reading file was too short (less than the minimum match length)

The reading appeared to match somewhere but failed to align sufficiently well
(too many padding characters or too high a percentage mismatch)

A reading of the same name was already present in the database

This error number is no longer used

During a masked assembly, no sequence match with this reading was found.

Chapter 2: Sequence assembly and finishing using Gap4

217

2.8 Ordering and Joining Contigs
After the initial rounds of assembly it is likely that the data for a sequencing project will
still not be contiguous. In order to minimise the number of experiments required to finish
the project it is useful to be able to get as much from the existing data as possible. The
functions described in this section can help to get the current set of contigs into a consistent
left to right order, can discover joins between contigs which were missed or overlooked by
the assembly engines, and can help in the analysis of repeats which may cause problems for
assembly. It is one of the strengths of gap4 that the results from several of these independent
types of analysis can be combined in a single display (see Section 2.4 [Contig Comparator],
page 126), and where they are seen to reinforce one another, users can feel more confident
in their decisions.

218

The Staden Package Manual

A typical Contig Comparator display is shown in the figure above. It is showing results
from other functions, as well as the ones described in this section.
The first function (see Section 2.8.1 [Order Contigs], page 219) automatically orders
contigs based on read-pair data. The orderings found can be examined in the Template
Display (see Section 2.5.1 [Template Display], page 130)
The next function (see Section 2.8.2 [Find read pairs], page 222) also examines read-pair
data, but instead of automatically ordering the contigs, plots out their relationships in the
Contig Comparator, from where the user can invoke the Template Display to check them,
and use the Contig Selector to reorder them.
Sometimes assembly engines will miss or regard some weak joins as too uncertain to be
made. The Find Internal Joins function (see Section 2.8.3 [Find Internal Joins], page 227),
compares contigs, including their hidden data, to find matches between the ends of contigs.
Again results are presented in the Contig Comparator, and users can invoke the Contig
Joining Editor (see Section 2.6.15 [The Join Editor], page 196) to examine and make joins.
Whereas Find Internal Joins makes sure that alignments between contigs continue right
to their ends, another search, Find Repeats (see Section 2.8.4 [Find Repeats], page 233) finds
any identical segments of sequence, wherever they lie in the consensus. This has several
uses. It gives another way of finding potential joins, and it provides a way of anotating
(tagging) repeats so that their positions are obvious to users, and can be taken into account
by other search procedures. Again results are presented in the Contig Comparator, and
users can invoke the Contig Joining Editor (see Section 2.6.15 [The Join Editor], page 196)
to examine and make joins.

Chapter 2: Sequence assembly and finishing using Gap4

219

2.8.1 Order contigs
This routine uses read-pair information to try to work out the left to right order of sets of
contigs. It is invoked from the gap4 Edit menu. At present it attempts to order all the
contigs in the database, and when finished it produces a listbox window which containing
one or more sets (one set per line) of contigs listed by the names of their leftmost readings.
By clicking on their names in the listbox the user can request that these "super contigs"
should be shown in the standard Template display window (see Section 2.5.1 [Template
Display], page 130).
Using the tools available within this window the user can manually move or complement
any contigs which appear to have been misplaced. The combination of automatic ordering
and the facility to view the results by eye and manually correct any errors make this a
powerful tool. The new contig order can be saved to the database by selecting the "Update
contig order" command from the "Edit" menu of the Template display. Note, however, that
unlike the editing operations in the Contig editor, which are only committed to the disk
copy of the database at the user’s request, all the complementing operations in gap4 are
always performed both in memory and on the disk. This means that any complementing
done as part of the contig ordering process will be immediately committed to disk.
An example of the "Super contig" listbox is shown here.

220

The Staden Package Manual

The example seen in the figures shows a Template display before and after the application
of the algorithm.

Before ordering

Chapter 2: Sequence assembly and finishing using Gap4

221

After ordering
Notice how the operation has reduced the large number of dark yellow (inconsistent)
templates by ordering and complementing the contigs so that they are now consistent and
show in bright yellow. The few remaining dark yellow templates represent problems, possibly with misassembly or with misnaming of readings. The reliability of these dark yellow
templates is also questionable when noting that one or the other of the readings are typically
within the middle of large contigs, and hence are not likely to be spanning contigs. The
gaps between the contigs, shown in the ruler at the bottom of the template display, are real
estimates of size of the missing data, based on the expected lengths of the templates.
The algorithm is based on ideas used to build cosmid contigs using hybridisation data
Zhang,P, Schon,EA, Fischer,SG, Cayanis,E, Weiss,J, Kistler,S and Bourne,P, (1994) "An
algorithm based on graph theory for the assembly of contigs in physical mapping of DNA",
CABIOS 10, 309-317. A difficulty for algorithms of this type is dealing with errors in the
data, i.e. pairs of readings that have been incorrectly assigned to the same template (often
by simple typing errors made prior to the creation of the experiment files). Our algorithm
uses several simple heuristics to deal with such problems but one known problem is that
it does not correctly deal with cases where templates span non-adjacent contigs, or where
such contigs interleave.

222

The Staden Package Manual

2.8.2 Find Read Pairs
This function is used to check the positions and orientations of readings taken from the
same templates. It is invoked from the gap4 View menu.

For each template the relative position of its readings and the contigs they are in are
examined. This analysis can give information about the relative order, separation and
orientations of contigs and also show possible problems in the data. The search can be
over the whole database or a subset of contigs named in a list (see Section 2.14 [Lists],
page 278) or file of file names. The results are written to the Output Window and plotted
in the Contig Comparator (See Section 2.4 [Contig Comparator], page 126.). Read pair
information is also used to colour code the results displayed in the Template Display (see
Section 2.5.1 [Template Display], page 130).

Note that during assembly the template names and lengths are copied from the experiment files into the gap database. See Section 11.3 [Experiment Files], page 552. The
accuracy of the lengths will depend upon some size selection being performed during the
cloning procedures.

2.8.2.1 Find Read Pairs Graphical Output
The contig comparator is used to plot all templates with readings that span contigs. That
is, the lines drawn on the contig comparator are a visual representation of the relationship

Chapter 2: Sequence assembly and finishing using Gap4

223

(orientation and overlap) between contigs. When a template spans more than two contigs,
all the combinations of pairs of contigs are plotted. However such cases are uncommon.

224

The Staden Package Manual

Clicking with the right mouse button on a read pair line brings up a menu containing,
amongst other things, "Invoke template display" (see Section 2.5.1 [Template Display],
page 130). This creates a template display of the two contigs. The spanning template will
be coloured bright yellow if the readings on the template are consistent with one another,
or dark yellow if they are not. The ordering of the contigs may need to be altered, or one
contig may need complementing, before the readings on the template become consistent.
Using the "Invoke join editor" command (see Section 2.6.15 [The Join Editor], page 196)
from the same menu will bring up the Join Editor with the two contigs shown end to end.

2.8.2.2 Find Read Pairs Text Output
Two types of results are written to the Output Window: those containing apparently consistent data about the relative orientations and positions of contigs, and those that show
inconsistencies in the data. The inconsistencies will be due to misassembly or to misnaming
of readings and templates.
In the Output Window the program writes a line of information for each template and a
line of information for each reading from that template. In order to restrict this information
to fit on a standard 80 column display a few abbreviations are used. An example for two
consistent and one problematic template is shown below. Templates with possible problems
are separated from those without. The templates shown are sorted by problem; consistent
templates at the top followed by increasingly inconsistent templates at the bottom.
Template
Reading
Reading

zf18c8( 117), length 1400-2000(expected 1700)
zf18a2.s1(
+1F), pos
5620 +91, contig
zf18c8.s1( -117F), pos
1084 +288, contig

46
127

Template
Reading
Reading

zf98f4( 659), length 1400-2000(computed 7263)
zf98f4.s1( -659F), pos
27 +238, contig
zf98f4.r1( +800R), pos
5392 +211, contig

548
46

*** Possibly problematic templates listed below ***
Template
zf24g6( 262), length 1400-2000(observed 1365)
D
Reading
zf24g6.r1( +808R), pos
463 +206, contig
D
Reading
zf24g6.s1( -262F), pos
1559 +268, contig

46
46

2.8.2.3 The Template Lines
To describe the format of the template line we provide a detailed explanation of the lines
above for the last Template block.
"Template zf24g6( 262)"
This is template with name "zf24g6" and number 262.
length 1400-2000
These are the minimum and maximum lengths specified for this template.
observed(1365)
This section has the general format of "comment(distance)", where "comment"
is one of the following.

Chapter 2: Sequence assembly and finishing using Gap4

225

observed

The template has both forward and reverse readings within this
contig. From this information the actual size of the template can
be seen. In the example this is "1365".

expected

The template length is estimated as the average of the specified
minimum and maximum size. This will be seen when the template
does not span contigs and does not have both forward and reverse
primers visible.

computed

The template has forward and reverse readings in different contigs.
The length is computed by butting the two contigs together, end
to end, and finding the resultant separation of the template ends.
It is not possible to tell whether the two contigs overlap, and if
so by how much. Hence the "computed" lengths should not be
considered as absolute.

2.8.2.4 The Reading Lines
"?DPS"

The first four characters may be either space or one of "?", "D", "P" or "S".
The meaning of each of these is as follows.
?

No primer information is available for these readings.

The distance between forward and reverse primers (ie the template
length) is not as expected.

The primer information for readings on this template is inconsistent. An example of this is where two forward readings exist, both
using the universal primer, and the readings are not in close proximity to each other.

The template strand information is inconsistent. This problem can
be seen when the forward and reverse readings are from the same
strand, or two forward readings are pointing in opposite directions.

Absence of all of these characters means that the template is consistent.
"Reading zf24g6.r1"
The reading name
"( +808R)"
The reading number. The "+" or "-" character preceding the number represents
whether the reading has been complemented ("+" for original, "-" for complemented). The letter following the number indicates the primer information
found for this reading. It may be one of:
?

Unknown

Forward, universal primer

Forward, custom primer (eg a walk)

Reverse, universal primer

Reverse, custom primer

226

The Staden Package Manual

"pos 463 +206"
The position and the length of the reading within the contig. In this case the
reading starts at position 463 and extends for 206 bases. For a complemented
reading the position marks the 3’ end of the reading. For both cases the position
can be considered as the ’left end’ of the reading as displayed within the contig.
"contig 46"
The reading number of the left most reading within this contig.
In the above example the template has two readings. It can be seen that the template
starts at contig position 463 and finishes at position 1827. The observed length is 1365,
which is just below the expected minimum length of 1400. Hence the template is flagged
as having an invalid distance. There are no other inconsistencies for this template and so
it is likely that the only "problem" is that the experimental size selection process was not
as precise as was thought.

Chapter 2: Sequence assembly and finishing using Gap4

227

2.8.3 Find Internal Joins
The purpose of this function (which is invoked from the gap4 View menu) is to use sequences
already in the database to find possible joins between contigs. Generally these will be joins
that were missed or judged to be unsafe during assembly and this function allows users to
examine the overlaps and decide if they should be made. During assembly joins may have
been missed because of poor data, or not been made because the sequence was repetitive.
Also it may be possible to find potential joins by extending the consensus sequences with
the data from the 3’ ends of readings which was considered to be too unreliable to align
during assembly i.e. we can search in the "hidden data".

If it has not already occurred, use of this function will automatically transform the
Contig Selector into the Contig Comparator. Each match found is plotted as a diagonal
line in the Contig Comparator, and is written as an alignment in the Output Window. The
length of the diagonal line is proportional to the length of the aligned region. If the match
is for two contigs in the same orientation the diagonal will be parallel to the main diagonal,
if they are not in the same orientation the line will be perpendicular to the main diagonal.
The matches displayed in the Contig Comparator can be used to invoke the Join Editor (see
Section 2.6.15 [The Join Editor], page 196) or Contig Editor. See Section 2.6 [Editing in
gap4], page 160. Alternatively, the "Next" button at the top left of the Contig Comparator
can be used to select each result in turn, starting with the best, and ending with the worst.
When this is in use, users can find the match in the Contig Comparator which corresponds

228

The Staden Package Manual

to the next result by placing the cursor over the Next button. The plotted match and the
contigs involved will turn white.

Chapter 2: Sequence assembly and finishing using Gap4

229

the results given below the positions for the first overlap are as reported, but those for the
second assume that the contig in the minus sense (i.e. 443) has been complemented.

230

The Staden Package Manual

2.8.3.1 Find Internal Joins Dialogue

Chapter 2: Sequence assembly and finishing using Gap4

231

in marked segments during searching, but in the alignment shown in the Output Window,
marked segments will be shown in lower case.
Some alignments may be very large. For speed and ease of scrolling Gap4 does not
display the textual form of the longest alignments, although they are still visible within the
contig comparator window. The maximum length of the alignment to print up is controlled
by the “Maximum alignment length to list (bp)” control.
The default setting for the consensus is to "Use hidden data" which means that where
possible the contigs are extended using the poor quality data from the readings near their
ends. To ensure that this additional data is not so poor that matches will be missed, the
program uses algorithms which can be configured from the "Edit hidden data parameters"
dialogue. Two algorithms are available. Both slide a window along the reading until a set
criteria is met. By default an algorithm which sums confidence values within the window is
used. It stops when a window with < "Minimum average confidence" is found. The other
algorithm counts the number of uncalled bases in the window and stops when the total
reaches "Max number of uncalled bases in window". The selected algorithm is applied to
all the readings near the ends of contigs and the data that extends the contig the furthest
is added to its consensus sequence.
If your total consensus sequence length (including a 20 character header for each contig
that is used internally by the program) plus any hidden data at the ends of contigs is greater
than the current value of a parameter called maxseq, Find Internal Joins may produce an
error message advising you to increase maxseq. Maxseq can be set on the command line
(see Section 2.21 [Command line arguments], page 306) or by using the options menu (see
Section 2.20.3 [Set Maxseq], page 299).
The search algorithms first finds matching words of length "Word length", and only
considers overlaps of length at least "Minimum overlap". Only alignments better than
"Maximum percent mismatches" will be reported.
There are two search algorithms: "Sensitive" or "Quick". The quick algorithm should
be applied first, and then the sensitive one employed to find any less obvious overlaps.
The sensitive algorithm sums the lengths of the matching words of length "Word length"
on each diagonal. It then finds the centre of gravity of the most significant diagonals.
Significant diagonals are those whose probability of occurence is < "Diagonal threshold". It
then uses a dynamic programming algorithm to align around the centre of gravity, using a
band size of "Alignment band size (percent)". For example: if the overlap was 1000 bases
long and the percentage set at 5, the aligner would only consider alignments within 50 bases
either side of the centre of gravity. Obviously the larger the percentage and the overlap,
the slower the aligment.
The quick algorithm can find overlaps and align 100,000 base sequences in a few seconds
by considering, in its initial phase only matching segments of length "Minimum initial match
length". However it does a dynamic programming alignment of all the chunks between
the matching segments, and so produces an optimal alignment. Again a banded dynamic
algorithm can be selected, but as this only applies to the chunks between matching segments,
which for good alignments will be very short, it should make little difference to the speed.

232

The Staden Package Manual

After the search the results will be sorted so that the best matches are at the top of a list
where best is defined as a combination of alignment length and alignment percent identity
(in some earlier Gap4 releases this was scored purely on percent identity). This list can be
stepped through, one result at a time using the Contig Joining Editor, by clicking on the
"Next" button at the top left of the Contig Comparator.

Chapter 2: Sequence assembly and finishing using Gap4

233

2.8.4 Find Repeats
The purpose of this function (which is invoked from the gap4 View menu) is to find exact
repeats in contig consensus sequences. An exact repeat is defined as a run of consecutive
identical ACGT characters; no mismatches or gaps are permitted.
If it has not already occurred, selection of this function will automatically transform
the Contig Selector into the Contig Comparator. See Section 2.4 [Contig Comparator],
page 126. Each match found is plotted as a diagonal line in the Contig Comparator. The
length of the diagonal line is proportional to the length of the match.
If the match is for two contigs in the same orientation the diagonal will be parallel to
the main diagonal, if they are not the line will be perpendicular to the main diagonal.
The matches displayed in the Contig Comparator can be used to invoke the Join Editor
(see Section 2.6.15 [The Join Editor], page 196) or Contig Editors (see Section 2.6 [Editing
in gap4], page 160), and an Information button will display data about the match in the
Output window. e.g.
Repeat match
From contig xb54a3.s1(#26) at 78
With contig xb62h3.s1(#3) at 1
Length 37
This means that position 78 in the contig with xb54a3.s1 (reading number 26) at its left
end matches 37 bases at position 1 in the contig with xb62h3.s1 (number 3) at its left end.

234

The Staden Package Manual

Chapter 2: Sequence assembly and finishing using Gap4

235

2.9 Checking Assemblies and Removing Readings
After assembly, and prior to editing, it can be useful to examine the quality of the alignments
between individual readings and the sections of the consensus which they overlap. This may
reveal doubtful joins between sections of contigs, poorly aligned readings, or readings that
have been misplaced. By using this analysis in combination with other gap4 functions such
as Find internal joins (see Section 2.8.3 [Find Internal Joins], page 227) and Find repeats (see
Section 2.8.4 [Find Repeats], page 233), it is also possible to discover if readings have been
positioned in the wrong copies of repeat elements. The functions for checking the alignment
of readings in contigs are described below. See Section 2.9 [Checking Assemblies], page 236.
If readings are found to be misplaced or need removing for other reasons, gap4 has functions for breaking contigs (see Section 2.9.1.1 [Breaking Contigs], page 239), and removing
readings (see Section 2.9.1.2 [Disassembling Readings], page 240). These functions can be
accessed through the main gap4 Edit menu or from within the Contig Editor.
If readings are removed from contigs to start new contigs of one reading, these contigs can
then be processed by Find internal joins (see Section 2.8.3 [Find Internal Joins], page 227)
and the Join editor (see Section 2.6.15 [The Join Editor], page 196), which should reveal all
the other positions at which the reading matches.

236

The Staden Package Manual

2.9.0.1 Checking Assemblies
The Check Assembly routine (which is invoked from the gap4 View menu) is used to check
contigs for potentially misassembled readings by comparing them against the segment of
the consensus which they overlap. It has two modes of use: the first simply counts the
percentage mismatch between each reading and the consensus it overlaps, and the second
performs an alignment between the hidden data for a reading and the consensus it overlaps.
If the percentage is above a user defined maximum, a result is produced. That is, one mode
compares the "visible" part of the readings, and the other aligns and compares the hidden
data. Results are displayed in the Output Window and plotted on the main diagonal in the
Contig Comparator. See Section 2.4 [Contig Comparator], page 126.
From the Contig Comparator the user can invoke the Contig Editor to examine the
alignment of any problem reading. See Section 2.6 [Editing in gap4], page 160. If the
reading appears to be correctly positioned the user can either edit it, or in the case of poor
alignment of the hidden data, place a tag, so that it does not produce a result if the search
is done again. Note however such data will then also be ignored by the automatic double
stranding routine. See Section 2.10.1 [Double Stranding], page 241. A typical textual output
from the analysis of hidden data is shown below.
Reading 802(fred.s1) has percentage mismatch of 25.86
375
385
395
405
415
425
Reading *CCTGTTTTAAATTG-TGG-C-CCCG*-TTAACCGGGGT*CAAC**CTGGGTTGCTTA
: ::::: :::::: :: : ::::: ::: ::: :::::: ::::: ::::: :
Consensus ACATGTTT*AAATTGATGAACACCCG*AATAAACGGTGT*CAAAA*CTGGATTGCTAA
2929
2939
2949
2959
2969
2979

Chapter 2: Sequence assembly and finishing using Gap4

237

Selecting between analysing the visible or hidden data is done by clicking on "yes" or
"no" in the "Use cutoff data" dialogue. All alignments that are worse than "Maximum
percentage of mismatches" will produce a result in the Output Window and the Contig
Comparator. If "Use cutoff data" is selected then dialogue to enable the user to restrict the
quality and length of the hidden data that the program aligns is activated. First, to avoid
finding very short mismatching regions (where percentage mismatch figures could be very
high) users can set a "Minimum length of alignment" figure. Secondly to ensure that the
hidden data is not so bad that alignments will necessarily be poor, the program uses the
following algorithm. It slides a window of size "Window size for good data scan" along the
hidden data for each reading and stops if it finds a window that contains more than "Max
dashes in scan window" non-ACGT characters.
To check the used data for each reading ("Use cutoff data" is set to "No") the program
compares all segments of size ’window’ against the consensus sequence that they lie above
(obviously no alignment is required). If the percentage mismatch within any segment is
above the specified amount, then the entire ’alignment’ of the reading and consensus is
displayed. Note that in the output the program will first give the percentage mismatch
over the window length, and then the percentage over the whole reading. To check the
overall percentage mismatch of readings, simply set the "Window size for used data" to be
longer than the reading lengths. To check for divergence of segments within readings set
the window size accordingly.
The "Information" window produced by selecting "Information" from the Contig Comparator "Results" menu produces a summary of the results sorted in order os percentage
mismatch.
By clicking with the right mouse button on results plotted in the Contig Comparator
a pop-up menu is revealed which can be used to invoke the Contig Editor (see Section 2.6
[Editing in gap4], page 160). The editor will start up with the cursor positioned on the
problem reading. If the reading is found to be misplaced it can be marked for removal
from within the Editor (see Section 2.6.7.12 [Remove Reading], page 179). However, prior
to this it may be beneficial to use some of the other analyses such as Find internal joins
(see Section 2.8.3 [Find Internal Joins], page 227) and Find repeats (see Section 2.8.4 [Find
Repeats], page 233), which may help to find its correct location. Both of these functions
produce results plotted in the Contig Comparator (see Section 2.4 [Contig Comparator],
page 126) and any alternative locations will give matches on the same vertical or horizontal
projection as the problem reading.

238

The Staden Package Manual

2.9.1 Removing Readings and Breaking Contigs
Occasionally contigs require more drastic changes than simple basecall edits. Sometimes it
is necessary to remove readings that have been put in the wrong place, or to break contigs
that should not have been joined. Gap4 contains functions to help with these problems,
and two types of interface.
If a contig needs to be broken cleanly into two new contigs, with all the readings, other
than the two at the incorrect join, still linked together, then Break Contig (see Section 2.9.1.1
[Breaking Contigs], page 239), or (see Section 2.6.7.13 [Break Contig], page 179) should be
used. The former interface is available via the main gap4 Edit menu, and the latter as an
option in the Contig Editor.
If one or more readings need removing from from contig(s), even if their removal will
break the contiguity of a contig, then (see Section 2.9.1.2 [Disassemble Readings], page 240),
or (see Section 2.6.7.12 [Remove Reading], page 179) should be used. The former interface
is available via the main gap4 Edit menu, and the latter as an option in the Contig Editor.
Readings can be removed from the database completely, or moved to start individual new
contigs, one for each reading.

Chapter 2: Sequence assembly and finishing using Gap4

239

2.9.1.1 Breaking Contigs
The Break Contig function (which is available from the gap4 Edit menu) enables contigs to
be broken by removing the link between two adjacent readings. The user defines the name
or number of the reading that, after the break, will be at the left end of the new contig.
That is, the break is made between the named reading and the reading to its left.

It is also possible to interactive select places to break the contig when using the Contig
Editor. See Section 2.6.7.13 [Break Contig], page 179.

240

The Staden Package Manual

2.9.1.2 Disassembling Readings
This function is used to remove readings from a database or move readings to new contigs.
There are two interfaces which allow sets of readings to be disassembled. One is to identify
the readings interactively when using the Contig Editor (see Section 2.6.9 [Remove Readings], page 186), and the other, described below, is available as a separate option from the
main gap4 Edit menu.

Chapter 2: Sequence assembly and finishing using Gap4

241

2.10 Finishing Experiments
Gap4 contains several functions for helping to select experiments to finish an assembly
project. These functions (which are all available from the gap4 Experiments menu) are
able to automatically analyse the contigs to find the regions which need attention, and to
suggest appropriate experiments.
Prior to performing any experiments it can be worthwhile to try to make the most of
the existing data by moving the boundary between the hidden and visible data of readings
to cover single stranded readings. (see Section 2.10.1 [Double Strand], page 241)
The following "Experiment Suggestion" functions analyse the contigs to find problems,
and then suggest the best templates to use for further experiments.
Primers and templates for primer walking experiments can be suggested. (see
Section 2.10.2 [Suggest Primers], page 243). Sometimes resequencing on a long gel machine
will help to fill a single stranded region or join a pair of contigs. (see Section 2.10.3 [Suggest
Long Readings], page 245). Compressions and stops can be solved by resequencing using
an different chemistry. (see Section 2.10.4 [Compressions and Stops], page 247). In order
to select oligos to use as probes for clones near the ends of contigs a further function is
available. (see Section 2.10.5 [Suggest Probes], page 249).

2.10.1 Double Stranding
The purpose of this function (which is available from the gap4 Edits menu) is to use hidden
data to fill regions of contigs that have data on only one strand (see Section 2.2.6 [Use
of the "hidden" poor quality data], page 120). First the routine finds a region that has
data for only one strand. Then it examines the nearby readings on the other strand to
see if they have hidden data that covers the single stranded region. If so it finds the best
alignment between this hidden data and the consensus over the region. If this alignment is
good enough the data is converted from hidden to visible. This process is continued over
all the selected contigs. The function can be run on a subsection of a single contig, on all
contigs, or on a subset of contigs that are named in a file of a list.
Significant portions of the sequence can be covered by this operation, hence saving a
great deal of experimental work, and it can be used as a standard part of cleaning up a
sequencing project. However it must be noted that an increased number of edits may be
required after its application. The amount of cutoff data used depends on the number of
mismatches and the percentage mismatch in the alignment. That is, it depends on the
quality of the alignment, not the quality of the data: if it aligns it is assumed to be correct!
The program reports its progress in the Output window as shown in the following example.
Wed 03:52:46 PM: double strand
-----------------------------------------------------------Double stranding contig xf48g3.s1 between 1 and 6189
Double stranded zf23b2.s1
by 121 bases at offset 3752
Double stranded zf18g11.s1
by 194 bases at offset 5652
Positive strand :
Double stranded 315 bases with 2 inserts into consensus

242

The Staden Package Manual

Filled 0 holes
Complementing contig
358
Double stranded zg29a11.s1
by 42 bases at offset 5265 - Filled
Double stranded zf38c7.s1
by 131 bases at offset 5015 - Filled
Negative strand :
Double stranded 174 bases with 1 insert into consensus
Filled 2 holes

Chapter 2: Sequence assembly and finishing using Gap4

243

2.10.2 Suggest Primers
The purpose of this function (which is available from the gap4 Experiments menu) is to
suggest custom primer experiments to extend and "double strand" contigs. First the routine
finds regions of contigs with data on only one strand. Then it selects templates and primers,
which if used in sequencing experiments, would produce data to cover these single stranded
regions. This information is written to a file or a list and also appears in the Output window.
For each primer suggested a tag is automatically created containing the template name and
the sequence. See also Section 2.10.3 [Suggest Long], page 245, and Section 2.10.1 [Double
Strand], page 241.
The following example shows how the results appear in the Output window.
Wed 04:53:08 PM: Suggest Primers
-----------------------------------------------------------Selecting oligos for contig xf23a3.s1 between 1 and 12379
At 3873 - template zf23b2, primer GAAACTGGATAATACGAC, number 1
At 5847 - template zf18g11, primer CCTCCAATAGCGTGAAG, number 2
At 7924 - template zf22d11, primer GTAAAGTGTAATTCAAGGAAG, number 3
At 9033 - template zf97c10, primer ATGATAGAAATCTCGTGG, number 4
At 9972 - template zf98b5, primer GCGGAAAGTTGAAAGAG, number 5
At 10506 - template zg09a9, primer ACACATCATTTCGGAGG, number 6
At 10958 - template zf24c1, primer CAGTTTACGAGAAAGTCC, number 7
At 11529 - template zg29a12, primer ACCTTCCCAAAAGTTCC, number 8
At 11897 - template zf97d7, primer AACCCGATTTTCGTAATG, number 9
Complementing contig
358
At 11400 - template zf38b1, primer CGAAGACCCAAAGAAAG, number 11
At 9902 - template zf98a4, primer CTTTTCTCTTTCAACTTTCC, number 12
At 7104 - template zf22h10, primer GTTGTCACGAAAATCGC, number 13
At 6564 - template zf21e6, primer CGGATCAAATATGGATGG, number 14
At 1499 - template zf98a11, primer CGTGATTTTTACACTATTTCC, number 15
At
774 - template zf19c4, primer TCCAATTTTGATTCAGGC, number 16
Complementing contig
46
The following shows the contents of the corresponding file. The fields are template name,
reading name, primer name, primer sequence, position and direction.
zf23b2 zf23b2.s1 B0334.1 GAAACTGGATAATACGAC 3818 +
zf18g11 zf18g11.s1 B0334.2 CCTCCAATAGCGTGAAG 5789 +
zf22d11 zf22d11.s1 B0334.3 GTAAAGTGTAATTCAAGGAAG 7883 +
zf97c10 zf97c10.s1 B0334.4 ATGATAGAAATCTCGTGG 8984 +
zf98b5 zf98b5.s1 B0334.5 GCGGAAAGTTGAAAGAG 9932 +
zg09a9 zg09a9.s1 B0334.6 ACACATCATTTCGGAGG 10460 +
zf24c1 zf24c1.s1 B0334.7 CAGTTTACGAGAAAGTCC 10902 +
zg29a12 zg29a12.r1 B0334.8 ACCTTCCCAAAAGTTCC 11487 +
zf97d7 zf97d7.s1 B0334.9 AACCCGATTTTCGTAATG 11855 +
zf23a3 zf23a3.s1 B0334.10 CAAAGCAATGTCCCCAG 12339 +
zf38b1 zf38b1.s1 B0334.11 CGAAGACCCAAAGAAAG 930 zf98a4 zf98a4.s1 B0334.12 CTTTTCTCTTTCAACTTTCC 2427 -

244

The Staden Package Manual

zf22h10 zf22h10.s1 B0334.13 GTTGTCACGAAAATCGC 5220 zf21e6 zf21e6.s1 B0334.14 CGGATCAAATATGGATGG 5771 zf98a11 zf98a11.s1 B0334.15 CGTGATTTTTACACTATTTCC 10833 zf19c4 zf19c4.s1 B0334.16 TCCAATTTTGATTCAGGC 11565 -

The contigs to process can be a particular "single" contig, "all contigs", or a subset of
contigs whose names are stored in a "file" or a "list". If a file or list is selected the browse
button will be activated and, if it is clicked, an appropriate browser will be invoked. If the
user selects "single", then the dialogue for choosing the contig and the section to process
becomes active.
The primer sequences, their template names and their reading names can be written to
a file or a list and an appropriate browser can be used to aid its selection.
For each single stranded region located, the program will search for a primer on its 5’
side in the region "search start position", to "search end position". That is, it will try to
locate a primer starting at "search start position" and then will look increasingly further
away until it reaches "search end position".
If required, by employing the "number of primers per match" entry box, the user can
request that the program tries to suggest more than one primer per problem. The "primer
start number" is an attempt to generate a unique name for each primer suggested. If the
number was set to, say 11, and the database was named B0334, then the first primer would
be named B0334.11, the next B0334.12, etc in the output file.
The "Edit parameters" button invokes a dialogue box which allows the specification of
further parameters. Primer constraints can be specified by melting temperature, length and
G+C content.

Chapter 2: Sequence assembly and finishing using Gap4

245

2.10.3 Suggest Long Readings
This routine (which is available from the gap4 Experiments menu) suggests which templates
could be resequenced on a long gel machine to fill in single stranded regions or extend contigs.
The "Estimated long reading length" tells the routine the expected length of reading that
will be produced by the sequencing machine. The routine finds all single stranded regions,
and where possible suggests solutions. Solutions will not be suggested using readings from
templates that have inconsistent read-pair information.
The example output below shows a list of problem segments followed by suggested templates.
Prob 1..1:
Long
Long

Extend contig start for joining.
c91d3.s1
367. T_pos=366, T_size=1000..1500 (1250), cov 189
c99e12.s1
340. T_pos=191, T_size=1000..1500 (1250), cov 216

Prob 1..456:
No solution.

No +ve strand data.

Prob 1597..1736:
No
Long
c53c6.s1
Long
e04c11.s1
Long
e05h9.s1
Long
e05a1.s1
Long
c53b11.s1

+ve strand data.
1074. T_pos=341,
1076. T_pos=376,
1081. T_pos=377,
1198. T_pos=329,
1382. T_pos=216,

T_size=1000..1500
T_size=1000..1500
T_size=1000..1500
T_size=1000..1500
T_size=1000..1500

(1250),
(1250),
(1250),
(1250),
(1250),

cov
cov
cov
cov
cov

32
34
39
156*
340*

Prob 2530..2532:
No +ve strand data.
Long
e03a8.s1 2283. T_pos=199, T_size=1000..1500 (1250), cov 308*
Long
e05b10.s1 2331. T_pos=200, T_size=1000..1500 (1250), cov 356*
Prob 3974..4067:
No solution.

No -ve strand data.

Prob 4067..4067:
Extend contig end for joining.
D Long
e06a3.s1 3588. T_pos=366, T_size=1000..1500 (1582), cov 76
Long
c53b1.s1 3709. T_pos=360, T_size=1000..1500 (1250), cov 197
Some brief notes on the above output; looking at the suggested rerun of reading e05a1.s1.
Prob 1597..1736: No +ve strand data.
A single stranded region has been identified in this contig at bases 1597 to 1736
inclusive.
"?D Long" The optional two letters before the word "Long" are used to flag possibly inconsistent templates (templates that are definitely inconsistent are ignored). "?"
means that no primer information is available for the template that the reading
is from. "D" means that the template size is not within the expected minimum
and maximum. In this case the observed size is displayed (see below).

246

The Staden Package Manual

"Long e05a1.s1 1198."
A possible solution; rerun reading e05a1.s1 as a long gel. The first used base
at the 5’ end of this reading is at position 1198 in the contig. Typically this
roughly corresponds to the primer position for this reading in the contig.
T_pos=329
The last used base at the 3’ end of the reading is estimated to be the 329th base
of the template. Together with the template lengths this gives us an estimate
of how much template there is available for a long gel or for walking.
T_size=1000..1500 (1250)
The estimated size for this template is 1250 bases. Gap4 is supplied a minimum
and maximum size when a reading is assembled. In this case the minimum is
1000 bases, and the maximum 1500. When forward and reverse reads assembled
into the same contig estimate the real length reasonably accurately. Otherwise
(as can be seen here), the estimated length is simply the average of the supplied
minimum and maximum lengths.
cov 156*

We would expect a long gel to cover our "hole" by 156 bases. This estimate is
based purely on the position of the start of the reading in relation to the start
of the hole, and the estimated length of a long gel. The asterisk here marks that
this coverage is more than enough to completely solve the problem by plugging
the positive strand hole.

For the problem "3974..4067" there is "No solution" listed. This is due to the fact that
there are no suitable readings within the estimated long gel reading length of this problem.

Chapter 2: Sequence assembly and finishing using Gap4

247

2.10.4 Compressions and Stops
This option (which is available from the gap4 Experiments menu) searches through a region
of a contig looking for stop (STOP) or compression (COMP) tags. These tags could have
been added using the Contig Editor or by a suitable external program which can analyse
traces to detect these types of problems. For each such tag found the routine produces a
list of readings that could be resequenced to try to solve the problem. Obviously the types
of experiments available will change as the technology improves but at present the program
produces output that suggests "Taq terminator" experiments. We welcome suggestions for
other experiment types or news of any programs that can automatically assign the tags.
The results, in the form of suggestions, are written to the Output window.

Note that the Taq reading length is used as a guideline for deciding which readings are
suitable candidates for solving a problem. All readings in the correct orientation and with
their 5’ ends within this length are assumed to solve the problem. The actual distance is
listed in the output; an example of this is shown below.
Prob 1544..1545: COMP tag on strand 0 (forward)
Taq for xd26d8.s1
1365 179
Prob 1554..1554: STOP tag on strand 0 (forward)
Taq for xd26d8.s1
1365 189
Prob 5276..5288: COMP tag on strand 1 (reverse)
Taq for xc34g11.s1
5299 23
Taq for xc34g11.s1t
5298 22
Taq for xc34d6.s1
5316 40
Taq for xc45e1.s1
5463 187
Prob 24042..24046: COMP tag on strand 1 (reverse)
Taq for xc50a12.s1
24167 125
Taq for xc33d1.s1
24188 146
Taq for xc36h4.s1
24208 166
Taq for xc51c8.s1
24232 190
The format of the above output is:

248

The Staden Package Manual

Prob ..: tag on strand
Taq for @
...
Where:
..
marks the inclusive range for the tag in the contig.

is the type of the current tag.

is the strand of the reading that the tag is placed upon

is the gel reading name.

is the position of the 5’ end of in the contig.

is the distance of the 5’ end from the tag.

Chapter 2: Sequence assembly and finishing using Gap4

249

2.10.5 Suggest Probes
The suggest probes function (which is available from the gap4 Experiments menu) looks for
oligos at the end of each contig suitable for use with an oligo probing strategy invented by
Jonathan Flint. Flint,J., Sims,M., Clark,K., Staden,R. and Thomas,K. An oligo-screening
strategy to fill gaps found during shotgun sequencing projects. DNA Sequence 8, 241-245.
The probing strategy is used part way through a sequencing project to find clones which
should help to extend contigs. The gap4 function described here is used to select oligos from
readings that are near the ends of the current contigs. These oligos are synthesised and
then used to probe a pool of sequencing clones. Those which it selects are then sequenced
in the hope that they will lengthen the contigs.

The dialogue contains the usual methods of selecting the set of contigs to operate on.
For each end of the selected contigs, oligos are chosen using the OSP Hillier, L., and Green,
P. (1991). "OSP: an oligonucleotide selection program," PCR Methods and Applications,
1:124-128. selection criteria which is dependent on the maximum and minimum size of
oligos specified. The "search from" and "search to" parameters control the area of consensus
sequence in which to search for oligos. For example, if they are set to 10 and 100 respectively
the a section of consensus sequence used is 90 bases long and starts 10 bases from the end
of the contig.
Once an oligo is found it is screened against all the existing consensus sequence. An oligo
is rejected if it matches with a score greater than or equal to the "maximum percentage
match". If a file of vector filenames has been specified then the oligos are also screened
against the vector sequences.
Typical output for a single contig follows. The output shows all oligos that have passed
the screening process. The information listed includes the distance of this oligo from the
end of the contig (Dist ??), the score returned from the OSP selection (primer=??), the
melting temperature (Tm=??), the best percentage match found (match=??%) and the oligo
sequence.
Contig zf37b5.s1(495): Start

250

The Staden Package Manual

Rejected 8 oligos due to non uniqueness
Contig zf37b5.s1(495): End
No oligos found
Contig zf48g3.s1(315): Start
Pos
71, Dist 70, primer=16, Tm=52,
Pos
80, Dist 79, primer=16, Tm=50,
Pos
69, Dist 68, primer=16, Tm=52,
Pos
48, Dist 47, primer=20, Tm=50,
Pos
52, Dist 51, primer=20, Tm=55,
Pos
51, Dist 50, primer=20, Tm=52,
Pos
63, Dist 62, primer=22, Tm=55,
Pos
68, Dist 67, primer=24, Tm=50,
Pos
77, Dist 76, primer=24, Tm=50,
Pos
46, Dist 45, primer=28, Tm=50,
Rejected 1 oligo due to non uniqueness

match=75%,
match=72%,
match=75%,
match=72%,
match=71%,
match=71%,
match=76%,
match=72%,
match=72%,
match=78%,

GCGTTTTACAATAACTTCTC
AATAACTTCTCAGGCAAC
GTGCGTTTTACAATAACTTC
AAAATACCATTGCAGCTC
TACCATTGCAGCTCACC
ATACCATTGCAGCTCAC
CTCACCGTGCGTTTTAC
CGTGCGTTTTACAATAAC
TACAATAACTTCTCAGGC
TCAAAATACCATTGCAGC

This output is sent to both the Output Window and additionally to a suggest probes
output window. This latter window (shown below) allows selection of oligos from those
available for each contig by clicking the left mouse button on a line of the output. The
selected oligos are shown in blue. By default the first in each set is automatically selected.

The selected oligos can then be written to a file by filling in the "output filename" and
will have OLIG tags created for them when the "Create tags" checkbutton is selected. This
output window vanishes once OK is pressed, but the text in the main Output Window is
left intact.

Chapter 2: Sequence assembly and finishing using Gap4

251

2.11 Calculating Consensus Sequences
In this section we describe the types of consensus which gap4 can produce, the formats they
can be written in, and the algorithms that can be used. The algorithms are not only used
to produce consensus sequence files, but in many other places throughout gap4 where an
analysis of the current quality of the data is required. One important place is inside the
Contig Editor (see Section 2.6 [Editing in gap4], page 160) where they are used to produce
an "on-the-fly" consensus, responding to every edit made by the user.
The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm], page 299).
There are four main types of consensus sequence file that can be produced by the program: Normal, Extended, Unfinished, and Quality. They are all invoked from the File
menu.
"Normal" is the type of consensus file that would be expected: a consensus from the
non-hidden parts of a contig. "Extended" is the same as "Normal" but the consensus is
extended by inclusion of the hidden, non-vector sequence, from the ends of the contig.
"Unfinished" is the same as "Normal" except that any position where the consensus
does not have good data for both strands is written using A,C,G,T characters, and the rest
(which has good data for both strands) is written using a different set of symbols. This
sequence can be used for screening against new readings: only the regions needing more
readings will produce matches. By screening readings in this way, prior to assembly, users
can avoid entering readings which will not help finish the project, and which may require
further editing work to be performed.
"Quality" produces a sequence of characters of the same length as the consensus, but
they instead encode the reliability of the consensus at each point.
Consensus sequence files can also encode the positions of the currently active tag types
by changing the case of the tagged characters (marking) or writing them in a different
character set (masking) (see Section 2.2.7.2 [Active tags and masking], page 121).
The consensus algorithms are usually configured to produce only the characters A,C,G,T
and "-", but it is possible to set them to produce the complete set of IUB codes. This mode
is useful for some types of work and allows the range of observed base types at any position
to be coded in the consensus. How the IUB codes are chosen is described in the introduction
to the consensus algorithms (see Section 2.11.5 [The Consensus Algorithms], page 257).
Depending on the type of consensus produced, the consensus sequence files can be written
in three different formats: Experiment files (see Section 11.3 [Experiment File], page 552),
FASTA (Pearson,W.R. Using the FASTA program to search protein and DNA sequence
databases. Methods in Molecular Biology. 25, 365-389 (1994)) or staden formats. If experiment file format is selected a further menu appears that allows users to select for the
inclusion of tag data in the output file. For FASTA format the sequence headers include the
contig identfier as the sequence name and the project database name, version number and
the number of the leftmost reading in the contig as comments. e.g. ">xyzzy.s1 B0334.0.274"
is database B0334, copy 0, and the left most reading for the contig is number 274, which has
a name of xyzzy.s1. For staden format the headers include the project database name and

252

The Staden Package Manual

2.11.1 Normal Consensus Output
This is the usual consensus type that will be calculated (and is available from the gap4
File menu). The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm],
page 299).
Contigs can be selected from a file of file names or a list. In addition, tagged regions can
be masked or marked (see Section 2.2.7.2 [Active tags and masking], page 121), and output
can be in Experiment file, fasta or staden formats. If experiment file format is selected a
further menu appears that allows users to select for the inclusion of tag data in the output
file.

Chapter 2: Sequence assembly and finishing using Gap4

253

The "strip pads" option will remove pads ("*"s) from the consensus sequence. In the
case of experiment files this will also automatically adjust the position and length of the
annotations to ensure that they still mark the correct segment of sequence.

Normally the consensus sequences are named after the left-most reading in each contig.
For the purposes of single-template based sequencing projects (eg cDNA assemblies) the
option exists to “Name consensus by left-most template” instead of by left-most reading.

The routine can write its consensus sequence (plus extra data for experiment files) in
"experiment file", "fasta" and "staden" formats. The output file can be chosen with the
aid of a file browser. If experiment file format is selected the user can choose whether or not
to have "all annotations", "annotations except in hidden", or "no annotations" written out
with the sequence. If the user elects to include annotations the "select tags" button will
become active, and if it is clicked, a dialogue for selecting the types to include will appear.

2.11.2 Extended Consensus Output
This consensus type (which is available from the gap4 File menu) is useful for those who
are too impatient to complete their sequence and want to compare it, in its fullest extent,
to other data. The sequence produced therefore includes hidden data from the ends of the
contigs.

The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm], page 299).

254

The Staden Package Manual

Contigs can be selected from a file of file names or a list. In addition tagged regions can
be masked or marked (see Section 2.2.7.2 [Active tags and masking], page 121), and output
can be in fasta or staden formats.

The contigs for which to calculate a consensus can be a particular "single" contig, "all
contigs", or a subset of contigs whose names are stored in a "file" or a "list". If a file or list
is selected the browse button will be activated, and if it is clicked, an appropriate browser
will be invoked. If the user selects "single" then the dialogue for choosing the contig and
the section to process becomes active.
Where possible the contigs are extended using the poor quality data from the readings
near their ends. To ensure that this additional data is not too poor the program uses the
following algorithm. It slides a window of size "Window size for good data scan" along
the hidden data for each reading and stops if it finds a window that contains more than
"Max dashes in scan window" non-ACGT characters. The data that extends the contig the
furthest is added to its consensus sequence.
If the user selects either "mask active tags" or "mark active tags" the "Select tags"
button is activated, and if it is clicked, a dialogue panel appears to enable the user to select
which tag types should be used in these processes. If "mask" is selected all segments covered
by the tag types chosen will not be written as ACGT but as defi symbols. If "mark" is
selected the tagged segments will be written in lowercase characters. Masking is useful for
producing a sequence to screen against other sequences: only the unmasked segments will
produce hits.
The "strip pads" option will remove pads ("*"s) from the consensus sequence.

Chapter 2: Sequence assembly and finishing using Gap4

255

The routine can write its consensus sequence in "fasta" and "staden" formats. The
output file can be chosen with the aid of a file browser.

2.11.3 Unfinished Consensus Output
This option is available from the gap4 File menu. An "Unfinished" consensus is one in
which any position where the consensus does not have good data for both strands is written
using A,C,G,T characters, and the rest (which has good data for both strands) is written
using a different set of symbols (d,e,f,i). This sequence can be used for screening against
new readings: only the regions needing more readings will produce matches. By screening
readings in this way, prior to assembly, users can avoid entering readings which will not
help finish the project, and which may require further editing to be performed. This type
of consensus when written in staden format, consists of A,C,G,T for single stranded regions
and d,e,f,i for finished sequence (d=a,e=c,f=g,i=t).
The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm], page 299).
Contigs can be selected from a file of file names or a list, and output can be in fasta or
staden formats.

2.11.4 Quality Consensus Output
The Quality Consensus Output option described here (which is available from the gap4 File
menu) applies either of the two simple consensus calculations (see Section 2.11.5.1 [Consensus Calculation Using Base Frequencies], page 258) and (see Section 2.11.5.2 [Consensus

256

The Staden Package Manual

Calculation Using Weighted Base Frequencies], page 259) to the data for each strand of
the DNA separately. The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus
Algorithm], page 299).
It produces, not a consensus sequence, but an encoding of the "quality" of the data
which defines whether it has been determined on both strands, and whether the strands
agree. The categories of data and the codes produced are shown in the table. For example
’c’ means bad data on one strand is aligned with good data on the other.
a

Good Good (in agreement)

Good Bad

Bad Good

Good None

None Good

Bad Bad

Bad None

None Bad

Good Good (disagree)

None None

Chapter 2: Sequence assembly and finishing using Gap4

257

2.11.5 The Consensus Algorithms
The consensus calculation is a very important component of gap4. It is used to produce
an "on-the-fly" consensus, responding to every individual change in the Contig Editor (see
Section 2.6 [Editing in gap4], page 160) and is used to produce the final sequence for
submission to the sequence libraries. Some years ago Bonfield, J.K. and Staden, R. The
application of numerical estimates of base calling accuracy to DNA sequencing projects.
Nucleic Acids Res. 23, 1406-1410 (1995) we put forward the idea of using base call accuracy
estimates in sequencing projects, and this has been partially realised with the values from
the Phred program (Ewing, B. and Green, P. Base-Calling of Automated Sequencer Traces
Using Phred. II. Error Probabilities. Genome Research. Vol 8 no 3. 186-194 (1998)).
These values are widely used and have defined a decibel type scale for base call confidence
values and gap4 is currently set to use confidence values defined on this scale. An overview
of our use of confidence values is contained in the introductory sections of the manual (see
Section 2.2.5 [The use of numerical estimates of base calling accuracy], page 118).
As is described elsewhere (see Section 2.11.6 [List Consensus Confidence], page 261)
being able to calculate the confidence for each base in the consensus sequence makes it
possible to estimate the number of errors it contains, and hence the number of errors that
will be removed if particular bases are checked and, if necessary, edited.
Gap4 caters for base calls with and without confidence values and hence provides a
choice of algorithms. There are currently three consensus algorithms that may be used.
The choice of the best algorithm will depend on the data that you have available and the
purpose for which you are using gap4.
The currently active consensus algorithm is selected from the "Consensus algorithm" dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm], page 299).
The only way to produce a consensus sequence for which the reliability of each base is
known, is to use reading data with base call confidence values. Their use, in combination
with the Confidence Value algorithm (see Section 2.11.5.3 [Consensus Calculation Using
Confidence Values], page 259). is strongly recommended.
For base calls without confidence values use the Base Frequencies algorithm (see
Section 2.11.5.1 [Consensus Calculation Using Base Frequencies], page 258). This is also
a fast algorithm so it may be appopriate for very high depth assemblies such those for
mutation studies.
For data with simple base call accuracy estimates rather than those on the decibel scale,
the Weighted Base Frequencies algorithm should be used (see Section 2.11.5.2 [Consensus
Calculation Using Weighted Base Frequencies], page 259).
All confidence values lie in the range 0 to 100. When readings are entered into a database, gap4 assigns a confidence of 99 to all bases without confidence values. For all three
algorithms, a base with confidence of 100 is used to force the consensus base to that base
type and to have a confidence of 100. However,if two or more base types at any position
have confidence 100, the consensus will be set to "unknown", i.e. "-", and will have a
confidence of 0. Note that dash ("-") is our preferred symbol for "unknown" as, within a
sequence, it is more easily distinguished from A,C,G,T than "N".

258

The Staden Package Manual

The consensus sequence is also assigned a confidence, even when base call confidence
values are not used to calculate it. The scale and meaning of the consensus confidence
changes between consensus algorithms. However the consensus cutoff parameter always has
the same meaning. A consensus base with a confidence ’X’ will be called as a dash when
’X’ is lower than the consensus cutoff, otherwise it is the determined base type.
Both the consensus cutoff and quality cutoff values can be set by using the "Configure cutoffs" command in the "Consensus algorithm" dialogue in the main gap4 Options
menu (see Section 2.20.2 [Consensus Algorithm], page 299). Within the Contig Editor (see
Section 2.6 [Editing in gap4], page 160) these values can be adjusted by clicking on the "<"
and ">" symbols adjacent to the "C:" (consensus cutoff) and "Q:" (quality cutoff) displays
in the top left corner of the editor. These buttons are repeating buttons - the values will
adjust for as long as the left mouse button is held down. Changing these values lasts only
as long as that invocation of the contig editor.
The consensus algorithms are usually configured to produce only the characters
A,C,G,T,* and "-", but it is possible to set them to produce the complete set of IUB
codes. This mode is useful for some types of work and allows the range of observed base
types at any position to be coded in the consensus. The IUB code at any position is
determined in the following way.
We assume that the user wants to know which base types have occurred at any point,
but may want some control over the quality and relative frequency of those that are used to
calculate the "consensus". For the simplest consensus algorithm there is no control over the
quality of the base calls that are included, but the Consensus Cutoff can be used to control
how the relative frequency affects the chosen IUB code. All base types whose computed
"confidence" exceeds the Consensus Cutoff will be included in the selection of the IUB code.
For example if only base type T reaches the Consenus Cutoff the IUB code will be T; if both
T and C reach the cutoff the code will be Y; if A, C and T each reach the cutoff the code
will be H; if A, C, G and T all reach the cutoff the code will be "N". For the Confidence
Value algorithm the Quality Cutoff can be used to exclude base calls of low quality, so that
all those that do not reach the Quality Cutoff are excluded from the IUB code calculation.
Otherwise the logic of the code selection is the same as for the two simpler algorithms.
Both the consensus cutoff and quality cutoff values can be set by using the "Configure
cutoffs" command in the "Consensus algorithm" dialogue in the main gap4 Options menu
(see Section 2.20.2 [Consensus Algorithm], page 299).
The algorithms are explained below.

2.11.5.1 Consensus Calculation Using Base Frequencies
This algorithm can be used for any data, with or without confidence values. Each standard
base type is given the same weight. The consensus will be the most frequent base type in a
given column provided that the consensus cutoff parameter is low enough. All unrecognised
base types, including IUB codes, are treated as dashes. Dashes are given a weight of
1/10th that of recognised base types. Pads are given a weight which is the average of their
neighbouring bases.

Chapter 2: Sequence assembly and finishing using Gap4

259

The confidence of a consensus base for this method is expressed as a percentage. So for
example a column of bases of A, A, A and T will give a consensus base of A and a confidence
of 75. Therefore a consensus cutoff of 76 or higher will give a consensus base of "-".
In the event that more than one base type is calculated to have the same confidence, and
this exceeds the consensus cutoff, the bases are assigned in descending order of precedence:
A, C, G and T.
The quality cutoff parameter (Q in the Contig Editor) has no effect on this algorithm.

2.11.5.2 Consensus Calculation Using Weighted Base Frequencies
This method can be used when simple, unquantified, base call quality values are available.
Instead of simply counting base type frequencies it sums the quality values. Hence a column
of 4 bases A, A, A and T with confidence values 10, 10, 10 and 50 would give combined
totals of 30/80 for A and 50/80 for T (compared to 3/4 for A and 1/4 for T when using
frequencies). As with the unweighted frequency method this sets the confidence value of
the consensus base to be the the fraction of the chosen base type weights over the total
weights (62.5 in the above example).
The quality cutoff parameter controls which bases are used in the calculation. Only bases
with quality values greater than or equal to the quality cutoff are used, otherwise they are
completely ignored and have no effect on either the base type chosen for the consensus or
the consensus confidence value. In the above example setting the quality cutoff to 20 would
give a T with confidence 100 (100 * 50/50).
In the event that more than one base type is calculated to have the same weight, and
this exceeds the consensus cutoff, the bases are assigned in descending order of precedence:
A, C, G and T.
This is Rule IV of Bonfield,J.K. and Staden,R. The application of numerical estimates
of base calling accuracy to DNA sequencing projects. Nucleic Acids Research 23, 1406-1410
(1995).

2.11.5.3 Consensus Calculation Using Confidence values
This is the prefered consensus algorithm for reading data with Phred decibel scale confidence
values. As will become clear from the follwing description, it is more complicated than the
other algorithms, but produces a much more useful result.
A difficulty in designing an algorithm to calculate the confidence for a consensus derived
from several readings, possibly using different chemistries, and hopefully from both strands
of the DNA, is knowing the level of independence of the results from different experiments
- namely the readings. Given that sequencing traces are sequence dependent, we do not
regard readings as wholly independent, but at the same time, repeated readings which
confirm base calls may give us more confidence in their accuracy. In addition, if we get a
particularly good sequencing run, with consequently high base call confidence values, we
are more likely to believe its base call and confidence value assignments. The final point in
this preamble is that the Phred confidence values refer only to the probability for the called
base, and they tell us nothing about the relative likelihood of each of the other 3 base types
appearing at the same position. These difficulties are taken into account by our algorithm,
which is described below.

260

The Staden Package Manual

In what follows, a particular position in an alignment of readings is referred to as a
"column". The base calls in a column are classified by their chemistry and strand. We
currently group them into "top strand dye primer", "top strand dye terminator", "bottom
strand dye primer" and "bottom strand dye terminator" classes.
Within each class there may be zero or many base calls. For each class we check for
multiple occurrences of the same base type. For each base type we find the highest confidence
value, and then increase it by an amount dependent on the number of confirming reads.
Then Bayes formula is used to derive the probabilities and hence the confidence values for
each base type.
To further describe the method it is easiest to work through an example. Suppose we
have 5 readings with the following characteristics covering a particular column.
Dye
Dye
Dye
Dye
Dye

primer, top strand,
primer, top strand,
primer, top strand,
terminator, top strand,
primer, bottom strand,

’A’,
’A’,
’T’,
’T’,
’A’,

confidence
confidence
confidence
confidence
confidence

20
10
20
10
5

Hence there are three possible classes.
Examining the "dye primer top strand" class we see there are three readings (A, A and
T). The highest A is 20. We add to this a fixed quantity to indicate one other occurence
of an A in this set. For this example we add 5. Now we have an adjusted confidence of
25 for A and 20 for T. This is equivalent to a .997 probability of A being correct and .99
probability of T being correct. To use Bayes we split the remaining probabilies evenly. A
has a probability of .997 and so the remaining .003 is spread amongst the other base types.
Similarly for the .01 of the T. The result is shown in the table below.
|
A
C
G
T
--+----------------------A | .997 .001 .001 .001
T | .0033 .0033 .0033 .990
Bayesian calculations on this table then give us probabilities of approximately .766 for
A, .00154 for C, .00154 for G and .231 for T.
The other classes give probalities of .033 for A, C, G and .9 for T, and .316 for A, and
.228 for C, G and T.
To combine the values for each class we produce a table for a further Bayesian calculation.
Once again we fill in the probabilities and spread the remainder evenly amongst the other
base types.
|
A
C
G
T
-----------+-------------------------Primer Top | .766 .00154 .00154 .231
Term
Top | .0333 .0333 .0333 .9
Primer Bot | .316 .228
.228
.228
From this Bayes gives the final probabilities of .135 for A, .0002 for C, .0002 for G and
.854 for T. This is what would be expected intuitively: the T signal was present in both

Chapter 2: Sequence assembly and finishing using Gap4

261

dye primer and dye terminator experiments with 1/100 and 1/10 error rates whilst the A
signal was present on both strands with 1/100 and 1/3 error rates. Hence the consensus
base is T with confidence 8.4 (-10*log10(1-.854)).
If a padding character is present in a column we consider the pad as a separate base
type and then evenly divide the remaining probabilities by 4 instead of 3.

2.11.5.4 The Quality Calculation
The Quality Calculation described here (which is available from the gap4 File menu) applies
either of the two simple consensus calculations (see Section 2.11.5.1 [Consensus Calculation
Using Base Frequencies], page 258) and (see Section 2.11.5.2 [Consensus Calculation Using
Weighted Base Frequencies], page 259) to the data for each strand of the DNA separately.
It produces, not a consensus sequence, but an encoding of the "quality" of the data which
defines whether it has been determined on both strands, and whether the strands agree.
This quality is used as the basis for problem searches, such as find next problem, and the
Quality Display within the Template Display (see Section 2.5.1.5 [Quality Plot], page 137).
The categories of data and the codes produced are shown in the table. For example ’c’
means bad data on one strand is aligned with good data on the other.
+Strand -Strand
a

Good Good (in agreement)

Good Bad

Bad Good

Good None

None Good

Bad Bad

Bad None

None Bad

Good Good (disagree)

None None
the "Configure cutoffs" command in the

2.11.6 List Consensus Confidence
The Confidence Value consensus algorithm (see Section 2.11.5.3 [Consensus Calculation
Using Confidence Values], page 259) produces a consensus sequence for which the expected

262

The Staden Package Manual

error rate for each base is known. The option described here (which is available from the
gap4 View menu) uses this information to calculate the expected number of errors in a
particular consensus sequence and to tabulate them.
The decibel type scale introduced in the Phred program uses the formula
-10xlog10(error rate) to produce confidence values for the base calls. A confidence value of
10 corresponds to an error rate of 1/10; 20 to 1/100; 30 to 1/1000; etc.
So for example, if 50 bases in the consensus had confidence 10, we would expect those 50
bases (with an error rate of 1/10) to contain 5 errors; and if 200 bases had confidence 20, we
would expect them to contain 2 errors. If these 50 bases with confidence 10, and 200 bases
with confidence 20 were the least accurate parts of the consensus, they are the bases which
we should check and edit first. In so doing we would be dealing with the places most likely
to be wrong, and would raise the confidence of the whole consensus. The output produced
by List Confidence shows the effect of working through all the lowest quality bases first,
until the desired level of accuracy is reached. To do this it shows the cumulative number
of errors that would be fixed by checking every consensus base with a confidence value less
than a particular threshold.
The List Confidence option is available from within the Commands menu of the Contig
Editor and the main gap4 View menu. From the main menu the dialogue simply allows
selection of one or more contigs. Pressing OK then produces a table similar to the following:
Sequence length = 164068 bases.
Expected errors = 168.80 bases (1/971 error rate).
Value

Frequencies

Expected Cumulative
Cumulative
Cumulative
errors
frequencies
errors
error rate
-------------------------------------------------------------------------0
0
0.00
0
0.00
1/971
1
1
0.79
1
0.79
1/976
2
0
0.00
1
0.79
1/976
3
3
1.50
4
2.30
1/985
4
30
11.94
34
14.24
1/1061
5
2
0.63
36
14.87
1/1065
6
263
66.06
299
80.94
1/1867
7
151
30.13
450
111.06
1/2841
8
164
25.99
614
137.06
1/5168
9
96
12.09
710
149.14
1/8344
10
80
8.00
790
157.14
1/14069
The output above states that there are 164068 bases in the consensus sequence with an
expected 169 errors (giving an average error rate of one in 971). Next it lists each confidence
value along with its frequency of occurrence and the expected number of errors (as explained
above, frequency x error rate). For any particular confidence value the cumulative columns
state: how many bases in the sequence have the same or lower confidence, how many errors
are expected in those bases, and the new error rate if all these bases were checked and all
the errors fixed.

Chapter 2: Sequence assembly and finishing using Gap4

263

Above it states that there are 790 bases with confidence values of 10 or less, and estimates
there to be 157 errors in those 790 bases. As we expect there to be about 169 errors in the
whole consenus this implies that manually checking those 790 bases would leave only 12
undetected errors. Given that the sequence length is 164068 bases this means an average
error rate of 1 in 14069. It is important to note that by using this editing strategy, this
error rate would be achieved by checking only 0.48% of the total number of consensus bases.
This strategy is realised by use of the consensus quality search in the gap4 Contig Editor
(see Section 2.6.6.7 [Search by Consensus Quality], page 175).

2.11.7 List Base Confidence
The various base-callers may produce a confidence value for each base call. Previous sections
describe how this may be used to produce a consensus sequence along with a consensus
confidence.
This function tabulates the frequency of each base confidence value along with a count
of how many times is matches or mismatches the consensus. Given that the standard scale
for confidence values follows the -10log10(probability of error) formula we can determine
what the expected frequency of mismatches should be for any particular confidence value.
By comparing this with our observed frequencies we then have a powerful summary of the
amount of misassembled data.
Total bases considered : 45270
Problem score
: 1.337130
Conf.
Match
Mismatch
Expected
Overvalue
freq
freq
freq representation
--------------------------------------------------------------------0
0
0
0.00
0.00
1
0
0
0.00
0.00
2
0
0
0.00
0.00
3
0
0
0.00
0.00
4
37
22
23.49
0.94
5
0
0
0.00
0.00
6
89
46
33.91
1.36
7
119
26
28.93
0.90
8
256
37
46.44
0.80
9
368
30
50.11
0.60
10
669
31
70.00
0.44
...
In the above example we see that there are 59 sequence bases with confidence 4, of which
37 match the consensus and 22 do not. If we work on the assumption that the consensus
is correct then we would expect approximately 40% of these to be incorrect, but we have
measured 37% to be incorrect (22/59) giving 0.94 fraction of the expected amount.
For a more problematic assembly, we may see a section of output like this:
Total bases considered : 1617511
Problem score
: 311.591358

264

The Staden Package Manual

Conf.
Match
Mismatch
Expected
Overvalue
freq
freq
freq representation
--------------------------------------------------------------------...
20
13432
384
138.16
2.78
21
23384
851
192.51
4.42
22
18763
487
121.46
4.01
23
13712
300
70.23
4.27
24
21182
363
85.77
4.23
25
20466
218
65.41
3.33
26
9752
123
24.80
4.96
27
23071
282
46.60
6.05
28
13816
158
22.15
7.13
29
27514
166
34.85
4.76
30
15664
140
15.80
8.86
...
We can see here that the observed mismatch frequency is greatly more than the expected
number. This indicates the number of misassemblies (or SNPs in the case of mixed samples)
within this project and is reflected by the combined “Problem score”. This score is simply
the sum of the final column (or 1 over that column for values less than 1.0).

Chapter 2: Sequence assembly and finishing using Gap4

265

2.12 Miscellaneous functions
2.12.1 Complement a Contig
This function (which is available from the gap4 Edit menu) is used to complement a contig,
which means that it will complement and reverse all its readings and reorder them to
produce a contig with the opposite orientation. It operates on a single contig selected via a
dialogue box.

2.12.2 Enter Tags
This routine (which is available from the gap4 Edit menu) is used to add a set of tags (see
Section 2.2.7 [Annotation readings and contigs], page 121) stored in a file, to the database.
The file format (see below) is identical to the output produced by the "save tags to file"
option of "Find Repeats". See Section 2.8.4 [Find Repeats], page 233. The format is a
subset of the experiment file format. See Section 11.3 [Experiment Files], page 552. The
two are close enough for Enter tags to use an experiment file as input. The only input
required is the name of the file to read and a file browser can be used to aid its selection.
Note that "Enter tags" will remove any results plotted in the Contig Comparator.
The start of a typical file is shown below.
CC
ID
TC
TC
CC
ID
TC
TC
CC
CC
ID
TC
TC
CC
ID
TC
TC

Repeat number 0, end 1
zf48g3.s1
REPT b 1031..1072
Repeats with contig zf48g3.s1, offset 957
Repeat number 0, end 2
zf48g3.s1
REPT b 957..998
Repeats with contig zf48g3.s1, offset 1031
Repeat number 1, end 1
zf48g3.s1
REPT b 1102..1130
Repeats with contig zf48g3.s1, offset 953
Repeat number 1, end 2
zf48g3.s1
REPT b 953..981
Repeats with contig zf48g3.s1, offset 1102

2.12.3 Shuffle Pads
This function realigns all of the sequences within a contig to improve pad placement. This
can be considered as the replacement to the old Shuffle Pads command within the contig
editor. (Being outside of the editor allows this to be autoamtically scripted.) The contigs
to realign are specified as either a single contig, all contigs or to input a contig names from
a file or a gap4 list. Currently the entire contig will be shuffled, which can take some time
on large contigs. In future we plan to allow regions to be specified.

266

The Staden Package Manual

Padding (gapping) problems originate in many sequence assembly algorithms, including
gap4’s, where sequences are aligned against a consensus rather than a profile. As an example
let us consider aligning TCAAGAC (Sequence4) to the following contig:
Sequence1:
Sequence2:
Sequence3:

GATTCAAAGAC
TTCAA*GACGG
CAAAGACGGATC

Consensus:

GATTCAAAGACGGATC

The consensus contains a triple A because that is the most likely sequence, however we
have three possible ways to align a sequence containing double A:
alignment1:
alignment1:
alignment1:
Consensus:

TCAA*GAC
TCA*AGAC
TC*AAGAC
GATTCAAAGACGGATC

All of these have identical alignment scores because the cost of inserting a gap into the
sequence is identical at all points. Alignment algorithms typically always pick the same end
to place pads (ie left end or right end), but after contigs get complemented and more data
inserted this often yields pads at both as, as follows:
Sequence1:
Sequence2:
Sequence3:
Sequence4:
Consensus:

GATTCAAAGAC
TTCAA*GACGG
CAAAGACGGATC
TC*AAGAC
GATTCAAAGACGGATC

The new Shuffle Pads algorithm implements the same ideas put forward by Anson and
Myers in ReAligner. It aligns each sequence against a consensus vector where the entire
column of bases in the consensus are used to compute match, mismatch and indel scores.
The result is that pads generally get shuffled to the same end (not necessarily always left
or always right) and the total number of disagreements to the consensus reduces.
For speed we acknowledge that the new alignment will only deviate slightly from the old
one and so a narrow “band size” is used. This paramater may be adjusted if required, but
at the expense of speed.

Chapter 2: Sequence assembly and finishing using Gap4

267

2.12.4 Show Relationships
This function (which is available from the gap4 View menu) is used to show the relationships
of the gel readings in the database in three ways.
1. All contig descriptor lines followed by all gel descriptor lines.
2. All contigs one after the other sorted, i.e. for each contig show its contig descriptor
line followed by all its gel descriptor lines sorted on position from left to right
3. Selected contigs: show the contig line and, in left to right order, the gel readings. This
can be done for a list or a file of contigs. For a single contig the output can be restricted
to a user-defined region.

In the above illustration, a single contig, all contigs, a file or list of contigs can be selected.
For a single contig, the contig identifier and range selector becomes enabled. Choosing a
file or list enables the "browse" button which will invoke either the file or list browser
respectively. When "all" contigs is selected a further choice is available: whether to ‘Show
readings in positional order’. This question determines whether to output in method
1 (No) or 2 (Yes) listed above.
The function is particularly useful for creating files or lists of reading names. To create a
list of reading names run Show Relationships to produce the desired output to the Output
Window. Then either use cut and paste from this window to a list editor, or use the right
mouse button in the output window to request the "Output to list" option. In this latter
case the header "CONTIG LINES" and "GEL LINES" lines should be removed (although most
functions will happily ignore, with warnings, a list containing unknown reading names).
In the output window the reading names are underlined, indicating that they are hyperlinks. Double clicking on a name with the left mouse button will bring up the contig editor
showing the start of that sequence, or it will move an existing contig editor to display that
position. (You may wish to turn off the "Scroll on output" button if you do not wish the
text output window to scroll to the bottom as it displays the "Edit contig" title.) Clicking
on a reading name with the right mouse button will bring up a popup menu containing Edit
contig, Template display, List reading notes and List contig notes.
Below is an example showing a contig from position 1 to 689. The left gel reading is
number 6 and has archive name HINW.010, the rightmost gel reading is number 2 and is
has archive name HINW.004. On each gel descriptor line is shown: the name of the archive

268

The Staden Package Manual

version, the gel number, the position of the left end of the gel reading relative to the left
end of the contig, the length of the gel reading (if this is negative it means that the gel
reading is in the opposite orientation to its archive), the number of the gel reading to the
left and the number of the gel reading to the right.
CONTIG LINES
CONTIG
LINE

LENGTH

48
689
GEL LINES
NAME
NUMBER POSITION LENGTH
HINW.010
HINW.007
HINW.009
HINW.999
HINW.017
HINW.031
HINW.004

6
3
5
17
12
18
2

1
91
137
140
193
385
401

-279
-265
-299
273
265
-245
-289

ENDS
LEFT
RIGHT
6
2
NEIGHBOURS
LEFT
RIGHT
0
3
6
5
3
17
5
12
17
18
12
2
18
0

Chapter 2: Sequence assembly and finishing using Gap4

269

2.12.5 Contig Navigation
This function, which can be found under the view menu, allows the user to navigate to
areas of interest within contigs. When Contig navigation is selected a dialog box is raised
asking for a filename containing the regions. The format is the same as the search by file
function. See Section 2.6.6.8 [Search by file], page 176.

The user can either enter the name of the file or browse for it using the browse button.
Once ok is hit, the file is loaded into a table for viewing.

The table has three fixed headers, contigID, Position and Problem Type. Clicking on
any of these cause the whole table to be sorted on that column. The regions can be viewed
by either randomly double clicking on a row , by selecting a row and using the next (->>)
and previous (<<-) buttons at the bottom, or by pressing the Page Up and Page Down keys.
The corresponding contig editor will be opened and moved to the position indicated. Once
a row has been clicked on it’s background will be changed to highlight that it has been
visited.
The reset button will clear the table and re-read the data from file. Auto-close editors
is set on by default. It closes any un-needed editors when the user selects a region on a
different contig. The Show Traces mode will automatically display some traces based on
the same mechanisms used in the editors ’Auto-display Traces’ option (see Section 2.6.8.2
[Trace Display Settings], page 181). Save will save the table list, including all rows previously
marked as selected, back to the file. If this file is re-read at a later stage then the table will
have the same sort order and tagging as when saved.
The format of the input file is as follows:
contig identifier position comment
If the comment contains “To:” and a number then the region indicator at the bottom of
the navigator window updates to show the size of the element, otherwise it just has a line

270

The Staden Package Manual

showing the position of the start. Finally the comment may end in the ’nul’ character to
indicate that it has already been visited. (This is utilised by the Save command.)

Chapter 2: Sequence assembly and finishing using Gap4

271

2.12.6 Sequence Search
The purpose of this function (which is available from the prog View menu) is to find
matches between the consensus sequence and short segments of sequence defined by the
user. The segments of sequence (or "strings") can be typed into the dialogue provided
or can be the sequences covered by consensus tag types (see Section 2.2.7.1 [Tag types],
page 121) selected by the user. The latter mode hence provides a way of checking to see
if a tagged segment of the sequence occurs elsewhere in the consensus. The function was
previously known as "Find Oligos".

272

The Staden Package Manual

Chapter 2: Sequence assembly and finishing using Gap4

273

2.12.7 Extract Readings
This function (which is available from the gap4 File menu) is used to produce copies of
readings stored in the assembly database. The readings, and information about them,
are written to disk in experiment file format (see Section 11.3 [Experiment file format],
page 552) and will include any edits made and tags created. They are written in their
original orientation. No change is made to the copies in the assembly database: this process
creates copies and should not be confused with "Disassemble readings". See Section 2.9.1.2
[Disassemble Readings], page 240. The names of the readings to extract can be read from a
list or a file of file names. Clicking on the browse button will invoke an appropriate browser
dialogue. If just a single reading is to be assembled choose "single" and enter the filename
instead of the file or list of filenames. The files are written into the "Destination directory"
with their original file names.

If required, the files will include additional information suitable for processing by either
"Enter pre-assembled data" or "Directed assembly" (see Section 11.3.1 [Experiment file
format explained], page 552). Both contain the ON and AV Experiment File records. Preassembled data also contains SE and PC records whilst Directed assembly contains AP
records. It is recommended that Directed Assembly format is always used in preference to
the Preassemble format.
To merge databases use the "Directed assembly" format to output the contigs required.
Then, within the database you wish to merge the data use the Directed Assembly (see
Section 2.7.2 [Directed Assembly], page 211) command. By using Directed Assembly with
new blank databases it is also possible to create database subsets or to split databases.

274

The Staden Package Manual

2.12.8 Automatic Clipping by Quality and Sequence Similarity
Our consensus calculation algorithms use the data for all the unclipped bases covering
each position in a contig. However, some assembly engines may leave the ends of readings
unaligned, and these unaligned bases could therefore lead to the production of an incorrect
consensus. The two clipping methods described here (which are available from the gap4
Edit menu) are designed to overcome this potential problem.
In addition to improving the reliability of the consensus calculation, clipping in this
way tidies up the alignments, so helping the user to concentrate on the better data. It is
important to note that in no case is the clipped sequence thrown away. The contig editor
can show this hidden data, and the clip points may be manually adjusted to reveal any
clipped sequence.

2.12.8.1 Difference Clipping

The difference clipping method (which is available from the gap4 Edit menu) works
in stages. First it calculates the most likely consensus sequence. Then it compares each
reading with that consensus sequence and identifies areas at the ends of the reading where
there are enough differences to indicate the possibility of badly aligned bases. The clip
points are adjusted accordingly.
To identify the clip points for each reading the algorithm first finds a good matching
segment near the middle of the reading. Then steps, base by base, from this point to the
left accumulating a score as it goes by using +1 for a match and -2 for a mismatch. It
sets the left clip point at the position of the highest score. The right clip point is set in
an equivalent way. These new clip points are used only if they are more severe than the
existing ones. The portions of readings which have been clipped are then tagged using a
DIFF tag type. To see which segments have been clipped use the contig editor search tool.
After clipping the algorithm then identifies any holes (breaks in the contigs) that may
have been created and fills them up again by extending the sequence(s) with the fewest
number(s) of expected errors.

Chapter 2: Sequence assembly and finishing using Gap4

275

2.12.8.2 Quality Clipping

The quality clipping function (which is available from the gap4 Edit menu) clips the
ends of readings when the average (over 31 bases) confidence value is lower than a user
defined threshold. As with the difference clipping method the clips are only adjusted when
the newly calculated clip points are more stringent than the originals.
After clipping Gap4 then identifies any holes (breaks in the contigs) that may have been
created and fills them up again by extending the sequence(s) with the fewest number of
expected errors.
An example output follows.
Hole from 32652 to 32725: extend #1378 and #1385 with 3.157324 expected errors
We have observed that when using confidence values expressed as -10*log(err rate), it is
sometimes better not to clip using the confidence values, but to use the difference clipping
method (see Section 2.12.8.1 [Difference Clipping], page 274).

2.12.8.3 Quality Clip Ends

This function performs a similar analysis to Quality Clipping, but only trimming the
ends of contigs. This can be useful as Phrap automatically clips where sequences disagree,
but the ends of contigs will not be trimmed in such a manner. By trimming such poor
quality from the end Find Internal Joins may find some problematic matches.

276

The Staden Package Manual

2.12.8.4 N-Base Clipping

The purpose of this function is to remove runs of Ns or -s from the ends of sequences.
Other bases may be interspersed in a run of dashes and the run will still be clipped, provided there are a sufficient number of non-A/C/G/T base calls. The exact algorithm for
determining where a ’run’ will stop is as follows:
1.
2.
3.
4.

Set score to zero
For each base call add 1 for N or -, -1 for A, C, G or T, zero for anything else.
Terminate when the score < -10.
Set the clip point at the highest score observed.

Generally this will have no effect (when on good data). It can never ’grow’ a sequence
(by extending the cutoffs into the good data). It will never form a hole in a contig by
clipping all sequences in a region (as it will extend the data from both ends of the hole to
join it back together again).

Chapter 2: Sequence assembly and finishing using Gap4

277

2.13 Results Manager
Some commands within prog produce "results" that are updated automatically as data
is edited. The Result Manager provides a way to list these results, and to interact with
them.
A result is an abstract term used to define any collection of data. Typically this data can
be displayed, manipulated and is usually updated automatically when changes are made that
affect it. Each set of matches from a particular search plotted on the Contig Comparator
(see Section 2.4 [Contig Comparator], page 126) is a result, as are entire displays such as
the Template Display.

278

The Staden Package Manual

2.14 Lists
For many operations it is convenient to be able to process sets of data together - for example
to calculate a consensus sequence for a subset of the contigs. To facilitate this prog uses
lists.
Most prog commands dealing with batches of files or sets of readings or contigs can
use either files of filenames or lists. When selecting list names from within dialogues the
"browse" button will display a window containing all the currently existing lists. To select
a list simply double click on the list name. Alternatively the name may simply be typed in.
The List menu on the main menubar contains commands to Edit, Create, Delete, Copy,
Load, and Save lists. Some of these display a list editor. This is simply a scrollable text
window supporting simple editing facilities (see Section 10.2.3 [Text Windows], page 524).
The "Clear" button clears the list. The "Ok" button removes the list editor window.
It is not necessary to use "Ok" here before supplying the list name for input to another
option.

2.14.1 Special List Names
Some lists are automatically updated or are generated on-the-fly as needed. The lists named
"contigs" and "readings" correspond to the currently selected contigs in the contig selector
window and the currently selected readings in the template displays. Note that lists (with
any names) can also be created from selected items in the contig editor. See Section 2.6.8.18
[Set Output List], page 186. The "allcontigs" and "allreadings" lists are created as needed
and always contain an identifier for every contig and every reading identifier.
Because of the way the lists are implemented, as is outlined below, there are some useful
"tricks" that can be employed. A list name consisting of a contig identifier surrounded by
square brackets (’[’ and ’]’) will cause the creation of a list containing all of the readings
within that contig. For example, to use the Extract Readings option (see Section 2.12.7
[Extract Readings], page 273) to extract all the readings from contig ’xb54f8.s1’, the list
name given in the Extract Readings dialogue would be ’[xb54f8.s1]’.
A list name surrounded by curly brackets (’{’ and ’}’) will cause the creation of a list
containing all of the readings in the contigs named in the specified list name. So ’{contigs}’
is equivalent to all the readings in the contigs contained in the ’contigs’ list. Hence the
’allreadings’ list is identical to ’{allcontigs}’.
These tricks can be used anywhere where a list name is required except for editing and
deletion of lists. As a final example, to produce a file of filenames for the currently selected
contigs, save the list named ’{contigs}’ to a file.

2.14.2 Basic List Commands
The basic operations that can be performed on lists include copying, loading, saving, editing,
creation and deletion. Joining and splitting can only be performed using the list editors
and using cut and paste between windows.
The Load and Save commands require a list name and a file name. If only the name of
the file is given the list is assumed to have the same name. If it is desired to load or save

Chapter 2: Sequence assembly and finishing using Gap4

279

2.14.3 Contigs To Readings Command
This command produces a list or file of reading names for a single contig or for a set of
contigs. The user interface provides a dialogue to select the contigs and to select a list name
or filename.

2.14.4 Minimal Coverage Command
This command produces a minimal list of readings that together span the entire length of
a contig. The dialogue allows contigs names to be defined using a list or a file of filenames.
The output produced, can be sent to a list or a file of filenames. An example use of this
function is to determine a minimal set of overlapping readings for resequencing.

2.14.5 Unattached Readings Command
This command finds the contigs that consist of single readings. The output can be written
to a list or a file of filenames. One example use of the option is for tidying up projects by
removing the trivial and unrequired contigs. In this case the list would be used as input to
disassemble readings (see Section 2.9.1.2 [Disassembling Readings], page 240).

2.14.6 Highlight Readings List
This simply loads the “readings” list so that the template display and contig editor autohighlight the chosen readings. This function is the same as the Highlight Readings List
option in the template display.

2.14.7 Search Sequence Names
This command allows searching for sequences matching a given pattern. The function
produces both a list in the text output window and a prog "list" of reading names.
The highlighted output is clickable, with the left mouse button invoking the contig editor
and the right mouse button displaying a popup-menu allowing additional operations (contig
editor, template display, reading notes and contig notes).
The text search may be performed as either case-sensitive or case-insensitive. Additionally the pattern search types are available.
sub-string Matches any reading name where the pattern matches all or part of the name.
wild-cards Searches for a pattern using normal filename wild-card matching syntax. So *
matches any sequence of characters, ? matches any single character, [chars]
matches a set of characters defined by chars, and \char matches the literal
character char. Character sets may use a minus sign to match a range. For
example x*.[fr][1-9] matches any name starting with x and ending with

280

The Staden Package Manual

fullstop followed by either f or r followed by a single digit between 1 and 9
inclusive. To match a substring using wild-cards prepend or append the search
string with *.
regular expression
This uses the Tcl regular expression syntax to perform a match. These patterns are naturally sub-strings unless anchored to one or both ends using the
^expression$ syntax. A full description of regular expressions is beyond the
scope of this manual.

2.14.8 Search Template Names
This searches for template names matching a given pattern. The list produced will contain
just the template names, but the information listed in the text output window lists the
template names and the readings contained within each template. The reading names
are hyperlinks and so double left-clicking on them will bring up the contig editor whilst
right-clicking brings up a popup menu.
For a description of the types of template search patterns see Section 2.14.7 [Search
Sequence Names], page 279.

2.14.9 Search Annotation Contents
This searches the contents of annotations on both the individual reading sequences and the
consensus sequences. A gap4 list will be produce containing the annotation number, contig
and position. In the text output window a more complete description is available listing the
annotation type and the contents of each annotation. Both the list and text-output window
will contain a highlighted section which is a hyperlink. Double clicking on this with the left
mouse button will bring up the contig editor at that point. Clicking with the right mouse
button will display a popup-menu with further options.
For a description of the types of annotation search patterns see Section 2.14.7 [Search
Sequence Names], page 279.

Chapter 2: Sequence assembly and finishing using Gap4

281

2.15 Notes
A ‘Note’ is an arbitrary piece of text which can be attached to any reading, any contig,
or to a database as a whole. Each note also contains a note type, a creation date and a
modification date. Any number of notes can be attached to each reading, contig or database.
They can be considered as positionless tags.

2.15.1 Selecting Notes
The primary interface to creating, viewing and editing notes is the Note Selector window.
This is accessable from a variety of places, including anywhere a contig or reading name (or
line in a graphical plot) is displayed, and also by using the "Edit Notebooks" command in
the main gap4 Edit menu.

The Note Selector initially starts up showing the database notes (unless selected from
a specific contig or reading plot). The picture above shows three notes attached to the
main gap4 database record. These are of type OPEN and RAWD, both of which have a specific
meaning to gap4, and type COMM.
The View Menu is used to see a list of notes for readings or contigs. If Reading Notes or
Contig Notes is selected, the interface will ask for a reading or contig identifier by adding
an extra line to the Note Selector Window, just beneath the menus. Typing one in and
pressing return will then list the notes for that reading or contig.
To speed up selection, it is possible to use the right mouse button on the Contig Selector
Window and in the contig rulers at the bottom of many plots (such as the Template Display),
to select the "List Notes" option. This will start the Note Selector if it is not already
running, and will direct it to display notes for the desired contig. Similarly, the right mouse
button can be used to popup a menu from a reading in the Template Display or from a
reading name in the Contig Editor.
To edit a note, double click anywhere in the Note Selector on the line for the note.
To delete a note, single click on the note line to highlight it and then select "Delete"
from the Note Selector Edit menu. To delete several notes at once, first highlight a range
by left clicking and dragging the mouse to mark a region of notes, and then use Delete.
Alternatively notes may be deleted by double clicking to bring up the note editor and
selecting Delete from the Note Editor File menu.

282

The Staden Package Manual

To create a new note use the "New" command from the Edit menu. The note will be
added to whatever data type is currently shown. To create a note for a particular contig,
select that contig using the Contig Notes option in the View menu, and then use New to
create a new note. New notes will have type COMM and the contents can be in any format.

2.15.2 Editing Notes
Double clicking on a note in the Note Selector, or creating a new note, will bring up the
Note Editor Window. This is simple text editor, allowing use of keyboard arrow keys and
the mouse to position and edit text. It also has keyboard bindings for many of the simple
emacs movement commands.

At the top of the Notes Editor are three buttons. The leftmost is the File menu which
contains the "Save", "Delete" and "Exit" options. Next to this is the Type selector. This
menu name displays the currently selected note type. To change the note type select the
appropriate type from the Type menu. The final button gives access to the online Help.
Listed underneath the menu are the creation and modification dates. The creation date
if fixed when a note is created. The modification date is adjusted every time a note is
edited. (Simply viewing a note will not update the modification date, but saving changes
to it will.)
Underneath these is the note text itself. For convenience, the first line of each note is
shown in the note selector window (so it can be helpful to make it identifiable).

2.15.3 Special Note Types
Several types of note have special meanings. These include the OPEN, CLOS and RAWD note
types.

Chapter 2: Sequence assembly and finishing using Gap4

OPEN
CLOS

283

Notes of type OPEN and CLOS should contain pure Tcl code. If they exist,
they will be executed when the database is opened (OPEN) and closed (CLOS).
Take great care in creating and editing a note with these types! The purpose
is to allow configuration options to be attached to a database, and hence allow
for different gap4 configurations to be used when a UNIX directory contains
more than one database. In general use of the ‘.gaprc’ file (see Section 2.20.1
[Options Menu], page 298) is probably safer.
If there is a problem with a database containing a malformed OPEN or CLOS
note, it may be opened using gap4 -no_exec_notes. This will prevent gap4
from executing the OPEN and CLOS notes and so allow them to be fixed using
the Note Editor.

RAWD

This note specifies an alternative to the RAWDATA environment variable and
should be set to be the full directory name for the location of the trace files for
the database. If both the environment variable and the note are exist then the
note will take priority. This automatic use of this note can be disabled be using
the -no_rawdata_note command line option to gap4.

INFO

When created on a reading or a contig, this note may be displayed in the contig
editor "information line" (see Section 2.6.14 [The Editor Information Line],
page 193) when the user moves the mouse over the editor sequence name list.

It is possible to create your own types by editing the ‘$STADENROOT/tables/NOTEDB’
file. The format is fairly self explainatory, and is very similar to the ‘GTAGDB’ file. Each
note type should consist of the long name followed by a colon and id=4 letter short name,
optionally followed by dt="any default text for this note". Lines may be split at colons by
adding a backslash to the end of the line. See the standard ‘NOTEDB’ file for examples.

284

The Staden Package Manual

2.16 Gap4 Database Files
Gap4 stores the data for each sequencing project (e.g. the data for a single cosmid or BAC)
in a gap4 assembly database, so at the start of a sequencing project the user should employ
gap4 to create the database for the project (see Section 2.16.2 [Opening a New Database],
page 285). New database are created with sufficient index space for around 8000 readings,
but this can be extended if required.
Gel reading data in experiment file format (see Section 11.3 [Experiment File Format],
page 552) is entered into the database using the methods available from the assembly menu
(see Section 2.7 [Entering Readings into the Database (Assembly)], page 205).
To assemble more data for the project or to edit or analyse readings already entered
the user should open the same project database (see Section 2.16.3 [Opening an Existing
Database], page 285).
Although the database files are designed to be free of corruption it is advisable to make
regular backups (see Section 2.16.4 [Making Backups of Databases], page 285).
Database names can have from one to 240 letters and must not include a full stop or
spaces. The database itself consists of two files; a file of records and an index file. If the
database is called ‘FRED’ then version 0 of the database comprises the pair of files named
‘FRED.0’ and ‘FRED.0.aux’, the latter of these being the index file. The "version" is the
character after the full stop in these filenames. Versions are not limited to numbers alone,
but must be single characters.
When a database is opened for writing a ‘BUSY’ file is created. For the ‘FRED’ database
this will be named ‘FRED.0.BUSY’. When the database is closed the file is deleted. The
file is used by gap4 to signify that the database is opened for writing and is part of its
mechanism to prevent more than one person editing a database at any time. Before opening
a database for writing, gap4 checks to see if the BUSY file for that database exists. If it
does the database is opened only for reading, if not it creates the file, so that any additional
attempts to open the database for writing will be blocked. A side effect of this mechanism,
is that in the event of a program or system crash the BUSY file will be left on the disk,
even though the database is not being used. In this case users must remove the BUSY file
(after checking that it really isn’t in use!) using, on UNIX the rm command before opening
the database. Eg "rm FRED.0.BUSY". On Windows use the Recycle Bin.
The gap4 database is robustly designed. Killing the program whilst updating the database should never yield an inconsistent state. A "roll-back" mechanism is utilised to undo
any partially written updates and revert to the last consistent database. Hence quitting
abnormally may result in the loss of some data. Always quit using the Exit command within
the File menu.
However it is advised that copies of the database are made regularly to safeguard against
any software bugs or disk corruptions.

2.16.1 Directories
By default, Gap4 expects files to be in the current directory. In dialogues which request
filenames, full pathnames can be specified, however it is generally tidier to keep files specific
to a particular project in the same directory as the project database. Creating new databases

Chapter 2: Sequence assembly and finishing using Gap4

285

and opening new databases will change directory to the directory containing the opened
project.
It is possible to change the current directory by selecting "Change directory" from the
File menu. Be warned that changing to a directory other than that containing the database
and the trace files may mean that gap4 can no longer find the trace files. The solutions to
this problem are discussed elsewhere (see Section 2.20.8 [Trace File Location], page 302).

2.16.2 Opening a New Database
To create a new gap4 database select the "New" command from the File menu. This brings
up a dialogue prompting the the new filename. Type the name of the database to create
without specifying the version number. To create version 0 of a database named ‘FRED’
typing FRED will create the two database files, ‘FRED.0’ and ‘FRED.0.aux’.
If the database already exists you will be asked whether you wish to overwrite it. Any
database that was already open will be closed before the new database is created. The new
database is then opened, ready for input.
Note that Gap4 database names are case sensitive.

2.16.3 Opening an Existing Database
To open an existing database select the "Open..." command from the File menu. This brings
up a file browser where the database name can be selected. The databases will be listed in
a NAME.V notation (where V is the version number). Double clicking on the database name
will then open this database.
If the program already had a database open it will close it before the new one is opened.
If the new database is already in use by gap4 a dialogue will appear warning you that the
database has been opened in read only mode. This mode prevents any edits from being
made to the database by greying out certain options and disabling the editing capabilities
in the contig editor.
A database may also be opened by specifying the database name and version on the
unix command line. To open version 0 of the database ‘FRED’ use "gap4 FRED.0".

2.16.4 Making Backups of Databases
The importance of making regular backups of your data cannot be over stated. Using
the "Copy database" command from the File menu brings up a dialogue asking for a new
database version. Type in a single character for the new version and press "ok" or return.
If the new database already exists you will be asked whether you wish to overwrite it. Any
subsequent changes you make will still be to the database that you originally opened, not
to the database you have just saved to.
The database file may sometimes become fragmented. An option available when saving
is to use garbage collection. This creates the new database by only copying over the used
portions of data (and hence reduces fragmentation). However it is quite a lot slower than
the standard "Copy database" mechanism, so if this causes problems add "set_def COPY_
DATABASE.COLLECT 0" to your ‘.gaprc’ file to change the default to no garbage collection.
It should be noted that garbage collection also performs a rigorous database consistency
check.

286

The Staden Package Manual

Do not always use the same version character for you backups. Instead keep several different backups. Otherwise you may find that both your current database and the backup have
problems. It is also wise to run "check database" to verify data integrity. See Section 2.18
[Check Database], page 290.
It is also possible to backup databases from outside gap4 by using standard unix commands to copy both the record and index files. Care should be taken when doing this
to ensure that the database is not being modified whilst copying. See your unix or Windows manuals for further details or the copy_db manual page (see Section 12.2 [Copy db],
page 573) for the external garbage collecting database copy program.

2.16.5 Reading and Contig Names and Numbers
For various reasons there are restrictions on the characters used in file names and the length
of the file names.
Characters permitted in file names:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789._A reading name or experiment file name must not be longer than 40 characters.
These restrictions also apply to SCF files which means, in turn, also to the names given
to samples obtained from sequencing instruments. For example do not give sample names
such as 27/OCT/96/r.1 when using and ABI machine: the / symbols will be interpreted as
directory name separators on UNIX!
As each reading is entered into a project database it is given a unique number. The first
is numbered 1, the second 2 and so on. Their reading names are read from the ID line in
the experiment files and copied into the database. As new readings are created and existing
ones removed the reading numbers change in an unpredictable fashion. Hence when taking
notes on a project always record the reading name instead of the reading number.
The maximum number of readings a database can hold is 99,999,999.
Many options ask for a reading or contig identifier. A contig identifier is simply any
reading name or number within that contig. A reading identifier is either the reading name
or the hash ("#") character followed by the number. For example, if the reading name is
fred.gel with number 99 users could type "fred.gel" or "#99" when asked to identify
the contig.
Generally when prompting for a contig or reading name a default is supplied. This is
the last name you used, or if you’ve only just opened the database, the name of the longest
contig in the database. For more information about selecting contigs within the program
see See Section 2.3.1 [Selecting Contigs], page 123..

Chapter 2: Sequence assembly and finishing using Gap4

287

2.17 Copy Readings
2.17.1 Introduction
During large scale sequencing projects where the genome is cloned into e.g. BACs prior to
being subcloned into sequencing vectors it is generally the case that the ends of the DNA
from one BAC will overlap that of two other BACs. Unless it is being used for quality
control, it is a waste of time to sequence the overlapping regions twice, and so most labs
transfer the relevant data between the adjacent gap4 databases. This is the function of
copy reads which copies readings from a "source" database to a "destination" database.
The consensus sequences for user selected contigs in each of the two databases are compared in both orientations. If an overlapping region is found, readings of sufficient quality
are automatically assembled into the destination database. In the source database readings which have been added to the destination database will be tagged with a "LENT"
tag and the equivalent readings in the destination databse will be tagged with a "BORO"
(borrowed) tag.

2.17.1.1 Copy Reads Dialogue

288

The Staden Package Manual

The Copy reads function is available from either the File menu of gap4 or from the
command line.
The program must be able to write to both databases. It is recommended that you create
backups of both databases before commencing using "Copy database". See Section 2.16.4
[Making Backups of Databases], page 285.
From within gap4: The source database must be entered into the "Open source database"
entry box at the top of the dialogue box. The adjacent Browse buttons will list only gap4
databases, that is files ending in aux. Either select from the browser by double clicking on
the name or type in the database name. The ending of .aux is ignored. The destination
database is always the database which is currently open in gap4.
The location of the traces of the source database can either be determined from the
rawdata note (see Section 2.20.8 [Trace File Location], page 302) held within the database
("read from database") or can be entered via the "directory" option. The program will
add the location of the source traces into the rawdata note of the destination database.
If the environment variable RAWDATA is set, this will be taken to be the location of the
destination database traces and will also be added to the rawdata note of the destination
database. If there are no traces for the source database, no rawdata note will be created.
One or more contigs from the source database can be compared. These are selected
either by clicking on "all contigs" or providing a file containing a list of contig names (any
reading name from within that contig, typically the first reading name). Only contigs
over a user defined length will be used. A minimum reading quality can be set so that
only readings with an average quality over the specified amount will be entered into the
destination database.
Contigs from the destination database can be chosen by either selecting "all contigs" or
providing a file of contig names.
The consensus sequence is determined for each contig in both databases using either the
standard consensus algorithm or "Mask active tags". The latter option will activate the
"Select tags" button. Clicking on this button will bring up a check box dialogue to enable
the user to select the tags types they wish to activate. Masking the active tags means that
all segments covered by tags that are "active" will not be used by the matching algorithms.
A typical use of this mode is to avoid finding matches in segments covered by tags of type
ALUS (ie segments thought to be Alu sequence) or REPT (ie segment that are known to
be repeated elsewhere in the data (see Section 2.2.7.1 [Tag types], page 121).
The consensus searching parameters are equivalent to those found in the find internal
joins algorithm (see Section 2.8.3 [Find Internal Joins], page 227). The search algorithm
first finds matching words of length "Word length", and only considers overlaps of length
at least "Minimum overlap". Only alignments better than "Maximum percent mismatch"
will be reported. Find internal joins has the option of either a quick or sensitive algorithm.
Here, it is only necessary to use the quick algorithm. The quick algorithm can find overlaps
and align 100,000 base sequences in a few seconds by considering, in its initial phase only
matching segments of length "Minimum initial match length". However it does a dynamic
programming alignment of all the chunks between the matching segments, and so produces
an optimal alignment. A banded dynamic algorithm can be selected, but as this only applies

Chapter 2: Sequence assembly and finishing using Gap4

289

to the chunks between matching segments, which for good alignments will be very short, it
should make little difference to the speed. The alignments between the consensus sequences
can be displayed in the text output window by selecting "Display consensus alignments".
If a match between two consensus sequences is found, the readings in that overlap
are assembled into the destination database using the "directed assembly" function (see
Section 2.7.2 [Directed Assembly], page 211). Only readings for which the "Maximum percent mismatch" is not exceeded, and which have an average reading quality higher than the
specified minimum, will be entered into the database. Again, the alignments can be shown
in the Output window by selecting "Display sequence alignments".
From the command line:
copy reads [-win] [-source trace dir ("")] [-contigs from (all contigs)]
[-min contig len (2000)] [-min average qual (30.0)] [-contigs to (all contigs)] [-mask
(none)] [-tag types ("")] [-word length (8)] [-min overlap (20)]
[-max pmismatch (30.0)] [-min match (20)] [-band (1)] [-display cons] [-align max mism
(10.0)] [-display seq] [source database] [destination database]
The values in brackets () are the default values. The only mandatory values are the
source and destintation databases. Details on these values are given in the copy reads man
page (see Section 12.3 [Copy reads], page 574).
The -win option will bring up a new program which presently only has one function
(copy reads). This is accessed from the "File" menu. This brings up a dialogue the same
as that from within gap4 except for an extra entry box to select the destination database.

290

The Staden Package Manual

2.18 Check Database
This function (which is available from the gap4 File menu) is used to perform a check on
the logical consistency of the database. No user intervention is required. If the checks are
passed the message "Database is logically consistent" is written to the Output Window. If
the database is not found to be consistent diagnostic messages will appear in the Output
Window and Doctor Database from the Edit menu should be used to correct the problem.
See Section 2.19 [Doctor Database], page 293.
Several options, such as assembly, automatically perform a check database prior to executing. If the database is found to be inconsistent the option will not continue. However
some checks are considered as "non fatal" and will not block such operations. Currently
the only non fatal checks are the positional checks for annotations and for readings that
are never used. To fix the database, use the Doctor Database "ignore check database"
setting to disable the inconsistency checking. See Section 2.19.2 [Ignoring Check Database],
page 296.
The following sections define the checks and the order in which they are performed.

2.18.1 Database Checks
•
•
•
•
•
•
•
•
•
•
•
•

Number of contigs used is <= number allocated
Disk and memory values for "number of contigs" are consistent
Number of readings used is <= number allocated
Disk and memory values for "number of readings" are consistent
Disk and memory values for "actual database size" are consistent
Actual database size <= maximum size
Data class is either DNA(0) or protein(1).
Number of free annotations >= 0 and <= number allocated
Contig order is consistent
Number of free notes >= 0 and <= number allocated
First note has prev type as GT Database
Detect note loops

2.18.2 Contig Checks
•
•
•
•
•

Has a left reading number
Has a right reading number
The left reading has no left neighbour
The right reading has no right neighbour
Chain right to
− check loops
− check holes
− flag a reading as used
• When finished chaining
− check length is correct

Chapter 2: Sequence assembly and finishing using Gap4

− check right reading number is correct
• Reference only valid reading numbers
• Chain left to
− check loops
− flag readings as used, if not done so in right chaining;
• When finished chaining, check left reading number is correct
• Chain along annotation list to
− flag as used
− detect annotation loops
− annotation is within the contig
− annotation is rightwards of previous
• First note has prev type as GT Contigs
• Detect note loops

2.18.3 Reading Checks
• Memory and disk values tally for
− left neighbour
− right neighbour
− relative position
− length + sense
• Left neighbour is a valid reading number
• Right neighbour is a valid reading number
• Reading is not used zero times
• Reading is not used more than once
• Hand holding: (lnbr[rnbr[reading]] == reading)
• Relative position of reading >= position of left neighbour
• Length != 0
• Used sequence length == "right clip position" - "left clip position"
• Has valid strand (0 or 1)
• Has valid primer
• Has valid sense (0 or 1)
• Chain along annotation list to
− flag as used
− detect annotation loops;
− annotation is rightwards of previous
• First note has prev type as GT Readings
• Detect note loops

291

292

The Staden Package Manual

2.18.4 Annotation Checks
•
•
•
•
•
•

No loops in free annotation list
Is neither used nor is on the free list
Annotation is not used more than once
Is used, yet is still on the free list
Length >= 0
Has valid strand (0 or 1)

2.18.5 Note Checks
•
•
•
•
•

No loops in free note list
Is neither used nor is on the free list
Hand holding: (note->next->prev == note)
Note is not used more than once
Is used, yet is still on the free list

2.18.6 Template Checks
•
•
•
•

Minimum insert length <= maximum insert length
Has valid vector
Has valid clone
Has valid strand

2.18.7 Vector Checks
• Level > 0
• Level <= MAX LEVEL (MAX LEVEL currently is 10; a "feasibility" check)

2.18.8 Clone Checks
• Has valid vector

Chapter 2: Sequence assembly and finishing using Gap4

293

2.19 Doctor Database
Doctor Database (which is available from the gap4 Edit menu) is used to make arbitrary
changes to the database. It should be extremely unlikely that is use will be required, and
if so, is for experts only. Very few checks are performed on the user’s input and there
are few limitations on what can be done. Consequently this option should never be used
without first making a backup using "Copy database". See Section 2.16.4 [Making Backups
of Databases], page 285. It is very easy to create inconsistencies within the database. Do
not feel that values (such as the maximum gel reading length) can be safely changed simply
because they are shown in a dialogue.

The main window consists of a menubar containing "File", "Structures" and "Commands" menus. The menus contain:
• File
− New
− Quit
• Structures
− Database
− Reading
− Contig
− Annotation
− Template
− Original clone
− Vector
− Note
• Commands
− Check
− Ignore check database
− Extend structures
− Reading
− Annotation
− Template
− Clone
− Vector
− Delete contig

294

The Staden Package Manual

− Shift readings
− Reset contig order
− Output annotations to file
− Delete annotations
The New command in the Commands menu brings up another Doctor Database window
complete with its own menubar. This is useful for comparing structures. Whilst Doctor
Database is running all other program dialogues, including the main gap4 menubar, are
blocked. Control is reenabled once the last Doctor Database window is removed. Remember to perform a Check Database (Commands menu) before quitting to double check for
database consistency.

2.19.1 Structures Menu
The gap4 database consists of records of several predefined types. The types correspond to
the commands available within the Structures menu. All of these, except for the "Database"
command, insert a dialogue between the menubar and whatever is underneath it. In the
picture below we have selected "Annotations" from the menu which has prompted for
"Which annotation (1-380)" (the 1-380 is the valid range of inputs available).

In the panel beneath the "Which annotation" question is a panel detailing another
annotation structure. In general the structure type and number are shown at the top of the
panel (in this case annotation number 100). Beneath this are the structure fields on the
left followed by the values for these fields on the right. Sometimes gap4 may store a value
as numeric, but display the structure as both a numeric and a string describing this value.
For instance here the annotation strand is "1" which is gap4’s way of storing "reverse".
Some values have an arrow next to them, such as with the "next" field in the illustration.
Clicking on this arrow will display the structure referenced by this value. Here it is another
annotation (annotation 357). It is stated that the annotation is part of Contig number 6.
Clicking on the arrow next to this will reveal that contig structure.

Chapter 2: Sequence assembly and finishing using Gap4

295

Selected notes on editing the structures follows.

2.19.1.1 Database Structure
There is only a single Database structure. A description of its more important fields follows.
num contigs
The number of currently used contigs
num readings
The number of currently used readings
Ncontigs

The number of currently allocated contigs

Nreadings The number of currently allocated readings
contigs
readings
annotations
templates
clones
vectors
notes
Record numbers of arrays holding the record numbers of each item
free annotations
A linked list of unused annotations
free notes A linked list of unused notes

2.19.1.2 Reading Structure
Some Reading Structure fields reference the record number in the gap4 database of a string.
Where this string is short, such as the reading name, both the record number and the
contents of the string can be edited. To edit a single name the string should be changed.
To swap two reading names around either edit both strings or swap the two name record
numbers.
The annotations value references an annotation number. If this is zero then this reading
has no annotations.
The length is the complete length of sequence, including hidden data.
The
sequence length is the length of only the used sequence. The location of the hidden data
is specified by the start and end values. Note that sequence length=end-start-1.
A left or right value of zero means that this reading has no left or right neighbour.

2.19.1.3 Contig Structure
A Contig Structure is defined as a list of readings. The left and right values specify the
first and last reading numbers in the doubly linked list representing the contig.

296

The Staden Package Manual

2.19.1.4 Annotation Structure
Annotations are stored as linked lists. Each reading and each contig has a (possibly blank)
list. All other unused annotations are held on the free list. The next value is used to
reference the next annotation number. A value of zero represents the end of the list.

2.19.1.5 Template Structure
The Template name field can be edited as both a string and the record number pointing to
that string. The Template Structure display has links to a vector number and a clone.

2.19.1.6 Original Clone Structure
The original clone name is often the name of the database. The use of original clones is
primarily for large scale sequencing. When breaking down a sequence into cosmids and then
into sequencing templates, we say that each cosmid is a clone.

2.19.1.7 Note Structure
A Note may be considered as a positonless annotation (without the position, length or
strand fields). Notes store both their creation and last-modification dates. Notes may be
attached, in a linked-list fashion, to readings, contigs, or the database structure.

2.19.2 Ignoring Check Database
Many functions use the Check Database function to determine whether the database is
consistent. Often editing an inconsistent database can yield more and more inconsistencies.
However it is sometimes useful to use such an editing function in the process of fixing the
database. In such cases, the "Ignore check database" toggle should be set.
An example of the use is for the Break Contig function. If we find that a database is
inconsistent due there being a gap in the contig, the obvious solution is to fix this using
Break Contig. But Break Contig checks for consistency, and refuses to work if the database
is inconsistent.

2.19.3 Extending Structures
Sometimes it is required to allocate new structures. The "Extend structure" item on the
command menu reveals a cascading menu containing the different structure types. Once a
type has been selected a dialogue appears asking how many extra structures to create.
The new structures created can then be modified using the Structures menu. Expect
strange behaviour if these structures are not initialised correctly.

2.19.4 Listing and Removing Annotations
The Commands menu contains two commands for manipulating lists of annotations. Output
annotations to file saves a list of annotations to file. The dialogue requests a filename
to save the annotations to and an annotation type. Only one type can be specified.
The format of the file is "Annotation_number Type Position Length Strand".
The "Delete annotations" command requests a file of annotations in this format. The
function then removes these annotations from readings and contigs and adds them to the
free annotation list.

Chapter 2: Sequence assembly and finishing using Gap4

297

2.19.5 Shift Readings
The Shift Readings option allows the user to change the relative positions of a set of neighbouring readings starting at a selected reading. Hence it can be used to change the alignment
of readings within a contig. It prompts for the number of the first reading to shift and then
the relative distance to move by. A negative shift will move the readings leftwards.
The reading and all its rightward neighbours are moved by the requested distance. Tags
on the readings and the consensus are moved accordingly. The command also automatically
updates then length of the contig.

2.19.6 Delete Contig
The Delete Contig function removes a contig and all its readings. Annotations on the
removed readings and contig are added to the free annotations list.

2.19.7 Reset Contig Order
The contig order information contains a list of contig numbers. If a contig number does
not appear within this list, or if it appears more than once, then the contig order is inconsistent and windows such as the Contig Selector may not work. The Reset Contig Order
function resets the contig order to a consistent state, but will lose the existing contig order
information.

298

The Staden Package Manual

2.20 Configuring
2.20.1 Introduction
The Options menu allows selection of the Consensus algorithm and the genetic code to use,
and adjustment of various parameters used throughout gap4. It also provides a way of
setting more trivial things such as fonts and colours.
Most of these options have "OK Permanent" buttons in addition to the normal "OK"
button. The "OK Permanent" button will save the current settings to the ‘.gaprc’ file in
the user’s home directory. On Windows 95 this may be C:\.
In general users will not need to be aware of this method as the most important configuration options are all available from within the graphical user interface. However there are
many additional configurable parameters which may be referred to throughout the manual.
These too are stored in the ‘.gaprc’ files.
When gap4 starts up it will first load the complete set of configurations from the
‘$STADENROOT/tables/gaprc’ file. Next it loads ‘.gaprc’ from the user’s home directory,
and finally ‘.gaprc’ from the user’s current project directory. This means that the setting
stored in the ‘.gaprc’ file in the user’s project directory will have priority over those found
in the home directory, which, in turn, have priority of those found in the Staden Package
installation directories.
Note that searching for the ‘.gaprc’ files only applies when starting gap4 and not when
opening new or different databases. Hence if the user double clicks on a database ‘.aux’
file then gap4 will read the ‘.gaprc’ file found in the same directory as the database. If
users start up gap4 from the Start menu and then open the project, the ‘.gaprc’ file in the
project directory will not be read.
The format of commands in the ‘.gaprc’ file are:
"#" followed by anything is a comment.
"set def VARIABLE value" sets the parameter "VARIABLE" to the value
"value". Note that value must be enclosed in double quotes if it contains
spaces.
"set defx temp VARIABLE value" sets a parameter in a temporary list named
"temp". This has no effect unless it is then used within a set def command. In
this case we use "$temp" as the "value" parameter of a set def command.
An example follows:
set_def FIJ.MAXMIS.VALUE 30.00
set_def TEMPLATE.PRIMER_REVERSE_COLOUR "green"
set_def CONTIG_EDITOR.DISAGREE_MODE 2
set_def CONTIG_EDITOR.DISAGREE_CASE 0
set_def CONTIG_EDITOR.MAX_HEIGHT 25

Chapter 2: Sequence assembly and finishing using Gap4

299

Note that some adjustments will effect more than just gap4. For example, the colours
of traces are stored in the ‘.tk_utilsrc’ file, and this file is used by both gap4 and trev.
For colour blind users it can be useful to change these particular settings. For example the
following is a ‘.tk_utilsrc’ file to change the colours for the trace displays.
set_def
set_def
set_def
set_def
set_def

TRACE.COLOUR_A white
TRACE.COLOUR_C blue
TRACE.COLOUR_G black
TRACE.COLOUR_T "#ff8000"
TRACE.LINE_WIDTH 2

2.20.2 Consensus Algorithm
Gap4 currently contains 3 consensus algorithms (see Section 2.11.5 [The Consensus Algorithms], page 257). This option (which is available from the gap4 Options menu) allows the
algorithm to be selected.
Note the consensus algorithm is used in several places throughout gap4: Assembly
(see Section 2.7.1 [Normal Shotgun Assembly], page 205), producing a consensus sequence
file (see Section 2.11.5 [The Consensus Algorithms], page 257), in the Contig Editor (see
Section 2.6 [Editor introduction], page 160), for Experiment Suggestion (see Section 2.10
[Finishing Experiments], page 241), and in the plot of the confidence values (see Section 2.5.2
[Consistency Display], page 140).

2.20.3 Set Maxseq/Maxdb
The "Set maxseq/maxdb" option (which is available from the gap4 Options menu) may be
used to adjust the maximum size of the total consensus sequence contained within gap4.
This includes concatenations of consensus sequences (with extra space for text headers) and
the cutoff data at either end of each contig.
When opening an already assembled project, maxseq is automatically increased accordingly (if required), so "Set maxseq" only needs to be used when adding in more data, such
as when using the sequence assembly algorithms.
The maxdb option controls the maximum combined number of readings and contigs
allowed. Note that changing this does not take effect on the currently opened database so
be sure to set it before opening your database.
Both these values can also be adjusted by using the -maxseq and -maxdb command line
arguments. See Section 2.21 [Command Line Arguments], page 306.

2.20.4 Set Fonts
"Set fonts" (which is available from the gap4 Options menu) controls the fonts used for the
various components of gap4’s windows. Note that for the correct operation of some displays,
careful font selection is necessary. For example it is not wise to chose a proportional font
for the Contig Editor, which displays fixed width sequence alignments. For more complete
documentation, see Section 10.8 [Font Selection], page 531..

300

The Staden Package Manual

2.20.5 Configuring Menus
When used for the first time gap4 will start up in beginner mode. What this means is that
some of the less widely used options will not appear in the menus. The "Configure menus"
command in the Options menu may be used to change between "beginner" and "expert"
mode. In expert mode all the menu items will be displayed.
To permanently set the menu level users select the appropriate level and press the "OK
Permanent" button. This will save the menu level information to the ‘.gaprc’ file in their
home directory.
If desired, other menu levels may be created by the package administrator. This is
achieved by editing the ‘$STADENROOT/tables/gaprc_menu_full’ file, changing the MENU_
LEVELS definition and adding the appropriate labels to the end of each command. Each
command specified in the menu file ends in a list of menu levels in which it is active. To
make a command active for several levels, enclose the level identifiers in a Tcl list, such as
{m e}. If this is missing, the command will be active at all menu levels.

2.20.6 Set Genetic Code
This function allows the user to change the genetic used in all the options. The codes are
defined as a set of codon tables stored in the directory tables/gcodes distributed with the
package. The current list of codes and their codon table file names is shown at the end of
this section.
The user interface consists of the dialogue shown below. The user selects the required
code by clicking on it, and then clicking "OK" or "OK permanent". The former choice
selects the code for immediate use, and the latter also selects it for future uses of the
program.

When the dialogue is left the codon table selected will be displayed, as below, in the
Output Window.
===============================================
F ttt
S tct
Y tat
C tgt

Chapter 2: Sequence assembly and finishing using Gap4

301

F ttc
S tcc
Y tac
C tgc
L tta
S tca
* taa
W tga
L ttg
S tcg
* tag
W tgg
===============================================
L ctt
P cct
H cat
R cgt
L ctc
P ccc
H cac
R cgc
L cta
P cca
Q caa
R cga
L ctg
P ccg
Q cag
R cgg
===============================================
I att
T act
N aat
S agt
I atc
T acc
N aac
S agc
M ata
T aca
K aaa
G aga
M atg
T acg
K aag
G agg
===============================================
V gtt
A gct
D gat
G ggt
V gtc
A gcc
D gac
G ggc
V gta
A gca
E gaa
G gga
V gtg
A gcg
E gag
G ggg
===============================================
The following table shows the list of available genetic codes and the files in which they
are stored for use by the package. They were created from genetic code files obtained from
the NCBI.
code_1
code_2
code_3
code_4
code_4
code_4
code_4
code_4
code_5
code_6
code_6
code_6
code_9
code_10
code_11
code_12
code_13
code_14
code_15

Standard
Vertebrate Mitochondrial
Yeast Mitochondrial
Coelenterate Mitochondrial
Mold Mitochondrial
Protozoan Mitochondrial
Mycoplasma
Spiroplasma
Invertebrate Mitochondrial
Ciliate Nuclear
Dasycladacean Nuclear
Hexamita Nuclear
Echinoderm Mitochondrial
Euplotid Nuclear
Bacterial
Alternative Yeast Nuclear
Ascidian Mitochondrial
Flatworm Mitochondrial
Blepharisma Macronuclear

2.20.7 Alignment Scores
The Alignment Scores command (which is available from the gap4 Options menu) may
be used to adjust the gap open and gap extension penalties for some of the alignment

302

The Staden Package Manual

algorithms used within gap4. At present this will affect all alignments except the Find
Internal Joins function and most of the assembly algorithms.
For dealing with sequences where the alignment differences have been caused by real
evolutionary events, these parameters will probably need changing from the defaults. The
default values are set up with the assumption that any alignment differences are due to
base calling errors, and hence the gap extension penalty will be high.
The alignment matrix may also be adjusted, but this is not listed in the dialogue.
To do this take a copy of ‘$STADENROOT/tables/nuc_matrix’, edit the copy, and set the
ALIGNMENT.MATRIX_FILE parameter in your ‘.gaprc’ file.

2.20.8 Trace File Location
Gap4 does not store the trace data within the gap4 database. Instead it stores the filename
of the trace file. Usually the trace files are kept within the same directory as the gap4
database. If this is not the case gap4 needs to know where they are.
To make sure that gap4 can still display the traces we need to specify any alternative
locations where traces may be found. The "Trace File Location" command (which is available from the gap4 Options menu) performs this task. It brings up a dialogue asking for
the directory names. If there is just one directory to specify, its name should be typed in.
If there are several directories to search through, they must all be typed in, separated by
the colon character (":"). To include a directory name that contains a colon, use a double
colon.
For example, on windows to specify two directories, use (eg) "F::\tfiles1:G::\tfiles2".
In addition to specifying directories, RAWDATA may also be used to indicate that
the trace files come from a variety of other sources using the general format SOURCETYPE=path. These can be combined with directories if desired. For example “.:/trace_
cache:TAR=/traces/archived.tar”.
TAR=filename.tar
Searches for the trace name in the Unix tar archive named filename.tar.
If filename.tar.index exists and is of the format created using the index_tar
program then the trace name will be looked up in the index instead of sequentially scanning through the tar file. In order to speed up accessing of
traces within the tar file a command line utility named index_tar may be
used. This produces a text index containing the filenames held within the tar
and their offsets within it. Programs will then use this index file to provide
a fast way of accessing the trace. The syntax for index_tar is: index_tar
tar filename > tar filename.index. (For example "index_tar traces.tar >
traces.tar.index".)
SFF=filename.sff
Searches for the trace name in a 454 SFF archive named filename.sff. SFF files
have their own binary-sorted index which allows for random access.
HASH=archive.hash
This method supersedes the TAR= accessor. Tar files may be “hashed” using
the hash_tar tool. Similarly 454 SFF archives may be hashed using hash_sff.

Chapter 2: Sequence assembly and finishing using Gap4

303

In theory any type of archive may be indexed as a “.hash” provided that the
traces are stored uncompressed (or compressed only using their own methods,
such as with ZTR) so that random access is possible within the archive.
The Hash file contains a precomputed binary index of all the traces contained
within it stored in such a way that random access is very fast.
URL=url
This uses the external wget tool (not supplied as part of the Staden
Package) to fetch a given url.
Anywhere that %s occurs within the
specified url will be replaced by the trace name. Hence, for example,
URL=http://trace.server.org/cgi-bin/lookup.pl?trace=%s could be
used to fetch named traces from a remote site. There are plans for such URL
access to be made available via the Ensembl TraceArchive.
If the gap4 database has been opened with write-access this directory location will be
stored as a database RAWD note (see Section 2.15.3 [Special Note Types], page 282), which
is read by gap4 when it opens the database. The demonstration data supplied with the
package includes an example database (named DEMO.0) that has a RAWD note to specify
that traces are fetched from a tar file within the same directory.
An alternative way of specifying the trace file location is by setting the RAWDATA environment variable. On Unix and Windows NT this is straightforward (although system
and shell specific). However on Windows 95 this may prove difficult (and at least require a
reboot), so manually setting the environment variable is no longer recommended.

304

The Staden Package Manual

2.20.9 The Tag Selector
Each command using tags (for example to mask tagged sequence segments) can utilise the
Tag Selector to determine which tag types are to be used. As each command has its own
particular use for tags, the default tags are command specific.

The Tag Selector dialogue (which is available from the relevant gap4 options) consists
of a set of checkbuttons plus commands to select all tags or to deselect all tags. The "OK"
button quits the display and accepts the selected list as the current list of active tags.
The "Cancel" button quits the display without making any changes. The "As default"
button marks the current selected tags as the defaults to be used for all future uses of
this command. These selections are not saved to disk and will be lost when the program
quits. To permanently set the default tag types, users must modify their ‘.gaprc’ file.
Brief instructions on how to edit this file follow. They are also contained within the copy of
the file distributed with the package: ‘$STADENROOT/tables/gaprc’. Search for "Tag type
lists".

2.20.10 The GTAGDB File
To plot tags, gap uses a file describing the available tag types and their colours. It is possible
for users to edit their own local copies of this file to create new tag types.
The environment variable GTAGDB is used to specify the location of tag type databases.
The GTAGDB variable consists of one or more file pathnames separated by colons. The first
file read defines a set of tags and colours. Subsequent files can define additional tags and also
override the earlier tag definitions. To achieve this gap4 loads each file from the GTAGDB
variable in the order of rightmost first to leftmost last. Thus, as is similar to the unix
shell PATH variable, the leftmost pathnames have highest precedence for the resultant tag
definitions. The default GTAGDB specified in the staden login and profile scripts is:
GTAGDB:$HOME/GTAGDB:$STADTABL/GTAGDB

Chapter 2: Sequence assembly and finishing using Gap4

305

Hence the ‘$STADTABL/GTAGDB’ file is read and the ‘$HOME/GTAGDB’ and ‘GTAGDB’ (a file
in the current directory) files are merged if present. To add a new tag type only to the
database local to the current directory, create a ‘GTAGDB’ file in the current directory.
The BNF grammar for the tag database is as follows:

::=
::=
::=
::=
::=

|
’:’ ’\n’
| ’:’ |
’=’
’id’ | ’bg’ | ’dt’

Quoting strings is optional for single words, but necessary when writing a string containing spaces. In plain English, this means that to define the compression tag (COMP) to
be displayed in red, with no default annotation string we write:
compression: id="COMP": bg=red
Any lines starting with hash (‘#’) are considered as comments. Lines ending in backslash
(‘\’) are joined with the next line. Hence the above definition can be written in a clearer
form using:
# For marking compressions
compression: \
id="COMP": \
bg=red:
An example including a default annotation string of "default string" follows:
# For general comments
comment: \
id="COMM": \
bg=MediumBlue: \
dt="default string"
Allowed names for colours are those recognised by the windowing system.

2.20.11 Template Status
This option allows control over computation of the template status. The validity of a
template is computed by checking the size (based on the locations of assembled readings
and position of vector tags) and the orientation of sequences (based on their “primer type”
values).

306

The Staden Package Manual

The most likely item to need changing is the “size limit scale factor”. The expected
range of template sizes for a ligation are specified in each template record as a minimumto-maximum range. Gap4 takes a very simple approach as anything within this range is
valid and anything outside it is invalid. The scale factor is applied such that the maximum
range becomes “max * scale” and the minimum range becomes “min / scale”. So a scale
factor of 2 would adjust a range from 1.0-1.4Kb to 0.5-2.8Kb.
The “minimum valid vector tag length” is designed to workaround problems where some
assemblies end up with SVEC tags of 1 or 2 bases long (which are common when converting
from phrap for some reason). The start and end of a template may be derived from observing
a single reading with sequencing vector at both ends, so the presence of very short falsely
added SVEC tags will mark many templates as inconsistent.
The “Ignore all primer-type values” and “Ignore custom primer-type values” are methods
to disable Gap4’s trust in the primer type information for each sequence. Normally this
will be one of universal-forward, universal-reverse, custom-forward (e.g. from a primer-walk)
and custom-reverse.

2.21 Command Line Arguments
-bitsize

Specifies whether the database file size is 32-bit or 64-bit. Practically speaking
due to the use of signed numbers in places and the restriction of 32-bit for the
number of records in a database (even when using -bitsize 64 for 64-bit file
offsets) the practical limits are 2Gb filesize for -bitsize 32 and somewhere
around about 100-million sequences for -bitsize 64.
Gap4 only needs this option for creating new databases. The bit-size of existing
databases is automatically detected when they are opened.
Databases produced in 64-bit format are not compatible with older versions of
Gap4, but old and newly created 32-bit databases still work with the 64-bit
Gap4 (and are maintained in 32-bit format so editing them will not invalidate
their use by older Gap4s). The copy_db program (see Section 12.2 [Copy db],
page 573) can be used to convert file formats.

-maxdb

Specifies the maximum number of readings plus contigs. This value is not
automatically adjusted whilst the program is running, but is not allowed to be
set to a value too small for the database to be opened. It controls the size of
some areas of memory (approximately 16*maxdb bytes) used during execution
of gap. The default value is 8000.

-maxseq

Specifies the maximum number of characters used in the concatenated consensus
sequences. This parameter is generally not required as the value is normally
computed and adjusted automatically. However a few functions (such as assembly) still need to know a maximum size before hand. The default is 100000
bases.

Chapter 2: Sequence assembly and finishing using Gap4

307

-ro
-read_only
Opens the database (if specified on the command line) in read only mode. This
does not apply to databases opened using the file browser.
-check
-no_check
Specifies whether to run the "Check Database" option when opening new databases. -check forces this to always be done and -nocheck forces it to never be
done. By default Check Database is always performed when opening databases
in read-write mode and never performed when opening in read-only mode.
-exec_notes
-no_exec_notes
Controls whether to search for and execute any Notes of type OPEN or CLOS.
This may be an important security measure if you are using foreign databases.
Gap4 defaults to -no check notes.
-rawdata_note
-no_rawdata_note
Controls whether to make use of the RAWD note type for specifying the trace file
search path. Defaults to -rawdata note.
-csel
-no_csel

Controls whether to automatically start up the contig selector when opening
a new gap4 database. In some cases (such as when dealing with many EST
clusters each in their own contig) the contig selector is not a practical tool; this
simply offers a way of speeding up database opening. Defaults to -csel.
Treat this as the last command line option. Only useful if the database name
is specified and the name starts with a minus character (not recommended!).

Chapter 3: Searching for point mutations using pregap4 and gap4

309

3 Searching for point mutations using pregap4
and gap4

The original version of these methods was described in James K Bonfield, Cristina Rada
and Rodger Staden, "Automated detection of point mutations using fluorescent sequence
trace subtraction", Nucleic Acids Res. 26, 3404-3409, 1998.. The more recent work has
been done by Mark Jordan and James Bonfield with advice from Graham Taylor, Andrew
Wallace, Will Wang and others.

3.1 Introduction to mutation detection
Our methods for detecting mutations are based on the alignment and comparison of the
fluorescent traces produced by Sanger DNA sequencing. To use clinical terminology, samples
from patients are compared to standard reference traces. Patient and reference traces should
be produced using the same primers and sequencing chemistry, ideally from both strands
of the DNA. The data shown in the examples below is from exon 11 of the BRCA1 gene.
The basic idea is illustrated in the following two figures which are screen dumps from
our program gap4(see Section 2.2 [Gap4 introduction], page 95). The first shows a sample
containing a point mutation and the second contains a heterozygous base position. The
displays are bisected vertically: at the top left is the sample trace from one strand of
the DNA, below that the reference trace for that strand, and underneath the difference
between these traces which is obtained by subtracting one from the other. On the right is
corresponding data from the other DNA strand (shown complemented).

310

The Staden Package Manual

Figure 1. Top and bottom strand differences for a point mutation.

Figure 2. Top and bottom strand differences for a heterozygous base.
As can be seen, although no vertical scaling is performed the difference trace is quite
flat or is consistently either above or below the mid-line, except at the sites of mutations.
Near these are strong peaks, but notice that only for the mutated base are there peaks both
above and below the mid-line. The context effects caused by the mutation produce peaks
only in one direction.
It is perhaps necessary to point out that analysis of the traces is essential because base
callers make mistakes: they can assign the wrong base types and also assign single bases
where the DNA is heterozygous. An example of the latter can be observed in Figure 2:
on one strand the base caller has assigned a "-" symbol at position 251, at least indicating
uncertainty, but on the other strand it has assigned "T". The DNA is clearly heterozygous
at this position. This means that simply looking for differences between patient sequences
and reference sequences will cause point mutations and heterozygous bases to be missed (of
course base calling errors will also create false differences).
These trace displays alone are very useful for visual inspection of data and are all some
users want. However we also have programs which automatically analyse the trace differences and tag the bases which have significant peaks as possible sites of mutation.
Trace viewing is initiated from within the gap4 editor(see Section 2.6 [Editing in gap4],
page 160). Each record in the editor shows an individual reading with its number and name
at the left. Negative numbers denote readings which have been complemented. Several
sequences have special status. At the top is a sequence labelled with a letter S at the
left edge. This is the reference sequence, here the EMBL entry HSLBRCA1 which covers
the entirety of the BRCA1 gene. The numbering at the top of the display corresponds to
positions in this reference sequence. The program has also coloured (green) all exons on the

Chapter 3: Searching for point mutations using pregap4 and gap4

311

reference sequence. The bottom DNA sequence in the editor is labelled "CONSENSUS".
For mutation detection work this sequence is forced to be identical to the reference. Below
the CONSENSUS sequence is the amino acid sequence for the reference. This is calculated
on the fly using the feature table of the reference sequence and so translates only exons and
in their correct reading frames. Two other sequences (near the top) are labelled R and F.
These are the readings providing the reverse and forward reference traces for this segment
of the data.

Figure 3. A set of aligned sequence readings displayed in the gap4 editor.
At the very bottom of the editor is an information line which is used to display data
about items touched by the mouse cursor. Here it is showing data about one of the positions
tagged as possibly being heterozygous. It includes the observed base types (G and A) and
the scores achieved by the automated analysis.
The editor can be set to show only differences between readings and the reference; all
matching bases appear as dots. For example, Figure 4. shows the same data as Figure 3,

312

The Staden Package Manual

but with the editor set to show differences, and the information line showing details about
a possible mutation.

Figure 4. An alternative view of aligned sequence readings in the gap4 editor.

One column contains several bases tagged in red, signifying possible heterozygotes, and
some in orange denoting possible point mutations. During visual inspection the program
can be made to move the cursor from one tag to the next and to display the aligned traces
as shown above in Figures 1 and 2.

It is also possible to have positive controls for displaying the trace differences; i.e. reference traces which contain the mutation. In this case the traces appear as shown in figure 5.
Here the forward and reverse positive controls are shown to the right of the normal plots.

Chapter 3: Searching for point mutations using pregap4 and gap4

313

In Figure 5 the positive control difference plots are quite flat hence, in this case, providing
confirmation of the presence of the heterozygous base.

Figure 5. Top and bottom strand differences and positive control for a heterozygous
base.
As mentioned above the package contains programs which can automatically compare
the traces and their reference sequences. The output from these programs are the tags
shown in the editor. Users can check the traces at these positions using the displays shown
in Figures 1, 2 and 5; if necessary removing or adding tags. Alternatively users can rely
entirely on visual inspection and create all tags themselves.
Once all the mutations are correctly tagged the program can produce a report which
includes the reading names, mutation positions relative to the reference sequence, the actual
change, its effect, and the evidence. An example is shown below in Figure 6.

001321_11aF
001321_11aF
001321_11cF
001321_11cF
001321_11dF
001321_11eF
001321_11eF
000256_11eF

33885T>Y
34407G>K
35512T>Y
35813C>Y
36314A>R
36749A>R
37313T>K
36749A>G

Figure 6. How gap4 reports mutations.

314

The Staden Package Manual

3.1.1 Mutation Detection Programs
The software handles batches of trace data from sequencing instruments. It performs all
processing except base calling (although it can employ third party programs such as phred
for this step). This includes file format conversions, quality clipping, scanning for mutations
and heterozygotes, multiple sequence alignment, easy visual inspection of traces, production of reports, and the accumulation and storage of readings and traces. The software also
handles the initialisation/configuration of standard reference files and databases for any
project. The two main programs are pregap4 and gap4. Pregap4 (see Section 4.2 [Pregap4
introduction], page 326) prepares data for gap4 by automatically using a variety of smaller
programs, including those used to search for mutations: mutscan (see Section 4.6.22 [Mutation Scanner], page 359. Gap4 (see Section 2.2 [Gap4 introduction], page 95) is used to
store the aligned readings, to view the sequences and traces, and to produce a report listing
the observed mutations.
Any number of sequences can be processed in a single run, and for each individual patient
sample the operation is generally performed in two steps. First, via pregap4, the traces are
aligned and compared to the reference traces and any possible mutations or heterozygous
bases marked. Secondly, the data is transfered into a gap4 database from where users can
visually check the differences between the reference and patient traces.
The program mutscan (see Section 4.6.22 [Mutation Scanner], page 359) can automatically compare patient and reference traces to find point mutations and heterozygous bases.
Users can set parameters which control the sensistivity of the algorithms (and hence which
determine the ratio of false negative and positive results). Mutscan adds tags of type
“mutation” or “heterozygous” to the patient files. The tags contain the numerical scores
achieved at the site of the reported base changes, and they can be viewed via the gap4
editor(see Section 2.6 [Editing in gap4], page 160). Mutscan is normally run via pregap4
(see Section 4.2 [Pregap4 introduction], page 326).
The description of the programs given below is presented in reverse order of use i.e. gap4
then pregap4, but first we give further details about the use of reference data.

3.1.2 Mutation Detection Reference Data
The mutation detection methods require reference traces and optionally reference sequences.
Reference traces are used for automatic mutation detection and for visual inspection of trace
differences. Reference sequences are used in gap4 to provide a base numbering standard, and
if required to provide feature table entries to control translation and mutation reporting.

3.1.3 Reference Sequences
Reference sequences are used in gap4 (see Section 2.2 [Gap4 introduction], page 95). Here
they can be used to define a numbering system independent of gaps introduced to produce
alignments. The numbering can start at any point in the reference sequence. If the reference

Chapter 3: Searching for point mutations using pregap4 and gap4

315

sequence is entered with a feature table the features are converted to tags and can be used
to control translation of the sequence in the contig editor. For mutation detection work the
reference sequence and feature table enable mutations to be reported using positions defined
by the reference sequence, and also allows the effect of the mutations to be noted. Gap4
is able to store entries from the EMBL sequence library complete with their feature tables.
These feature tables are converted to gap4 database annotations (tags), which means that
they can be selectively displayed in the template display and editor, and used to translate
only the exons (in the correct reading frame). Obviously it may be useful to augment the
feature tables with the sites of known polymorphisms or deleterious mutations so that they
can be displayed in gap4 as landmarks. When it comes to producing a report of the observed
mutations the feature table is used to work out if a mutation is expressed and if so what
the amino acid change is. Additional tags can be created to specify the positions of the
primers or restriction sites used to obtain data covering segments of the sequence. For any
project the reference sequence need only be set up once. Either project databases can be
started with the reference sequence already configured or the reference can be assembled
along with the reading data. The reference sequence can be designated (or reassigned) as
follows. In pregap4 (see Section 4.2 [Pregap4 introduction], page 326) it can be named in
the module "Reference Traces". In the gap4 editor it can be set by right clicking on its
name. Once set it should appear labelled "S" at the left edge of the editor.

3.1.4 Reference Traces
References traces are used by the automatic mutation detection program mutscan (see
Section 4.6.22 [Mutation Scanner], page 359, and by the trace difference display in the gap4
editor(see Section 2.6 [Editing in gap4], page 160). Ideally forward and reverse reference
traces should be available and should be obtained using the same primers and sequencing
chemistry as the patient data. From the "settings" menu of the editor the trace display can
be set to "Auto-Diff traces". Once this is activated, whenever the user double clicks on a
base in the editor sequence display, not only is the reading’s trace displayed, but also its
designated reference trace plus the difference between them. If its complementary reading

316

The Staden Package Manual

is available, its trace and reference trace and their differences are also displayed. These
trace displays and the editing cursor scroll in synch.

Top and bottom strand differences for a heterozygous base.
The preferred way of assigning reference traces to readings is by use of "naming conventions"; that is to have a simple set of rules which control the names given to the trace
files. It can be seen in the figures showing the editor that forward and reverse readings from
the same patient have names with a common root but which end either F or R. This both
ties the two together (so the software knows which is the corresponding complementary
trace when the user double clicks on a reading) and also enables the association of readings
and their reference traces. Once a convention has been adopted the rules can be defined
for pregap4 by loading them via the "Load Naming Scheme" option in its File menu (see
Section 4.8 [Pregap4 Naming Schemes], page 366). For any batch of readings the reference
traces are defined within pregap4’s "Reference Traces" module. Note that this mode of
operation, by allowing the specification of only one forward and one reverse trace, limits
each batch of traces processed to those which correspond to a given pair of reference traces.
The size of the batch is unlimited.
The alternative way of specifying the reference traces is to right click on their names in
the editor. This also allows positive trace controls to be specified (which is not possible in
pregap4).

Chapter 3: Searching for point mutations using pregap4 and gap4

317

3.1.5 Using The Template Display With Mutation Data

Figure 7. The template display showing the whole of the BRCA1 gene (exons in green).

The view obtained from the Template display and shown in Figure 7 is not of practical
use but serves here to illustrate the overall arrangement of the data for our chosen example

318

The Staden Package Manual

the BRCA1 gene. This figure shows the entirety of the EMBL entry HSLBRCA1 with its
exons marked in green. Only exon 11 has patient trace data stacked above it.

Figure 8. A zoomed-in version of the data shown in Figure 7.
Here we can see all the readings covering exon 11. Forward readings are light blue,
reverse readings orange, primers are marked in yellow, mutations in red and orange. A
common mutation appears in the leftmost set of readings and illustrates the value of using
the template display for visualising the overall pattern of the tagged mutations.

3.1.6 Configuring The Gap4 Editor For Mutation Data
The current version of the gap4 editor contains very many options that are not needed
for mutation data. Given sufficient demand a version tailored for mutation studies could
be produced. For now it might make it easier to understand the program if its origin as
a genome assembly program is borne in mind. Here we outline the options and settings
relevant to mutation studies. The assignment of reference sequence and traces is described
above. From the editor they can be set by right clicking on the reading names.
Gap4 enables segments of sequences to be annotated (or tagged). Each tag has a type
(eg primer) and each type has an associated colour. Each instance of a tag can include

Chapter 3: Searching for point mutations using pregap4 and gap4

319

editable text. This text can be viewed and edited by right clicking on the tag and selecting
"Edit tag", after which a text box will appear. Gap4 can display annotations/tags as
background colour and the user can specify which tag types are shown. For mutation
studies the following tag types may usefully be activated, and all others turned off. Using
the "Set Active Tags" option in the "Settings" menu first click on "Clear all". Then click
on "primer". To add further types you must hold down the "Ctrl" key on the keyboard
while clicking. Now scroll down and click on "Mutation", "Heterozygous" and "FEATURE
CDS". Add any others required, then click "OK".
The following configurations are performed via the "Settings" menu.
Gap4 has three consensus generation algorithms. When using a reference sequence it
is convenient if the consensus shown in the editor is forced to be the same as the reference. This will be the case if either the "Weighted base frequencies" or the "Confidence
values" consensus algorithms are being used. This selection is made using the "Consensus
algorithm" option.
Translations are shown in what gap4 refers to as the "Status" line. To enable automatic
translation of the exons defined in the reference sequence, in the "Status Line" option set
"Translate using feature tables".
To enable automatic display of trace diferences, in the "Trace Display" option set "AutoDiff Traces".
To show only the base differences between the consensus/reference, set "Highlight Disagreements". These can be shown by dots or colour.
To show base confidence values set "Show reading quality" and also make sure that the
value in the box labelled "Q" at the top left of the editor is set to 0 or greater.
To force forward and reverse reading pairs to be shown in adjacent records in the editor
set "Group readings by templates" (NB this assumes that an appropriate naming scheme
has been used).
If a reference sequence is assigned, the numbering at the top of the sequence will reflect
the base positions in that sequence. Any pads in the reference sequence are ignored. If
no reference sequence is assigned, the numbering will ignore pads if the "Show unpadded
positions" option is activated.
At the bottom of the "Settings" menu is an option to "Save settings". Use of this will
mean that the current configuration will be set automatically next time the editor is used
(and hence the steps just described only need to be performed once).

3.1.7 Using The Gap4 Editor With Mutation Data
The current version of the editor has a fixed width and a maximum height. If too many
sequences are present at any position a vertical scrollbar on the right edge can be used to
move them up and down. The CONSENSUS line will always be visible, but at present,
the reference sequence is scrolled along with all the other sequences and so may disappear.
Horizontal scrolling is achieved in the usual ways, plus by use of the >, >> and <, << buttons.
The reading names can be moved left and right using the scrollbar above them.
Configure the editor as described above.

320

The Staden Package Manual

The traces for readings (and their reverse) can be examined over their full length one at
a time by simply double clicking on them then scrolling along. Any mutations observed can
be labelled by right clicking on the base in the editor display and invoking the "Create tag"
option. This brings up a dialogue box. At the top is a button marked "Type:comment";
clicking on this will bring up another dialogue with a list of all the tag types; choose the
appropriate one ("Heterozygous" or "Mutation"). There are obviously many advantages to
examining the traces like this using gap4. However, if the automated mutation detection
methods are trusted, or used in way that makes them trustworthy for the type of study
being undertaken, then there are quicker ways of examining the data.
The "Next Search" button at the top of the editor gives access to many types of
search, one of which is "tag type". If this is selected a button appears labelled "Tag
type COMM(Comment)". Clicking on this will bring up a dialogue showing all the available tag types. If the user selects, say "Mutation", each time the "Next Search" button is
used the program will position the editing cursor on the next mutation tag. Double clicking will automatically bring up the appropriate traces as shown in figures 1, 2 and 5 (see
Section 3.1 [Introduction to mutation detection], page 309). The user can view the traces
and if necessary alter the tag (eg delete it if it is a false positive).
Once all the data has been checked and all mutations and heterozygous bases have
been tagged a report can be generated using the "Report Mutations" option in the editor
"Commands" menu. Note that it is also possible to simply report all differences between
base calls and the reference, but the usual procedure is for the program to report all bases
tagged as "Mutation" or "Heterozygous". Example output is shown above in Figure 6 (see
Section 3.1 [Introduction to mutation detection], page 309). The report appears in the gap4
"Output window" which can be saved to disk by right clicking on the text and selecting
"Output to disk".

3.1.8 Processing Batches Of Mutation Data Trace Files
It is not clear which is the best way of organising the data for the simplest and most efficient
processing using the current programs, but for now we make the following suggestions.
We assume that the region of the DNA being studied has a standard set of forward and
reverse primer pairs covering all segments of interest and that a standard reference sequence
in EMBL format is available.
We recommend that batches of data from single primer pair combinations are processed
separately, using separate temporary gap4 databases. For example, exon 11 of BRCA1 can
be covered by five pairs of forward and reverse primers and we suggest that batches of traces
obtained from each of these primer pairs should be processed using five gap4 databases.
Each processing run should create a new database and should enter, not only the new
sets of patient data for that particular primer pair, but also the corresponding reference
sequence and reference traces.
Obviously when several primer pairs are needed to cover a given region of the DNA (eg
for BRCA1) the same reference sequence would be used for all the primer pairs.
An alternative to the above is to create a template database for each primer pair which
contains the data for the corresponding forward and reverse reference traces plus the fully

Chapter 3: Searching for point mutations using pregap4 and gap4

321

annotated reference sequence. These template databases are copied to create a temporary
database for each new batch of data for the given primer pair.
Whichever of these two strategies is adopted each batch of new data is processed, analysed and assembled into these temporary databases, inspected visually, and a mutation
report generated.
The use of separate temporary databases simplifies the assignment of reference traces
and the use of the report generation function.

Figure 9. An overview of a database containing data for only one primer pair of BRCA1
For long term storage and to facilitate larger studies, the content of each of these temporary databases is then transferred to archive databases, after which the temporary databases
are no longer needed. The archive databases could be restricted to individual primer pairs
or could accommodate data covering the whole of the reference sequence.

322

The Staden Package Manual

3.1.9 Processing Batches Of Mutation Data Trace Files Using
Pregap4
All the data processing other than visual inspection of traces and report generation is
handled by the program pregap4 (see Section 4.2 [Pregap4 introduction], page 326). Pregap4
achieves this by running a set of individual programs selected by the user.

Figure 10. The pregap4 Configure Modules window showing a typical list of mutation
data option selections.
The "Configure Modules" window shown in Figure 10. is used to select which programs
to apply to a batch of data, and to configure their usage. On the left is a list of programs
and options, with "x" showing the ones that have been selected. If the user clicks on an
option name its name is given a blue background and its configurable parameters are shown
in the right hand panel to enable the user to alter them. Here "Reference Traces" has been
selected which enables the user to set the reference traces and sequence.
The other selected options (marked with "x") are typical of the ones used for mutation
detection studies. Below we describe the use of each plus a few alternatives. All of the
options are descibed in more detail elsewhere in our documentation, our intention here is
to give an overview of their use during mutation studies.
Note that the window labelled "Files to Process" is used to tell the program which files
to process as a batch.

Chapter 3: Searching for point mutations using pregap4 and gap4

323

3.1.10 Configuration Of Pregap4 For Mutation Data
General Configuration
This option allows the user to select whether the trace names used for the
samples should be the same as their file names or should be the names stored
inside the files.
Phred
Phred is a base caller which also assigns confidence values to each base. Generally the data passed to pregap4 has already been base called. However not
all base callers assign confidence values and so it can be useful to apply phred
or ATQA (which does not base call but does assign confidence values). Alternatively "Estimate Base Accuracies" can be applied which is a simple program
for providing numerical values which reflect the signal to noise ratio for each
base, and which can be used instead of confidence values. (Note that if quality
clipping is used, its score thresholds depend on whether confidence values of
eba values are used).
Trace Format Conversion
This option can be used to convert bulky files such as those of ABI to a compact
format such as SCF or ZTR without loss of the data required for trace display.
Initialise Experiment Files
The input to gap4 and several of the other programs used here is a data format known as Experiment file format. This step, which has no configurable
parameters is essential for mutation data processing.
Augment Experiment Files
The section on Reference Traces outlined the use of "Naming Schemes" for
associating pairs of forward and reverse readings, and for assigning reference
traces. The naming scheme must be loaded from pregap4’s File menu. "Augment Experiment Files" must be activated in order for the naming scheme to
be applied. No parameters need be set.
Quality Clip
The reliability of the base calls varies with position along the sequence. Near
to both ends the data is less reliable. The "Quality Clip" option trims the ends
of the sequences by analysing their confidence values or accuracy estimates (if
present) or the density of unknown bases in the sequence. By observing these
"clip points" other processing programs will work more reliably.
Reference Traces
As explained above it is necessary to specify a reference trace (preferably one
for each strand of the data if processing data from both strands). The Reference
sequence can also be set here. Note that even if our suggestion to preload the
reference traces into the gap4 database is followed, it is still necessary to specify
them here for use by the mutation detection modules.
Trace Difference
This is the program which compares the patient and reference traces to search
for possible mutations. It adds data to the experiment files to mark each predicted mutation, and this data will appear as tags in the gap4 database. It

324

The Staden Package Manual

can also create a new trace file containing the difference of the reference and
the sample. The numerical parameters control the sensitivity of the algorithms,
and hence the ratio between the numbers of false positive and negative results.
Heterozygote Scanner
This is the program which compares the patient and reference traces to search
for possible heterozygous bases. It adds data to the experiment files to mark
each predicted heterozygous base, and this data will appear as tags in the gap4
database. The numerical parameters control the sensitivity of the algorithms,
and hence the ratio between the numbers of false positive and negative results.
Gap4 shotgun assembly
In order to be able report the positions of mutations relative to the reference
sequence, and to be able to compare sets of samples from patients, it is necessary
to perform multiple sequence alignment on the data. This is termed "assembly"
and is usually performed by gap4, although other programs can be operated
via pregap4. If following the suggestion to preload the reference sequence to
a temporary database for each batch, supply the name of this database here.
Otherwise a new database should be named and created from this option. (If
this strategy is adopted make sure that the reference sequence and the references
traces are assembled!) The parameters that control the assembly process and
are described elsewhere.
Note that pregap4 has the facility to save its configuration and parameter settings. This
means that the current configuration will be set automatically next time the program is
used (and hence the steps just described only need to be performed once). In addition
pregap4 can be run non-interactively by typing a single line on the command line. Taking
thse two capabilities together, means that only one line need be typed in order to process all
subsequent batches of data (assuming the file names are reused, which is easy to arrange.)

3.1.11 Discussion Of Mutation Data Processing Methods
At present pregap4 and gap4 clearly show their primary usage in the field of genome assembly, but versions tailored to mutation studies can be created once the requirements are
agreed. Ideally all processing should be controlled by a single program which once configured for any project should require users to provide only the project name - all other file
names and parameters could be preset, and all processing, including archiving and backup,
performed automatically, leaving the data ready for visual inspection.
The automatic mutation and heterozygote detection programs work well on all the test
data we have but now they require evaluation by external groups. Such analysis would
enable us to improve the algorithms and to tune their parameters. At present we know that
sometimes a base will be declared both as a mutation and as a heterozygous position when
visual inspection shows that it is one or the other.
There is still much that can be done overall to improve the methods, but the text above
summarises their status in July 2002. Although currently valuable for real scientific and
clinical work they should perhaps be viewed as prototypes.

Chapter 4: Preparing readings for assembly using pregap4

325

4 Preparing readings for assembly using pregap4
4.1 Organisation of the Pregap4 Manual
Pregap4 is a relatively simple program to use. It is also very flexible and extendable, and so
much of the manual is taken up by explaining to programmers and system managers how
it can be configured. The average user need not be concerned with these details.
The Introductory section of the manual is meant to give an overview of the program:
what it is for, the files it uses and functions it performs, and how to use it. It is very
important for all users to have a basic understanding of the files used by pregap4 and the
processes through which it can pass their data (see Section 4.2.1 [Summary of the Files used
and the Processing Steps], page 326). The next section of the Introduction (see Section 4.2.3
[Pregap4 Menus], page 336) tabulates the program’s menus. This is followed by an overview
of the pregap4 user interface (see Section 4.2.2 [Introduction to the Pregap4 User Interface],
page 330) which should give a clear idea of how to actually use the program, and concludes
the introductory section.
More detail about how to define the set of files to be processed (see Section 4.3 [Specifying
Files to Process], page 337) is followed by a section showing how to run pregap4 and
giving examples of its use (see Section 4.4 [Running Pregap4], page 338). Next are sections
on configuring the pregap4 user interface (see Section 4.5 [Configuring the Pregap4 User
Interface], page 341).
The next part of the manual describes how to use the Configure Modules Window
to select the modules to apply and to set their parameters (see Section 4.6 [Configuring
Modules], page 342). This is one of the longest and most detailed parts of the manual in
that it describes how to configure all the current possible modules, many of which will not
be available at all sites, and several of which perform identical functions. Obviously, only
the entries which describe the functions that are available at a site, are of interest.
One of the important tasks of pregap4 is to make sure that each reading’s Experiment file
contains all the information needed by gap4 to ensure the accuracy of the final consensus
sequence and to make the project proceed as efficiently as possible. Pregap4 provides
several methods for sourcing this information. One of these, as for example employed at
the Sanger Centre in the UK, is to encode some information about a reading in its reading
name. Pregap4 contains flexible mechanisms to enable a variety of the "Naming schemes"
or "Naming conventions" to be used as a source of information to augment the Experiment
files (see Section 4.8 [Pregap4 Naming Schemes], page 366). Alternatively pregap4 can
use simple text databases as an information source (see Section 4.10 [Information Sources],
page 371), or the user can set up some Experiment file record types for use with a batch of
readings (see Section 11.3.1 [Experiment file format record types], page 552).
The rest of the manual deals with increasingly complicated matters, and the average user
should never need to consult these sections. First there is a section on adding an removing
modules (see Section 4.11 [Adding and Removing Modules], page 375). This describes how
to control the list of modules which appear in the Configure Modules Window. The package
is usually shipped with this list set to contain more modules than are likely to be available
at any one site and so it might be found useful to remove those that are not available.

326

The Staden Package Manual

The next two sections, as their names imply, are for programmers only (see Section 4.12
[Low Level Pregap4 Configuration], page 377) and (see Section 4.13 [Writing New Modules],
page 395).

4.2 Introduction
Before entry into a gap4 database the raw data from sequencing instruments needs to be
passed through several processes, such as screening for vectors, quality evaluation, and
conversion of data formats. Pregap4 is used to pass a batch of readings through these
steps in an automatic way. It provides an interface for setting up and configuring the
processing and for controlling the passage of the readings through each stage. The separate
tasks are termed "modules" and each module is typically managed by a dedicated program.
Pregap4 wraps all of these modules into a single easy to use environment, whilst maintaining
the flexibility to select and extend the processing modules. It is an, as yet, unpublished
replacement of the program pregap Bonfield, J.K. and Staden, R. Experiment files and their
application during large-scale sequencing projects. DNA Sequence 6, 109-117 (1996).

4.2.1 Summary of the Files used and the Processing Steps
Gap4 stores the data for an assembly project in a gap4 database. Before being entered into
the gap4 database the data must be passed through several steps via pregap4. The range of
tasks that can be peformed using pregap4 are shown schematically in the following figure.

Chapter 4: Preparing readings for assembly using pregap4

327

328

The Staden Package Manual

The package can handle data produced by a variety of sequencing instruments, and also
data entered using digitisers or that has been typed in by hand. One of the first steps is to
convert trace files, such as those of ABI, which are in proprietary format, to SCF files (see
Section 11.1 [SCF introduction], page 533).
Next, as originally put forward in Bonfield,J.K. and Staden,R. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Research
23, 1406-1410 (1995) (see Section 2.2.5 [The use of numerical estimates of base calling accuracy], page 118), if they are not already included in the files, base call confidence values
are calculated, and are normally stored in the reading’s SCF file.
Next the base calls are copied from the trace files to text files known as Experiment files
(see Section 11.3 [Experiment files], page 552).
Note it is also possible to enter sequence readings in the form of FASTA files for use at this
stage of the processing, in which case they will be automatically converted to Experiment
file format.
All the subsequent processes operate on the Experiment files.
Experiment file format is similar to that of EMBL sequence entries in that each record
starts with a two letter identifier, but we have invented new records specific to sequencing
experiments. Gap4 can make use of information about readings which may not be contained
within the raw data files, such as sequencing chemistry and whether it is a forward or
reverse reading. Gap4 will work without this information, but at a reduced level. For
instance knowing which forward and reverse readings belong together allows gap4 to check
the validity of assembly and for automatic ordering of contigs.
One of pregap4’s next tasks is to augment the Experiment files to include data about
the chemistry, vectors, primers and templates used in the production of each reading, and
if necessary it can extract this information from external databases (see Section 4.10 [Information Sources], page 371), or via local reading name conventions (see Section 4.8 [Pregap4
Naming Schemes], page 366). Once the Experiment file for a reading contains all the necessary information the remaining processing programs can be used in turn to analyse the
data.
First the reading is marked at both ends to define the range of reasonable quality base
calls (see Section 4.6.8 [Quality Clip], page 347).
Then the reading is searched for the presence of sequencing vector at the 5’ end 3’ ends
(see Section 4.6.9 [Sequencing Vector Clip], page 348).
Next the sequence is checked for the presence of "cloning" vector, i.e. non-sequencing
vectors, such as those of BACs (see Section 4.6.11 [Cloning Vector Clip], page 350).
The final check of this type is to screen the reading for any vector that may have been
missed in the previous searches (see Section 4.6.12 [Screen for Unclipped Vector], page 351).
The next check is to screen the reading for any set of sequences which it may be contaminated by, such as E. coli (see Section 4.6.13 [Screen Sequences], page 351).
Note that vector sequence files are normally stored in the package vectors directory/folder. If a file of vector file names is used the vector sequences can also be stored in

Chapter 4: Preparing readings for assembly using pregap4

329

its directory/folder. Files of file names and vector-primer files can also contain environment
variables to define the location of vector files.
Vector primer files, vector sequence files and files of file names must be stored in plain
text files (see Section 11.5 [Vector primer Files], page 567), (see Section 11.6 [Vector sequence format], page 568).
Pregap4 is usually used non-interactively once the modules have been configured, but
some groups prefer (or have the time) to check the data by eye using the program trev (see
Section 8.1 [Trev], page 417) at this stage.
Another option is to search the readings for families of known repeats (see Section 4.6.18
[Tag Repeats], page 354). This will tag any regions which are found to match known repeats.
Some groups are using the package for mutation studies and the final pregap4 option,
prior to assembly is to use the mutation scanner program (see Section 3.1 [Introduction
to mutation detection], page 309) to search the readings for mutations (see Section 4.6.22
[Mutation Scanner], page 359.
Pregap4 can also be used to assemble the readings into a gap4 database (see
Section 4.6.23 [Gap4 Shotgun Assembly], page 361), or to assemble the readings using an
external assembly engine such as FAKII (see Section 4.6.26 [FakII Assembly], page 362),
and then to enter that assembly into a gap4 database (see Section 4.6.28 [Enter Assembly
into Gap4], page 364).
It is unlikely that any particular user will want to employ all of these options and one
of pregap4’s modes of use is to enable users to configure the program for their work (see
Section 4.6 [Configuring Modules], page 342). Not only can they select which tasks should
be performed, and which of the alternative programs ("modules") should be used for them,
but also the order in which they are applied. Although it is very rarely a problem, this
high level of flexibility comes at a price in the current version of pregap4: pregap4 does not
include code to check on the logicality of the configuration set by a user and will attempt to
execute the modules in the order given. There are some users, who having read this section,
will configure pregap4 to perform assembly before creating the Experiment files from the
trace files. Pregap4 will attempt to do this and no data will be assembled as the files given
to the assembly engine will be in the wrong format. This is just something to be aware of.
Pregap4 uses configuration files to remember the setup for each user or project.
These files define which modules are activated and what their parameter settings are
(see Section 4.7 [Using Config Files], page 366). These files, which can obviously save
considerable amounts of time, are created automatically and can be saved from the
Configure Modules Window once the configuration is complete.
The trace files are not altered, but are kept as archival data so that it is always possible
to check the original base calls and traces. The trace files are used by gap4 to display traces
and to compare the final consensus sequence with the original data, therefore they must be
kept online for the lifetime of the project. To save disk space it is best to use SCF files and,
if they were derived from a proprietary format such as that of ABI, to remove the originals.
Any changes to the data prior to assembly (and we recommend that none are made until
readings can be viewed aligned with others) are made to the copy of the sequence in the

330

The Staden Package Manual

Experiment file. For example the results of all the searching procedures outlined above are
added as new records to each reading’s Experiment file. The reading data, in Experiment
file format, is entered into the project database (see Section 2.16 [Gap Database Files],
page 284), usually via one of the assembly engines. All the changes to the data made by
gap4 are made to the copies of the data in the project database. Once the data has been
copied into the gap4 database the Experiment files are no longer required.

During processing pregap4 uses temporary files. The number and nature of these files
depends on the modules used. At the very least pregap4 will produce files containing the
names of the input files and the result of their processing. Those that were processed
successfully will be stored in a file with a name ending ".passed" and those that failed in
one ending ".failed". The ".passed" file can be used as a file of input file names for assembly
into gap4 (assuming that a pregap4 assembly module has not already been used).

While it is running, pregap4 will create files with a file name prefix defined by the user,
and store them in an output directory of the user’s choice (see Section 4.3 [Specifying Files
to Process], page 337).

When processing has finished pregap4 will produce a report containing information from
each module and the final list of passed and failed sequences.

4.2.2 Introduction to the Pregap4 User Interface
Pregap4 provides interfaces to define the batch of data files to be processed, which modules
are to applied to them; to configure the modules, and to start the processing. It also
provides mechanisms for adding and removing modules, but this facility will be used far
less often than the others.

Chapter 4: Preparing readings for assembly using pregap4

331

Pregap4 supports two styles of windowing. The default method is a compact mode, with
the alternative being "separate" mode - similar to gap4 and spin.

This is the "separate" window style. Here the main window is always visible, with
commands in the main window bringing up new windows. In the picture above the configure
window can be seen on top of the main window.

332

The Staden Package Manual

The second style is "compact" mode.

In the compact picture above the most common top level windows are "pages" in a
tabbed notebook. The benefit is greatly reduced screen space and quicker controls, but the
text output window is no longer permanently visible. The Window Style can be changed
using the options menu (see Section 4.5.2 [Window Styles], page 341).

4.2.2.1 Introduction to the Files to Process Window
Pregap4 operates on batches of files. These files can be binary trace files (in ABI, ALF or
SCF format), Experiment Files, or plain text, and do not need to all be in the same format.
The Files to Process Window is used to define which files are to be processed. The "Files

Chapter 4: Preparing readings for assembly using pregap4

333

to Process" dialogue (see below) can be brought up from the File menu, or by pressing the
appropriate tab when in compact_win mode.

On the left hand side of the figure is the current list of files to process. This list can be
edited simply by clicking with the mouse and typing.
On the right side of the panel is the pregap4 output filename prefix, the output directory
name, and several buttons. The filename prefix is used when pregap4 needs to create files.
For example after processing there may be prefix.passed, prefix.failed files. All files will be
created within the output directory.
The buttons allow selection of the files to process. The "Add files" button will bring
up a file browser, which will allow one or more files to be selected. Pressing Ok on the file
browser will then add the selected files to the "List of files to process" panel on the left side
of the pregap4 window.
The "Add file of filenames" button may be used to select a list of files whose filenames
have been written to a ‘file of filenames’.
The "Clear current list" button will remove all filenames from the list.
Both the "Add files" and "Add file of filenames" button append their selections to the
list of files to process, so to replace the current list the "Clear current list" button must
first be used.

334

The Staden Package Manual

The "Save current list to..." button may be used to produce a new file of filenames,
containing the combined list of files to process.

4.2.2.2 Introduction to the Configure Modules Window
The "Configure Modules" dialogue is available from the Modules menu or, when using the
compact window style, by pressing the Configure Modules tab.
As can be seen in the figure below, the left side of the display contains a list of the
currently loaded modules. One module in this list will be highlighted. The right side of the
display shows the configuration panel for this highlighted module and is module specific.

The module list shown on the left consists of a series of module names and their status,
and is termed the "enable status". The tick or cross at the left of the name indicates whether
this module is enabled. The text to the right of the module name indicates whether the
module has been given all the parameters needed for it to run. This will be one of "ok"
(all configuration options have been filled in), "-" (no configuration options exist for this
module), "edit" (further configuration is required") or blank (this module is disabled).
The "enable status" can be toggled by left clicking on the tick/cross to the left of the
module name. The enable status can be written to the current Pregap4 configuration file
using the "Save Module List" or "Save All Parameters" commands in the Modules menu.
Left clicking anywhere on a module name in the module list will switch the pane on the right
side of the window to display any available parameters for this module. Not all modules
will have parameters to configure.

Chapter 4: Preparing readings for assembly using pregap4

335

For modules that do have parameters, the top line of the configuration panel will contain
two buttons labelled "Select params to save" and "Save these parameters". The "Select
params to save" button will add check boxes next to each parameter. Clicking on these check
boxes allows selection of individual parameters to save for this module. Once these have
been selected pressing the "Save" will save only those selected to the pregap4 configuration
file. Pressing the "Save these parameters" button will save all parameters for this module
to the configuration file.
The bottom strip of the window is an "Information Line".

4.2.2.3 Introduction to the Textual Output Window
Pregap4 has a main text output window identical to that of gap4 and spin. It is used for
showing textual results in the top section and error messages in the lower part. Full details
of the user interface are given elsewhere (see [User Interface], page 523), but an example of
the Text Output Window is given below.

4.2.2.4 Introduction to Running Pregap4
When pregap4 is started the user first needs to select the files to process. This is done using
the "Files to Process" command (from the File menu).
The "Configure Modules" tab allows for the currently available modules to be enabled
or disabled, and the module parameters edited accordingly.

336

The Staden Package Manual

Once all modules have been configured (so that none have edit listed next to their
name) pregap4 is ready to begin processing. This is started by pressing "Run" or by
selecting "Run" from the File menu.
When pregap4 has a setup that would be useful in the future "Save All Parameters (in
all modules)" from the Modules menu can be used, and pregap4 will store all the module
parameters to a configuration file ready for subsequent runs.
When processing has finished pregap4 will produce a report containing information from
each module and the final list of passed and failed sequences.
If for any reason pregap4 fails a particular step in the processing, users are strongly
recommended to correct whatever has caused the module to fail, clean up any files it has
created, and then repeat the whole process. That is, until users have a good understanding
of what happens at each stage of processing, it is better to repeat all the steps with the
original list of files, than to try to guess which step to continue from.

4.2.3 Pregap4 Menus
The main window of pregap4 contains File, Modules, Information source and Options
menus.

4.2.3.1 Pregap4 File menu
The File menu includes functions to set the files for processing, loading configuration files
and naming schemes, including configuration components, starting processing and exiting.
•
•
•
•
•
•

Set Files to Process (see Section 4.3 [Specifying Files to Process], page 337)
Load New Config File (see Section 4.7 [Using Config Files], page 366)
Load Naming Scheme (see Section 4.8 [Pregap4 Naming Schemes], page 366)
Include Config Component (see Section 4.9 [Pregap4 Components], page 371)
Save All Parameters (in all modules) (see Section 4.6 [Configuring Modules], page 342)
Save All Parameters (in all modules) to: (see Section 4.6 [Configuring Modules],
page 342)
• Save Module List (see Section 4.6 [Configuring Modules], page 342)
• Exit

4.2.3.2 Pregap4 Modules menu
The pregap4 Modules menu contains options for adding and configuring modules, and running pregap4.
•
•
•
•

Add/Remove Modules (see Section 4.11 [Adding and Removing Modules], page 375)
Configure Modules (see Section 4.6 [Configuring Modules], page 342)
Select all modules
Deselect all modules

4.2.3.3 Pregap4 Information source menu
The Information source menu contains options for specifying how the information required
for the experiment files is to be obtained. These menu options can also be entered from the
"Augment Experiment Files" module.

Chapter 4: Preparing readings for assembly using pregap4

337

• Simple Text Database (see Section 4.10.1 [Simple text Database], page 371)
• Experiment File Line Types (see Section 4.10.2 [Experiment File Line Types], page 373)

4.2.3.4 Pregap4 Options menu
The Options menu contains options for setting fonts and colours and defining the style of
the user interface.
•
•
•
•

Set Fonts (see Section 4.5.1 [Fonts and Colours], page 341)
Set Colours (see Section 4.5.1 [Fonts and Colours], page 341)
Compact Window Style (see Section 4.5.2 [Window Styles], page 341)
Separate Window Style (see Section 4.5.2 [Window Styles], page 341)

4.3 Specifying Files to Process
Pregap4 needs to be given a list of files to process. These files can be binary trace files
(in ABI, ALF, SCF, CTF or ZTR format), Experiment Files, FASTA, or plain text. The
files to process do not need to all be in the same format. FASTA files will be converted to
Experiment files.

Refering to the figure above, the "Files to Process" dialogue can be brought up from the
File menu, or just by pressing the appropriate tab when in compact_win mode.
On the left hand side we have the current list of files to process. This list can be
edited simply by clicking with the mouse and typing as normal. This only edits Pregap4’s

338

The Staden Package Manual

temporary copy of this list and does not modify the contents of any file of filenames that
the list was obtained from.

On the right side of the panel is the pregap4 output filename prefix, the output directory
name, and several buttons. The filename prefix is used when Pregap4 needs to create files
for its own use, both for temporary and not so temporary files. For example after processing
there may be prefix.passed, prefix.failed files. The prefix defaults to ‘pregap’ until a file of
filenames is loaded, in which case it switches to the last used file of filenames. All files
will be created within the output directory, regardless of where the input files reside. The
output directory defaults to the current directory or to the last used input directory.

The buttons allow selection of the files to process. The "Add files" button will bring
up a file browser, which will allow one or more file to be selected. Pressing Ok on the file
browser will then add the selected files to the "List of files to process" panel on the left side
of the pregap4 window. The "Add file of filenames" button may be used to select a list of
files whose filenames have been written to a ‘file of filenames’. The list of files to process
may be edited within pregap4, allowing new filenames to be added or removed. The "Clear
current list" will remove all filenames from the list. Both the "Add files" and "Add file of
filenames" button append their selections to the list of files to process, so to replace the
current list the "Clear current list" button must first be used. Finally the "Save current
list to..." button may be used to produce a new file of filenames, containing the combined
list of files to process.

4.4 Running Pregap4
When the Run button or Run command (File menu) is used, pregap4 starts processing the
files using the selected modules and their configurations. If the configuration is invalid an
error message will be produced. For example the following may be written to the error
window, and the configure modules panel will be selected with the problematic module
automatically highlighted.

Fri 10 Jul 10:04:25 1998 Run: Module sequence_vector_clip needs configuring

Assuming that the configuration is correct, the processing will start and output will be
sent to the output window as progress is made. The progress within each module is shown

Chapter 4: Preparing readings for assembly using pregap4

339

by a series of fullstops (.) for each correctly processed sequence, and an exclamation mark
(!) for each failed sequence.

The text output window above shows the early processing stages of 20 sequences. When
finished pregap4 will produce a report containing information from each module and the
final list of passed and failed sequences. For example:
- Report Production Passed files:
xb54a3.s1.exp (xb54a3.s1SCF.gz) : type EXP
xb54b12.r1L.exp (xb54b12.r1LSCF.gz) : type EXP
xb54b12.r1.exp (xb54b12.r1SCF.gz) : type EXP
xb54b12.s1.exp (xb54b12.s1SCF.gz) : type EXP
xb54c3.s1.exp (xb54c3.s1SCF.gz) : type EXP
Failed files:
xb54g5.s1.exp (xb54g5.s1SCF.gz) ’screen_vector_clip:

sequence too short’

- Report from ’Augment Experiment Files’ xb54a3.s1.exp : added fields SF CF SC SP TN ST PR SI CH.
xb54b12.r1L.exp : added fields SF CF SC SP TN ST PR SI CH.
xb54b12.r1.exp : added fields SF CF SC SP TN ST PR SI CH.
xb54b12.s1.exp : added fields SF CF SC SP TN ST PR SI CH.

340

The Staden Package Manual

xb54g5.s1.exp : added fields SF CF SC SP TN ST PR SI CH.
xb54c3.s1.exp : added fields SF CF SC SP TN ST PR SI CH.
- Report from ’Tag Repeats’ xb54a3.s1.exp : no repeat found.
xb54b12.r1L.exp : no repeat found.
xb54b12.r1.exp : no repeat found.
xb54b12.s1.exp : no repeat found.
xb54c3.s1.exp : no repeat found.

***

Processing finished

***

The list of passed and failed files are written to prefix.passed and prefix.failed, where
prefix is the output filename prefix specified in the "Files to Process" panel. The reports
are written to prefix.report. The passed and failed files contain the most recent filenames
associated with each sequence. So if a sequence fails early on it could be listed as something
like xb54a3.s1SCF.gz and if it fails later it will be listed like xb54a3.s1.exp. This is
because it is the final filename which is important for later processing, such as for assembly
into gap4.
A prefix.log file is also created containing a list of passed files, failed files, and the filename history for each file (the intermediates will still exist). The format of the passed
section is "filename (file type) PASSED". The format of the failed section is "filename
(file type) ERROR: error message". The format of the file history lines is a series of "filename (file type)" segments separated by "<-", with the original filename listed to the
right. Filenames containing Tcl meta-characters may be ‘escaped’ using curly braces or
back slashes. (The Tcl subst command may be used to generate the original name.) An
example of a log file follows. This was produced with the command line "pregap4 "Sample
671" WT5.exp zf89a2.s1.scf xb56e5.s1.scf".
[passed files]
ha59a6.s1.exp (EXP) PASSED
WT5.exp (EXP) PASSED
xb56e5.s1.exp (EXP) PASSED
[failed files]
zf89a2.s1.exp (UNK) ERROR: screen_vector_clip:

sequence too short

[passed file history]
ha59a6.s1.exp (EXP) <- ha59a6.s1.scf (SCF) <- {Sample 671} (ABI)
WT5.exp (EXP)
xb56e5.s1.exp (EXP) <- xb56e5.s1.scf
[failed file history]
zf89a2.s1.exp (UNK) <- zf89a2.s1.scf
Some modules may also keep their own separate records, such as an assembly log. Where
this is the case, it will be explained in the help specific to that module.

Chapter 4: Preparing readings for assembly using pregap4

341

After running pregap4 it is time to either assemble the data (if this was not done using
pregap4) or to edit it. If the data has already been assembled with Pregap4 then you
will need to start up gap4 and use ‘Open Database’. Otherwise one of the gap4 assembly
functions should be used, with the filename prefix .passed file. For more information on
this see the Gap4 manual.

4.5 Configuring the Pregap4 User Interface
4.5.1 Fonts and Colours
The pregap4 Options menu contains options for modifying the fonts and colours used. These
options are common to many programs and so are documented elsewhere. See Section 10.8
[Font Selection], page 531. See Section 10.6 [Colour Selector], page 529.

4.5.2 Window Styles
Pregap4 supports two styles of windowing. The default method is a compact mode, with
the alternative being "separate" mode - similar to gap4 and spin.

342

The Staden Package Manual

The second style is "compact" mode.

In the compact picture above the most common top level windows are "pages" in a
tabbed notebook.
The benefit is greatly reduced screen space and quicker controls, but the text output
window is no longer permanently visible.
To switch styles select the "Compact Window Style" and "Separate Windows Style"
commands from the Options menu.

4.6 Configuring Modules
The "Configure Modules" dialogue is available from the Modules menu or, when using the
compact window style, by pressing the Configure Modules tab.
This dialogue contains the main interface through which most of the user’s interaction
with pregap4 will be performed. The left side of the display contains a list of the currently

Chapter 4: Preparing readings for assembly using pregap4

343

loaded modules. One module in this list will be highlighted. The right side of the display
shows the configuration panel for this highlighted module.

The module list shown on the left consists of a series of module names and their status,
and is termed the "enable status". The [ ] and [x] strings at the left of the name indicates
whether this module is enabled; crossed boxes are enabled modules. The highlighting is
another indication of whether the module is enabled. The "General Configuration" module
is mandatory and cannot be disabled. The text to the right of the module name indicates
whether the module has been given all the parameters needed for it to process. This will
be one of "ok" (all configuration options have been filled in), "-" (no configuration options
exist for this module), "edit" (further configuration is required") or blank (this module is
disabled).
The "enable status" can be toggled by left clicking on the "[ ]" to the left of the module
name. The enable status can be written to the current Pregap4 configuration file using the
"Save Module List" or "Save All Parameters" commands in the Modules menu. Left clicking
anywhere on a module name in the module list will switch the pane on the right side of
the window to display any available parameters for this module. Not all modules will have
parameters to configure.
For modules that do have parameters, the top line of the configuration panel will contain
a button labelled "Save these parameters". This button will save all parameters for this
module to the configuration file. Note that this is not the same as the "Save all parameters"
option in the main Modules menu, as this saves all parameters in all modules.

344

The Staden Package Manual

4.6.1 General Configuration
Description
This is a mandatory module. It is always the first module executed and will
not appear in the "Add/Remove Modules" list. Its purpose is to set general
parameters which affect several other modules. At present it contains just two
items.
Option: Get entry names from trace files
Many trace formats include storage for a sequence "sample name". This option
controls whether or not the sample name should be used instead of deriving the
name from the filename. If "No" is answered to this question then the sequence
sample name will be generated by removing the filename suffix; for example
xb55a2.s1.ztr will become xb55a2.s1.

4.6.2 Estimate Base Accuracies
Description
This module analyses the traces at each base call to estimate a confidence value
for the called base. It does this by simply looking at the area underneath the
trace for the called base and dividing this by the highest area under the trace
for the three uncalled bases. This is a very simplistic statistic which should
ideally only be used for measuring the average reliability of the entire sequence
rather than any individual base. If another program (eg Phred, or ATQA) is
available then this should be used in preference. From the 2002 release the eba
values are normalised to the phred scale (this was achieved by comaring the
original eba values and phred values for 4.6 million base calls of Sanger Centre
data).
There are no adjustable parameters for this module.

4.6.3 Phred
Phred is not included as part of the Staden Package. It is available from Phil Green.
http://www.genome.washington.edu/UWGC/analysistools/phred.htm
Description
Phred is an ABI base caller. Ewing, B. and Green, P. 1998. Base-Calling of
Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome
Res. 8, 186-194. It will analyse the chromatogram data to produce new base
calls. For each base it assigns confidence value indicating how likely this base
call is to be correct. These confidence values are significantly more reliable
than those produced by eba and they are compatible with the Phrap assembly
program and the gap4 consensus algorithm.
Phred can process either ABI or SCF files, but pregap4 will automatically
convert all input to SCF format first. This means that the phred pregap4
module will be able to process any supported trace format.

Chapter 4: Preparing readings for assembly using pregap4

345

There are no adjustable parameters for this module.

4.6.4 ATQA
ATQA is not include as part of the Staden package. It is available from its developers,
Daniel H. Wagner, Associates, at
http://www.wagner.com/ .
Description
The ATQA program estimates confidence values for each called base in a lane
file. A confidence value corresponds to the probability that the associated base
call is incorrect by the formula
score = -10*log10(probability of error).
(This is the same log scale used by Phred.) In fact, the ATQA program computes four confidence values for each called base. The first three values correspond to the probabilities of substitution, insertion, and deletion errors, respectively. The fourth value is a combined score representing the probability that
the called base is an error of any sort. Currently, only the combined confidence
value is used by Staden package software.
Unlike Phred, the ATQA program does not produce base calls. Rather, it
assigns confidence values to each base call in a lane file based on features of the
trace data. The current version of the ATQA program is tuned to base calls
made by the ABI base caller and to trace data from the ABI 377 sequencer.
Although ATQA can read ABI files, it will not create SCF files in such circumstances. However pregap4 will always convert any non SCF trace files into SCF
format before running ATQA, so an explicit conversion is not required.

4.6.5 Trace Format Conversion
Description
This converts files between the various supported trace formats. At present it
can read ABI, ALF, SCF, CTF and ZTR formats, and can write SCF, CTF
and ZTR. Of these formats, ZTR typically represents the smallest size and is
fast due to its own internal compression routines.
The Trace Format Conversion may also be used to apply some simple editing methods to the traces. These include down-scaling (to reduce file size),
background subtraction, and amplitude normalisation.
Option: Output format
This selects the format for the output trace files. If the output format is the
same as the input format then the input files will not be overridden. Instead
new files will be produced with names based on the input names, generated by
replacing (for example) ".scf" with "..scf". The available output format choices
are ZTR, CTF and SCF.

346

The Staden Package Manual

Option: Downscale sample range
Option: Range
These select whether to reduce the scale used to store the amplitudes, and if so
to what range. ABI files typically range from 0 to 1600 (which is approximately
11-bit data). Shrinking this down to 0 to 255 (8-bit) will usually be visually
comparable as the trace displays in Gap4 and Trev are typically smaller than
255 pixels high, although if the Y scale is increased differences will still be
detectable. The purpose of this is to further reduce file size.
Option: Subtract background
This attempts to eliminate the trace background by a simple technique of deducting the lowest of the four amplitudes from all of the four amplitudes. This
is an overly crude method which should only be used when the preprocessing
software included on the sequencing manufacturer’s instruments has not been
used.
Option: Normalise amplitudes
This uses a sliding window to compute the average single strengths. From this
it scales the data to try and provide, on average, more uniform peak heights
along the trace. Again this is a very simplistic method and so it is not advisable
unless their is a problem with the sequencing manufacturer’s own software.
Option: Delete temporary files
When pregap4 can determine that a trace file is neither the original input or
the final output then it is considered to be a temporary file which may be
suitable for deletion. An example would be using Phred with ABI files and
then converting to ZTR. Phred produces SCF files and so we have ABI to SCF
to ZTR, in which the SCF files may be safely deleted.

4.6.6 Initialise Experiment Files
Description
This modules creates an Experiment File from a trace file (of any format). It
uses the init_exp program to write ID, EN, LN, LT, AQ and SQ Experiment File
line types. This module is mandatory for many subsequent modules, such as
vector clipping/screening and assembly.
There are no adjustable parameters for this module.

4.6.7 Augment Experiment Files
Description
This module adds further data to the Experiment File, with the additional
information typically obtained from external sources. Such information could
be the data required by the vector clipping program, or template information
needed by gap4.
The parameters for this module may be configured by using the "Simple Text
Database" (see Section 4.10.1 [Simple text Database], page 371) or "Experiment
File Line Types" (see Section 4.10.2 [Experiment File Line Types], page 373)

Chapter 4: Preparing readings for assembly using pregap4

347

dialogues. These both allow setting of the Experiment File records to be written
during the Augment stage.

4.6.8 Quality Clip
Description
This module determines where the sequence quality is too poor to use for reliable assembly. It supercedes the Uncalled Base Clip module. This uses the
qclip program which reads and writes to Experiment Files. Its default quality
evaluation is based on the range of values produced by the Estimate Base Accuracies module (quality value 70, averaged over 100 bases). For use with phred,
try lower values such as quality value 15 averaged over 50 bases. When quality
values are not available it will use the same method as the Uncalled Base Clip
module; to analyse the base calls and count the number of undetermined bases
within a given window of sequence. Both 5’ and 3’ ends may be quality clipped.
For the confidence mode of clipping the method starts from the point of highest
average quality, and then steps outwards in both directions until the average
quality is below a defined threshold.
For the sequence mode of clipping the method starts from a defined position
and steps outwards in both directions until the number of uncalled bases within
a given window length exceeds a predefined threshold. For more details see the
qclip documentation (see Section 12.19 [qclip], page 597).
Note that the Phrap assembly algorithm works best without quality clipping
and it can make use of the full length of readings (due to the use of the Phred
confidence values).
Option: Clip mode
This may be one of "by sequence" or "by confidence". The "by sequence"
mode is equivalent to the Uncalled Clip module. The "by confidence" mode
uses Phred-scaled confidence values to determine the quality for clipping. This
does not work with eba confidence values.
Option: Minimum extent
The lowest allowable 5’ clip position.
Option: Maximum extent
The largest allowable 3’ clip position.
Option: Minimum length
If after quality clipping the good portion of a sequence is shorter than the
specified length, then this file will be rejected with the message "qclip: Sequence
too short".
Option: Window length
The window length over which the confidence will be averaged. This option is
only relevant for the "clip by confidence" mode.

348

The Staden Package Manual

Option: Average confidence
The minimum average confidence (over ‘window length’ bases) for sequence
to be accepted as good quality. This option is only relevant for the "clip by
confidence" mode.
Option: Start offset
The base number to start the 5’ and 3’ good quality searches from. This option
is only relevant for the "clip by sequence" mode.
Option: 3’ window length
The window length in which to count uncalled bases. This option is only relevant for the "clip by sequence" mode.
Option: 3’ number of uncalled bases
The maximum allowed count of uncalled bases in a single window length. This
option is only relevant for the "clip by sequence" mode.
Option: 5’ window length
The window length in which to count uncalled bases. This option is only relevant for the "clip by sequence" mode.
Option: 5’ number of uncalled bases
The maximum allowed count of uncalled bases in a single window length. This
option is only relevant for the "clip by sequence" mode.

4.6.9 Sequencing Vector Clip
Description
This module uses the vector_clip program to identify and mark the sequencing vector (those used to produce templates for sequencing, eg m13mp18 or
puc18). To achieve this task it needs to know information about the vector
including the cut site position and the position of the primer site relative to the
cut site. See Section 6.8 [Defining the Positions of Cloning and Primer Sites for
Vector Clip], page 408..
Option: Use Vector-primer file
Vector clip may be told to search through a series of vectors and primers held
within an external file. Alternatively we can request that it looks only at one
specific, known, vector. This question is to determine which of the two mutually
exclusive methods to use. In general it is still important for the Experiment
File to contain primer and template data. The Vector-primer module can be
used to add the primer and sequencing vector information to the Experiment
File but not the template name.
Option: Vector-primer filename.
This is only used if the "Use Vector-primer file" question was answered with
"Yes". Each input sequence will be compared against each vector-primer pair

Chapter 4: Preparing readings for assembly using pregap4

349

to find the best match. This provides a simple way of comparing against multiple vectors or comparing against both forward and reverse primers of a single
vector. For further details on creating this vector-primer file, see Section 6.6
[Vector Primer file format], page 407..
Option: Select vector-primer subset
This is used in conjuction with the vector-primer filename to indicate which of
the vector-primer pairs listed in this file should be used. Initially this is set
to all vector-primer pairs, but efficiency will be greatly increased if just the
required subset is selected. (Internally pregap4 will then temporarily produce
a new vector-primer filename each time vector_clip requires one, containing
just the selected items.) To select more than one vector-primer pair use the
standard listbox mouse bindings: single left click to pick an item; click and drag
to select a range; and control left click to toggle a single item. The selected list
will be saved to the pregap4 configuration file whenever all the parameters for
this module are saved.
Option: Max primer to cut-site length
This parameter is only used when a vector-primer file is defined. The sequence
stored in the vector-primer file may be considerably longer than we expect to
see at the start of the sequences being analysed. By defining the maximum
length of sequence we expect to see, vector_clip may be more sensitive and
slightly faster.
Option: Vector file name
This, and the following two options, are only used if the "Use Vector-primer
file" question was answered with "No". The vector file name should be the
name of a file containing just the vector bases or white space, in a plain text
format.
Option: Cut site
The cut site specified as a base count from the start of the vector file.
Option: Primer site
The primer site specified as a base offset from the cut site. e.g. for m13mp18
forward primers the value is 41. If, instead of the usual single value, two values
are specified separated by a slash, then this gives the values for the universal
forward and reverse primers (for example "41/-24"). Only use this format if
the PR (primer type) experiment file line type is known AND will be specified
in the experiment file. If the PR record is not specified in the experiment file,
the primer site position will be set to zero, and the vector clipping is unlikely
to work correctly. (PR values do not have to be known if they can be derived
using naming schemes such as those used by the Sanger Centre). If the primer
site indicates a custom primer sequence then the primer site is taken to be 0.

350

The Staden Package Manual

Option: Percentage minimum 5’ match
Option: Percentage minimum 3’ match
Both ends of the sequence are checked using a dynamic programming algorithm
to find the optimal alignment. An end is marked as vector if the percentage
match is at least as high as this supplied parameter.
Option: Default 5’ position
This specifies the value to use for marking the 5’ sequencing vector if none is
detected. Specifying this as -1 will cause the absolute value given for the primer
site (which is specified as relative to the cut site).

4.6.10 Cross match
Cross match is not included as part of the Staden Package. It is available from Phil Green.
http://www.genome.washington.edu/UWGC/analysistools/swat.htm
Description
This uses the cross_match program to search for sequencing vector. (Future
versions may also check for other cloning vectors.) This allows for searching
of multiple vector files. However as cross match does not make use of primer
and cut site information the vector detection is inherently less sensitive than
vector_clip (see Chapter 6 [Screening against Vector Sequences], page 401).
Option: FASTA vector file name
This specifies a fasta format file of one or more sequencing vector sequences.
Option: Minimum match length
Minimum length of matching word for SWAT comparison.
Option: Minimum score
Minimum SWAT score.

4.6.11 Cloning Vector Clip
Description
This module searches for non "sequencing" vectors used in the shotgunning
process, eg for Cosmid or YAC. Any fragment in any orientation of this vector
could be present so there is no need for the cut sites to be known. The vector_
clip program is used for this task (see Chapter 6 [Screening against Vector
Sequences], page 401).
Option: Vector file name
The filename containing the vector sequence. At present this should be a file
containing a single plain text sequence containing just the bases or white space.

Chapter 4: Preparing readings for assembly using pregap4

351

Option: Max probability
For each match its probability of occurring by chance is calculated. Any match
with a probability lower than ‘Max probability’ is accepted.

4.6.12 Screen for Unclipped Vector
Description
This module may be used to identify undetected segments of sequencing vector
or to detect recombinations. After searching and marking sequencing vector,
any further strong matches to the sequencing vector indicate a possible problem.
This module uses the vector_clip program (see Chapter 6 [Screening against
Vector Sequences], page 401).
Note that this module requires the Sequencing Vector Clip module to be used
before screening, otherwise all sequences containing unclipped vector will be
falsely rejected.
Option: Minimum length of match
If a match of at least this length is found then the sequence currently being
processed will be rejected.

4.6.13 Screen Sequences
Description
This module can perform very fast matches between the sequences to process
and one or more screen sequences. Any sequence containing a significant match
is rejected. An example of use for this module is to reject sequences prior to
assembly that appear to be contaminated with E. coli. This uses the screen_
seq program (see Section 12.20 [Screen seq], page 600).
Option: Screen single sequence
This is yes/no question used to determine whether the screen sequence filename
is the filename of a single sequence or a filename of a file containing a series of
sequence filenames. To compare just one file select "Yes".
Option: Screen sequence file (of filenames)
This is either the filename of a single sequence or the filename of a file of
filenames, depending on the answer to the previous question. The sequence
files must be in plain text format containing just the bases or white space.
Option: Maximum screen sequence length
The maximum length of any individual screen sequence.
Option: Minimum match length
Any fragment containing an exact match longer than this length will be rejected.

352

The Staden Package Manual

4.6.14 Blast Screen
Description
This module uses the blastall program to compare all the input sequences
against a prebuilt blast database of screen sequences. It is not possible to
compare against a subset of the database - to do this build a new blast database
using formatdb. This module is an alternative to the Screen Sequences module
which uses the screen_seq program.
Blast may be used for either completely rejecting sequences or for simply tagging
the matching segments, or for both. If you wish to tag with several tag types,
then several instances of the Blast screen module need to be used.
Blast is not included as part of the Staden Package. It is available from the
NCBI.
Option: BLAST database
This is the filename of the BLAST database to screen against, with the ‘.nhr’,
‘.nin’ and ‘.nsq’ suffixes removed.
Option: E value
This specifies the ‘E value’ used by blast when determining which hits should
be considered as real.
Option: Match fraction
This is the total percentage of the sequence which much have a blast match
somewhere in the BLAST database searched in order to reject this sequence.
Segments of the input sequence that match multiple components in the BLAST
database are only counted once when computing this percentage, but the locations of the matches in the BLAST database do not need to be consecutive.
If you wish to accept everything, but still want to tag the matches, then set the
match fraction to greater than 1.0.
Option: Tag type
The default for this is which indicates no tagging is required. Otherwise
this should be a 4 letter tag type (such as REPT) known to gap4.

4.6.15 Interactive Clipping
Description
This modules invokes the trev program to view the raw chromatogram files.
The user can then adjust the quality and vector clip positions if desired. The
trev window will contain Next and Previous buttons to skip from trace to trace.
The Reject buttons allows a trace to be rejected, in which case it is added to
the failure file with the message "interactive clip: manually rejected".
There are no adjustable parameters for this module.

Chapter 4: Preparing readings for assembly using pregap4

353

4.6.16 Extract Sequence
Description
This module uses the extract_seq program to extract the sequence information
from binary trace files, Experiment files, or from the old Staden format plain
files. The output contains the sequences split onto lines of at most 60 characters
each, in plain or fasta format. The input files are passed unchanged onto
subsequent modules.
Option: Output only the good sequence
When reading an experiment file or trace file containing clip marks, output only
the good sequence which is contained within the boundaries marked by the QL,
QR, SL, SR, CL, CR and CS line types.
Option: Consider cosmid as good sequence
When the Output only the good sequence option is specified this controls
whether the cosmid sequence should be considered good.
Option: Output in fasta format
Specifies that the output should be in fasta format rather than plain text.
Option: Output in one file only
If this option is selected then the output from every sequence is sent to one
file. This is best used with the Output in fasta format option selected, and
is useful for feeding into BLAST searches, for example. The file to write to is
specified in the File name filed.
If this option is unselected then the output is sent to separate files, one per
sequence. The output files have the same name as the input files, except with
an extra suffix specified in the File name suffix field.

4.6.17 RepeatMasker
RepeatMasker is not included as part of the Staden Package. It is available from Arian
Smit.
http://ftp.genome.washington.edu/RM/RepeatMasker.html
Description
This module uses the RepeatMasker program. This is a program which searches
for a comprehensive set of repeat sequences. Any matches which are found will
be tagged with a comment indicating the type of repeat. These tags will then
be visible from within gap4. Full documentation is available from the author
of RepeatMasker, or from typing RepeatMasker -h.
Option: Repeat library
This specifies the directory containing the library of repeat sequences. Only
one library directory may be specified. The library "" will let RepeatMasker use its own default library.

354

The Staden Package Manual

Option: RepeatMasker cutoff
This specifies the cutoff score for RepeatMasker. The documentation with
RepeatMasker states that a cutoff of 250 will guarantee no false positives.
Option: Gap4 tag type
When a repeat is found a tag will be added to the Experiment File. This
specifies the tag type to use. It should be one of the tag types available to
Gap4, but other tag types may be used if desired (they will be coloured as is
COMMent tags in gap4).
Option: Types of repeat to screen against
The default setting of RepeatMasker is to search for primate repeats, however
it may be told to search for other repeat families or to restrict its search to only
ALU primate repeats. The full list of options here are Alu only, Rodent only,
Simple only, Mammalian excluding primate/rodent, and no low complexity.
These are as defined in the RepeatMasker documentation. It is not known
what effect enabling mutually exclusive options will have.

4.6.18 Tag Repeats
Description
This module uses the repe program to identify and mark known repetitive
elements within the sequences. An example usage is to tag all ALU fragments.
This information may be used by the gap4 assembly algorithm to improve
the assembly by initially ignoring matches between two ALU fragments which
may otherwise produce incorrect assemblies. If available, we recommend using
RepeatMasker instead of this module.
Option: Repeat file name
This is the filename of a file of filenames, each of which contain a single repeat
to search for. The format of these individual files is plain text consisting of just
the nucleotides and white space.
Option: Repeat score
This is the minimum score for classifying a matched segment as a repeat.
Option: Tag type
This is the gap4 tag type to use for identifying this repeat segment. It is not
possible to choose different tag types for different repeats, although the tag
comments contain the match score and match filename.

4.6.19 Mutation Detection
Description
Superceded by the newer modules: (see Section 4.6.21 [Trace Difference],
page 357) and (see Section 4.6.22 [Mutation Scanner], page 359).

Chapter 4: Preparing readings for assembly using pregap4

355

This module compares each sequence chromatogram against a "wild type" or
reference chromatogram to detect point mutations. The mutations are detected
by aligning and subtracting each trace from the wild type trace to produce a
"difference trace". The difference trace is then analysed to identify point mutations which are written back to the Experiment File and MUTN tags. This uses
the trace_diff program Bonfield, J.K., Rada, C. and Staden, R. Automated
detection of point mutations using fluorescent sequence trace subtraction. Nucleic Acids Res. 26, 3404-3409 (1998).
Obviously the reference traces should be as similar as possible to the ones
being compared against it. It should be prepared by sequencing the wild type
from the same primer, and using the same chemistry as the readings being
screened. One good way to produce a reference trace is to run the wild type
sequence on the gel along with the other samples. It is also possible to get
gap4 to produce a consensus trace. This requires using pregap4 twice. Firstly
process the sequences through pregap4 with all the appropriate options except
with the mutation detection module disabled. Assemble these sequences into
gap4. Within gap4, for each contig start up the Contig Editor and select Save
Consensus Trace from the command menu. This will produce a trace which is
the average of the traces in that contig. Then delete the gap4 database and
reprocess the sequences using Pregap4, this time using mutation detection to
compare against the consensus trace.
Option: Wild type file (+ve strand)
Option: Wild type file (-ve strand)
These are the filenames of the chromatogram for the wild type sequence on
each strand. These may be in any allow trace format (SCF, ZTR, ABI, CTF
or ALF). In the augment stage, these are represented in the WT line type using
plus filename|minus filename notation.
Option: Start position
Option: End position
These define the range within each sequence in which to identify mutations. The
algorithm works better on good quality data so including very bad sequence may
give errors.
Option: Score
This a threshold used to determine when a peak in the difference trace is considered to be a mutation. The higher the value the more stringent the test.
Option: Alignment band width
The trace alignment is performed by firstly doing a sequence alignment on
the text sequences contained in the two files. This parameter specifies the band
width for this alignment. Smaller values give quicker alignments, but only work
if the alignment is sufficiently close to the main diagonal.

356

The Staden Package Manual

Option: Other arguments
This allows for any other arguments to be passed to the trace_diff program.
See the trace diff documentation for more details.
The module above is superceded by the newer modules: (see Section 4.6.21 [Trace
Difference], page 357) and (see Section 4.6.22 [Mutation Scanner], page 359).

4.6.20 Reference Traces and Reference Sequences
Description
This module specifies the reference traces and reference sequences used by the
two mutation detection modules (see Section 4.6.21 [Trace Difference], page 357
and see Section 4.6.22 [Mutation Scanner], page 359). The left and right clip
points for each trace can also be specified.
A reference trace should be as similar as possible to the ones being compared
against. It should be prepared by sequencing the wild type from the same
primer and using the same chemistry as the readings being screened. One good
way to produce a reference trace is to run the wild type sequence on the gel
along with the other samples.
If the input files have been sequenced from both strands, reference traces from
each strand may be specified here.
NOTE: In order for pregap4 to choose the appropriate wild type trace it needs
to know the strand for each input sequence. This is specified by the PR record in
the experiment file which is typically generated using a naming convention (see
Section 4.8 [Pregap4 Naming Schemes], page 366) If pregap4 cannot determine
the strand, or if only one reference trace is specified, then each input sequence
will be compared against the +ve strand reference trace.
The reference data supplied in this module, when entered with gap4 shotgun
assembly, will add REFS and REFT notes (see Section 2.15 [Notes], page 281) to
the gap4 database. A reference sequence is used to number bases in the Contig
Editor (see Section 2.6.12 [Reference sequences and traces], page 191) and in
reporting the positions of mutations (see Section 2.6.7.9 [Report Mutations],
page 178.)
Option: Reference Trace (+ve strand)
Option: Reference Trace (-ve strand)
These are the filenames of the chromatogram for the reference trace on each
strand. These may be in any allowable trace format (ZTR, SCF, ABI, CTF or
ALF). The filenames are entered into the experiment file as WT records by the
"Augment Database" phase of pregap4, so this module must also be enabled.
Option: Clip left
Option: Clip right
These values determine which region of the reference trace (in bases) is used
for mutation detection. This can be used to exclude poor quality regions, or
restrict the range over which mutation detection occurs. Restricting the range
will also speed up the algorithms. If you specify -1 for any value, mutscan

Chapter 4: Preparing readings for assembly using pregap4

357

will use the clip point QL/QR records within the reference trace experiment
file (provided they exist). If they don’t exist, then the entire reference trace is
used. i.e. No clipping occurs. If the range specified is too small, the mutation
detection algorithms may report an error, since there must be a useful overlap
between the sequences in order to process them.
Option: Reference Sequence
This specifies the reference sequence, which is typically an annotated EMBL
entry. This field is optional.
Option: Start base number
If a reference sequence was specified this indicates which base number it will
start counting from within Gap4’s contig editor. It also defines the positions of mutations, as output by the Report mutations function of gap4 See
Section 2.6.7.9 [Report Mutations], page 178..
Option: Circular
Option: Sequence length
If the reference sequence is defined to be circular then the length needs to be
known too. When the base number reaches the sequence length the next base
in the sequence will be renumbered to base 1. This may be useful if the circular reference sequence needs to be chopped to form a linear sequence at a
different position than the standard numbering. (For example this is typical
when sequencing the mitochondrial variable loop, which by standard conventions contains base number 1.)
Note that it is possible (though no longer recommended) to use gap4 to produce a consensus trace. This requires using pregap4 twice. Firstly process the sequences through pregap4
with all the appropriate options except with the mutation detection modules disabled. Assemble these sequences into gap4. Within gap4, for each contig start up the Contig Editor
and select Save Consensus Trace from the command menu (available only in expert mode).
This will produce a trace which is the average of the traces in that contig. Then delete
the gap4 database and reprocess the sequences using Pregap4, this time using mutation
detection to compare against the consensus trace. Best results are usually obtained by first
deleting pads in the consensus sequence. You should inspect the resulting consensus trace
carefully to ensure there are no discontinuities introduced as a result of the pad deletions.

4.6.21 Trace Difference
Description
This module compares each sequence chromatogram against a "wild type" or
reference chromatogram to detect point mutations. The mutations are detected
by aligning and subtracting each trace from the wild type trace to produce a
"difference trace". The difference trace is then analysed to identify point mutations which are written back to the Experiment File as MUTA tags. The basic
idea is explained in the paper Bonfield, J.K., Rada, C. and Staden, R. Automated detection of point mutations using fluorescent sequence trace subtraction.
Nucleic Acids Res. 26, 3404-3409 (1998).

358

The Staden Package Manual

This implementation is the second version of the algorithm. The previous version used basecalls to do trace alignment. This led to problems when bases were
called in error (often the case around mutations). The new algorithm ignores
the basecalls completely and aligns the trace signals themselves, avoiding such
problems. This is much more computationally intensive, but it has proved to
be fast enough for interactive use.
If the input files have sequenced from both strands then two wild type sequences may be given. In order for pregap4 to choose the appropriate wild
type trace it needs to know the strand for each input sequence, which is typically generated using the naming convention. A simple naming scheme is
provided with pregap4 (in the lib/pregap4/naming schemes directory) called
"mutation detection.p4t". This can be loaded from the pregap4 file menu. It
assumes that trace names have an ’f’ or ’r’ suffix, denoting the forward and
reverse strands respectively. If you need something more complex, then you’ll
have to create and load your own naming scheme. If pregap4 cannot determine
the strand, or if only one wild type is specified, then each input sequence will
be compared against the +ve strand wild type.
The reference or wild type traces for tracediff are specified in the see
Section 4.6.20 [Reference Traces module], page 356.
Option: Sensitivity
This threshold is used to determine when an above/below baseline double peak
in the difference trace is considered to be a mutation. It is specified in standard
deviations from the mean over the analysis window. The higher the value, the
more stringent the test. This value is reduced dynamically by the algorithm in
the presense of mutations since small mutations near larger ones can often be
missed with a uniform sensitivity setting. It’s likely that some experimentation
with this parameter will be required for optimal mutation detection in your
data.
Option: Noise threshold
This threshold is used to filter out low level noise during the analysis phase. It
is specified as a percentage of the maximum peak-to-peak trace difference value.
A high threshold will lead to fewer false positives but you run the additional
risk of missing low level mutations.
Option: Analysis window length
Analysis of the trace difference is done over a local region to counter the effects
of non-stationarity in the trace signal. The analysis region is defined by a short
window whose length is specified in bases. The window is asymmetric in that
it’s located to the left of the base it’s positioned on. This avoids measurement
problems when mutations are encountered. The window size is a tradeoff. If
it’s too big, low level mutations may be missed. If it’s too small, there may be
insufficient data to give unbiased measurements leading to many false positives.

Chapter 4: Preparing readings for assembly using pregap4

359

Option: Maximum peak alignment deviation
The centres of each individual half-peak of a double peak above and below
the baseline must align reasonably well for them to be considered to be real
mutations. The amount of half-peak alignment deviation allowable is specified
in bases by this parameter, usually as a fraction of one base.
Option: Maximum peak width
During analysis, the width of each peak is measured to avoid problems caused
by gel artifacts. These often appear as broad peaks that overlay many bases.
The maximum peak width is specified in bases. A lower value will lead to fewer
false positives, but you run the additional risk of missing smeared mutations
towards the end of a trace.
Option: Complement bases on reverse strand tags
After mutation detection and after readings have been assembled into a GAP4
database, GAP4 displays both forward and reverse readings in a single direction
in the contig editor. This makes it much easier to compare sequences and traces
in both directions simultaneously. When the corresponding traces are displayed,
any reverse strand traces are complemented automatically such that the bases
are interchanged. In this case, the original mutation tag generated by tracediff
will then be of the wrong sense, so if checked, this option complements the tag
base labels to match the complemented trace displayed by GAP4.
Option: Write difference traces out to disk
After trace difference analysis, the generated traces are normally discarded and
not written to disk. Checking this option lets you save the trace difference files
to the same directory as the original traces. The .ZTR trace format is used
for this purpose. The original filename is retained and a " diff.ztr" suffix is
appended.

4.6.22 Mutation Scanner
Description
This module compares each input sequence chromatagram against a reference
chromatogram (or trace) to detect mutations. The reference traces are specified in the see Section 4.6.20 [Reference Traces module], page 356. Using this
method it is possible to detect both base-change mutations and heterozygous
mutations.
It works by aligning the reference trace with the input trace and then examining
the peak pairs for each individual base separately. It does not use basecalls
as these are prone to error and their use generates too many false positives.
After normalisation, the amplitude ratios of peak pairs which are abnormal are
analysed more closely. For heterozygotes, a drop in peak height with respect to
the reference of about 50% is expected. The final set of candidate mutations are
validated against a difference trace to ensure it contains a double peak at that
location, thus confirming the mutation to be real. After chromatagram analysis

360

The Staden Package Manual

has been completed, mutation tags are written back to the Experiment File as
HETE and MUTA tags.
Option: Adaptive Noise Floor
Traces are very noisy difficult to process signals. To find valid peaks in a trace
an adaptive noise threshold based on envelope height is used to eliminate all low
level noise from consideration. The effect of this parameter can be seen in the
trace below. By default this parameter is set to 25% of envelope peak height.
If set lower, too much noise is picked up; if set higher, low level mutations may
be missed.

Option: Upper and Lower Peak Drop Thresholds
For heterozygote mutations, the peak height of the mutant drops by 50% with
respect to the normalised reference trace as shown in the trace below. For
accurate detection, we use this information to validate potential mutations.
Due to overzealous preprocessing done by sequencing machine software, the
peak height drops are often not 50%, but typically hover between 20% and
70% of reference peak height. Any potential heterozygote whose peak height
drop with respect to the normalised reference trace that lies within this range
is considered to be a real mutation.

Option: Peak Alignment Search Window Size
In an ideal world, heterozygote peaks in a trace would be perfectly aligned
on top of each other. In practice however, they can often be skewed due to
gel chemistry problems or inaccurate mobility correction as shown in the trace
below. When mutscan looks for peak pairs, it allows for this skew by looking

Chapter 4: Preparing readings for assembly using pregap4

361

either side of the current position for nearby peaks. This parameter is the
distance mutscan looks in bases around each candidate position.

Option: Heterozygote SNR Threshold
For a normal trace containing normal bases, the signal-to-noise ratio (SNR) is
the ratio of the highest base peak to the second highest trace level. Mutscan
computes this value in decibels (dB) as 20*log10(S/N). For normal bases, this
usually in the region of 20-30dB or higher. However, for heterozygotes, the
SNR as defined by this measure degrades significantly to around 2-5dB. This
is the mechanism mutscan uses to accurately determine the mutation tag type.
If the candidate mutation’s SNR is equal to or below this threshold, mutscan
designates it to be heterozygous, otherwise it’s considered to be a normal basechange mutation.
Option: Trace Alignment Failure Threshold
Mutscan works by aligning a mutant trace against a reference trace and comparing the peaks. However, if the traces are too different, the alignment may
fail and as a consequence, large numbers of false positive mutation tags are
generated. Typically, within each trace there are only one or two mutations, so
if we find 15 mutations, then we can confidently predict that things have gone
badly wrong! This parameter sets a threshold, beyond which an alignment failure error message is printed, rather than outputting large numbers of invalid
mutation tags.
Option: Complement Bases on Reverse Strand Tags
After mutation detection and after readings have been assembled into a GAP4
database, GAP4 displays both forward and reverse readings in a single direction
in the contig editor. This makes it much easier to compare sequences and traces
in both directions simultaneously. When the corresponding traces are displayed,
any reverse strand traces are complemented automatically such that the bases
are interchanged. In this case, the original mutation tag generated by mutscan
will then be of the wrong sense. If checked, this option complements the tag
base labels to match the complemented trace displayed by GAP4.

4.6.23 Gap4 Shotgun Assembly
Description
This module assembles the processed sequences into gap4 using gap4’s own
assembly engine. Note that this is incompatible with use of "Enter assembly
into Gap4", which should only be used for external (to gap4) assembly engines.

362

The Staden Package Manual

Option: Gap4 database name
Option: Gap4 database version
The name and version of the database to assemble into.
Option: Create new database
This is a toggle to define whether the specified gap4 database should be created or appended to. Be warned that at present creating a new database will
overwrite existing one of the same name, in the same directory, without any
warnings.
Option: Minimum exact match
Option: Maximum number of pads
Option: Maximum percentage mismatch
These control the main assembly parameters within gap4. For more details see
Section 2.7.1 [Normal shotgun assembly], page 205.

4.6.24 Cap2 Assembly
Cap2 is not included as part of the Staden Package. It is available from Xiaoqiu Huang ().
Description
This module uses the cap2 program to perform shotgun assembly. Output
will be placed in the fofn.assembly directory, where fofn is the filename prefix
listed in the "Files to Process" panel. The output is in a format suitable for
directed assembly within gap4. This can also be performed by using the "Enter
Assembly into Gap4" module.
There are no adjustable parameters for this module.

4.6.25 Cap3 Assembly
Cap3 is not included as part of the Staden Package. It is available from Xiaoqiu Huang ().
Description
This module uses the cap3 program to perform shotgun assembly. Output
will be placed in the fofn.assembly directory, where fofn is the filename prefix
listed in the "Files to Process" panel. The output is in a format suitable for
directed assembly within gap4. This can also be performed by using the "Enter
Assembly into Gap4" module. Cap3 differs from Cap2 in that it can make use
of confidence values (in the range supplied from phred) and constraints.
Option: Auto-generate constraints
When enabled, this uses the reading direction (forward / reverse primers), the
template name and the insert size, to produce a file containing data to constrain
how the readings may be assembled.

4.6.26 FakII Assembly
FakII is not included as part of the Staden Package. It is available from Susan Miller ().

Chapter 4: Preparing readings for assembly using pregap4

363

Description
This module uses the FakII suite of programs to perform shotgun assembly.
Output will be placed in the fofn.assembly directory, where fofn is the filename
prefix listed in the "Files to Process" panel. The output is in a format suitable
for directed assembly within gap4. This can also be performed by using the
"Enter Assembly into Gap4" module.
Option: E limit
Option: D limit
Option: O threshold
These parameters control the graph component of FakII, which is used to find
the initial overlaps between sequences.
Option: Auto-generate constraints
When enabled, this uses the reading direction (forward / reverse primers), the
template name and the insert size, to produce a file containing data to constrain
how the readings may be assembled.
Option: E rate
Option: O threshold
Option: D threshold
These parameters control the assemble component of FakII, which is used for
determining the best construction of sequences from the overlap graph.
Option: Assembly number
This allows for non optimum assemblies to be chosen. The optimum assembly
is assembly number 1, with the next optimum being number 2, and so on.

4.6.27 Phrap Assembly
Phrap is not included as part of the Staden Package. It is available from Phil Green.
http://www.genome.washington.edu/UWGC/analysistools/phrap.htm
Description
This module uses the phrap program to perform shotgun assembly. Output will
be placed in the fofn.assembly directory, where fofn is the filename prefix listed
in the "Files to Process" panel. The output is in a format suitable for directed
assembly within gap4. This can also be performed by using the "Enter Assembly
into Gap4" module. Phrap can make use of the confidence value information
written by the phred program to produce better assemblies. Phrap also uses
the full length of the sequence and will ignore any quality clipping. It is still
necessary to clip sequencing vector.
Option: Minimum exact match
Minimum length of matching word for SWAT comparison.

364

The Staden Package Manual

Option: Minimum SWAT score
Minimum SWAT score.
Option: Other phrap arguments
Any other phrap command line arguments.

4.6.28 Enter Assembly into Gap4
Description
This module is used to enter assemblies into gap4 which have been generated
externally to gap4 (ie all assembly engines except "Gap4 shotgun assembly").
This is achieved by using the gap4 "Directed Assembly" function. The assembly
is read from the fofn.assembly directory, where fofn is the filename prefix listed
in the "Files to Process" panel.
Option: Gap4 database name
Option: Gap4 database version
The name and version of the database to assemble into.
Option: Create new database
This is a toggle to define whether the specified gap4 database should be created or appended to. Be warned that at present creating a new database will
overwrite existing one of the same name, in the same directory, without any
warnings.
Option: Post-assembly quality clipping
Option: Lowest (average) quality to use
This can be used to direct gap4 to run the "Quality Clip" function after entering
the assembly. This performs quality clipping by identifying segments where the
average quality is below a particular threshold. This should only be necessary
if quality clipping was not performed earlier (eg because Phrap was used for
assembly), and even then it is usually better to use difference clipping instead.
Option: Post-assembly difference clipping
This can be used to direct gap4 to run the "Difference Clip" function after
entering the assembly. This identifies ends of readings where the alignment between readings and consensus is bad and marks these ends as hidden data. This
is primarily designed for use after the Phrap assembly engine, which sometimes
leaves poorly aligned end fragments.

4.6.29 Email
Description
This module can be used to send an E-mail indicating that the processing of
Pregap4 has reached a given point. This may be of use when running pregap4 in
batch mode, where the GUI is not visible. Typically the email module is placed

Chapter 4: Preparing readings for assembly using pregap4

365

at the end of the module list to indicate that pregap4 has (almost) finished,
however it may be used elsewhere in the module list if desired.
Option: Email address
The email address to send a message to.
Option: Email program
The mail agent used for sending the message.
Option: Program arguments
The arguments (except for the email address) to the mail agent. These could
include options for setting the email subject.

4.6.30 Old Cloning Vector Clip - Obsolete
Description
This is an older version of the Cloning Vector Clip module. It still uses the
vector_clip program to perform this task, but does not use the newer probabilistic model for analysing matches. It is still present as an option for people
who have tuned the parameters for their data and are happy with this. The
probability mode is recommended (see Chapter 6 [Screening against Vector
Sequences], page 401).
Option: Vector file name
The filename containing the vector sequence. At present this should be a file
containing a single plain text sequence containing just the bases or white space.
Option: Word length
Option: Number of diagonals
Option: Diagonal score
The searching method involves hashing words to quickly identify matches and
then combining these words along the best and neighbouring diagonals to produce an overall score which is compared against the diagonal score to determine
whether this is vector sequence. The score is normalised from 0 (no match) to
1.0 (perfect match). For full details on this see the vector clip manual.

4.6.31 ALF/ABI to SCF Conversion - Obsolete
Description
This module converts ABI and ALF files to SCF format using the makeSCF
program. SCF format is not required by programs such as gap4, but it is
considerably smaller and has been designed to give high compression ratios.
Option: SCF bit size
This selects the data size for the chromatogram data. An 8 bit value can
store 256 possible values, which is typically good enough for display purposes.
If Y scaling is required (for instance because the signal strength diminishes

366

The Staden Package Manual

significantly along the length of the trace), or further computational analysis
of the trace is required, a 16 bit data size should be chosen. As the majority of
the trace file is the sample data, using 8 bit data typically saves about half of
the disk space.
This module may also be used for converting 16-bit SCF files to 8-bit SCF files.

4.7 Using Config Files
Pregap4 uses configuration files to remember the setup for each user or project. These
files define which modules are activated and what their parameter settings are. The files,
which can obviously save considerable amounts of time, are created automatically and can
be saved from the Configure Modules Window once the configuration is complete.
The "Load New Config File" option, available from the File menu, may be used to
switch to a new (existing) configuration file. Pregap4 will display a file browser window to
enable selection of another configuration file. Once chosen, Pregap4 will discard the existing
configuration and use the new one. From this point onwards, any modifications and saving
in Pregap4 will be to the new configuration file.

4.8 Pregap4 Naming Schemes
The "Load Naming Scheme" command is in the File menu. It will bring up a dialogue requesting the pathname of a naming scheme file. The browse button will automatically bring
up the file browser in the pregap4 naming scheme directory, however naming schemes can
be loaded from elsewhere if desired. The "Save to config file" query determines whether the
component is also copied to the current pregap4 configuration file to make this component
the default for subsequent pregap4 runs.
The use of naming schemes within pregap4 is specifically for extracting information from
a reading name in order to supply paramaters to other pregap4 modules or to gap4. For
example a naming scheme may be used to indicate where both the forward and reverse
primers have been used to generate two sequences, which gap4 can then use for checking
assembly and suggesting possible contig joins.
Currently only two naming schemes are supplied with pregap4, both of which are from
the Sanger Centre. To create your own naming schemes please see Section 4.8.4 [Writing
Your Own Naming Scheme], page 369.

4.8.1 Mutation Detection Naming Scheme
Filename

mutation detection.p4t

Description
This naming scheme can be used for other purposes too, but its primary goal is
to provide the simplest scheme possible suitable for handling pairs of sequences
for the mutation detection module.
Any sequence with a name ending with f or F is assumed to be a forward reading
and any sequence with a naming ending with r or R is assumed to be a reverse
reading. The rest of the name (i.e. everything except the last character) is used

Chapter 4: Preparing readings for assembly using pregap4

367

as the template name and so needs to exactly match between the forward and
reverse reading pair.
Configuration section
[naming_scheme]
Configuration elements
PR_com, TN_com.

4.8.2 Old Sanger Centre Naming Scheme
Filename

sanger names old.p4t

Description
This scheme extracts information from sequence names by assuming that they
adhere to the old-style Sanger Centre naming scheme. The information extracted consists of the template name, primer type and chemistry information.
The format of a reading name is as follows.

Manual

Navigation menu

Versions of this User Manual:

Views

Navigation