Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 654 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Preface
Next generation assembly editing with Gap5
- Gap5 Databases
- Contig Selector / Comparator
  - Contig Selector
  - Contig Comparator
    - Examining Results and Using Them to Select Commands
    - Automatic Match Navigation
- Template Display
- Editing in Gap5
- Importing and Exporting Data
- Finding Sequence Matches
- Checking Assemblies and Removing Readings
  - Checking Assemblies
- Removing Readings and Breaking Contigs
- Tidying up alignments
- Calculating Consensus Sequences
Other Miscellany
- List Libraries
- Results Manager
- Lists
Sequence assembly and finishing using Gap4
- Organisation of the gap4 Manual
- Introduction
  - Summary of the Files used and the Preprocessing Steps
  - Summary of Gap4's Functions
  - Introduction to the gap4 User Interface
  - Gap4 Menus
  - The use of numerical estimates of base calling accuracy
  - Use of the "hidden" poor quality data
  - Annotating and masking readings and contigs
    - Standard tag types
    - Active tags and masking
- Contig Selector
  - Selecting Contigs
  - Changing the Contig Order
  - The Contig Selector Menus
- Contig Comparator
  - Examining Results and Using Them to Select Commands
  - Automatic Match Navigation
- Contig Overviews
  - Template Display
  - Consistency Display
  - SNP Candidates
  - Plotting Consensus Quality
    - Examining the Quality Plot
  - Plotting Stop Codons
    - Examining the Plot
    - Updating the Plot
  - Plotting Restriction Enzymes
- Editing in Gap4
  - Moving the visible segment of the contig
  - Names
  - Editing
  - Selections
  - Annotations
  - Searching
  - The Commands Menu
  - The Settings Menu
  - Removing Readings
  - Primer Selection
    - Parameters
    - Template selection
  - Traces
  - Reference Sequence and Traces
    - Reference sequences
    - Reference traces
  - Template Status Codes
  - The Editor Information Line
  - The Join Editor
  - Using Several Editors at Once
  - Quitting the Editor
  - Editing Techniques
  - Summary
- Assembling and Adding Readings to a Database
  - Normal Shotgun Assembly
  - Directed Assembly
  - Screen Only
  - General Comments and Tips on Assembly
  - Assembly Failure Codes
- Ordering and Joining Contigs
  - Order contigs
  - Find Read Pairs
  - Find Internal Joins
    - Find Internal Joins Dialogue
  - Find Repeats
- Checking Assemblies and Removing Readings
  - Checking Assemblies
- Removing Readings and Breaking Contigs
  - Breaking Contigs
  - Disassembling Readings
- Finishing Experiments
  - Double Stranding
  - Suggest Primers
  - Suggest Long Readings
  - Compressions and Stops
  - Suggest Probes
- Calculating Consensus Sequences
  - Normal Consensus Output
  - Extended Consensus Output
  - Unfinished Consensus Output
  - Quality Consensus Output
  - The Consensus Algorithms
  - List Consensus Confidence
  - List Base Confidence
- Miscellaneous functions
  - Complement a Contig
  - Enter Tags
  - Shuffle Pads
  - Show Relationships
  - Contig Navigation
  - Sequence Search
  - Extract Readings
  - Automatic Clipping by Quality and Sequence Similarity
- Results Manager
- Lists
  - Special List Names
  - Basic List Commands
  - Contigs To Readings Command
  - Minimal Coverage Command
  - Unattached Readings Command
  - Highlight Readings List
  - Search Sequence Names
  - Search Template Names
  - Search Annotation Contents
- Notes
  - Selecting Notes
  - Editing Notes
  - Special Note Types
- Gap4 Database Files
  - Directories
  - Opening a New Database
  - Opening an Existing Database
  - Making Backups of Databases
  - Reading and Contig Names and Numbers
- Copy Readings
  - Introduction
    - Copy Reads Dialogue
- Check Database
  - Database Checks
  - Contig Checks
  - Reading Checks
  - Annotation Checks
  - Note Checks
  - Template Checks
  - Vector Checks
  - Clone Checks
- Doctor Database
  - Structures Menu
  - Ignoring Check Database
  - Extending Structures
  - Listing and Removing Annotations
  - Shift Readings
  - Delete Contig
  - Reset Contig Order
- Configuring
  - Introduction
  - Consensus Algorithm
  - Set Maxseq/Maxdb
  - Set Fonts
  - Configuring Menus
  - Set Genetic Code
  - Alignment Scores
  - Trace File Location
  - The Tag Selector
  - The GTAGDB File
  - Template Status
Command Line Arguments
Searching for point mutations using pregap4 and gap4
- Introduction to mutation detection
Preparing readings for assembly using pregap4
- Organisation of the Pregap4 Manual
- Introduction
- Specifying Files to Process
- Running Pregap4
- Configuring the Pregap4 User Interface
  - Fonts and Colours
  - Window Styles
- Configuring Modules
- Using Config Files
- Pregap4 Naming Schemes
- Pregap4 Components
- Information Sources
  - Simple Text Database
  - Experiment File Line Types
- Adding and Removing Modules
- Low Level Pregap4 Configuration
- Writing New Modules
Marking poor quality and vector segments of readings
- Introduction to read clipping
Screening Against Vector Sequences
- Algorithms
- Options
- Parameters (defaults in brackets)
- Error codes
- Examples
- Vector_Primer file format
- Vector_Primer File Notes
- Defining Cloning and Primer Sites for Vector_Clip
- Finding the Cloning and Primer Sites
Screening Readings for Contaminant Sequences
- Parameters
- Limits
- Error codes
- Examples
Viewing and editing trace data using trev
- Introduction
- Opening trace files
  - Opening a trace file from the command line
  - Opening a trace file from within Trev
- Viewing the trace
  - Searching
  - Information
- Editing
- Saving a trace file
- Processing multiple files
- Printing a trace
- Quitting
Analysing and comparing sequences using spin
- Organisation of the Spin Manual
- Introduction
- Spin's Analytical Functions
- Spin Comparison Functions
- Controlling and Managing Results
- The Spin User Interface
- Controlling and Managing Results
  - Result manager
- Reading and Managing Sequences
User Interface
- Introduction
- Basic Interface Controls
- Standard Mouse Operations
- The Output and Error Windows
- Graphics Window
  - Zooming
- Colour Selector
- File Browser
  - Directories and Files
  - Filters
- Font Selection
File Formats
- SCF
- ZTR
- Experiment File
- Restriction Enzyme File
- Vector_primer File
- Vector Sequence Format
Man Pages
- Convert_trace
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLES
  - NOTES
  - SEE ALSO
- Copy_db
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLES
  - NOTES
- Copy_reads
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLE
- Eba
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - EXAMPLES
  - SEE ALSO
- Extract_seq
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - SEE ALSO
- Extract_fastq
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - SEE ALSO
- Find_renz
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - SEE ALSO
- GetABIfield
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLES
  - SEE ALSO
- Get_comment
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - SEE ALSO
- Get_scf_field
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - SEE ALSO
- Hash_exp
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - SEE ALSO
- Hash_extract
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - SEE ALSO
- Hash_list
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - SEE ALSO
- Hash_tar
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLES
  - SEE ALSO
- Init_exp
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - NOTES
  - SEE ALSO
- MakeSCF
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLES
  - NOTES
  - SEE ALSO
- Make_weights
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLE
  - SEE ALSO
- PolyA_clip
  - NAME
  - SYNOPSIS
  - OPTIONS
  - DESCRIPTION
  - SEE ALSO
- Qclip
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLE
  - SEE ALSO
- Screen_seq
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLES
  - NOTES
  - SEE ALSO
- TraceDiff
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
- Trace_dump
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - SEE ALSO
- Vector_clip
  - NAME
  - SYNOPSIS
  - DESCRIPTION
  - OPTIONS
  - EXAMPLES
  - NOTES
  - SEE ALSO
References
- Publications
General Index
File Index
Variable Index
Function Index

The Staden Package Manual

Last update on 25 April 2016

James Bonﬁeld, Kathryn Beal, Mark Jordan,

Yaping Cheng and Rodger Staden

1999-2002, Medical Research Council, Laboratory of Molecular Biology. Made

available under the standard BSD licence.

2002-2006, Genome Research Limited (GRL). Made available under the stan-

dard BSD licence.

Portions of this code are derived from a modiﬁed Primer3 library. This bears the following

1996,1997,1998 Whitehead Institute for Biomedical Research. All rights re-

served.

Redistribution and use in source and binary forms, with or without modiﬁcation, are per-

mitted provided that the following conditions are met:

1. Redistributions must reproduce the above copyright notice, this list of conditions and

the following disclaimer in the documentation and/or other materials provided with the

distribution. Redistributions of source code must also reproduce this information in the

source code itself.

2. If the program is modiﬁed, redistributions must include a notice (in the same places as

above) indicating that the redistributed program is not identical to the version distributed

by Whitehead Institute.

3. All advertising materials mentioning features or use of this software must display the

following acknowledgment: This product includes software developed by the Whitehead

Institute for Biomedical Research.

4. The name of the Whitehead Institute may not be used to endorse or promote products

derived from this software without speciﬁc prior written permission.

We also request that use of this software be cited in publications as

Steve Rozen, Helen J. Skaletsky (1996,1997,1998) Primer3. Code available at http://www-

genome.wi.mit.edu/genome software/other/primer3.html

THIS SOFTWARE IS PROVIDED BY THE WHITEHEAD INSTITUTE “AS IS” AND

ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED

TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A

PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE WHITE-

HEAD INSTITUTE BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,

SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT

NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;

LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER

CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,

STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)

ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF

ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Permission is given to duplicate this manual in both paper and electronic forms.

Short Contents

1 Next generation assembly editing with Gap5 ............... 3

2 Sequence assembly and ﬁnishing using Gap4 .............. 95

3 Searching for point mutations using pregap4 and gap4 ..... 309

4 Preparing readings for assembly using pregap4 ............ 325

5 Marking poor quality and vector segments of readings . . . . . . 399

6 Screening Against Vector Sequences .................... 401

7 Screening Readings for Contaminant Sequences ........... 413

8 Viewing and editing trace data using trev ................ 417

9 Analysing and comparing sequences using spin............ 429

10 User Interface ...................................... 523

11 File Formats ....................................... 533

12 Man Pages ......................................... 569

References ............................................. 611

General Index .......................................... 613

File Index ............................................. 625

Variable Index ......................................... 627

Function Index ......................................... 629

iii

Table of Contents

Preface.............................................................. 1

1 Next generation assembly editing with Gap5

................................................. 3

1.1 Gap5 Databases................................................ 4

1.1.1 Creating databases ........................................ 4

1.1.2 Opening/closing databases ................................ 5

1.1.3 Changing directories....................................... 5

1.1.4 Check Database ........................................... 6

1.2 Contig Selector / Comparator .................................. 7

1.2.1 Contig Selector ............................................ 7

1.2.1.1 Selecting Contigs ..................................... 7

1.2.1.2 Changing the Contig Order ........................... 9

1.2.1.3 The Contig Selector Menus ........................... 9

1.2.2 Contig Comparator....................................... 10

1.2.2.1 Examining Results and Using Them to Select

Commands .............................................. 11

1.2.2.2 Automatic Match Navigation ........................ 12

1.3 Template Display ............................................. 14

1.3.1 Filtering data ............................................ 15

1.3.2 Template plot ............................................ 15

1.3.2.1 Controlling The Y Layout............................ 17

1.3.3 Depth / Coverage Plot ................................... 20

1.4 Editing in Gap5............................................... 21

1.4.1 Moving the visible segment of the contig .................. 22

1.4.2 Names ................................................... 23

1.4.3 Editing .................................................. 25

1.4.3.1 Moving the editing cursor ........................... 26

1.4.3.2 Adjusting the Quality Values ........................ 26

1.4.3.3 Adjusting the alignment coordinates ................. 27

1.4.3.4 Adjusting the Cutoﬀ Data ........................... 27

1.4.3.5 Summary of Editing Commands ..................... 27

1.4.4 Cut and Paste Control of Sequence ....................... 28

1.4.5 Selecting Sequences ...................................... 29

1.4.6 Annotations.............................................. 29

1.4.6.1 Annotation Macros .................................. 31

1.4.7 Searching ................................................ 32

1.4.7.1 Search by Annotation Comments .................... 32

1.4.7.2 Search by Tag Type ................................. 33

1.4.7.3 Search by Padded Position .......................... 33

1.4.7.4 Search by Unpadded Position ........................ 33

1.4.7.5 Search by Sequence.................................. 33

1.4.7.6 Search by Reading Name ............................ 33

1.4.7.7 Search by Reference InDel ........................... 33

iv The Staden Package Manual

1.4.7.8 Search by Consensus Quality ........................ 34

1.4.7.9 Search by Consensus Discrepancy .................... 34

1.4.7.10 Search by Consensus Heterozygosity ................ 34

1.4.7.11 Search by Low Coverage............................ 34

1.4.7.12 Search by High Coverage ........................... 34

1.4.8 The Settings Menu ....................................... 34

1.4.8.1 Group Readings ..................................... 35

1.4.8.2 Highlight Disagreements ............................. 36

1.4.8.3 Pack Sequences...................................... 36

1.4.8.4 Hide Annotations ................................... 36

1.4.9 Primer Selection ......................................... 36

1.4.10 Traces .................................................. 38

1.4.11 The Editor Information Line ............................ 40

1.4.11.1 Reading Information ............................... 41

1.4.11.2 Contig Information ................................. 42

1.4.11.3 Tag Information.................................... 42

1.4.12 The Join Editor ......................................... 43

1.4.13 Using Several Editors at Once ........................... 44

1.4.14 Quitting the Editor ..................................... 44

1.4.15 Summary ............................................... 44

1.4.15.1 Keyboard summary for editing window ............. 44

1.4.15.2 Mouse summary for editing window ................ 45

1.4.15.3 Mouse summary for names window ................. 46

1.4.16 Plotting Restriction Enzymes............................ 47

1.4.16.1 Selecting Enzymes.................................. 47

1.4.16.2 Examining the Plot ................................ 48

1.4.16.3 Reconﬁguring the Plot ............................. 48

1.4.16.4 Textual Outputs ................................... 48

1.5 Importing and Exporting Data ................................ 50

1.5.1 Assembly ................................................ 50

1.5.1.1 Importing with tg index ............................. 50

1.5.1.2 Importing fasta/fastq ﬁles ........................... 52

1.5.1.3 Mapped assembly by bwa aln ........................ 52

1.5.1.4 Mapped assembly by bwa dbwtsw ................... 52

1.5.2 Importing GFF .......................................... 54

1.5.3 Export Tags ............................................. 54

1.5.4 Export Sequences ........................................ 55

1.6 Finding Sequence Matches .................................... 56

1.6.1 Find Internal Joins ....................................... 56

1.6.1.1 Find Internal Joins Dialogue......................... 59

1.6.2 Find Repeats............................................. 62

1.6.3 Find Read Pairs.......................................... 64

1.6.3.1 Find Read Pairs Graphical Output .................. 64

1.6.4 Sequence Search.......................................... 67

1.7 Checking Assemblies and Removing Readings.................. 69

1.7.0.1 Checking Assemblies ................................ 70

1.7.1 Removing Readings and Breaking Contigs ................ 72

1.7.1.1 Breaking Contigs .................................... 73

1.7.1.2 Disassembling Readings ............................. 74

1.7.1.3 Delete Contigs ...................................... 75

1.8 Tidying up alignments ........................................ 76

1.8.1 Shuﬄe Pads.............................................. 76

1.8.2 Remove Pad Columns .................................... 77

1.8.3 Remove Contig Holes..................................... 78

1.9 Calculating Consensus Sequences .............................. 79

1.9.1 Normal Consensus Output ............................... 80

1.9.2 The Consensus Algorithms ............................... 81

1.9.2.1 Consensus Calculation Using Base Frequencies ....... 83

1.9.2.2 Consensus Calculation Using Weighted Base Frequencies

......................................................... 83

1.9.2.3 Consensus Calculation Using Conﬁdence values . . . . . . 84

1.9.2.4 The Quality Calculation ............................. 85

1.9.3 List Consensus Conﬁdence................................ 86

1.9.4 List Base Conﬁdence ..................................... 87

1.10 Other Miscellany............................................. 89

1.10.1 List Libraries ........................................... 89

1.10.2 Results Manager ........................................ 91

1.10.3 Lists .................................................... 92

1.10.3.1 Special List Names ................................. 92

1.10.3.2 Basic List Commands .............................. 92

1.10.3.3 Contigs To Readings Command .................... 93

1.10.3.4 Search Sequence Names ............................ 93

2 Sequence assembly and ﬁnishing using Gap4

................................................ 95

2.1 Organisation of the gap4 Manual .............................. 95

2.2 Introduction .................................................. 95

2.2.1 Summary of the Files used and the Preprocessing Steps . . . 97

2.2.2 Summary of Gap4’s Functions ............................ 99

2.2.3 Introduction to the gap4 User Interface .................. 101

2.2.3.1 Introduction to the Contig Selector ................. 102

2.2.3.2 Introduction to the Contig Comparator ............. 103

2.2.3.3 Introduction to the Template Display ............... 105

2.2.3.4 Introduction to the Consistency Display ............ 107

2.2.3.5 Introduction to the Restriction Enzyme Map ....... 109

2.2.3.6 Introduction to the Stop Codon Map ............... 110

2.2.3.7 Introduction to the Contig Editor .................. 111

2.2.3.8 Introduction to the Contig Joining Editor........... 114

2.2.4 Gap4 Menus ............................................ 115

2.2.4.1 Gap4 File menu .................................... 115

2.2.4.2 Gap4 Edit menu ................................... 115

2.2.4.3 Gap4 View menu ................................... 115

2.2.4.4 Gap4 Options menu ................................ 116

2.2.4.5 Gap4 Experiments menu ........................... 116

2.2.4.6 Gap4 Lists menu ................................... 116

2.2.4.7 Gap4 Assembly menu .............................. 117

vi The Staden Package Manual

2.2.5 The use of numerical estimates of base calling accuracy . . 118

2.2.6 Use of the "hidden"poor quality data ................... 120

2.2.7 Annotating and masking readings and contigs ........... 121

2.2.7.1 Standard tag types ................................. 121

2.2.7.2 Active tags and masking ........................... 121

2.3 Contig Selector .............................................. 123

2.3.1 Selecting Contigs........................................ 123

2.3.2 Changing the Contig Order.............................. 125

2.3.3 The Contig Selector Menus .............................. 125

2.4 Contig Comparator .......................................... 126

2.4.1 Examining Results and Using Them to Select Commands

........................................................... 127

2.4.2 Automatic Match Navigation ............................ 128

2.5 Contig Overviews ............................................ 130

2.5.1 Template Display ....................................... 130

2.5.1.1 Reading and Template Plot......................... 132

2.5.1.2 Reading and Template Plot Display ................ 132

2.5.1.3 Reading and Template Plot Options ................ 135

2.5.1.4 Reading and Template Plot Operations ............. 136

2.5.1.5 Quality Plot ....................................... 137

2.5.1.6 Restriction Enzyme Plot ........................... 139

2.5.2 Consistency Display ..................................... 140

2.5.2.1 Conﬁdence Values Graph ........................... 142

2.5.2.2 Reading Coverage Histogram ....................... 142

2.5.2.3 Read-Pair Coverage Histogram ..................... 143

2.5.2.4 Strand Coverage ................................... 144

2.5.2.5 2nd-Highest Conﬁdence ............................ 145

2.5.2.6 Diploid Graph...................................... 147

2.5.3 SNP Candidates ........................................ 147

2.5.4 Plotting Consensus Quality.............................. 153

2.5.4.1 Examining the Quality Plot ........................ 153

2.5.5 Plotting Stop Codons ................................... 156

2.5.5.1 Examining the Plot ................................ 156

2.5.5.2 Updating the Plot .................................. 156

2.5.6 Plotting Restriction Enzymes............................ 157

2.5.6.1 Selecting Enzymes.................................. 157

2.5.6.2 Examining the Plot ................................ 158

2.5.6.3 Reconﬁguring the Plot ............................. 158

2.5.6.4 Creating Tags for Cut Sites......................... 158

2.5.6.5 Textual Outputs ................................... 158

2.6 Editing in Gap4 ............................................. 160

2.6.1 Moving the visible segment of the contig................. 162

2.6.2 Names .................................................. 163

2.6.3 Editing ................................................. 165

2.6.3.1 Moving the editing cursor .......................... 165

2.6.3.2 Editing Modes ..................................... 166

2.6.3.3 Adjusting the Quality Values ....................... 169

2.6.3.4 Adjusting the Cutoﬀ Data .......................... 169

vii

2.6.3.5 Summary of Editing Commands .................... 169

2.6.4 Selections ............................................... 170

2.6.5 Annotations............................................. 171

2.6.6 Searching ............................................... 174

2.6.6.1 Search by Position ................................. 174

2.6.6.2 Search by Problem ................................. 175

2.6.6.3 Search by Annotation Comments ................... 175

2.6.6.4 Search by Tag Type ................................ 175

2.6.6.5 Search by Sequence ................................ 175

2.6.6.6 Search by Quality .................................. 175

2.6.6.7 Search by Consensus Quality ....................... 175

2.6.6.8 Search by ﬁle....................................... 176

2.6.6.9 Search by Reading Name ........................... 176

2.6.6.10 Search by Edit .................................... 176

2.6.6.11 Search by Evidence for Edit (1) ................... 176

2.6.6.12 Search by Evidence for Edit (2) ................... 176

2.6.6.13 Search by Discrepancies ........................... 177

2.6.6.14 Search by Consensus Discrepancies ................ 177

2.6.7 The Commands Menu ................................... 177

2.6.7.1 Search ............................................. 177

2.6.7.2 Create Tag ......................................... 177

2.6.7.3 Edit Tag ........................................... 177

2.6.7.4 Delete Tag ......................................... 177

2.6.7.5 Save Contig ........................................ 177

2.6.7.6 Dump Contig to File ............................... 178

2.6.7.7 Save Consensus Trace .............................. 178

2.6.7.8 List Conﬁdence .................................... 178

2.6.7.9 Report Mutations .................................. 178

2.6.7.10 Select Primer ..................................... 179

2.6.7.11 Align ............................................. 179

2.6.7.12 Remove Reading .................................. 179

2.6.7.13 Break Contig...................................... 179

2.6.8 The Settings Menu ...................................... 179

2.6.8.1 Status Line ........................................ 180

2.6.8.2 Trace Display ...................................... 181

2.6.8.3 Auto-display Traces ................................ 181

2.6.8.4 Show Read-pair Traces ............................. 182

2.6.8.5 Auto-diﬀ Traces .................................... 182

2.6.8.6 Y scale diﬀerences .................................. 183

2.6.8.7 Consensus Algorithm ............................... 183

2.6.8.8 Group Readings .................................... 183

2.6.8.9 Highlight Disagreements ............................ 184

2.6.8.10 Compare Strands ................................. 184

2.6.8.11 Toggle auto-save .................................. 184

2.6.8.12 3 Character Amino Acids ......................... 184

2.6.8.13 Show Reading and Consensus Quality ............. 184

2.6.8.14 Show edits ........................................ 185

2.6.8.15 Show Unpadded Positions ......................... 185

viii The Staden Package Manual

2.6.8.16 Show Template Names ............................ 185

2.6.8.17 Set Active Tags ................................... 186

2.6.8.18 Set Output List ................................... 186

2.6.8.19 Set Default Conﬁdences ........................... 186

2.6.8.20 Set or unset saving of undo........................ 186

2.6.9 Removing Readings ..................................... 186

2.6.10 Primer Selection ....................................... 187

2.6.10.1 Parameters........................................ 188

2.6.10.2 Template selection ................................ 188

2.6.11 Traces ................................................. 188

2.6.12 Reference Sequence and Traces ......................... 191

2.6.12.1 Reference sequences ............................... 191

2.6.12.2 Reference traces................................... 191

2.6.13 Template Status Codes................................. 192

2.6.14 The Editor Information Line ........................... 193

2.6.14.1 Reading Information .............................. 194

2.6.14.2 Contig Information................................ 195

2.6.14.3 Tag Information................................... 195

2.6.14.4 Base Information.................................. 195

2.6.15 The Join Editor........................................ 196

2.6.16 Using Several Editors at Once .......................... 197

2.6.17 Quitting the Editor .................................... 197

2.6.18 Editing Techniques..................................... 197

2.6.18.1 Consensus and Quality Cutoﬀs .................... 198

2.6.18.2 Editing by Base Change or Conﬁdence ............ 199

2.6.18.3 Base Overcalls .................................... 199

2.6.18.4 Base Undercalls ................................... 200

2.6.18.5 Multiple Base Disagreements ...................... 200

2.6.18.6 Poor Quality ...................................... 201

2.6.18.7 Checking for Errors ............................... 201

2.6.19 Summary .............................................. 202

2.6.19.1 Keyboard summary for editing window ............ 202

2.6.19.2 Mouse summary for editing window ............... 203

2.6.19.3 Mouse summary for names window ................ 203

2.6.19.4 Mouse summary for scrollbar ...................... 204

2.7 Assembling and Adding Readings to a Database .............. 205

2.7.1 Normal Shotgun Assembly .............................. 205

2.7.1.1 Assemble Independently ............................ 209

2.7.1.2 Assemble Into Single Stranded Regions ............. 209

2.7.1.3 Stack Readings ..................................... 210

2.7.1.4 Put All Readings In Separate Contigs .............. 211

2.7.2 Directed Assembly ...................................... 211

2.7.3 Screen Only............................................. 213

2.7.4 General Comments and Tips on Assembly ............... 215

2.7.5 Assembly Failure Codes ................................. 216

2.8 Ordering and Joining Contigs ................................ 217

2.8.1 Order contigs ........................................... 219

2.8.2 Find Read Pairs ........................................ 222

2.8.2.1 Find Read Pairs Graphical Output ................. 222

2.8.2.2 Find Read Pairs Text Output ...................... 224

2.8.2.3 The Template Lines ................................ 224

2.8.2.4 The Reading Lines ................................. 225

2.8.3 Find Internal Joins ...................................... 227

2.8.3.1 Find Internal Joins Dialogue........................ 230

2.8.4 Find Repeats ........................................... 233

2.9 Checking Assemblies and Removing Readings ................ 235

2.9.0.1 Checking Assemblies ............................... 236

2.9.1 Removing Readings and Breaking Contigs ............... 238

2.9.1.1 Breaking Contigs ................................... 239

2.9.1.2 Disassembling Readings ............................ 240

2.10 Finishing Experiments ...................................... 241

2.10.1 Double Stranding ...................................... 241

2.10.2 Suggest Primers........................................ 243

2.10.3 Suggest Long Readings................................. 245

2.10.4 Compressions and Stops................................ 247

2.10.5 Suggest Probes ........................................ 249

2.11 Calculating Consensus Sequences............................ 251

2.11.1 Normal Consensus Output ............................. 252

2.11.2 Extended Consensus Output ........................... 253

2.11.3 Unﬁnished Consensus Output .......................... 255

2.11.4 Quality Consensus Output ............................. 255

2.11.5 The Consensus Algorithms ............................. 257

2.11.5.1 Consensus Calculation Using Base Frequencies..... 258

2.11.5.2 Consensus Calculation Using Weighted Base

Frequencies............................................. 259

2.11.5.3 Consensus Calculation Using Conﬁdence values . . . . 259

2.11.5.4 The Quality Calculation........................... 261

2.11.6 List Consensus Conﬁdence ............................. 261

2.11.7 List Base Conﬁdence ................................... 263

2.12 Miscellaneous functions ..................................... 265

2.12.1 Complement a Contig .................................. 265

2.12.2 Enter Tags............................................. 265

2.12.3 Shuﬄe Pads ........................................... 265

2.12.4 Show Relationships .................................... 267

2.12.5 Contig Navigation ..................................... 269

2.12.6 Sequence Search ....................................... 271

2.12.7 Extract Readings ...................................... 273

2.12.8 Automatic Clipping by Quality and Sequence Similarity

........................................................... 274

2.12.8.1 Diﬀerence Clipping ................................ 274

2.12.8.2 Quality Clipping .................................. 275

2.12.8.3 Quality Clip Ends ................................. 275

2.12.8.4 N-Base Clipping .................................. 276

2.13 Results Manager ............................................ 277

2.14 Lists........................................................ 278

2.14.1 Special List Names..................................... 278

x The Staden Package Manual

2.14.2 Basic List Commands .................................. 278

2.14.3 Contigs To Readings Command ........................ 279

2.14.4 Minimal Coverage Command ........................... 279

2.14.5 Unattached Readings Command........................ 279

2.14.6 Highlight Readings List ................................ 279

2.14.7 Search Sequence Names ................................ 279

2.14.8 Search Template Names ................................ 280

2.14.9 Search Annotation Contents............................ 280

2.15 Notes ....................................................... 281

2.15.1 Selecting Notes ........................................ 281

2.15.2 Editing Notes .......................................... 282

2.15.3 Special Note Types .................................... 282

2.16 Gap4 Database Files ........................................ 284

2.16.1 Directories ............................................. 284

2.16.2 Opening a New Database .............................. 285

2.16.3 Opening an Existing Database ......................... 285

2.16.4 Making Backups of Databases .......................... 285

2.16.5 Reading and Contig Names and Numbers .............. 286

2.17 Copy Readings ............................................. 287

2.17.1 Introduction ........................................... 287

2.17.1.1 Copy Reads Dialogue ............................. 287

2.18 Check Database ............................................ 290

2.18.1 Database Checks ....................................... 290

2.18.2 Contig Checks ......................................... 290

2.18.3 Reading Checks ........................................ 291

2.18.4 Annotation Checks..................................... 292

2.18.5 Note Checks ........................................... 292

2.18.6 Template Checks....................................... 292

2.18.7 Vector Checks ......................................... 292

2.18.8 Clone Checks .......................................... 292

2.19 Doctor Database............................................ 293

2.19.1 Structures Menu ....................................... 294

2.19.1.1 Database Structure................................ 295

2.19.1.2 Reading Structure................................. 295

2.19.1.3 Contig Structure .................................. 295

2.19.1.4 Annotation Structure ............................. 296

2.19.1.5 Template Structure ............................... 296

2.19.1.6 Original Clone Structure .......................... 296

2.19.1.7 Note Structure .................................... 296

2.19.2 Ignoring Check Database............................... 296

2.19.3 Extending Structures .................................. 296

2.19.4 Listing and Removing Annotations ..................... 296

2.19.5 Shift Readings ......................................... 297

2.19.6 Delete Contig .......................................... 297

2.19.7 Reset Contig Order .................................... 297

2.20 Conﬁguring ................................................. 298

2.20.1 Introduction ........................................... 298

2.20.2 Consensus Algorithm .................................. 299

2.20.3 Set Maxseq/Maxdb .................................... 299

2.20.4 Set Fonts .............................................. 299

2.20.5 Conﬁguring Menus ..................................... 300

2.20.6 Set Genetic Code ...................................... 300

2.20.7 Alignment Scores ...................................... 301

2.20.8 Trace File Location .................................... 302

2.20.9 The Tag Selector....................................... 304

2.20.10 The GTAGDB File ................................... 304

2.20.11 Template Status ...................................... 305

2.21 Command Line Arguments.................................. 306

3 Searching for point mutations using pregap4

and gap4..................................... 309

3.1 Introduction to mutation detection ........................... 309

3.1.1 Mutation Detection Programs ........................... 314

3.1.2 Mutation Detection Reference Data ..................... 314

3.1.3 Reference Sequences..................................... 314

3.1.4 Reference Traces ........................................ 315

3.1.5 Using The Template Display With Mutation Data ....... 317

3.1.6 Conﬁguring The Gap4 Editor For Mutation Data . . . . . . . . 318

3.1.7 Using The Gap4 Editor With Mutation Data ............ 319

3.1.8 Processing Batches Of Mutation Data Trace Files. . . . . . . . 320

3.1.9 Processing Batches Of Mutation Data Trace Files Using

Pregap4 ................................................... 322

3.1.10 Conﬁguration Of Pregap4 For Mutation Data .......... 323

3.1.11 Discussion Of Mutation Data Processing Methods . . . . . . 324

4 Preparing readings for assembly using pregap4

............................................... 325

4.1 Organisation of the Pregap4 Manual ......................... 325

4.2 Introduction ................................................. 326

4.2.1 Summary of the Files used and the Processing Steps ..... 326

4.2.2 Introduction to the Pregap4 User Interface .............. 330

4.2.2.1 Introduction to the Files to Process Window ........ 332

4.2.2.2 Introduction to the Conﬁgure Modules Window..... 334

4.2.2.3 Introduction to the Textual Output Window . . . . . . . . 335

4.2.2.4 Introduction to Running Pregap4 ................... 335

4.2.3 Pregap4 Menus ......................................... 336

4.2.3.1 Pregap4 File menu ................................. 336

4.2.3.2 Pregap4 Modules menu............................. 336

4.2.3.3 Pregap4 Information source menu .................. 336

4.2.3.4 Pregap4 Options menu ............................. 337

4.3 Specifying Files to Process ................................... 337

4.4 Running Pregap4 ............................................ 338

4.5 Conﬁguring the Pregap4 User Interface ....................... 341

4.5.1 Fonts and Colours....................................... 341

4.5.2 Window Styles .......................................... 341

xii The Staden Package Manual

4.6 Conﬁguring Modules ......................................... 342

4.6.1 General Conﬁguration................................... 344

4.6.2 Estimate Base Accuracies ............................... 344

4.6.3 Phred................................................... 344

4.6.4 ATQA .................................................. 345

4.6.5 Trace Format Conversion................................ 345

4.6.6 Initialise Experiment Files............................... 346

4.6.7 Augment Experiment Files .............................. 346

4.6.8 Quality Clip ............................................ 347

4.6.9 Sequencing Vector Clip.................................. 348

4.6.10 Cross match ........................................... 350

4.6.11 Cloning Vector Clip .................................... 350

4.6.12 Screen for Unclipped Vector ............................ 351

4.6.13 Screen Sequences....................................... 351

4.6.14 Blast Screen ........................................... 352

4.6.15 Interactive Clipping .................................... 352

4.6.16 Extract Sequence ...................................... 353

4.6.17 RepeatMasker ......................................... 353

4.6.18 Tag Repeats ........................................... 354

4.6.19 Mutation Detection .................................... 354

4.6.20 Reference Traces and Reference Sequences .............. 356

4.6.21 Trace Diﬀerence ....................................... 357

4.6.22 Mutation Scanner ...................................... 359

4.6.23 Gap4 Shotgun Assembly ............................... 361

4.6.24 Cap2 Assembly ........................................ 362

4.6.25 Cap3 Assembly ........................................ 362

4.6.26 FakII Assembly ........................................ 362

4.6.27 Phrap Assembly ....................................... 363

4.6.28 Enter Assembly into Gap4 ............................. 364

4.6.29 Email.................................................. 364

4.6.30 Old Cloning Vector Clip - Obsolete ..................... 365

4.6.31 ALF/ABI to SCF Conversion - Obsolete ............... 365

4.7 Using Conﬁg Files ........................................... 366

4.8 Pregap4 Naming Schemes .................................... 366

4.8.1 Mutation Detection Naming Scheme..................... 366

4.8.2 Old Sanger Centre Naming Scheme ...................... 367

4.8.3 New Sanger Centre Naming Scheme ..................... 368

4.8.4 Writing Your Own Naming Schemes ..................... 369

4.9 Pregap4 Components ........................................ 371

4.10 Information Sources......................................... 371

4.10.1 Simple Text Database .................................. 371

4.10.2 Experiment File Line Types ............................ 373

4.11 Adding and Removing Modules ............................. 375

4.12 Low Level Pregap4 Conﬁguration ........................... 377

4.12.1 Low Level Global Conﬁguration ........................ 377

4.12.2 Low Level Component Conﬁguration ................... 378

4.12.3 Low Level Module Conﬁguration ....................... 378

4.12.3.1 General Conﬁguration ............................. 379

xiii

4.12.3.2 ALF/ABI to SCF Conversion ..................... 379

4.12.3.3 Estimate Base Accuracies ......................... 380

4.12.3.4 Phred ............................................. 380

4.12.3.5 ATQA ............................................ 380

4.12.3.6 Trace Format Conversion .......................... 381

4.12.3.7 Initialise Experiment Files......................... 381

4.12.3.8 Augment Experiment Files ........................ 382

4.12.3.9 Uncalled Base Clip ................................ 382

4.12.3.10 Quality Clip ..................................... 382

4.12.3.11 Sequencing Vector Clip........................... 383

4.12.3.12 Cross match ..................................... 384

4.12.3.13 Cloning Vector Clip .............................. 385

4.12.3.14 Old Cloning Vector Clip ......................... 385

4.12.3.15 Screen for Unclipped Vector ...................... 386

4.12.3.16 Screen Sequences................................. 387

4.12.3.17 Blast Screen ..................................... 387

4.12.3.18 Interactive Clipping .............................. 388

4.12.3.19 Extract Sequence ................................ 388

4.12.3.20 Tag Repeats ..................................... 388

4.12.3.21 RepeatMasker ................................... 389

4.12.3.22 Mutation Detection .............................. 389

4.12.3.23 Gap4 Shotgun Assembly ......................... 390

4.12.3.24 Cap2 Assembly .................................. 391

4.12.3.25 Cap3 Assembly .................................. 391

4.12.3.26 FakII Assembly .................................. 392

4.12.3.27 Phrap Assembly ................................. 393

4.12.3.28 Enter Assembly into Gap4 ....................... 393

4.12.3.29 Email ............................................ 394

4.12.3.30 Shutdown ........................................ 394

4.13 Writing New Modules ....................................... 395

4.13.1 An Overview of a Module .............................. 395

4.13.2 Functions .............................................. 395

4.13.3 Module Variables ...................................... 397

4.13.4 Global Variables ....................................... 397

4.13.5 Builtin Functions ...................................... 398

4.13.6 An Example Module ................................... 398

5 Marking poor quality and vector segments of

readings...................................... 399

Introduction to read clipping ...................................... 399

xiv The Staden Package Manual

6 Screening Against Vector Sequences........ 401

6.1 Algorithms .................................................. 402

6.2 Options...................................................... 404

6.3 Parameters (defaults in brackets)............................. 404

6.4 Error codes .................................................. 405

6.5 Examples .................................................... 406

6.6 Vector Primer ﬁle format .................................... 407

6.7 Vector Primer File Notes .................................... 408

6.8 Deﬁning Cloning and Primer Sites for Vector Clip ............ 408

6.9 Finding the Cloning and Primer Sites ........................ 410

7 Screening Readings for Contaminant Sequences

............................................... 413

7.1 Parameters .................................................. 413

7.2 Limits ....................................................... 414

7.3 Error codes .................................................. 414

7.4 Examples .................................................... 415

8 Viewing and editing trace data using trev

............................................... 417

8.1 Introduction ................................................. 417

8.2 Opening trace ﬁles ........................................... 420

8.2.1 Opening a trace ﬁle from the command line ............. 420

8.2.2 Opening a trace ﬁle from within Trev.................... 421

8.3 Viewing the trace ............................................ 421

8.3.1 Searching ............................................... 422

8.3.2 Information ............................................. 422

8.4 Editing ...................................................... 422

8.4.1 Setting the left and right cutoﬀs ......................... 422

8.4.2 Editing the sequence .................................... 423

8.4.3 Undoing clip edits....................................... 423

8.5 Saving a trace ﬁle ............................................ 423

8.6 Processing multiple ﬁles ...................................... 423

8.7 Printing a trace .............................................. 424

8.7.1 Page options ............................................ 424

8.7.1.1 Paper options ...................................... 424

8.7.1.2 Panels ............................................. 425

8.7.1.3 Fonts .............................................. 425

8.7.2 Trace options ........................................... 425

8.7.2.1 Title ............................................... 425

8.7.2.2 Line width and colour .............................. 425

8.7.2.3 Dash pattern ....................................... 426

8.7.2.4 Print bases ......................................... 426

8.7.2.5 Print magniﬁcation................................. 426

8.7.3 Example ................................................ 427

8.8 Quitting ..................................................... 427

9 Analysing and comparing sequences using spin

............................................... 429

9.1 Organisation of the Spin Manual ............................. 429

9.2 Introduction ................................................. 429

9.2.1 Summary of the Spin Single Sequence Functions ......... 429

9.2.2 Summary of the Spin Comparison Functions ............. 430

9.2.3 Introduction to the Spin User Interface .................. 431

9.2.3.1 Introduction to the Spin Plot ....................... 432

9.2.3.2 Introduction to the Spin Sequence Display .......... 437

9.2.3.3 Introduction to the Spin Sequence Comparison Plot

........................................................ 439

9.2.3.4 Introduction to the Spin Sequence Comparison Display

........................................................ 443

9.2.4 Spin Menus ............................................. 444

9.2.4.1 Spin File Menu..................................... 444

9.2.4.2 Spin View Menu ................................... 444

9.2.4.3 Spin Options Menu................................. 444

9.2.4.4 Spin Sequences Menu............................... 445

9.2.4.5 Spin Statistics Menu ............................... 445

9.2.4.6 Spin Translation Menu ............................. 445

9.2.4.7 Spin Search Menu .................................. 446

9.2.4.8 Spin Comparison Menu............................. 446

9.3 Spin’s Analytical Functions .................................. 446

9.3.1 Count Sequence Composition............................ 446

9.3.2 Count Dinucleotide Frequencies ......................... 447

9.3.3 Plot base composition ................................... 447

9.3.4 Calculate codon usage................................... 448

9.3.5 Set genetic code......................................... 451

9.3.6 Translation - general .................................... 453

9.3.7 Find open reading frames ............................... 454

9.3.8 Restriction enzyme search ............................... 456

9.3.8.1 Selecting Enzymes.................................. 457

9.3.8.2 Examining the Plot ................................ 457

9.3.8.3 Reconﬁguring the Plot ............................. 458

9.3.8.4 Printing the sites ................................... 458

9.3.9 Subsequence search ..................................... 460

9.3.10 Motif search ........................................... 462

9.3.11 Gene ﬁnding ........................................... 463

9.3.11.1 Start codon search ................................ 464

9.3.11.2 Stop codon search ................................. 464

9.3.11.3 Codon usage method .............................. 466

9.3.11.4 Positional base preferences ........................ 472

9.3.11.5 Author test ....................................... 473

9.3.11.6 Uneven positional base preferences ................ 477

9.3.11.7 Splice site search .................................. 478

9.3.11.8 tRNA search ...................................... 480

9.4 Spin Comparison Functions .................................. 482

9.4.1 Finding Similar Spans ................................... 483

xvi The Staden Package Manual

9.4.2 Finding Matching Words ................................ 485

9.4.3 Finding the Best Diagonals.............................. 487

9.4.4 Aligning Sequences Globally............................. 489

9.4.5 Aligning Sequences Locally .............................. 493

9.5 Controlling and Managing Results............................ 497

9.5.1 Probabilities and expected numbers of matches .......... 498

9.5.2 Changing the maximum number of matches ............. 498

9.5.3 Changing the default number of matches ................ 498

9.5.4 Hide duplicate matches.................................. 499

9.5.5 Changing the score matrix .............................. 499

9.5.6 Set protein alignment symbols........................... 501

9.6 The Spin User Interface ...................................... 501

9.6.1 SPIN Sequence Plot ..................................... 502

9.6.1.1 Cursors ............................................ 505

9.6.1.2 Crosshairs.......................................... 506

9.6.1.3 Zoom .............................................. 506

9.6.1.4 Drag and drop ..................................... 506

9.6.2 Sequence display ........................................ 509

9.6.2.1 Search ............................................. 510

9.6.2.2 Save ............................................... 511

9.6.3 SPIN Sequence Comparison Plot ........................ 512

9.6.3.1 Cursors ............................................ 513

9.6.3.2 Crosshairs.......................................... 513

9.6.3.3 Zoom .............................................. 513

9.6.3.4 Drag and drop ..................................... 514

9.6.4 Sequence Comparison Display ........................... 514

9.7 Controlling and Managing Results............................ 515

9.7.1 Result manager ......................................... 515

9.7.1.1 Information ........................................ 517

9.7.1.2 List ................................................ 517

9.7.1.3 Conﬁgure .......................................... 517

9.7.1.4 Hide ............................................... 517

9.7.1.5 Reveal ............................................. 517

9.7.1.6 Remove ............................................ 517

9.8 Reading and Managing Sequences ............................ 517

9.8.1 Use of feature tables in spin ............................. 517

9.8.2 Reading in sequences .................................... 518

9.8.2.1 Simple search ...................................... 518

9.8.2.2 Extracting a sequence from a personal archive ﬁle. . . 519

9.8.3 Sequence manager....................................... 519

9.8.3.1 Change the active sequence......................... 520

9.8.3.2 Set the range....................................... 520

9.8.3.3 Copy Sequence ..................................... 520

9.8.3.4 Sequence type ...................................... 520

9.8.3.5 Complement sequence .............................. 520

9.8.3.6 Interconvert t and u ................................ 520

9.8.3.7 Translate sequence ................................. 520

9.8.3.8 Scramble sequence ................................. 521

xvii

9.8.3.9 Rotate sequence .................................... 521

9.8.3.10 Save sequence ..................................... 521

9.8.3.11 Delete sequence ................................... 522

9.8.4 Selecting a sequence..................................... 522

10 User Interface ............................... 523

Introduction ...................................................... 523

10.2 Basic Interface Controls..................................... 523

10.2.1 Buttons................................................ 523

10.2.2 Menus ................................................. 524

10.2.3 Text Windows ......................................... 524

10.2.4 Text Entry Boxes ...................................... 525

10.3 Standard Mouse Operations................................. 526

10.4 The Output and Error Windows ............................ 526

10.5 Graphics Window........................................... 528

10.5.1 Zooming ............................................... 528

10.6 Colour Selector ............................................. 529

10.7 File Browser ................................................ 529

10.7.1 Directories and Files ................................... 530

10.7.2 Filters ................................................. 530

10.8 Font Selection .............................................. 531

11 File Formats ................................. 533

11.1 SCF ........................................................ 533

11.1.1 Header Record ......................................... 533

11.1.2 Sample Points. ......................................... 535

11.1.3 Sequence Information................................... 536

11.1.4 Comments. ............................................ 537

11.1.5 Private data............................................ 537

11.1.6 File structure. ......................................... 538

11.1.7 Notes .................................................. 538

11.1.7.1 Byte ordering and integer representation. .......... 538

11.1.7.2 Compression of SCF Files ......................... 539

11.2 ZTR........................................................ 540

11.2.1 Header................................................. 540

11.2.2 Chunk Format ......................................... 540

11.2.2.1 Data format 0 - Raw .............................. 541

11.2.2.2 Data format 1 - Run Length Encoding............. 541

11.2.2.3 Data format 2 - ZLIB ............................. 542

11.2.2.4 Data format 64/0x40 - 8-bit delta ................. 542

11.2.2.5 Data format 65/0x41 - 16-bit delta ................ 542

11.2.2.6 Data format 66/0x42 - 32-bit delta ................ 543

11.2.2.7 Data format 67-69/0x43-0x45 - reserved ........... 543

11.2.2.8 Data format 70/0x46 - 16 to 8 bit conversion . . . . . . 543

11.2.2.9 Data format 71/0x47 - 32 to 8 bit conversion . . . . . . 543

11.2.2.10 Data format 72/0x48 - "follow"predictor......... 544

11.2.2.11 Data format 73/0x49 - ﬂoating point 16-bit chebyshev

polynomial predictor ................................... 544

xviii The Staden Package Manual

11.2.2.12 Data format 74/0x4A - integer based 16-bit chebyshev

polynomial predictor ................................... 544

11.2.3 Chunk Types .......................................... 545

11.2.3.1 SAMP ............................................ 545

11.2.3.2 SMP4............................................. 546

11.2.3.3 BASE............................................. 546

11.2.3.4 BPOS............................................. 547

11.2.3.5 CNF4 ............................................. 547

11.2.3.6 TEXT ............................................ 548

11.2.3.7 CLIP ............................................. 548

11.2.3.8 CR32 ............................................. 548

11.2.3.9 COMM ........................................... 549

11.2.4 Text Identiﬁers ........................................ 549

11.2.5 References ............................................. 551

11.3 Experiment File ............................................ 552

11.3.1 Records................................................ 552

11.3.2 Explanation of Records ................................ 554

11.3.3 Example ............................................... 563

11.3.4 Unsupported Additions (From LaDeana Hillier) . . . . . . . . 564

11.4 Restriction Enzyme File .................................... 566

11.5 Vector primer File .......................................... 567

11.6 Vector Sequence Format .................................... 568

12 Man Pages................................... 569

12.1 Convert trace............................................... 570

NAME ........................................................ 570

SYNOPSIS .................................................... 570

DESCRIPTION ............................................... 570

OPTIONS ..................................................... 570

EXAMPLES ................................................... 571

NOTES........................................................ 571

SEE ALSO .................................................... 572

12.2 Copy db.................................................... 573

NAME ........................................................ 573

SYNOPSIS .................................................... 573

DESCRIPTION ............................................... 573

OPTIONS ..................................................... 573

EXAMPLES ................................................... 573

NOTES........................................................ 573

12.3 Copy reads ................................................. 574

NAME ........................................................ 574

SYNOPSIS .................................................... 574

DESCRIPTION ............................................... 574

OPTIONS ..................................................... 574

EXAMPLE .................................................... 576

12.4 Eba ........................................................ 577

NAME ........................................................ 577

SYNOPSIS .................................................... 577

xix

DESCRIPTION ............................................... 577

EXAMPLES ................................................... 577

SEE ALSO .................................................... 577

12.5 Extract seq ................................................. 578

NAME ........................................................ 578

SYNOPSIS .................................................... 578

DESCRIPTION ............................................... 578

OPTIONS ..................................................... 578

SEE ALSO .................................................... 578

12.6 Extract fastq ............................................... 579

NAME ........................................................ 579

SYNOPSIS .................................................... 579

DESCRIPTION ............................................... 579

OPTIONS ..................................................... 579

SEE ALSO .................................................... 579

12.7 Find renz................................................... 580

NAME ........................................................ 580

SYNOPSIS .................................................... 580

DESCRIPTION ............................................... 580

OPTIONS ..................................................... 580

SEE ALSO .................................................... 580

12.8 GetABIﬁeld ................................................ 581

NAME ........................................................ 581

SYNOPSIS .................................................... 581

DESCRIPTION ............................................... 581

OPTIONS ..................................................... 581

EXAMPLES ................................................... 582

SEE ALSO .................................................... 582

12.9 Get comment ............................................... 583

NAME ........................................................ 583

SYNOPSIS .................................................... 583

DESCRIPTION ............................................... 583

OPTIONS ..................................................... 583

SEE ALSO .................................................... 583

12.10 Get scf ﬁeld ............................................... 584

NAME ........................................................ 584

SYNOPSIS .................................................... 584

DESCRIPTION ............................................... 584

OPTIONS ..................................................... 584

SEE ALSO .................................................... 584

12.11 Hash exp .................................................. 585

NAME ........................................................ 585

SYNOPSIS .................................................... 585

DESCRIPTION ............................................... 585

SEE ALSO .................................................... 585

12.12 Hash extract .............................................. 586

NAME ........................................................ 586

SYNOPSIS .................................................... 586

xx The Staden Package Manual

DESCRIPTION ............................................... 586

OPTIONS ..................................................... 586

SEE ALSO .................................................... 586

12.13 Hash list .................................................. 587

NAME ........................................................ 587

SYNOPSIS .................................................... 587

DESCRIPTION ............................................... 587

OPTIONS ..................................................... 587

SEE ALSO .................................................... 587

12.14 Hash tar .................................................. 587

NAME ........................................................ 587

SYNOPSIS .................................................... 587

DESCRIPTION ............................................... 587

OPTIONS ..................................................... 588

EXAMPLES ................................................... 588

SEE ALSO .................................................... 589

12.15 Init exp ................................................... 590

NAME ........................................................ 590

SYNOPSIS .................................................... 590

DESCRIPTION ............................................... 590

OPTIONS ..................................................... 590

NOTES........................................................ 590

SEE ALSO .................................................... 590

12.16 MakeSCF.................................................. 591

NAME ........................................................ 591

SYNOPSIS .................................................... 591

DESCRIPTION ............................................... 591

OPTIONS ..................................................... 591

EXAMPLES ................................................... 591

NOTES........................................................ 592

SEE ALSO .................................................... 592

12.17 Make weights.............................................. 593

NAME ........................................................ 593

SYNOPSIS .................................................... 593

DESCRIPTION ............................................... 593

OPTIONS ..................................................... 595

EXAMPLE .................................................... 596

SEE ALSO .................................................... 596

12.18 PolyA clip................................................. 597

NAME ........................................................ 597

SYNOPSIS .................................................... 597

OPTIONS ..................................................... 597

DESCRIPTION ............................................... 597

SEE ALSO .................................................... 597

12.19 Qclip ...................................................... 597

NAME ........................................................ 597

SYNOPSIS .................................................... 597

DESCRIPTION ............................................... 598

xxi

OPTIONS ..................................................... 598

EXAMPLE .................................................... 599

SEE ALSO .................................................... 599

12.20 Screen seq ................................................. 600

NAME ........................................................ 600

SYNOPSIS .................................................... 600

DESCRIPTION ............................................... 600

OPTIONS ..................................................... 600

EXAMPLES ................................................... 601

NOTES........................................................ 601

SEE ALSO .................................................... 602

12.21 TraceDiﬀ .................................................. 603

NAME ........................................................ 603

SYNOPSIS .................................................... 603

DESCRIPTION ............................................... 603

OPTIONS ..................................................... 603

12.22 Trace dump ............................................... 606

NAME ........................................................ 606

SYNOPSIS .................................................... 606

DESCRIPTION ............................................... 606

SEE ALSO .................................................... 606

12.23 Vector clip ................................................ 607

NAME ........................................................ 607

SYNOPSIS .................................................... 607

DESCRIPTION ............................................... 607

OPTIONS ..................................................... 607

EXAMPLES ................................................... 608

NOTES........................................................ 610

SEE ALSO .................................................... 610

References ........................................ 611

Publications ...................................................... 611

General Index .................................... 613

File Index......................................... 625

Variable Index.................................... 627

Function Index ................................... 629

Preface

This manual describes the sequence handling and analysis software developed at the Medical

Research Council Laboratory of Molecular Biology, Cambridge, UK, which has come to be

known as the Staden Package.

The vast bulk of work on the package was done at LMB within Rodger Staden’s group,

which over time has consisted of Tim Gleeson, Simon Dear, James Bonﬁeld, Kathryn Beal,

Mark Jordan and Yaping Cheng. Besides the group members a number of people have

made important contributions; most notably including David Judge and John Taylor for

feedback / tutorials and developing the Windows release respectively.

Since mid-2003 the group in LMB no longer exists. The package became “open source”

and moved onto SourceForge in early 2004. The only active maintainer (James Bonﬁeld)

now works at the Wellcome Trust Sanger Institute. The new package homepage may be

found at

http://staden.sourceforge.net/ and the SourceForge project page is at

https://sourceforge.net/projects/staden/ .

The focus of the development since 1990 has been to produce improved methods for

processing the data for large scale sequencing projects, and this is reﬂected in the scope of

the package: the most advanced components (trev, preﬁnish, pregap4 and gap4) are those

used in that area. Nevertheless the package also contains a program (spin) for the analysis

and comparison of ﬁnished sequences. The latter also provides a graphical user interface to

EMBOSS.

Since the LMB group disbanded it has become necessary to reduce the scope of further

development, so active work is primarily being directed to the Gap4 program.

Gap4 performs sequence assembly, contig ordering based on read pair data, contig joining

based on sequence comparisons, assembly checking, repeat searching, experiment sugges-

tion, read pair analysis and contig editing. It has graphical views of contigs, templates,

readings and traces which all scroll in register. Contig editor searches and experiment sug-

gestion routines use conﬁdence values to calculate the conﬁdence of the consensus sequence

and hence identify only places requiring visual trace inspection or extra data. The result is

extremely rapid ﬁnishing and a consensus of known accuracy.

Pregap4 provides a graphical user interface to set up the processing required to prepare

trace data for assembly or analysis. It also automates these processes. The possible pro-

cesses which can be set up and automated include trace format conversion, quality analysis,

vector clipping, contaminant screening, repeat searching and mutation detection.

Trev is a rapid and ﬂexible viewer and editor for ABI, ALF, SCF and ZTR trace ﬁles.

Preﬁnish analyses partially completed sequence assemblies and suggests the most eﬃcient

set of experiments to help ﬁnish the project.

Tracediﬀ and hetscan automatically locate mutations by comparing trace data against

reference traces. They annotate the mutations found ready for viewing in gap4.

Spin analyses nucleotide sequences to ﬁnd genes, restriction sites, motifs, etc. It can

perform translations, ﬁnd open reading frames, count codons, etc. Many results are pre-

2 The Staden Package Manual

sented graphically and a sliding sequence window is linked to the graphics cursor. Spin also

compares pairs of sequences in many ways. It has very rapid dot matrix analysis, global and

local alignment algorithms, plus a sliding sequence window linked to the graphical plots. It

can compare nucleic acid against nucleic acid, protein against protein, and protein against

nucleic acid.

The manual describes, in turn, each of the main programs in the package: gap4, and

then pregap4 and its associated programs such as trev, and then spin. This is followed by a

description of the graphical user interface, the ZTR, SCF and Experiment ﬁle formats used

by our software, UNIX manpages for several of the smaller programs, and ﬁnally a list of

papers published about the software. The description for each of the programs includes an

introductory section which is intended to be suﬃcient to enable people to start using them,

although in order to get the most from the programs, and to ﬁnd the most eﬃcient ways of

using them we recommend that the whole manual is read once. The mini-manual is made

up from the introductory sections for each of the main programs.

Chapter 1: Next generation assembly editing with Gap5 3

1 Next generation assembly editing with Gap5

4 The Staden Package Manual

1.1 Gap5 Databases

1.1.1 Creating databases

Gap5 cannot directly work on assembly formats in their native format. This is a substantial

diﬀerence from things like BAM ﬁle viewers, but the reason is simply that the other formats

do not have data structured in a manner that is suitable for in-place editing. Gap5 is ﬁrst

and foremost an assembly editor.

Gap5 databases are currently created external to Gap5 using a command-line program

named tg_index.

tg_index [options] input ﬁle ...

The most general usage is simply to specify one or more data ﬁles (it accepts SAM/BAM,

CAF, ACE, BAF, MAQ and in a more limited fashion fasta/fastq), optionally specifying

the output database with -o database name. This will then create a database suitable for

editing by Gap5.

Valid options are:

-m Input is MAQ format

-M Input is MAQ-long format

-A Input is ACE format

-B Input is BAF format

-C Input is CAF format

-f Input is FASTA format

-F Input is FASTQ format

-b Input is BAM format

-s Input is SAM format (with @SQ headers)

-u Also store unmapped reads (SAM/BAM only)

-x Also store auxillary records (SAM/BAM only)

-r Store reference-position data (on) (SAM/BAM only)

-R Don’t store reference-position data (SAM/BAM only)

-D Do not remove duplicates (SAM/BAM only)

-p Link read-pairs together (default on)

-P Do not link read-pairs together

-q value Number of reads to queue in memory while waiting for pairing. Use to reduce

memory requirements for assemblies with lots of single reads at the expense of

running time. 0 for all in memory, suggest 1000000 if used (default 0).

-a Append to existing db

-n New contigs always (relevant if appending)

Chapter 1: Next generation assembly editing with Gap5 5

-g When appending to an existing db, assume the alignment was performed against

an ungapped copy of the existing consensus. Add gaps back in to reads and/or

consensus as needed.

-t Index sequence names (default)

-T Do not index sequence names

-z value Specify minimum bin size (default is ’4k’)

-f Fast mode: read-pair links are unidirectional large databases, eg n.seq >100

million.

-d data_types

Only copy over certain data types. This is a comma separated list containing

one or more words from: seq, qual, anno, name, all or none

-c method

Speciﬁes the compression method. This shold be one of ’none’, ’zlib’ or ’lzma’.

Zlib is the default.

-[1-9] Use a ﬁxed compression level from 1 to 9

-v version_num

Request a speciﬁc database formation version

To merge existing gap5 databases you will need to export either one or both into an

intermediate format (we suggest SAM) and then use tg index to import data again.

1.1.2 Opening/closing databases

The Open menu item is in the main gap5 File menu. It brings up a ﬁle browser allowing

selection of the gap5 database name. Databases consist of two ﬁles - a main data block

(.g5d) and a data index (.g5x). It does not matter which you choose as gap5 will open both.

Alternatively you can specify the database name on the command line when launching

gap5. Additionally this supports read-only access if you specify the -ro ﬂag. For example

to open a database named Egu.0 (the old Gap4 convention implying version 0) in read-only

mode we would type:

gap5 -ro Egu.0 &

1.1.3 Changing directories

By default gap5 changes to the directory containing the database you have open. All local

output ﬁles speciﬁed (for example Save Consensus or Export Sequences) will be relative

to that location unless you use a full pathname. The current working directory may be

changed by using the Change Direction dialogue, found in the main File menu.

6 The Staden Package Manual

1.1.4 Check Database

This function (which is available from the Gap5 File menu) is used to perform a check on

the logical consistency of the database. No user intervention is required. If the checks are

passed the program will report zero errors. Otherwise a report of each error is displayed.

On a large database these checks can take a considerable amount of time. The default

is a thorough, but slow, check. However a faster mode is available which only performs

gross contig and contig-binning level checks, omitting the per sequence and per annotation

validation.

The dialogue also oﬀers the choice of attempting to ﬁx any problems that are found.

It is strongly recommended that you back the gap5 database up prior to performing ﬁxes

as depending on the nature of the corruption the choices made may not necessarily be an

improvement. Note that this also may not ﬁx every problem that is found, or the ﬁxes

themselves may cause other errors to be found so it is best to recheck again.

Chapter 1: Next generation assembly editing with Gap5 7

1.2 Contig Selector / Comparator

1.2.1 Contig Selector

The prog Contig Selector is used to display, select and reorder contigs. It can be invoked

from the prog View menu, but will automatically appear when a database is opened.

In the Contig Selector all contigs are shown as colinear horizontal lines separated by short

vertical lines. The length of the horizontal lines is proportional to the length of the contigs

and their left to right order represents the current ordering of the contigs. This Contig Order

is stored in the gap database and users can change it by dragging the lines representing the

contigs in the display. The Contig Selector can also be used to select contigs for processing.

Unlike gap4, gap5 does not display annotations within the Contig Selector window.

The ﬁgure shows a typical display from the Contig Selector. At the top are the File, View

and Results menus. Below that are buttons for zooming and for displaying the crosshair.

The four boxes to the right are used to display the X and Y coordinates of the crosshair.

The rightmost two display the Y coordinates when the contig selector is transformed into

the contig comparator (see Section 2.4 [Contig Comparator], page 126). The two leftmost

boxes display the X coordinates: the leftmost is the position in the contig and the other is

the position in the overall consensus. The crosshair is the vertical line spanning the panel

below.

This panel shows the lines that represent the contigs and the currently active tags. Those

tags shown above the contig lines are on readings and those below are on the consensus.

Right clicking on a tag gives a menu containing “information” (to see the tag contents) and

“Edit contig at tag” which invokes the contig editor centred on the selected tag.

The information line is showing data for the contig that is currently under the crosshair.

1.2.1.1 Selecting Contigs

Contigs can be selected by either clicking with the left mouse button on the line representing

the required contig in the contig selector window or alternatively by choosing the "List

contigs"option from the "View"menu. This option invokes a "Contig List"list box where

8 The Staden Package Manual

the contig names and numbers are listed in the same order as they appear in the contig

selector window.

Within this list box the contig names can be sorted alphabetically on contig name or

numerically on contig number. This is done by selecting the corresponding item from the

sort menu at the top of the list box. Clicking on a name within the list box is equivalent

to clicking on the corresponding contig in the contig selector. More than one contig can

be selected by dragging out a region with the left mouse button. Dragging the mouse oﬀ

the bottom of the list will scroll it to allow selection of a range larger than the displayed

section of the list. When the left button is pressed any existing selection is cleared. To

select several disjoint entries in the list press control and the left mouse button. The “Copy”

button copies the current selection to the paste buﬀer.

Most commands require a contig identiﬁer, which can be the contig name itself or the

name/number of any reading within that contig. Prog always knows reading record

numbers, but depending on the options used in tg index when creating the assembly data-

base the reading names may not be indexed. To specify a reading by record number, precede

it by a # character, e.g. “#10000” means reading record number 10000, but “10000” means

the contig or reading with name 10000.

Also any currently active dialogue boxes that require a contig to be selected can be

updated simply by clicking on a contig in the contig selector or clicking on an entry in

the "Contig Names"list box. For example, if the Edit contig command is selected from

the Edit menu it will bring up a dialogue requesting the identity of the contig to edit. If

the user clicks the left mouse button on a contig in the contig selector window, the contig

editor dialogue will automatically change to contain the name of the selected contig. Some

commands, such as the Contig Editor, can be selected from a popup menu that is activated

by clicking the right mouse button on the contig line in the Contig Selector or clicking the

right mouse button on the corresponding name within the "Contig List"list box. This

simultaneously deﬁnes the contig to operate on and so the command starts up without

dialogue.

Chapter 1: Next generation assembly editing with Gap5 9

Several contigs can be selected at once by either clicking on each contig with the left

mouse button or dragging out a selection rectangle by holding the left mouse button down.

Contigs which are entirely enclosed within the rectangle will be selected. Alternatively,

selecting several contigs from the "Contig Names"list box will also result in each contig

being selected. Selected contigs are highlighted in bold. Selecting the same contig again

will unselect it.

The currently selected contigs are also kept in a ’list’ named contigs.

1.2.1.2 Changing the Contig Order

The order of contigs is shown by the order of the lines representing them within the Contig

Selector. The order of contigs can be changed by moving these lines using the middle

mouse button, or Alt left mouse button. Several contigs may be moved at once by selecting

several contigs using the above method. After selection, move the contigs with the middle

mouse button, or Alt left mouse button, and position the mouse cursor where you want the

selection to be moved to. Upon release of the mouse button the contigs will be shuﬄed to

reﬂect their new order. The separator line at the point the contig was moved from increases

in height.

The contig order is saved automatically whenever a contig is created or removed (eg auto

assemble), including operations like disassemble which temporarily create contigs. The order

can be saved manually using the Save Contig Order option on the File menu.

1.2.1.3 The Contig Selector Menus

The File menu contains only one command; "Exit". This simply quits the contig selector

display.

The View menu gives access to the Results Manager (see Section 2.13 [Results Manager],

page 277), allows contigs to be selected using a list box containing the contig names (See

Section 2.3.1 [Selecting Contigs], page 123), and the list of selected contigs to be cleared.

The Results menu is updated on the ﬂy to contain cascading menus for each of the

plots shown when the contig selector is in its 2D Contig Comparator mode (see Section 2.4

[Contig Comparator], page 126). The contents of these cascading menus are identical to

the pulldown menus available from within the Results Manager.

10 The Staden Package Manual

1.2.2 Contig Comparator

Prog commands such as Find Internal Joins (see Section 2.8.3 [Find Internal Joins],

page 227) and Find Repeats (see Section 2.8.4 [Find Repeats], page 233) automatically

transform the Contig Selector (see Section 2.3 [Contig Selector], page 123) to produce the

Contig Comparator. To produce this transformation a copy of the Contig Selector is added

at right angles to the original window to create a two dimensional rectangular surface

on which to display the results of comparing or checking contigs. Each of the functions

plots its results as diagonal lines of diﬀerent colours. If the plotted points are close to the

main diagonal they represent results from pairs of contigs that are in the correct relative

order. Lines parallel to the main diagonal represent contigs that are in the correct relative

orientation to one another. Those perpendicular to the main diagonal show results for which

one contig would need to be reversed before the pair could be joined. The manual contig

dragging procedure can be used to change the relative positions of contigs. See Section 2.3.2

[Changing the Contig Order], page 125. As the contigs are dragged the plotted results will

be automatically moved to their corresponding new positions. This means that if users

drag the contigs to move their plotted results close to the main diagonal they will be

simultaneously putting their contigs into the correct relative positions.

By use of popup menus the plotted results can be used to invoke a subset of commands.

For example if the user clicks the right mouse button over a result from Find Internal Joins

a menu containing Invoke Join Editor (see Section 2.6.15 [The Join Editor], page 196) and

Invoke Contig Editors (see Section 2.6 [Editing in prog ], page 160) will pop up. If

the user selects Invoke Join Editor the Join Editor will be started with the two contigs

aligned at the match position contained in the result. If required one of the contigs will be

complemented to allow their alignment.

A typical display from the Contig Comparator is shown below. It includes results for

Find Internal Joins in black, Find Repeats in red and Sequence Search in green. The

currently highlighted item is shown in pink with a summary at the bottom of the screen.

The orientation of this is from top-left to bottom-right indicating that the match is in the

same orientation within both contigs (we can see some in the opposite orientation indicating

that we need to reverse complement either of the two contigs before attempting any joins,

although this will happen automatically). The crosshairs show the positions for a pair of

Chapter 1: Next generation assembly editing with Gap5 11

contigs. The vertical line continues into the Contig Selector part of the display, and the

position represented by the horizontal line is also duplicated there.

1.2.2.1 Examining Results and Using Them to Select Commands

Moving the cursor over plotted results highlights them, and the information line gives a

brief description of the currently highlighted match. This is in the form:

match name:contig1 number@position in contig1, with contig2 number@position in contig2,

length of the match

For Find Internal Joins the percentage mismatch is also displayed.

12 The Staden Package Manual

Several operations can be performed on each match. Pressing the right mouse button

over a match invokes a popup menu. This menu will contain a set of options which depends

on the type of result to which the match corresponds. The following is a complete list, but

not all will appear for each type of result.

Information

Sends a textual description of the match to the Output Window.

Hide Removes the match from the Contig Comparator. The match can be revealed

again by using "Reveal all"within the Results Manager.

Invoke contig editors

Invoke join editors

When invoked these options bring up their respective displays to show the

match in greater detail.

Remove Removes the match from the Contig Comparator. The match cannot be re-

vealed again by using "Reveal all"within the Results Manager.

One of the items in the popup menu may have an asterisk next to it. This is the default

operation which can also be performed by double clicking the left mouse button on the

match. For Repeat or Find Internal Joins matches this will normally be the Join Editor,

or two Contig Editors when the match is between two points in the same contig.

The crosshairs can be toggled on and oﬀ and a diagonal line going from top left to bottom

right of the plot can also be displayed if required. This is useful as a guide for moving the

contigs such that their matches lie upon the diagonal line.

The "Results"menu on the contig selector window provides a similar mechanism of

accessing results, but at the level of all matches in a particular search. This is simply a

menu driven interface to the Results Manager window (see Section 2.13 [Results Manager],

page 277), but containing only the results relevant to the contig comparator window.

1.2.2.2 Automatic Match Navigation

The "Next"button of the contig comparator window automatically invokes the default

operation on the next match from the current active result. This provides a mechanism to

step through each match in turn ensuring that no matches have been missed.

With a single result (set of matches) plotted, the "Next"button simply steps through

each match in turn until all have been seen. Moving the mouse above the "Next"button,

without pressing it, highlights the next match and displays brief information about it in

the status line at the bottom of the window. To step through the matches in "best ﬁrst"

order, select the "Sort Matches"option from the relevant name in the Results menu. The

exact order is dependent on the result in question, but is generally arranged to be the most

interesting ones ﬁrst.

Bringing up another result now directs "Next"to step through each of the new matches.

To change the result that "Next"operates on, use the Result menu to select the "Use for

’Next’"option in the desired result. Alternatively, double clicking on a match also causes

"Next"to process the list starting from the selected result.

Chapter 1: Next generation assembly editing with Gap5 13

The "Next"scheme remembers any matches that have been previously examined ei-

ther by itself or by manually double clicking, and will skip these. To clear this ’visited’

information select "Reset ’Next’"in the Results Manager.

14 The Staden Package Manual

1.3 Template Display

The template display is a graphical overview of a single contig. It allows us to see how much

data we have, how long the fragments are and how they relate to each other (whether they

are forming valid pairs).

The window consists of one or more tracks, by default showing the reading template

layout at the top and a sequence / read-pair coverage plot at the bottom. The Tracks menu

allows us to turn these on and oﬀ.

Below the main menu bar is a series of buttons that bring up new dialogues for controlling

how the data is to be display and what is to be displayed.

Then come a graphic plot per track. A cross-hair automatically tracks the cursor, indi-

cating the X and Y coordinates (in appropriate units) in the status line at the bottom of

the window. The track displays can be moved by either using the horizontal and vertical

scrollbars at the bottom and right hand edges of the window, or by clicking and dragging

the contents of the window. While dragging the display will not update to show newly

visible regions of a contig until the left mouse button is released.

Finally the bottom contains a scrollbar and ruler for positioning and a series of controls.

The X scale simply controls how many base-pairs of the contig are covered by he window.

The X scale number is arbitrary, but is interpreted in an exponential manner so it is easy

Chapter 1: Next generation assembly editing with Gap5 15

to rapidly zoom in or zoom out. All other controls in the bottom panel do not aﬀect the

reading coverage track, so they are covered in the template track section below.

1.3.1 Filtering data

By default all templates are used for drawing the tracks, but there are times when we may

wish to focus on speciﬁc problem data or to exclude it from our graphics.

The Filter button at the top of the Template Display brings up the dialogue shown

above. Making changes to this dialogue either have an instant impact on the display (when

“Auto update” is enabled) or instead only when we hit Apply or OK to dismiss the dialogue.

The Pairs: section allows us to select either reads on all templates, reads that are the sole

read for that template, or reads that are paired on a template. Note that the deﬁnition of a

pair here is strictly dependant on how many reads for a template are in the gap5 database

rather than the library preparation strategy. So a paired-end template for which only one

read is in the gap5 database (perhaps due to failure to map) is classiﬁed as “single”.

The Consistency section can be used to select all, consistent only or inconsistent only

data. This requires read-paired data (single reads cannot be inconsistent as so are considered

as consistent). The interpretation of inconsistent currently is that the two reads of a pair

do not point towards one another, but in future releases this is planned to check the correct

orientation for that library type as for some constructions it is normal to have reads pointing

in the same orientation.

The Spanning section governs whether to display read pairs with one read in this contig

and the other read in another contig. Handling templates with more than two reads is still

on-going work, but when ﬁnished a spanning read-pair will be one with any read not in this

contig.

Underneath these are two sliders applied in addition to the above ﬁlters. They allow

removal of any read or read-pair (depending on the type of data being plotted) with a

mapping quality outside the selected range.

1.3.2 Template plot

This is the main body of the template display window. The default plot will be showing read-

pairs, mainly coloured by mapping quality with the insert size governing the Y coordinate.

Larger inserts are at the bottom of the track while shorter ones are at the top.

16 The Staden Package Manual

The colours used are as follows:

blue This is a template with only one reading present. It could be either a pair with

one end not in this assembly, or a true single-ended sequencing experiment. The

horizontal size of the line is now the length of the individual sequence rather

than the computed length of the insert.

orange This is a template with one reading present in another contig. The size of the

line is derived from the size of the data in this contig (typically a single reading).

red This template is considered as inconsistent in some manner, typically due to

the relative position and orientation of the forward and reverse sequences being

incorrect.

grey (variety of)

Any consistent read-pair is coloured by the mapping quality, by default using the

average of the individual sequence mapping qualities. Lighter shades represent

higher mapping qualities.

The row of scale bars at the bottom of the window control how data is to be plotted.

They are:

X Scale Controls how many base-pairs in the contig to plot. Higher values indicate more

base pairs, but with an exponentially growing scale.

Y Magniﬁcation

Governs the amount of vertical space consumed by the template track. This

has no impact on the depth track.

Y Oﬀset Adds a small shift to the Y position of data prior to plotting. This is of little

use unless Separate Strands has also been selected, where upon this allows the

two halves of the plot to be brought closer together. (Eﬀectively meaning the

a plot can go from -1000 to -100 and +100 to +1000 instead of -1000 to +1000

with a blank area in the middle if our sequences are a minimum of 100 bases

long.)

Stacking Y Size

Only of use in Stacking Y-Position mode. This vertically groups together data

of similar length, allowing a basic approach of separating short-read and long-

read technologies. The Y layout is performed in steps of “Stacking Y Size”. To

pack reads tightly together regardless of length, set this to the maximum value

possible.

Y Spread This adds a small perturbation to the computed Y coordinates of lines in the

template track. When the Y coordinate is derived based on the insert size of

the read-pair it is not always clear whether a line represents a single item or

Chapter 1: Next generation assembly editing with Gap5 17

many items stacked perfectly on top of one another. The Y spread control

compensates for this.

Template track with Y spread of 0.

Template track with Y spread of 50.

1.3.2.1 Controlling The Y Layout.

The layout and type of data in the template track can be controlled using the Template

button at the top of the main template display window.

The Y Position section controls how the Y coordinates are computed when plotting data

(with X being tied to the position in the assembly or reference). It can be one of three

settings.

18 The Staden Package Manual

Template size

The default mode. The size of an object is deﬁned to be the number of bases it

spans. This is normally the size of a read-pair, or if the pair spans contigs or if

only readings are shown it is the size of a single reading instead. Larger objects

are at the bottom of the window. This Y method very clearly reveals indels in

a mapped assembly. It sometimes also sometimes reveals misassemblies.

Given that items of identical size will stack on top of one another, of particular

use to this display mode is the Y Spread control in the main window.

Stacking

A more traditional view - each and every item is allocated its own

non-overlapping Y coordinate (although low Y magniﬁcations may imply these

are drawn at the same Y pixel).

Chapter 1: Next generation assembly editing with Gap5 19

It is still possible to partially group items by their insert size using the “Stacking

Y Size” control in the main window.

Mapping Quality

Finally we can display data collated by the mapping score. This is typically only

available for mapped assemblies. This plot sometimes helps to reveal regions

where all the data present is of poor mapping quality, indicating a likely repeat.

Adjacent to the Y Position frame is the Colour frame. This controls the colour of the

lines drawn in the template display rather than their location.

Combined mapping quality

Minimum mapping quality

Maximum mapping quality

For templates with multiple reads visible, we have a variety of mapping qualities.

Often these individual sequence mapping qualities will diﬀer, but we wish to

draw a single line for the template with a single colour. These three methods

control whether we take the average, minimum or maximum values from the

individual sequences on this template.

Reads The line typically represents the entire span of the insert, but we may not have

sequence data for all of the template. This colour mode will also draw the

portions of the template that we have known sequence for, in green for forward

strand sequences and magenta for reverse strand sequences. Any remaining

portion of template between the reads is drawn using the combined mapping

quality.

At the bottom of this dialogue is a row of check buttons.

“>>Acc” enables accurate mode, but be warned this can be very slow. When the template

display is drawn it fetches all data within the visible portion plus a little bit ether side.

From this reads from the same template are paired up. However when a template spans

20 The Staden Package Manual

a substantially larger range than is shown we may only have fetched one read for this

template. We do know that such a template forms a pair, but we do not know the exact

location of the other end or even whether it is in this contig. The assumption is that it is

not, and the template is drawn in orange. Enabling accurate mode will work out the precise

location of the other end and if it is present elsewhere within this contig then the insert size

will be correctly determined and the plot adjusted accordingly.

The “Reads” checkbutton (not to be confused with the Reads colour selector) disables

all drawing of read-pairing and template lines, instead drawing lines to represent the known

DNA sequence instead.

“Y-log scale” controls whether we plot our Y values using log or linear scales.

“Separate strands” attempts to classify all templates as coming from the top or bottom

strand of DNA (based on the orientation of the sequences on that template, although

sometimes these are conﬂicting). It then splits the plot in two, forming an approximate

mirror image. This may be of use in some transcriptome sequencing experiments.

1.3.3 Depth / Coverage Plot

The depth track shows coverage of both individual readings and read-pairs, where a read-

pair counts as +1 coverage over the entire length it spans rather than just the portion

directly sequenced.

The ﬁlter options for (in)consistent read pairs also apply here, giving the option to only

show depth of consistent pairs.

Chapter 1: Next generation assembly editing with Gap5 21

1.4 Editing in Gap5

The Gap5 Contig Editor is designed to allow rapid checking and editing of characters

in assembled readings. Very large savings in time can be achieved by its sophisticated

problem ﬁnding procedures which automatically direct the user only to the bases that

require attention. The following is a selection of screenshots to give an overview of its use.

The ﬁgure above shows a screendump from the Contig Editor showing the consensus for

a small region of a contig and the aligned reads. The main components are, top-most menu

bar; common buttons and controls beneath this; the main name and sequence panels to the

left and right; scrollbars and jog-control; a status text line at the bottom.

The names panel on the left can show either reading names or a small ASCII diagram

representing their position, orientation and mapping quality as a grey-scale. The sequences

to the right in the screenshot has base quality shown in grey (dark being poor, light being

good) with disagreements to the consensus at the top shown in blue. The consensus line

also shows base qualities. You may notice we have a mixture of long and short sequences,

with the longer ones being at the top. This screenshot is from a mixed assembly of Illumina

short-read data and ABI Sanger-method capillary sequences.

One base is drawn in inverse video (a “G”). This is the current location of the editing

cursor. We can move this we arrow keys or clicking with the left mouse button. It behaves

much like the editing cursor in a word processor and need not be visible in the portion of

the contig we are viewing.

22 The Staden Package Manual

Also visible is a set of bases coloured yellow. These are an OLIGO annotation. Gap5

supports a wide variety of annotation types (often also referred to as “tags”). These are

covered later in more detail.

This ﬁgure is an example of the Trace Display showing three capillary traces and an

Illumina trace from readings in the previous Contig Editor screendumps. Note that this

demonstrates the possibility of showing the raw trace data for new short-read sequencing

technologies, but typically this is not available due to the high storage size.

1.4.1 Moving the visible segment of the contig

The contig editor displays only one segment of the entire contig, although several contig

editors can be in use at once. Below the sequence is a scrollbar and below that a “jog”

control. The scrollbar behaves as expected, allowing rapid positioning anywhere within the

contig using the middle mouse button or left-clicking and dragging the slider. However with

extremely long contigs (for example 100Mb) it can become tricky to move by the desired

amount. Each pixel on the scrollbar may represent 100Kb worth of data, so dragging the

scrollbar is only approximate positioning. Equally so clicking in the trough to move a

screen-full at a time can be too small. This is where the jog-control can be of use.

By default this is always centred. Clicking and dragging this left or right starts to scroll

the editor, at a speed proportional to how far away from the centre the jog is dragged.

Releasing the mouse button stops automatically scrolling and recentres the jog control.

The ﬁnal, more precise, manner of positioning the editor view is with the text entry box

in the bottom left corner. Type in any coordinate here and press return to jump straight

to that location. Note however that Gap5’s coordinates are currently always in padded

form; that is to say that a gap in the consensus caused by an insertion in one of the aligned

sequences is still counted as a base position.

For particularly deep displays the vertical scrollbar on the right edge of the window will

also be useful. While scrolling in X, the editor attempts to keep the same sequences visible

on screen. To do this it may automatically adjust the Y scrollbar for you due to changing

layout of sequences. (By default the top-most sequence is always the sequence that starts

furthest left and the bottom most is the sequence starting furthest right.)

Chapter 1: Next generation assembly editing with Gap5 23

If you have a mouse wheel, this may also be used for small scrolling. By itself it scrolls

in Y one sequence at a time. With the Control key held down it scrolls in larger incre-

ments. Using the Shift key in conjunction with the mouse wheel scrolls in X instead, with

Shift+Control to scroll in larger increments.

The displayed portion of the contig is separate from the current location of the editing

cursor. This is displayed as a black rectangle with typically a light coloured letter inside it.

Any editing keys operate on the base underneath this or to the base immediately preceding

it for Delete. We cover the topic of editing later (see Section 2.6.3 [Editing], page 165),

however moving the editing cursor is also another way of scrolling the editor.

Finally the Page Up and Page Down keys scroll the editor left or right by 90% of current

screen width. Used with Shift the moves in increments of 1Kb, with Control in increments

of 10Kb and with both Shift and Control in increments of 100Kb. The Home and End

keys jump to the start or end of the current item underneath the editing curosr - either a

sequence or the consensus.

1.4.2 Names

At the left side of the editor window is the “names panel”. This either displays an ASCII

pictorial summary of the sequence layout or the actual sequence names themselves depend-

ing on the settings in use. Between the names panel and the sequences panel is a vertical

line, visible at the right edge of the above image. This can be dragged left and right to

adjust the proportion of display dedicated to the names and sequence panels.

The default name display looks like this:

This plot is a mini diagram of the way the sequences overlap. Here the >and <symbols

represent the start of sequences, assembled on either the forward or reverse strand, with

the ... sections reﬂecting their relative lengths. The background shading indicates the

mapping quality of the sequence (which may not be available in many cases, depending on

how the assembly was derived). This should indicate the likelihood that the sequence has

been assembled to the correct point. Sequence that appears to map elsewhere, e.g. due to

a repeat, will be dark grey while unique sequence will be light grey or white. Moving the

mouse cursor over a sequence will tell you the precise mapping quality along with additional

24 The Staden Package Manual

information such as the sequence name, the technology used (Sanger, Illumina, 454, etc),

and whether it is part of a pair of sequences.

In the editor Settings menu is a checkbox labelled “Pack Sequences”. When checked we

permit multiple sequences to be drawn in the same row. Unchecking this reverts to the

Gap4 style of display where each sequence has its own dedicated row. This also has an

aﬀect on the names panel, which switches to showing the sequence names, as below.

This still uses the >and <symbols to reﬂect strand and grey scales for representing the

mapping quality. The >and <are now also coloured independently.

•light blue The read is not paired

•white Forms a consistent pair

•grey Paired, but the insert size is too large or too small

•red Paired, but in an invalid orientation

•orange Paired, but the other end is in another contig

At the bottom of the names panel is an editable text ﬁeld containing the current display

position. Adjacent to this is a small “P” indicating these coordinates are “padded”. Clicking

this will alternate with “R” to indicate reference coordinates, although these may not be

available in all situations. Note that currently, for speed reasons, it cannot directly display

unpadded coordinates.

Typing into this position entry-box allows us to direct the editor to a speciﬁc location. If

we end the number with “u” it performs an unpadded to padded conversion before jumping

to this location.

Left clicking on a name will toggle the background between the current grey to a shade

of blue (with luminosity once again reﬂecting mapping quality). This indicates that the

sequence name has been added to the “readings” list. Multiple names may be selected and

deselecting by pressing and holding the left mouse button while moving the mouse cursor.

In both display modes, pressing the right mouse button brings up a context sensitive

menu containing operations relevant to that speciﬁc sequence. This may contain the fol-

lowing commands.

Chapter 1: Next generation assembly editing with Gap5 25

Copy name to clipboard

Copy #number to clipboard

These copy the sequence name or the record number to the clipboard for use

in a subsequent paste operation. Note that there is no visual cue that this

has happened. The same function may also be achieved by left-clicking and

dragging the mouse horiztonally, as if attempting to highlight a region of text.

These two items are also available when right clicking on the Consensus label,

but in this case it copies the contig name or number to the clipboard instead.

Goto... This lists other sequences sharing the same template, such as the other end of

a read-pair. Selecting this command will jump the editor to the left-most base

in that sequence. If the sequence is in another contig then a new editor will be

created, unless one already exists for that contig in which case that other editor

will be moved accordingly.

Join to... In the case of read-pairs that span contigs, the join to function will bring up

the join editor for both contigs involved, automatically complementing the other

contig if appropriate based on the library pair orientation statistics.

Right clicking on the contig name also pops up a menu. In here are otions to change

the contig name or the starting coordinate. These options are also available in the editor

Commands menu.

1.4.3 Editing

Editing can take up a signiﬁcant portion of the time taken to ﬁnish a sequencing project.

Gap5 has a selection of searches (see Section 2.6.6 [Searching], page 174) designed to speed

up this process. The problems that require most attention are conﬂicts between good bases.

Where base conﬁdence values are present it should be unnecessary to edit all conﬂicting

bases as, generally, this will amount to adjusting poor quality data to agree with good

quality data in which case the consensus sequence should be correct anyway.

Pads in the consensus should not be considered a problem requiring edits because it

is possible to output the consensus sequence (from the main Gap5 File menu) with pads

stripped out. Obviously poorly deﬁned pads (a mixture of several alignment padding char-

acters and real bases) require checking in the same manner as other poorly deﬁned consensus

bases.

To change a base simply overtype with a new base call, one of a,c,g or t in lowercase.

Alternatively a base can be changed to an alignment padding character by pressing “*”.

These new bases and pads automatically get given a quality value of 100, but see below for

how to adjust this. The consensus cannot be edited in this manner.

To insert a gap into sequence press “i” or the Insert key. At present only alignment pads

can be inserted, not bases, although the pads can subsequently be edited to turn them into

bases. The “i” and Insert keys also permits insertions of gaps into the consensus, which it

achieves by inserting into every sequence aligned at that position.

Bases may be deleted by pressing the Delete or Backspace key. This deletes the base

immediately to the left of the current editing cursor. Note that if Delete or Backspace is

pressed with the editing cursor on the consensus this removes an entire column of data.

26 The Staden Package Manual

Deleting anything other than alignment padding characters (either in sequences or the

consensus) is a dangerous operation needing careful thought. To prevent accidental removal

of data therefore, to delete anything other than “*” you must press Control in conjunction

with Delete or Backspace.

1.4.3.1 Moving the editing cursor

Nearly all editing operations happen at the location of the editing cursor. This cursor

appears as a black block containing the base in a light colour, instead of the usual black

base on a light background.

The simplest mechanism of moving the cursor is using the left mouse button. Alterna-

tively the following keys can be used.

Left arrow or Control b Move left one base

Right arrow or Control f Move right one base

Up arrow or Control p Move up one base

Down arrow or Control n Move down one base

Control a Move editing cursor to start of sequence

Control e Move editing cursor to end of sequence

Home Move editing cursor to start of sequence

End Move editing cursor to end of sequence

Meta or Alt < Move editing cursor to start of contig

Meta or Alt > Move editing cursor to end of contig

If any of these move the editing cursor outside of the visible region, the editor will scroll

to accommodate. Control-a and Control-e with the editor on the consensus line will also

jump to the start and end of the contig.

If “Cutoﬀs” are shown (see Section 2.6.3.4 [Adjust the Cutoﬀ Data], page 169) the cursor

may be placed in the cutoﬀ data too. Note that turning oﬀ displaying cutoﬀ data would

then leave the editor on an invisible base, so it is moved to the consensus line instead.

1.4.3.2 Adjusting the Quality Values

Each base has its own quality value. Assembly will allow only values between 1 and 99

inclusive. A quality value of 0 means that this base should be ignored. A quality value of

100 means that this base is deﬁnitely correct and the consensus will be forced to be the

same base type and will be given a consensus conﬁdence of 100. If two conﬂicting bases

both have a quality of 100 the consensus will be a dash with a conﬁdence of 0.

Newly added bases or replaced bases are assigned a quality of 100.

Several keyboard commands are available to edit the quality value of an individual base.

[ Set quality to 0 and move cursor right

] Set quality to 100 and move cursor right

Shift Up-Arrow Increment quality by 1

Control Up-Arrow Increment quality by 10

Shift Down-Arrow Decrement quality by 1

Control Down-Arrow Decrement quality by 10

Chapter 1: Next generation assembly editing with Gap5 27

Finally note that quality values can also be made visible by clicking on the “Quality”

checkbutton at the top of the editor. This shows the quality by use of a grey scale.

1.4.3.3 Adjusting the alignment coordinates

On rare occasions we may need to move an entire sequence a small amount to achieve an

optimal alignment, rather than simply inserting or deleting pads.

This is achieved by using Control plus the left and right arrow keys while the editing

cursor is anywhere on the sequence.

Control Left-Arrow Shift sequence left

Control Right-Arrow Shift sequence right

1.4.3.4 Adjusting the Cutoﬀ Data

Sequences typically consist of a good quality “used” portion and poor quality “clipped”

or “cutoﬀ” portions at the 5’ and 3’ ends of the sequence. Although for short sequencing

technologies it’s quite likely we have no cutoﬀ data at all. The reason for this is that the

low quality ends of sequences may have a suﬃcient number of errors that the sequence

alignment algorithms are no longer conﬁdent they have the correct bases aligned, or event

that the sequence simply disagrees too much.

By default these are not shown, although you may see blank lines in the display as room

is left for this sequence even when it is not visible. The cutoﬀ data may be displayed by

pressing the “Cutoﬀs” check-button at the top of the editor. The cutoﬀ sequence will then

be displayed in grey. We call the boundary between the cutoﬀ data and the used data the

cutoﬀ position. These positions can be adjusted by pressing the “<” (left cutoﬀ) or “>”

(right cutoﬀ) keys. In both cases the cutoﬀ point is between the base with the editing

cursor and the base to the left of the editing cursor.

Using the “<” and “>” keys with the editing cursor in the consensus performs bulk

versions of these edits by clipping every single sequence to that poinit. One small diﬀerence

here though is that the bulk versions only ever shrink cutoﬀ data and do not grow it.

< In sequence: set left cutoff position

> In sequence: set right cutoff position

< In consensus: bulk clip left cutoff

> In consensus: bulk clip right cutoff

1.4.3.5 Summary of Editing Commands

A brief summary of these editing operations can be seen below:

Key Location Action

----------------------------- --------------------

a,c,g,t,* Reading Change base

i, Insert Reading Insert pad

Delete Reading Delete * to left

Ctrl Delete Reading Delete any base to left

Control Left Reading Move reading left

28 The Staden Package Manual

Control Right Reading Move reading right

[ Reading Set quality to 0

] Reading Set quality to 100

Shift Up Reading Incr. quality by 1

Shift Down Reading Decr. quality by 1

Ctrl Up Reading Incr. quality by 10

Ctrl Down Reading Decr. quality by 10

< Reading Set left cutoff

> Reading Set right cutoff

i, Insert Consensus Insert column of pads

Delete Consensus Delete * to left

Ctrl Delete Consensus Delete any base to left

< Consensus Bulk clip left cutoff

> Consensus Bulk clip right cutoff

1.4.4 Cut and Paste Control of Sequence

It is possible to highlight an area of a reading or the consensus sequence in preparation for

performing some further action upon it. Such examples of actions are: creating annotations

and pasting into a new window. We call these highlighted areas “selections”. They are

displayed as an underlined region.

The simplest way to make a selection is using the left mouse button. Pressing the mouse

button marks the base beneath the cursor as the start of the selection. Then, without

releasing the button, moving the mouse cursor adjusts the end of the selection. Finally

releasing the button will allow normal use of the mouse again. If while marking a selection

we reach the edge of the window then the editor will automatically start scrolling for us.

Sometimes we may wish to make a particularly long selection, or just extend an existing

selection after we’ve already released the mouse button. This can be done by using shift

left mouse button to adjust the end of the selection. Hence we can mark the start of the

selection using the left button, scroll along the contig to the desired position, and set the

end using the shift left button.

The selection is stored in the “clipboard”. This allows for the usual “cut and paste”

operations between applications, although the contig editor only supports this in one direc-

tion (as it is not possible to “paste” into the window). The mechanism employed for this

follows the usual X Windows standard of using the middle mouse button.

A quick summary of the mouse selection commands follows.

Left button Position editing cursor to mouse cursor

Left button (drag) Mark start and end of selection

Shift left button Adjust end of selection

Middle button (in another window) Copy selected sequence

Chapter 1: Next generation assembly editing with Gap5 29

1.4.5 Selecting Sequences

The list named “readings” is used for all sequences selected in all editors. This is automat-

ically updated whenever a sequence is selected or deselected.

Inividual sequence names can be (de)selected by clicking on them with the left mouse

button, or clicking and dragging out a region. This works well for a few sequences.

If you need to select all readings overlapping a speciﬁc consensus base or a region of

consensus bases mark the range of the consensus you wish to select over by pressing and

dragging the left mouse button (as if you were going to create an annotation) and then

either right click in the consensus or use the Commands menu to choose Select Reads.

When using the Commands menu you get a dialogue asking for conﬁrmation of the start

and end positions and the option of whether to select sequences that overlap this range

or only those which are entirely containing within that range. When using the right-click

popup on the consensus it simply takes the defaults (overlapping sequences).

Deselection follows the same procedure.

1.4.6 Annotations

Annotations (or tags) can be placed at any position on readings or on the consensus. They

are usually used to record positions of primers for walking, or to mark sites, such as repeats

or compressions, that have caused problems during sequencing. Each annotation has a type

such as “primer”, a position, a length, a strand (forward, reverse or both) and an optional

comment. Each type and strand has an associated colour that will be shown on the display.

For information on searching for annotations see Section 2.6.6.4 [Searching by Tag Type],

page 175, and Section 2.6.6.3 [Searching by Annotation Comments], page 175.

FIXME: not all of the tag editor features are supported yet; speciﬁcally the Move/Copy

functionality is currently missing.

To create an annotation, make a selection and then select “Create Tag” from the contig

editor commands menu at the top of the editor or by pressing the right mouse button.

See Section 2.6.7 [The Commands Menu], page 177. This will bring up a further window;

30 The Staden Package Manual

the “tag editor” (shown above). The “Type:” button at the top of the editor invokes a

selectable list from which tag types can be chosen. See below.

Use this to select the desired type of annotation.

Next the strand of the annotation can be selected. This will be displayed as one of

“<—->”, “<—-”, “—->” and “?—-?” indicating both strands, top strand only, bottom

strand only, and stranded but unknown strand respectively. These mirror the GFF strand

deﬁnitions. The comment (the box beneath the buttons) can be edited using the usual

combination of keyboard input and arrow keys. The “Save” button will exit the tag editor

and create the annotation. To abandon editing without creating the annotation use the

“Cancel” button.

To edit an existing annotation, position the editing cursor within a annotation and select

“Edit Tag” from the commands menu. This will be a cascading menu, typically showing

one tag. If multiple tags coincide at the same sequence position you will be able to chose

which tag to edit. Once again the tag editor will be invoked and operates as before. The

F11 key is also a shortcut for editing the top-most tag underneath the editor cursor. When

editing, the “Save” will save the edited changes and “Cancel” will abandon changes.

Removing a annotation involves positioning the editing cursor within an annotation and

selecting “Delete Tag” from the commands menu. As with “Edit Tag” this is a cascading

menu to allow you to chose which tag at a speciﬁc point to delete. The F12 key is a shortcut

to remove the top-most tag underneath the editor cursor.

As usual, “undo” can be used to undo any of these annotation creations, edits and

removals.

Some tags may contain graphical controls instead of the usual text panel. These are

encoded with the master gap4/5 tag database (GTAGDB) by specifying the default tag

text to be a piece of “ACD” code. A full description of the (modiﬁed for gap4/5) ACD

Chapter 1: Next generation assembly editing with Gap5 31

syntax is not available currently, but it is strongly modelled on the the EMBOSS ACD

syntax which has documentation at

http://www.emboss.org/Acd/index.html .

It is possible to add your own tag types by modifying either the system GTAGDB ﬁle

or creating your own GTAGDB ﬁle in your home directory (for all your databases) or the

current directory (for just those in that directory).

For rapid editing and deleting the F11 and F12 keys may be used. These edit and

delete the top-most tag underneath the editing cursor. If you wish to edit or delete the

tag underneath the mouse cursor instead (and hence save a mouse click) use Shift F11 and

Shift F12 for edit and delete.

The Control-Q key sequence may be used to toggle the displaying of tags. Pressing it

once will prevent all tags from being displayed in the editor. This is sometimes useful to see

any colouring information underneath the tag. Pressing Control-Q once more will redisplay

them.

1.4.6.1 Annotation Macros

For rapid annotating a series of 10 macros may be programmed. Press Shift and a

function key between F1 and F10 to bring up the macro editor. This look much like the

normal tag editor except that Save is replaced with Save Macro and saving does not actually

create a tag on the sequence. To use the macro, highlight the bases you wish and press the

function key corresponding to that macro - F1 to F10. For a single base pair tag you do

not need to underline a region as the tag will automatically cover the base underneath the

editing cursor. To remember these permanently use the “Save Tag Macros” option in the

“Settings” menu.

If you have an existing tag you wish to rapidly duplicate to many places, use Control

plus a function key to copy the tag underneath the editing cursor to that numbered tag

macro. This is simply a short cut for Shift and the function key, but without needing to

manually replicate the tag type and textual comment.

You may ﬁnd that some function keys are already programmed to do other things (such

as raise or lower windows), depending on the windowing environment in use. If this is the

case either modify the conﬁguration of your windowing system or simply use another macro

key.

32 The Staden Package Manual

Shift F1-F10 Create a tag macro via a dialogue window

Control F1-F10 Create a tag macro from tag at editor cursor

F1-F10 Apply a tag macro (create a real tag)

1.4.7 Searching

The contig editor’s searching ability and its links to the consensus calculation algorithm

are crucial in determining the eﬃciency with which contigs can be checked and corrected.

The consensus is calculated “on the ﬂy” and changes in response to edits. For editing, the

most important search functions are those which reveal problems in the consensus whilst

ignoring all bases that are adequately well determined. The standard search type is therefore

by consensus quality. By default this is done in the forward direction and for a quality value

of 30, although this is conﬁgurable by changing the collowing lines in the gap5rc ﬁle.

set_def CONTIG_EDITOR.SEARCH.DEFAULT_TYPE consquality

set_def CONTIG_EDITOR.SEARCH.DEFAULT_DIRECTION forward

set_def CONTIG_EDITOR.SEARCH.CONSQUALITY_DEF 30

Pressing the “Search” button brings up a separate search window. This allows the user

to select the direction of search, the type of search, and a value to search on. The value

is entered into a value text box, then pressing the “search” button performs the search. If

successful, the cursor is positioned accordingly.

The Control-s and Control-r key bindings in the editor are equivalent to searching for

the next or previous match. Both key bindings will bring up the search window if it is not

currently displayed (and not search), otherwise they perform the search currently selected

in that window. Additionally with the mouse focus in the search dialogue window the Page

Up and Page Down keys will perform previous and next search too.

As is described below, there are several search modes.

1.4.7.1 Search by Annotation Comments

This positions the cursor at the start of the next tag which has a comment containing the

string speciﬁed in the value box. The search performed is a regular expression search, and

Chapter 1: Next generation assembly editing with Gap5 33

certain characters have special meaning. Be careful when your string contains “.”, “*”, “[“,

“]”, “\”, “^” or “$”. The search can be performed either forwards or backwards from the

current cursor position. Searching with an empty value will ﬁnd all tags.

1.4.7.2 Search by Tag Type

This positions the cursor at the start of the next tag of the speciﬁed type. To change the

type, click on the currently listed tag type, which displays a tag type selection dialogue.

The search can be performed either forwards or backwards of the current cursor position.

To ﬁnd all tags, use “Search by Annotation Comments”, with an empty text box.

1.4.7.3 Search by Padded Position

This jumps to a padded location in the editor and is directly equivalent to typing a number

into the position entry box in the bottom left corner of the editor followed by “p”.

It is also possible to do relative searches by preﬁxing the location with +or -. So +100

will skip ahead 100 bases.

1.4.7.4 Search by Unpadded Position

As per the padded search, but this jumps to an unpadded coordinate - essentially the number

of non-* bases since the start of the contig, regardless of whether the ﬁrst consensus base

is labelled as base 1.

1.4.7.5 Search by Sequence

This positions the cursor at the start of the next segment of sequence that matches the

value speciﬁed in the text box. The search is case insensitive, ignores pads, and can allow

a speciﬁed number of mismatches. Unlike Gap4, Gap5’s sequence search only looks in the

consensus sequence. It also operates either forwards or backwards from the current editing

cursor position.

1.4.7.6 Search by Reading Name

This positions the cursor at the left end of the reading speciﬁed in the value text box.

Note that not all reading names may be indexed by Gap5 and that the search will not

ﬁnd unindexed names. See tg_index -t for information on creating Gap5 databases with

reading name indices.

The reading name has to be an exact match and so currently does not ﬁnd preﬁx strings.

If multiple sequences exist with the same name (which should be strongly discouraged) then

it is undeﬁned which will be found ﬁrst.

1.4.7.7 Search by Reference InDel

Note: this information may not be available in all scenarios. If you imported the gap5

database from a SAM or BAM ﬁle there is an implicit set of reference coordinates used within

SAM/BAM. Gap5 can keep track of the relationship between gap5’s padded coordinate

system and the reference coordinates. This function uses this data to search for the next

or previous reference insertion or deletion.

34 The Staden Package Manual

1.4.7.8 Search by Consensus Quality

This positions the cursor on the consensus at the next position where the quality of the

consensus is below a given threshold. The quality threshold should be entered into the value

box and should be within the range of 0 to 100 inclusive.

1.4.7.9 Search by Consensus Discrepancy

The consensus algorithm can keep track of the expected number of diﬀerences to the consen-

sus given sequence depth and sequence quality values. This search looks for locations where

the actual number of diﬀerences exceeds the expected amount by more than a speciﬁed

factor.

1.4.7.10 Search by Consensus Heterozygosity

The consensus algorithm has a simple heterozygous calling method. Rather than simply

weighing up the evidence for the base being A, C, G, T or a pad it also considers that it

may be a combination of any two of these values. The consensus scores for the individual

bases as well as the highest scoring consensus base can be seen in the editor information

line when the mouse cursor is moved over a consensus base.

This search is looking for consensus bases where the best heterozygous score is greater

than or equal to the speciﬁed value.

1.4.7.11 Search by Low Coverage

This jumps to the next or previous location where the sequence coverage drops below a

speciﬁed value.

1.4.7.12 Search by High Coverage

This jumps to the next or previous location where the sequence coverage is higher than a

speciﬁed value. Regions of extreme depth are often indication of misassemblies.

1.4.8 The Settings Menu

The purpose of this menu is to conﬁgure the operation of the contig editor. Settings can

be saved using the “Save settings” button, which also saves preferences for the editor width

and height and the location of the divider between the names and sequence panels. It does

not save tag macros though; these may be saved separately using the “Save Macros” option.

Settings for the following options can be changed.

•Group Readings

•Highlight Disagreements

•By dots

•By foreground colour

•By background colour

•Case sensitive

•Set quality threshold

•Pack sequences

•Hide annoations

Chapter 1: Next generation assembly editing with Gap5 35

•Background stripes

•Show Mapping Quality

•Show Template Status

•Padded coordinates

•Reference coordinates

•Save tag macros

•Save settings

1.4.8.1 Group Readings

Sequences have an “X” location in the editor deﬁned by the location within the contig that

they align to. The “Y” location though is determined by the sequence layout algorithm,

governed by the Pack Sequences setting and Group Readings options.

By default sequences are grouped into distinct technologies, typically with longer se-

quences up the top (capillary) and shorter ones at the bottom (Illumina, SOLiD). Within

these technology groups the sequences are then sorted by their start location, so the top-

most sequences start earlier and the bottom most sequences start later.

The Group Readings menu allows user control over these primary and secondary collating

orders. The sorting methods are deﬁned below.

By technology

Sorted in order of unknown, sanger (capillary), Illumina, SOLiD, 454.

By clipped start

Sorted by the visible (non-cutoﬀ) start position.

By start Sorted by the start position, regardless of whether the base is in cutoﬀ data or

not.

By template

Sorted by template name. In Gap5 this is always deﬁned to be a preﬁx of

the sequence name, or optionally the same as the sequence name. The sorting

method is using a simple ASCII collation order.

By strand Sorts data into the top strand ﬁrst followed by the bottom strand data.

By base This sort order is diﬀerent from all others in that it depends on the location of

the editor cursor.

Sorts sequences by the base type overlapping the last editor cursor location in

the consensus. The collation order is A, C, G, T, N and *. Sequences that

do not overlap that consensus location or those that only overlap in the cutoﬀ

portion are not sorted by this method. If this is used as the primary sort then

these other sequences will be sorted using the secondary sort. If the secondary

sort is By Base then an implicit tertiary sort order of By Start is used.

Note that moving the editing cursor around sequences will not update the Y

order. Only placement of the editing cursor on the consensus will update this.

36 The Staden Package Manual

1.4.8.2 Highlight Disagreements

This toggles between the normal sequence display (showing the current base assignments)

and one in which those assignments that diﬀer from the consensus are highlighted. It makes

scanning for problems by eye much easier.

Several modes of highlighting are available: “By dots” will only display the bases that

diﬀer from the consensus, displaying all other bases as full stops if they match or colons if

they mismatch but are poor quality. The deﬁnition of poor quality here can be adjusted

using the “Set quality threshold” option of the Settings menu. The base colours are as

normal (ie reﬂecting tags and quality).

Highlight disagreements “By foreground colour” and “By background colour” displays

all base characters, but colours those that diﬀer from the consensus. Bases which diﬀer

by are below the diﬀerence quality threshold are shaded in light blue while high quality

diﬀerences are dark blue. This allows easier visual scanning of the context that a diﬀerence

occurs in, but it may be wise to disable the displaying of tags (hint: control-Q toggles tags

on and oﬀ).

Finally the “Case sensitive” toggle controls whether upper and lower case bases of the

same base type should be considered as diﬀerences.

1.4.8.3 Pack Sequences

This controls whether the editor allocates one row per sequence or whether it is permitted

to pack multiple sequences onto a single row, assuming they do not overlap.

The latter allows for a more compact plot which is desirable when dealing with short

sequences, however it has the side eﬀect that the reading names can no longer be listed in

the names panel to the left.

1.4.8.4 Hide Annotations

Sometimes we need to see the background shading underneath an annotation, for example to

see the base quality or if we have Highlight Disagreements turned on using the by background

colour mode. This option simply hides all annotations from display until it is selected again

to reveal them once more.

The Control-Q keyboard shortcut has the same eﬀect.

1.4.9 Primer Selection

The “Find Primer Walk” function from the Commands menu is an interface to the Primer3

program (builtin to Gap5 so it does not need an external installation). Currently it only

allows for selection of a single internal oligo suitable for “walking” along a template. It is

designed for manual ﬁnishing work and is not appropriate for automatic ﬁnishing. Future

plans are to add PCR support.

Chapter 1: Next generation assembly editing with Gap5 37

The command brings up its own dialogue window.

The top portion of this window controls where to look for primers. By default it will be

either side of the editing cursor location. We also specify here what strand we wish to run

our experiment on.

Below this are a series of Primer3 parameters. Please see the Primer3 documentation

for a full description of these.

Upon hitting OK, and assuming that some primers can be found, a new window showing

the available choices is presented.

38 The Staden Package Manual

The primers show are sorted by Primer3 score, with lower being better. Clicking on any

of the other headings in the table allows the data to be re-sorted by that column. Clicking

the left mouse button on any line will show the location of this primer in the main editor

window as an underlined region. It also updates the bottom half of the Oligos window with

further details.

At the bottom of the window are two editable selections. The left most labelled “Seq.

name to tag” allows us to pick a sequence we wish to place an oligo (OLIG) annotation on,

which defaults to the consensus sequence. The right selection box labelled “Template name”

is an list of identiﬁed templates at this region, however this is not necessarily exhaustive as

it only includes the sequences at this position and may miss some read-pairs that span this

region. If you have a speciﬁc template in mind you can also type in the name of it to here.

Pressing the “Add annotation” button then creates an oligo annotation. The text asso-

ciated with the annotation will depend on the primer chosen, but an example follows.

Sequence AACACATGGTAAAGCAGATG

Template zDH64-714h06

GC 40.0

Temperature 53.45

Score 1.54377204143

Date_picked Thu Aug 12 17:31:18 BST 2010

Oligoname ??

1.4.10 Traces

The original trace data from which the readings where derived can be displayed by double

clicking (two quick clicks) with the left or middle mouse button on the area of interest.

Control-t has the same eﬀect. The trace will be displayed centred around the base clicked

upon and the name of the reading in the contig editor will be highlighted. Double clicking

on the consensus displays traces for all the readings covering that position.

Moving the mouse pointer over a trace base causes the display of an information line at

the bottom of the window. This gives the base type, its position in the sequence, and its

conﬁdence value.

There are two forms of trace display which are selected using the “Compact” button at

the top of the Trace display. The compact form diﬀers by not showing the Info, Diﬀ, Comp.

and Cancel buttons at the left of each trace.

Note that Gap5 does not store the trace ﬁles in the project database: it stores only their

names and reads them when required. By default it will attempt to look for them in the

current working directory (likely the same directory as the gap database). However this

Chapter 1: Next generation assembly editing with Gap5 39

can be adjusted to look in other directories or via URLs using “Trace ﬁle location” in the

main Gap5 conﬁgure menu (see Section 2.20.8 [Trace File Location], page 302).

This ﬁgure is an example of the Trace Display showing three capillary traces and an

Illumina trace. On the top line, the Lock checkbutton keeps the trace data in sync with the

editor cursor position. The layout is controlled by the Columns and Rows selectors at the

top of the window; 2 column by up to 3 rows in the above screenshot. Show conﬁdence draws

coloured bars and a numerical value representing the quality of each individual base-call.

The main trace panels each have the sequence name displayed in the top left corner.

Below this are X and Y zoom controls on the left and the actual trace data on the right.

The style of this will depend on the type of trace. Sanger chromatograms take multiple

samples per base and are subsequently analysed (base-called) to identify the peaks and

the number/type of bases represented by that peak. These are drawn using smooth lines,

examples of which can be seen in the top row of the image above. Illumina GA instruments

are “clocked” in that each and every measurement corresponds to one base. These are

drawn using a stick plot, as seen in the bottom row of the screen-shot. Note that it is quite

likely you will not have the processed trace data available for Illumina GA sequences due

to size constraints, so the above is simply an example of what could be viewed rather than

a typical example.

454 instruments use pyro-sequencing and so produce a variable number of bases per mea-

surement, with each measurement being clocked to a speciﬁc cycle (ﬂow) on the sequencing

instrument. Hence 454 data is also drawn using a stick plot, although with potentially

multiple bases per measurement. An example is visible below.

40 The Staden Package Manual

The horizontal rulers in this plot correspond to normalised peak intensities for 1.0, 2.0

and so on to indicate 1, 2, 3... bases per ﬂow. Clearly visible are ﬂows of approximate

height 1 (C T A G T on the left), 2 (the following AA) and 0 (the G between the left most

C and T). Above these the conﬁdence bars are visible.

Right clicking on a trace will bring up a popup menu containing the following options.

Information

Displays some basic textual information about the trace. The information avail-

able will vary by trace type, but it may include details such as the length,

instrument and run-date.

Save Saves the trace in ZTR format to a local ﬁle on disk. This can be useful for

when you are using a remote service for fetching traces or extracting them from

an archive such as .sﬀ or .srf ﬁle.

Complement

Reverse complements the trace display. This does not modify data in any way,

but simply adjusts how it is drawn.

Quit Removes this trace from the trace window. If it is the last displayed trace then

the window will be removed too.

1.4.11 The Editor Information Line

The very bottom line of the editor display is text line used by the editor to display pieces

of useful information. Currently this gives information on individual bases, readings, the

contig, and tags, as the mouse is moved over the appropriate object. Each type of object

we move the mouse pointer over (sequence base, consensus base, sequence name panel,

annotation) has its own list of information to display which can be conﬁgured using a

format string stored in your $HOME/.gap5rc ﬁle.

Typically you will not need to modify these, but if you choose to do so the default values

to start from are shown below.

# Mouse-over a sequence the reading name panel

set_def READ_BRIEF_FORMAT \

Reading:%n(#%Rn) Tech:%V Length:%l(%L) MappingQ:%m%**/%*m Pos:%S%p / %*S%*p

# Mouse-over the "Consensus" label in the name panel

set_def CONTIG_BRIEF_FORMAT \

Contig:%n(#%Rn) Length:%l Start:%s End:%e

# Mouse-over a base in a sequence

set_def BASE_BRIEF_FORMAT1 \

Base %b confidence:%4.1c (Prob. %Rc, raw %4.1A %4.1C %4.1G %4.1T) Position %Rp %n

# Mouse-over a base in the consensus

set_def BASE_BRIEF_FORMAT2 \

Base confidence:%4.1c (Prob. %Rc) A=%4.1A C=%4.1C G=%4.1G T=%4.1T *=%4.1* Po-

sition %p

# Mouse-over an annotation

set_def TAG_BRIEF_FORMAT \

Tag type:%t Comment:"%.100c"

Chapter 1: Next generation assembly editing with Gap5 41

The text output is as listed above, but replacing percent-code strings with a relevant

piece of text. In many cases a capital R indicates raw mode to display a numerical value

instead of a string. For example %n in READ BRIEF FORMAT will be replaced by the

sequence name while %Rn will be replaced by the sequence record number. The full syntax

of percent expansion is as follows:

•A percent sign.

•An optional minus sign to request left alignment of the information. When displaying

information in a speciﬁc ﬁeld with where that data does not ﬁll the entire space allowed

the information will, by default, be right justiﬁed. Adding a minus character here

requests left justiﬁcation.

•An optional minimum ﬁeld width. This is a decimal number indicating how much space

to leave for this information.

•An optional precision for numbers or maximum ﬁeld width for strings. This is given

as a fullstop followed by a decimal number.

•An optional ’R’ to specify Raw mode. This changes the meaning of many (but not all)

of the expansion requests to give a numercial representation of the data. For example

%n is a reading name and %Rn is a reading number.

•Th expansion type itself. This is either one or two letters. See below for full details of

their meanings.

To programmers this syntax may seem very similar to printf. This is intentional, but

do not assume it is the same. Speciﬁcally the print syntax of %#,%+ and %0 will not work.

1.4.11.1 Reading Information

Used when we move the mouse over a sequence name in the names panel or a sequence

base-call. Example output is Reading:xc04a1.s1(#74) Tech:Sanger Length:295(474) Map-

pingQ:50. Note that not all expansions make sense when used in the names panel as no

cursor X position is available.

%% A single % sign

%n Reading name. Raw mode: record number

%# Reading record number

%p Position in sequence. Raw mode: position in contig.

%l Clipped sequence length

%L Unclipped sequence length

%s Start of clip

%e End of clip

%S Sense (whether complemented) - “<<” or “>>”. Raw mode: 0/1

%d Strand - “+” or “-”. Raw mode: 0/1

%b Base call

%c Conﬁdence value of called base (phred style). Raw mode: probability

42 The Staden Package Manual

%T Individual conﬁdence (phred style) of A,C,G,T component in log-odds form.

Raw mode: probability value.

%m Mapping Quality. Raw mode: probability of correctly mapped.

%V Instrument type - Sanger, Illumina, SOLiD, 454 or Unknown.

1.4.11.2 Contig Information

For the CONTIG BRIEF FORMAT and BASE BRIEF FORMAT2 the following expan-

sions apply. These operate on contigs and the consensus sequence.

%% Single % sign

%n Contig name. Raw mode: contig record number.

%# Contig record number

%p Position in contig

%l Length of contig

%s Contig start coordinate

%e Contig end coordinate

%b Called consensus base

%c Score for called consensus base. Raw mode: probability value

%* Individual conﬁdence for A,C,G,T,* base types in log-odds form. Raw mode:

as a probability value.

1.4.11.3 Tag Information

The TAG BRIEF FORMAT string is used to display annotation summaries. The possible

percent encodings are as follows.

%% Single % sign

%p Tag position

%t Tag type (always 4 characters)

%l Tag length

%# Tag number (0 if unknown)

%c Tag comment

Chapter 1: Next generation assembly editing with Gap5 43

1.4.12 The Join Editor

Contigs are joined interactively using the Join Editor. This is simply a pair of contig editor

displays stacked above one another. The top editor is ﬂipped in Y so that the consensus

appears at the bottom. This allows the two consensus sequences to be adjacent to one

another, separated only by a “diﬀerences” line. Note that it is essential to align the contigs

over the full length of their overlap. It is much more diﬃcult to achieve this after a join has

been made, and until the alignment is correct, the consensus sequence will be nonsense.

The few diﬀerences between the Join Editor and the Contig Editor can be seen in the

ﬁgure below. Otherwise all the commands and operations are the same as those for the

Contig Editor

One diﬀerence is the Lock button. When set (as it is in the illustration) scrolling either

contig will also scroll the other contig.

The Align button aligns the overlapping consensus sequences and adds pads as necessary.

The alignment routine assumes that the two contigs are already in approximately the right

relative position (as they are immediately after the Join Editor has been invoked from Find

Internal Joins, or Find Repeats). If they are not you may get better results by manually

positioning then before hand.

The “<” and “>” buttons either side of the “Align” button perform the alignment from

the editing cursor to the start of the contig and and from the cursor to the end of the contig

only. Alignment end-gaps are penalised at the curosr position but not for the alignment

end at the contig start/end position. These buttons are useful for when multiple alignment

positions may be valid, such as is the case with an overlap consisting entirely of a short

tandem repeat.

44 The Staden Package Manual

It should be noted that each of the pair of editors comprising the Contig Editor maintains

its own undo history, and using Align is likely to add to both undo histories. There is only

one Undo button, but it applies to the editor last clicked within. A hint is given as to which

of the two editors this is by highlighting the editor in a red border when the mouse is moved

over the Undo button.

Pressing the Join button will display a small dialogue box informing you of the length

and percentage match of the overlap between the two contigs. At this point you can decide

to make the join, to not make the join (both of which remove the editors from the screen)

or to cancel which leaves the join editor visible still to permit further editing.

1.4.13 Using Several Editors at Once

Several editors can be used simultaneously, even on the same contig. In the latter case, it

is useful to understand the diﬀerence between the data and the view of the data.

Each operating Contig Editor is a view of the data for a particular contig. With two

editors viewing the same contig, making changes in either will modify the data that both

are viewing, hence the change will be visible in both editors. Similarly, using Undo in either

will undo the changes to both.

Interaction between Contig Editors and Join Editors is more complicated and generally

isn’t advised. However such interactions work consistently with the notion of views of

contigs. For example, suppose there are two Contig Editors open on two separate contigs,

and in addition to these a Join Editor displaying both contigs. Making the join in the Join

Editor will update the two stand-alone Contig Editors so that they are each viewing the

correct positions in the new contig, even though they’re both now viewing the same contig.

1.4.14 Quitting the Editor

The Exit operation in the File menu quits the editor. If changes have been made since the

last save you will be asked whether you wish to save these changes. Answering “Cancel”

abandons the exit process and provides control of the editor again, otherwise the appropriate

action will be taken and the editor quitted.

1.4.15 Summary

1.4.15.1 Keyboard summary for editing window

(“Left”, “Right”, “Up”, “Down” refer to the appropriate arrow keys.)

Page Up Scroll left by 1Kb

Shift-Page Up Scroll left by 10Kb

Control-Page Up Scroll left by 100Kb

Shift-Control-Page Up Scroll left by 1Mb

Page Down Scroll right by 1Kb

Shift-Page Down Scroll right by 10Kb

Control-Page Down Scroll right by 100Kb

Shift-Control-Page Down Scroll right by 1Mb

Left arrow or Control-b Move editing cursor left one base

Chapter 1: Next generation assembly editing with Gap5 45

Right arrow or Control-f Move editing cursor right one base

Up arrow or Control-p Move editing cursor up one base

Down arrow or Control-n Move editing cursor down one base

Control-a or Home Move editing cursor to start of sequence

Control-e or End Move editing cursor to end of sequence

Alt-comma Move editing cursor to start of contig

Alt-fullstop Move editing cursor to end of contig

Control-t Display trace

Control-s Search forward

Control-r Search backwards

Control-q Toggle tag display

< Set left cutoff clip point (in sequence)

> Set right cutoff clip point (in sequence)

< Bulk clip left cutoff (in consensus)

> Bulk clip right cutoff (in consensus)

[ Set confidence to 0

] Set confidence to 100

Shift Up Increase confidence of base by 1

Shift Down Decrease confidence of base by 1

Control Up Increase confidence of base by 10

Control Down Decrease confidence of base by 10

a, c, g, t or * Overwrite base with a new call.

i or Insert Insert pad (or column if in consensus)

Backspace or Delete Delete padding character

Ctrl-Backspace or Ctrl-Delete Delete base (any base type)

Control-right arrow Move sequence right 1 base-pair

Control-left arrow Move sequence left 1 base-pair

F11 Edit tag under editing cursor

F12 Delete tag under editing cursor

Shift F1 to Shift F10 Edit tag macro 1 to 10

Control F1 to Control F10 Copy tag at editing cursor to macro 1 to 10

F1 to F10 Create tag from macro 1 to 10

1.4.15.2 Mouse summary for editing window

Left button Position editing cursor to mouse cursor

Left button (drag) Mark start and end of selection

Shift left button Adjust end of selection

46 The Staden Package Manual

Left button (double click) Display trace

Right button Display commands menu

Mouse-wheel Vertically scroll the editor

Control mouse-wheel Vertically scroll the editor, fast

Shift mouse-wheel Horizontally scroll the editor

Shift Control mouse-wheel Horizontally scroll the editor, fast

1.4.15.3 Mouse summary for names window

Left button + drag Copy sequence name to clip-board

Right button Display popup menu

Mouse-wheel Vertically scroll the editor

Control mouse-wheel Vertically scroll the editor, fast

Chapter 1: Next generation assembly editing with Gap5 47

1.4.16 Plotting Restriction Enzymes

The restriction enzyme map function ﬁnds and displays restriction sites within a speciﬁed

region of a contig. It is invoked from the gap4 View menu. Users can select the enzyme

types to search for and can save the sites found as tags within the database.

This ﬁgure shows a typical view of the Restriction Enzyme Map in which the results for

each enzyme type have been conﬁgured by the user to be drawn in diﬀerent colours. On

the left of the display the enzyme names are shown adjacent to the lines of plotted results.

If no result is found for any particular enzyme eg here APAI, the line will still be drawn

so that zero cutters can be identiﬁed. Three of the enzymes types have been selected and

are shown highlighted. The results can be scrolled vertically (and horizontally if the plot is

zoomed in). A ruler is shown along the base and the current cursor position (the vertical

black line) is shown in the left hand box near the top right of the display. If the user clicks,

in turn, on two restriction sites their separation in base pairs will appear in the top right

hand box. Information about the last site touched is shown in the Information line at the

bottom of the display. At the top the edit menu is shown torn oﬀ and can be used to create

tags for highlighted enzyme types.

1.4.16.1 Selecting Enzymes

Files of restriction enzyme names and their cut sites are stored in disk ﬁles. For the format

of these ﬁles and notes about creating new ones see Section 11.4 [Restriction enzyme ﬁles],

page 566.

When the ﬁle is read, the list of enzymes is displayed in a scrolling window. To select

enzymes press and drag the left mouse button within the list. Dragging the mouse oﬀ the

bottom of the list will scroll it to allow selection of a range larger than the displayed section

of the list. When the left button is pressed any existing selection is cleared. To select several

disjoint entries in the list press control and the left mouse button. Once the enzymes have

been chosen, pressing OK will create the plot.

48 The Staden Package Manual

1.4.16.2 Examining the Plot

Positioning the cursor over a match will cause its name and cut position to appear in

the information line. If the right mouse button is pressed over a match, a popup menu

containing Information and Conﬁgure will appear. The Information function in this menu

will display the data for this cut site and enzyme in the Output Window.

It is possible to ﬁnd the distance between any two cut sites. Pressing the left mouse

button on a match will display "Select another cut"at the bottom of the window. Then,

pressing the left button on another match will display the distance, in bases, between the

two sites. This is shown in a box located at the top right corner of the window.

1.4.16.3 Reconﬁguring the Plot

The plot displays the results for each restriction enzyme on a separate line. Enzymes with

no sites are also shown. The order of these lines may be changed by pressing and dragging

the middle mouse button or alt +left mouse button on one of the displayed names at the

left side of the screen.

The results are plotted as black lines but users can select colours for each enzyme type

by pressing the right button on any of its matches. A menu containing Information and

Conﬁgure will pop up. Conﬁgure will display a colour selection dialogue. Adjusting the

colour here will adjust the colour for all matches for this restriction enzyme.

1.4.16.4 Textual Outputs

The Results menu of the plot contains options to list the restriction enzyme sites found.

One option sorts the results by enzyme name and the other by the positions of the matches.

The output below shows the textual output from "Output enzyme by enzyme". The

Fragment column gives the size of the fragments between each of the cut sites. The Lengths

column contains the fragment sizes sorted on size.

Contig zf98g12.r1 (#801)

Number of enzymes = 3

Number of matches = 7

Matches found= 1

Name Sequence Position Fragment lengths

1 AATII GACGT’C 7130 7129 556

556 7129

Matches found= 5

Name Sequence Position Fragment lengths

1 ACCI GT’CGAC 414 413 189

2 ACCI GT’CTAC 1296 882 413

3 ACCI GT’CTAC 3871 2575 882

4 ACCI GT’CTAC 5816 1945 1681

5 ACCI GT’CGAC 7497 1681 1945

189 2575

Matches found= 1

Name Sequence Position Fragment lengths

1 AHAII GA’CGTC 7127 7126 559

Chapter 1: Next generation assembly editing with Gap5 49

559 7126

The output below shows the textual output from "Output ordered on position".

Contig zf98g12.r1 (#801)

Number of enzymes = 3

Number of matches = 7

Name Sequence Position Fragment lengths

1 ACCI GT’CGAC 414 413 3

2 ACCI GT’CTAC 1296 882 189

3 ACCI GT’CTAC 3871 2575 367

4 ACCI GT’CTAC 5816 1945 413

5 AHAII GA’CGTC 7127 1311 882

6 AATII GACGT’C 7130 3 1311

7 ACCI GT’CGAC 7497 367 1945

189 2575

50 The Staden Package Manual

1.5 Importing and Exporting Data

1.5.1 Assembly

There are two main types of assembly - denovo and mapped - with the latter not really

being a true assembly at all.

Denovo assembly consists of an assembly of DNA fragments without typically knowing

any of the goal target sequence. Hence it compares sequence fragments against each other

in order to form contigs. Mapped assembly makes uses of a known reference sequence and

compares all sequence fragments against the reference, which is a far simpler and faster

process than denovo assembly.

Gap5 however has neither denovo or mapped assembly built-in. Instead it relies on

externally running standard command-line tools. At present this consists purely of using

bwa for a mapped assembly, but in future this will be expanded upon.

This means that the Assembly menu currently only contains a “Map Reads” sub-menu,

which is turn has multiple choices for bwa usage. You will not be directly able to join contigs

using these facilities or to ﬁll holes in the contig, although this is possible by manually

following some of the steps outlined below and using an alternate step for generating the

SAM ﬁle.

1.5.1.1 Importing with tg index

To enable eﬃcient editing of data, Gap5 needs its own database format for storing sequence

assemblies. Formats such as BAM are good at random access for read-only viewing, but

are not at all amenable to actions such as reverse complementing a contig and joining it to

another.

Hence we need a tool that can take existing assembly formats and convert them to a

form suitable for Gap5. The tg_index program performs this task. It is strictly a command

line tool, although in some speciﬁc cases Gap5 has basic GUI dialogues to wrap it up.

One or more input ﬁles may be speciﬁed. The general form is:

tg_index [options] -o gap5 db name input ﬁle name ...

An example usage is:

tg_index -z 16384 -o test_data.g5 test_data.bam

gap5 test_data.g5 &

File formats supported are SAM, BAM, ACE, MAQ (both short and long variants), CAF,

BAF, Fasta and Fastq. The latter two have no assembly and/or alignment information so

they are simply loaded as single-read contigs instead. Tg index typically automatically

detects the type of ﬁle, but in rare cases you may need to explicitly state the input ﬁle type.

Tg index options:

-o ﬁlename

Creates a gap5 database named ﬁlename and ﬁlename.aux If not speciﬁed the

default is “g db”.

Chapter 1: Next generation assembly editing with Gap5 51

-a Append to an existing database, instead of creating a new one (which is the

default action).

-n When appending, the default behaviour is to add reads to existing contigs if

contigs with the appropriate names already exist. This option always forces

creation of new contigs instead.

-g When appending to an existing database, assume that the alignment has been

performed against an ungapped copy of the consensus exported from this data-

base. (This is internally used when performing mapped assemblies as they

consist of exporting the consensus, running the external mapped alignment

tool, and then importing the newly generated alignments.)

-m

-M Forces the input to be treated as MAQ, both short (-m) and long (-M) formats

are supported. By default the ﬁle format is automatically detected.

-A Forces the input to be treads as ACE format.

-B Forces the input to be treads as BAF format.

-C Forces the input to be treads as CAF format.

-b

-s Forces the input to be treads as BAM (-b) or SAM (-s) format. SAM must

have @SQ headers present. Both need to be sorted by position.

-z bin size Modiﬁes the size of the smallest allowable contig bin. Large contigs will contain

child bins, each of which will contain smaller bins, recursing down to a mini-

mum bin size. Sequences are then placed in the smallest bin they entirely ﬁt

within. The default minimum bin size is 4096 bytes. For very shallow assem-

blies increasing this will improve performance and the decrease disk space used.

Ideally 5,000 to 10,000 sequences per bin is an approximate ﬁgure to aim for.

-u Store unmapped reads only (from SAM/BAM only)

-x Store SAM/BAM auxillary key:value records too.

-p

-P Enable (-p) or disable (-P) read-pairing. By default this is enabled. The purpose

of this is to link sequences from the same template to each other such that gap5

knows the insert size and read-pairings. Generally this is desirable, but it adds

extra time and memory to identify the pairs. Hence for single-ended runs the

option exists to disable attempts at read-pairing.

-f Attempt a faster form of read-pairing. In this mode we link the second occur-

rence of a template to the ﬁrst occurrence, but not vice versa. This is suﬃcient

for the template display graphical views to work, but will cause other parts of

the program to behave inconsistently. For example the contig editor “goto...”

popup menu will sometimes be missing.

-t

-T Controls whether to index (-t) or not (-T) the sequence names. By default this

is disabled. Adding a sequence name index permits us to search by sequence

52 The Staden Package Manual

name or to use a sequence name in any dialogue that requires a contig identiﬁer.

However it consumes more disc space to store this index and it can be time

consuming to construct it.

-r nseq Reserves space for at least nseq sequences. This generally isn’t necessary, but

if the total number of records extends above 2 million (equivalent to 2 billion

sequences, or less if we have lots of contigs, bins and annotation records to write)

then we run out of suitable sequence record numbers. This option preallocates

the lower record numbers and reserves them solely for sequence records.

-c compression method

Speciﬁes an alternate compression method. This defaults to zlib, but can be set

to either none for fastest speed or lzma for best compression.

1.5.1.2 Importing fasta/fastq ﬁles

Sometimes we have a few individual sequences we wish to import as single-read contigs.

That is we won’t align them against each other or against existing data, but just load them

into our gap5 database so we can then run tools such as Find Repeats or Find Internal

Joins on them. (This can be ideal for importing consensus sequences.)

The “Import Fasta/Fastq as single-read contigs” function is designed for this purpose.

Behind the scenes it is nothing more than running tg_index -a to add a fasta or fastq ﬁle.

1.5.1.3 Mapped assembly by bwa aln

This function runs the bwa program using the “aln” method for aligning sequences. It is

appropriate for matching most types of short-read data.

The GUI is little more than a wrapper around command line tools, which can essentially

be repeatedly manually as follows.

1. Calculate and save the consensus for all contigs in the database in fastq format.

2. Index the consensus sequence using “bwa index”.

3. Map our input data against the bwa index using “bwa aln”. Repeat for reverse matches

too.

4. Generate SAM format from the alignments using “bwa samse” or “bwa sampe”.

5. Convert to BAM and sort by position.

6. Import the BAM ﬁle, appending to the existing gap5 database (equivalent to tg_index

-a).

1.5.1.4 Mapped assembly by bwa dbwtsw

This function runs the bwa program using the “dbwtsw” method for aligning sequences.

This should be used when attempting to align longer sequences or data with lots of indels.

The GUI is little more than a wrapper around command line tools, which can essentially

be repeatedly manually as follows.

1. Calculate and save the consensus for all contigs in the database in fastq format.

2. Index the consensus sequence using “bwa index”.

3. Map our input data against the bwa index using “bwa dbwtsw”.

Chapter 1: Next generation assembly editing with Gap5 53

4. Convert to BAM and sort by position.

5. Import the BAM ﬁle, appending to the existing gap5 database (equivalent to tg_index

-a).

54 The Staden Package Manual

1.5.2 Importing GFF

Annotations within GFF ﬁles can be imported to Gap5 as annotations (sometimes referred

to as tags). The “Import GFF Annotatons” function in the main File menu performs

this task. Note that in order for this to work the contigs should not have been edited or

complemented since the GFF ﬁle was created, otherwise the coordinates in the GFF ﬁle

will not match.

One caveat to this relates to sequence gaps. By default consensus gaps/padding char-

acters are excluded from the contig consensus sequences when counting GFF sequence

coordinates. In some cases we may wish to support annotations in a gapped sequence, so

the “GFF coordinates are already padded” checkbox may be used to disable this coordinate

de-padding process.

1.5.3 Export Tags

This dialogue allows annotations (“tags”) to be written to disk as a GFF version 3 ﬁle.

Currently this just uses the GFF “remark” type, but future plans will be to support a

more wide variety of GFF types.

By default the coordinates generated are de-padded, such that “*”s in the consensus

sequence are not counted when identifying the coordinate of an annotation. This may be

disabled by deselecting the “Unpadded coordinates” checkbox.

The object a tag is attached to is typically the contig it is within, with the contig name

being used in the ﬁrst column of the GFF ﬁle. This applies even for annotations place on

a sequence rather than the consensus. This feature may also be disabled by deselecting the

“Map sequence tags to consensus” checkbox.

Example GFF output follows, with “...” to denote lines truncated for illustrative pur-

poses.

Contig6 gap5 remark 4745 4745 . . . type=COMM;Note=Possible SNP?

Contig2 gap5 remark 3178 3196 . . . type=OLIG;Note=Template%09xb63f10%0AOligoname%09??%0A...

Note we can see URL style percent encoding being used to avoid GFF format metachar-

acters, as per the GFFv3 speciﬁcation.

Chapter 1: Next generation assembly editing with Gap5 55

1.5.4 Export Sequences

This function exports sequence and annotation data from a Gap5 database to a variety of

assembly formats.

The fasta and fastq formats are basic sequence-only or sequence plus quality, with no

support for contigs or alignments. The BAF, CAF, ACE and SAM formats all hold assem-

bly data and so are reasonably complete representatives of data within Gap5. Note that

ACE does not directly support quality values and this export function does not create the

associated phdball ﬁle that houses this data.

There is also no direct support for BAM, however command line tools like samtools or

picard can convert the SAM ﬁle into BAM format. The SAM ﬁle should already be sorted

by position.

For SAM only there are additional options: whether to ﬁx mate-pair information and

whether to use depadded coordinates. This former will ensure that the MRNM (Mate

Reference Name), MPOS and ISIZE ﬁelds are ﬁlled out. Note that this considerably slows

down the speed of exporting, so it is disabled by default.

56 The Staden Package Manual

1.6 Finding Sequence Matches

1.6.1 Find Internal Joins

The purpose of this function (which is invoked from the Gap5 View menu) is to use sequences

already in the database to ﬁnd possible joins between contigs. Generally these will be joins

that were missed or judged to be unsafe during assembly and this function allows users to

examine the overlaps and decide if they should be made. During assembly joins may have

been missed because of poor data, or not been made because the sequence was repetitive.

Also it may be possible to ﬁnd potential joins by extending the consensus sequences with

the data from the 3’ ends of readings which was considered to be too unreliable to align

during assembly i.e. we can search in the "hidden data".

If it has not already occurred, use of this function will automatically transform the

Contig Selector into the Contig Comparator. Each match found is plotted as a diagonal

line in the Contig Comparator, and is written as an alignment in the Output Window. The

length of the diagonal line is proportional to the length of the aligned region. If the match

is for two contigs in the same orientation the diagonal will be parallel to the main diagonal,

if they are not in the same orientation the line will be perpendicular to the main diagonal.

The matches displayed in the Contig Comparator can be used to invoke the Join Editor (see

Section 2.6.15 [The Join Editor], page 196) or Contig Editor. See Section 2.6 [Editing in

gap5], page 160. Alternatively, the "Next"button at the top left of the Contig Comparator

can be used to select each result in turn, starting with the best, and ending with the worst.

When this is in use, users can ﬁnd the match in the Contig Comparator which corresponds

Chapter 1: Next generation assembly editing with Gap5 57

to the next result by placing the cursor over the Next button. The plotted match and the

contigs involved will turn white.

A typical display from the Contig Comparator is shown in the ﬁgure above.

To deﬁne the match all numbering is relative to base number one in the contig: matches

to the left (i.e. in the hidden data) have negative positions, matches oﬀ the right end of

the contig (i.e. in the hidden data) have positions greater than that of the contig length.

The convention for reporting the positions of overlaps is as follows: if neither contig needs

to be complemented the positions are as shown. If the program says "contig x in the -

sense"then the positions shown assume contig x has been complemented. For example, in

58 The Staden Package Manual

the results given below the positions for the ﬁrst overlap are as reported, but those for the

second assume that the contig in the minus sense (i.e. 443) has been complemented.

Possible join between contig 445 in the + sense and contig 405

Percentage mismatch after alignment = 4.9

412 422 432 442 452 462

405 TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA

::::::::: : :::::::: ::::: ::: :::::::::: :::::::::: ::::::::::

445 *TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG*TT AGCTCACTCA

-127 -117 -107 -97 -87 -77

472 482 492 502 512

405 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT

:::::::::: :::::::::: :::::::::: :::::::::: ::

445 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT

-67 -57 -47 -37 -27

Possible join between contig 443 in the - sense and contig 423

Percentage mismatch after alignment = 10.4

64 74 84 94 104 114

423 ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG*CGAT GTCAGATGGG

:::: ::::: :::::::::: :::::::::: :::::: :: ::::: :::: :::::::::

443 ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,

3610 3620 3630 3640 3650 3660

124 134 144 154 164

423 TTG*ATGAAG TAGAAGTAGG AG*AGGTGGA AGAGAAGAGA GTGGGA

::: :::::: :::::::::: :: ::::::: ::: ::::: :: ::

443 TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG*

3670 3680 3690 3700 3710

Chapter 1: Next generation assembly editing with Gap5 59

1.6.1.1 Find Internal Joins Dialogue

The contigs to use in the search can be deﬁned as "all contigs", a list of contigs in a ﬁle

"ﬁle", or a list of contigs in a list "list". If "ﬁle"or "list"is selected the browse button

is activated and gives access to ﬁle or list browsers. Two types of search can be selected:

one, "Probe all against all"compares all the contigs deﬁned against one another; the other

"Probe with single contig", compares one contig against all the contigs in the list. If this

option is selected the Contig identiﬁer panel in the dialogue box is ungreyed. Both sense of

the sequences are compared.

If users elect not to "Use standard consensus"they can either "Mark active tags"or

"Mask active tags", in which cases the "Select tags"button will be activated. Clicking on

this button will bring up a check box dialogue to enable the user to select the tags types

they wish to activate. Masking the active tags means that all segments covered by tags that

are "active"will not be used by the matching algorithms. A typical use of this mode is to

avoid ﬁnding matches in segments covered by tags of type ALUS (ie segments thought to

be Alu sequence) or REPT (ie segment that are known to be repeated elsewhere in the data

60 The Staden Package Manual

(see Section 2.2.7.1 [Tag types], page 121). "Marking"is of less use: matches will be found

in marked segments during searching, but in the alignment shown in the Output Window,

marked segments will be shown in lower case.

Some alignments may be very large. For speed and ease of scrolling Gap5 does not

display the textual form of the longest alignments, although they are still visible within the

contig comparator window. The maximum length of the alignment to print up is controlled

by the “Maximum alignment length to list (bp)” control.

The default setting for the consensus is to "Use hidden data"which means that where

possible the contigs are extended using the poor quality data from the readings near their

ends. To ensure that this additional data is not so poor that matches will be missed, the

program uses algorithms which can be conﬁgured from the "Edit hidden data parameters"

dialogue. Two algorithms are available. Both slide a window along the reading until a set

criteria is met. By default an algorithm which sums conﬁdence values within the window is

used. It stops when a window with < "Minimum average conﬁdence"is found. The other

algorithm counts the number of uncalled bases in the window and stops when the total

reaches "Max number of uncalled bases in window". The selected algorithm is applied to

all the readings near the ends of contigs and the data that extends the contig the furthest

is added to its consensus sequence.

If your total consensus sequence length (including a 20 character header for each contig

that is used internally by the program) plus any hidden data at the ends of contigs is greater

than the current value of a parameter called maxseq, Find Internal Joins may produce an

error message advising you to increase maxseq. Maxseq can be set on the command line

(see Section 2.21 [Command line arguments], page 306) or by using the options menu (see

Section 2.20.3 [Set Maxseq], page 299).

The search algorithms ﬁrst ﬁnds matching words of length "Word length", and only

considers overlaps of length at least "Minimum overlap". Only alignments better than

"Maximum percent mismatches"will be reported.

There are three search algorithms: “Sensitive”, “Quick” and “Fastest”. The quick or

fastest algorithm should be applied ﬁrst, and then the sensitive one employed to ﬁnd any

less obvious overlaps.

The sensitive algorithm sums the lengths of the matching words of length "Word length"

on each diagonal. It then ﬁnds the centre of gravity of the most signiﬁcant diagonals.

Signiﬁcant diagonals are those whose probability of occurence is < "Diagonal threshold". It

then uses a dynamic programming algorithm to align around the centre of gravity, using a

band size of "Alignment band size (percent)". For example: if the overlap was 1000 bases

long and the percentage set at 5, the aligner would only consider alignments within 50 bases

either side of the centre of gravity. Obviously the larger the percentage and the overlap,

the slower the aligment.

The fastest and quick algorithms can ﬁnd overlaps and align 100,000 base sequences in a

few seconds by considering, in its initial phase only matching segments of length "Minimum

initial match length". However it does a dynamic programming alignment of all the chunks

between the matching segments, and so produces an optimal alignment. Again a banded

dynamic algorithm can be selected, but as this only applies to the chunks between matching

Chapter 1: Next generation assembly editing with Gap5 61

segments, which for good alignments will be very short, it should make little diﬀerence to the

speed. the fastest and quick methods only diﬀer in how aggressively they prune potential

alignments before entering the dynamic programming phase.

After the search the results will be sorted so that the best matches are at the top of a list

where best is deﬁned as a combination of alignment length and alignment percent identity.

This list can be stepped through, one result at a time using the Contig Joining Editor, by

clicking on the "Next"button at the top left of the Contig Comparator.

62 The Staden Package Manual

1.6.2 Find Repeats

The purpose of this function (which is invoked from the Gap5 View menu) is to ﬁnd exact

repeats in contig consensus sequences. An exact repeat is deﬁned as a run of consecutive

identical ACGT characters; no mismatches or gaps are permitted.

If it has not already occurred, selection of this function will automatically transform

the Contig Selector into the Contig Comparator. See Section 2.4 [Contig Comparator],

page 126. Each match found is plotted as a diagonal line in the Contig Comparator. The

length of the diagonal line is proportional to the length of the match.

If the match is for two contigs in the same orientation the diagonal will be parallel to

the main diagonal, if they are not the line will be perpendicular to the main diagonal. The

matches displayed in the Contig Comparator can be used to invoke the Join Editor (see

Section 2.6.15 [The Join Editor], page 196) or Contig Editors (see Section 2.6 [Editing in

Gap5], page 160), and an Information button will display data about the match in the

Output window. e.g.

Repeat match

From contig xb54a3.s1(#26) at 78

With contig xb62h3.s1(#3) at 1

Length 37

This means that position 78 in the contig with xb54a3.s1 (reading number 26) at its left

end matches 37 bases at position 1 in the contig with xb62h3.s1 (number 3) at its left end.

Users can elect to search a "single"contig, or compare "all contigs", or a subset of

contigs deﬁned in a list or a ﬁle. If "ﬁle"or "list"is selected the browse button is activated

and gives access to ﬁle or list browsers. If they choose to analyse a single contig the

dialogue concerned with selecting the contig and the region to search becomes activated.

Chapter 1: Next generation assembly editing with Gap5 63

The "Minimum Repeat"deﬁnes the smallest match that the algorithm will report. The

algorithm will search only for repeats in the forward direction "Find direct repeats", or

only those in the reverse direction "Find inverted repeats", or both "Find both".

If "Mask active tags"is selected the "Select tags"button is activated. Clicking on this

button will bring up a check box dialogue to enable the user to select the tags types they

wish to activate. Masking the active tags means that all segments covered by tags that are

"active"will not be used in the matching algorithm. A typical use of this mode is to avoid

ﬁnding matches in segments covered by tags of type ALUS (ie segments thought to be Alu

sequence) or that already covered by REPT tags. See Section 2.2.7.1 [Tag types], page 121.

After the search is complete clicking on "Yes"in the "Save tags to ﬁle"panel will activate

the "File name"box and all repeats on the list will be written to a ﬁle. This ﬁle can be

used with "Enter tags"(see Section 2.12.2 [Enter Tags], page 265) to create REPT tags

for all the repeats found. Note that "Enter tags"will remove all the results plotted in the

contig comparator.

Note that the current version of Find Repeats has a limit to the number of repeats it

can store. The limit depends on the current maximum consensus length, so if you want to

increase the limit, reset the maximum consensus length. This can be done using the "Set

maxseq"item in the "Options"menu.

64 The Staden Package Manual

1.6.3 Find Read Pairs

This function is used to check the positions and orientations of readings taken from the

same templates. It is invoked from the gap5 View menu.

For each template the relative position of its readings and the contigs they are in are

examined. This analysis can give information about the relative order, separation and

orientations of contigs and also show possible problems in the data. The search can be

over the whole database or a subset of contigs named in a list (see Section 2.14 [Lists],

page 278) or ﬁle of ﬁle names. The results are written to the Output Window and plotted

in the Contig Comparator (See Section 2.4 [Contig Comparator], page 126.). Read pair

information is also used to colour code the results displayed in the Template Display (see

Section 2.5.1 [Template Display], page 130).

Note that during assembly the template names and lengths are copied from the exper-

iment ﬁles into the gap database. See Section 11.3 [Experiment Files], page 552. The

accuracy of the lengths will depend upon some size selection being performed during the

cloning procedures.

Users choose to process "all contigs"or a subset selected from a ﬁle of ﬁle names ("ﬁle")

or a list ("list"). If either of the subset options is selected the "browse"button will be

activated and can be clicked on to call up a ﬁle or list browser dialogue.

1.6.3.1 Find Read Pairs Graphical Output

The contig comparator is used to plot all templates with readings that span contigs. That

is, the lines drawn on the contig comparator are a visual representation of the relationship

Chapter 1: Next generation assembly editing with Gap5 65

(orientation and overlap) between contigs. When a template spans more than two contigs,

all the combinations of pairs of contigs are plotted. However such cases are uncommon.

The ﬁgure above shows a typical Contig Comparator plot which includes several types

of result in addition to those from Read Pair analysis.

The lines for the read-pairs are, by default, shown in blue. The length of the line is the

average length of the two readings within the pair. The slope of the line represents the

relative orientation of the two readings. If they are both the same orientation (including

both complemented) the line is drawn from top left to bottom right, otherwise the line is

drawn from top right to bottom left.

66 The Staden Package Manual

Clicking with the right mouse button on a read pair line brings up a menu containing,

amongst other things, "Invoke join editor"(see Section 2.6.15 [The Join Editor], page 196).

This will bring up the Join Editor with the two contigs shown end to end.

Chapter 1: Next generation assembly editing with Gap5 67

1.6.4 Sequence Search

The purpose of this function (which is available from the prog View menu) is to ﬁnd

matches between the consensus sequence and short segments of sequence deﬁned by the

user. The segments of sequence (or "strings") can be typed into the dialogue provided

or can be the sequences covered by consensus tag types (see Section 2.2.7.1 [Tag types],

page 121) selected by the user. The latter mode hence provides a way of checking to see

if a tagged segment of the sequence occurs elsewhere in the consensus. The function was

previously known as "Find Oligos".

Users can elect to search against a "single"contig, "all contigs", or a subset of contigs

deﬁned in a list (see Section 2.14 [Lists], page 278) or a ﬁle. If "ﬁle"or "list"is selected

the browse button is activated and gives access to ﬁle or list browsers. If they choose to

analyse a single contig the dialogue concerned with selecting the contig and the region to

search becomes activated.

Both strands of the consensus are scanned using a very simple algorithm: insertions and

deletions are not allowed, but mismatches are. The "Minimum percent match"deﬁnes the

smallest percentage match which will be reported by the algorithm. A value of 75 means

that at least 75% of the bases must match the target sequence.

The user can elect to use tags or to specify their own sequences for the search. Selecting

"Use tags"will activate the "Select tags"browse button. Clicking on this button will bring

up a check box dialogue to enable the user to select the tags types they wish to activate.

Alternatively selecting "Enter sequence"will activate a text entry box and the user can

enter a string of characters. Only the characters ACGTU are allowed and there is no limit

to the length of the string.

If it has not already occurred, selection of this function will automatically transform

the Contig Selector into the Contig Comparator. See Section 2.4 [Contig Comparator],

page 126. Each match found is plotted as a diagonal line in the Contig Comparator. The

length of the diagonal line is proportional to the length of the search string. Self matches

from the tag search are not reported.

68 The Staden Package Manual

If the match between the search string and the contig are in the same orientation,

the diagonal match line will be parallel to the main diagonal, otherwise the line will be

perpendicular to the main diagonal. Matches found between a tag and a contig can be used

to invoke the Join Editor (see Section 2.6.15 [The Join Editor], page 196) or Contig Editors

(see Section 2.6 [Editing in prog ], page 160). Matches between a speciﬁed sequence and

a contig will only invoke the Contig Editor. All of the matches found are displayed in the

Output Window e.g.

Match found between tag on contig 315 in the + sense and contig 495

Percentage mismatch 16.7

957 967 977 987 997

315 CATAAGGATTTCCAATATTTTATTCCAGTTGGGCATCCTAGT

:: ::::::::::: :::::::::::::::::: ::::

495 GATTGGGATTTCCAATGTTTTATTCCAGTTGGGCACCCTAAG

2 12 22 32 42

Chapter 1: Next generation assembly editing with Gap5 69

1.7 Checking Assemblies and Removing Readings

After assembly, and prior to editing, it can be useful to examine the quality of the alignments

between individual readings and the sections of the consensus which they overlap. This may

reveal doubtful joins between sections of contigs, poorly aligned readings, or readings that

have been misplaced. By using this analysis in combination with other gap5 functions such

as Find internal joins (see Section 2.8.3 [Find Internal Joins], page 227) and Find repeats

(see Section 2.8.4 [Find Repeats], page 233), it is also possible to discover if readings have

been positioned in the wrong copies of repeat elements.

If readings are found to be misplaced or need removing for other reasons, gap5 has func-

tions for breaking contigs (see Section 2.9.1.1 [Breaking Contigs], page 239), and removing

readings (see Section 2.9.1.2 [Disassembling Readings], page 240). These functions can be

accessed through the main gap5 Edit menu or from within the Contig Editor.

If readings are removed from contigs to start new contigs of one reading, these contigs can

then be processed by Find internal joins (see Section 2.8.3 [Find Internal Joins], page 227)

and the Join editor (see Section 2.6.15 [The Join Editor], page 196), which should reveal all

the other positions at which the reading matches.

70 The Staden Package Manual

1.7.0.1 Checking Assemblies

The Check Assembly routine (which is invoked from the gap5 View menu) is used to check

contigs for potentially misassembled readings by comparing them against the segment of

the consensus which they overlap. It simply slides a small window along the sequence

identifying regions of high disagreement between that portion of sequence and the consensus.

Results are displayed in the Output Window and plotted on the main diagonal in the Contig

Comparator. See Section 2.4 [Contig Comparator], page 126.

From the Contig Comparator the user can invoke the Contig Editor to examine the

alignment of any problem reading. See Section 2.6 [Editing in gap5], page 160. If the

reading appears to be correctly positioned the user can either edit it, or instead select the

name to add it to the “readings” list for subsequent disassembly or removal.

Users select either to search only one contig ("single"), all contigs ("all contigs"), or a

subset of contigs contained in a "ﬁle"or a "list". If "ﬁle"or "list"is selected the "browse"

button will be activated and clicking on it will invoke a ﬁle or list browser. If a single contig

is selected the "Contig identiﬁer"dialogue will be activated and users should enter a contig

name.

The percentage disagreement and over what size of window are both conﬁgurable pa-

rameters. Additionally there is a parameter to control whether N bases in the sequence

should be considered as disagreements or not. The choice will depend on whether you are

looking for sequences that appear to be in the wrong place (ignore Ns) or simply sequences

that appear to have a large number of incorrect base calls (keep Ns).

The "Information"window produced by selecting "Information"from the Contig Com-

parator "Results"menu produces a summary of the results sorted in order os percentage

mismatch.

By clicking with the right mouse button on results plotted in the Contig Comparator

a pop-up menu is revealed which can be used to invoke the Contig Editor (see Section 2.6

[Editing in gap4], page 160). The editor will start up with the cursor positioned on the

problem reading. If the reading is found to be misplaced it can be marked for removal

from within the Editor (see Section 2.6.7.12 [Remove Reading], page 179). However, prior

to this it may be beneﬁcial to use some of the other analyses such as Find internal joins

(see Section 2.8.3 [Find Internal Joins], page 227) and Find repeats (see Section 2.8.4 [Find

Repeats], page 233), which may help to ﬁnd its correct location. Both of these functions

Chapter 1: Next generation assembly editing with Gap5 71

produce results plotted in the Contig Comparator (see Section 2.4 [Contig Comparator],

page 126) and any alternative locations will give matches on the same vertical or horizontal

projection as the problem reading.

72 The Staden Package Manual

1.7.1 Removing Readings and Breaking Contigs

Occasionally contigs require more drastic changes than simple basecall edits. Sometimes it

is necessary to remove readings that have been put in the wrong place, or to break contigs

that should not have been joined. Gap5 contains functions to help with these problems,

and two types of interface.

If a contig needs to be broken cleanly into two new contigs, with all the readings, other

than the two at the incorrect join, still linked together, then Break Contig (see Section 2.9.1.1

[Breaking Contigs], page 239), or (see Section 2.6.7.13 [Break Contig], page 179) should be

used. The former interface is available via the main gap5 Edit menu, and the latter as an

option in the Contig Editor.

If one or more readings need removing from from contig(s), even if their removal will

break the contiguity of a contig, then (see Section 2.9.1.2 [Disassemble Readings], page 240),

or (see Section 2.6.7.12 [Remove Reading], page 179) should be used. The former interface

is available via the main gap5 Edit menu, and the latter as an option in the Contig Editor.

Readings can be removed from the database completely, or moved to start individual new

contigs, one for each reading.

Chapter 1: Next generation assembly editing with Gap5 73

1.7.1.1 Breaking Contigs

The Break Contig function (which is available from the gap5 Edit menu) enables contigs to

be broken by removing the link between two adjacent readings. The user deﬁnes the contig

coordinate to break at. All sequences starting to the right of that position will be placed

into a new contig.

Breaking contig can somtimes cause more holes to be created. The “Remove contig

holes” will also cause subsequent breaks to happen at these cases, producing more than one

additional contig. If we have aligned against a reference and expect regions of zero coverage

then this option should be disabled.

74 The Staden Package Manual

1.7.1.2 Disassembling Readings

This function is used to remove readings from a database or move readings to new contigs.

If readings are removed from the database all reference to them is deleted. If a reading

is moved to a “single-read contig” a new contig will be created containing this one single

reading, which may then be re-processed by Find Internal Joins (see Section 2.8.3 [Find In-

ternal Joins], page 227) and the Join editor (see Section 2.6.15 [The Join Editor], page 196),

which should reveal all the other positions at which the reading matches.

More useful is the general “Move readings to new contigs”. This will keep any assembly

relationships intact between the set of readings to be disassembled. For example if three

readings overlap then when disassembled all three will end up in a single new contig. This

function is particularly useful for pulling apart false joins or repeats.

The set of readings to be processed can be read from a “ﬁle” or a “list” and clicking on

the “browse” button will invoke an appropriate browser. If just a single reading is to be

assembled choose “single” and enter the reading name instead of the ﬁle or list of ﬁlenames.

Removal via a “list” is a particularly powerful option when controlled via the list gen-

eration functions within the contig editor. For example break contig could be viewed as

disassembling a list of readings selected using “Select this reading and all to right”.

Unlike gap4, gap5 can cope with having holes in contigs. (This is obviously a requirement

when dealing with mapped alignments.) Hence gap5 gives us a choice whether to break

contigs into two (or more) pieces when removing sequences produces holes in the contigs.

By default this is enabled.

Chapter 1: Next generation assembly editing with Gap5 75

1.7.1.3 Delete Contigs

While Disassemble Readings is capable of removing entire contigs, it is ineﬃcient for this

task as it has a lot of additional house-keeping to perform.

Delete Contigs should be used when we wish to remove entire contigs. Be careful not

to accidentally choose this over disassemble readings as even when giving a single sequence

name, this function will interpret it as a request for removing all other sequences in that

contig too.

There is no Undo feature, so backups are advised before hand.

76 The Staden Package Manual

1.8 Tidying up alignments

The Shuﬄe Pads, Remove pad Columns and Remove Contig Holes all share a common goal

of tidying up sequence alignments, possibly also breaking the contig up.

1.8.1 Shuﬄe Pads

This function is an implementation of the Anson and Myers “ReAligner” algorithm. It anal-

yses multiple sequence alignments to detect locations where the number of disagreements

to the consensus could be reduced by realignment of sequences, possibly also correcting the

consensus in the process. For example:

Sequence1: GATTCAAAGAC

Sequence2: TTCAA*GACGG

Sequence3: TC*AAGAC

Consensus: GATTCAAAGACGGATC

The consensus contains AAA, but the corrected alignment only has two As:

Sequence1: GATTCAAAGAC

Sequence2: TTC*AAGACGG

Sequence3: TC*AAGAC

Consensus: GATTC*AAGACGGATC

For speed we acknowledge that the new alignment will only deviate slightly from the old

one and so a narrow “band size” is used. This paramater may be adjusted if required, but

at the expense of speed.

Chapter 1: Next generation assembly editing with Gap5 77

1.8.2 Remove Pad Columns

There are cases where we may have multiple alignments where every single sequence has a

padding character such that the complete column is “*”. This can occur when disassembling

data from a falsely made join.

The Shuﬄe Pads algorithm will remove entire columns of pads when it ﬁnds them, but

it is time consuming and it may also edit alignments elsewhere. The Remove Pad Columns

function is a faster, more speciﬁc solution to this problem.

By default the function will only ever delete columns where 100% of the sequences have a

pad/gap. However with appropriate due care it is possible to reduce this and allow removal

of columns where a few sequences have a real base provided the overall percentage is still

high. This is achieved by reducing the “Percentage pad needed” parameter.

Reducing from 100% is not recommended though as it is removal of data purely for

tidyness sake, while the consensus algorithm will automatically ﬁnd the correct solution.

78 The Staden Package Manual

1.8.3 Remove Contig Holes

Unlike Gap4, Gap5 permits contig regions with zero coverage. These can naturally occur

when using sequence mapping to known references. However in a denovo assembly context

they are not desireable.

Some algorithms have check boxes querying whether you wish holes to be removed by

breaking contigs up, but this dialogue oﬀers a choice of ﬁxing the holes at a later stage.

It identiﬁes all regions of zero coverage and will break the contig into multiple fragments.

Chapter 1: Next generation assembly editing with Gap5 79

1.9 Calculating Consensus Sequences

In this section we describe the types of consensus which gap4 can produce, the formats they

can be written in, and the algorithms that can be used. The algorithms are not only used

to produce consensus sequence ﬁles, but in many other places throughout gap4 where an

analysis of the current quality of the data is required. One important place is inside the

Contig Editor (see Section 2.6 [Editing in gap4], page 160) where they are used to produce

an "on-the-ﬂy"consensus, responding to every edit made by the user.

The currently active consensus algorithm is selected from the "Consensus algorithm"di-

alogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm], page 299).

There are four main types of consensus sequence ﬁle that can be produced by the pro-

gram: Normal, Extended, Unﬁnished, and Quality. They are all invoked from the File

menu.

"Normal"is the type of consensus ﬁle that would be expected: a consensus from the

non-hidden parts of a contig. "Extended"is the same as "Normal"but the consensus is

extended by inclusion of the hidden, non-vector sequence, from the ends of the contig.

"Unﬁnished"is the same as "Normal"except that any position where the consensus

does not have good data for both strands is written using A,C,G,T characters, and the rest

(which has good data for both strands) is written using a diﬀerent set of symbols. This

sequence can be used for screening against new readings: only the regions needing more

readings will produce matches. By screening readings in this way, prior to assembly, users

can avoid entering readings which will not help ﬁnish the project, and which may require

further editing work to be performed.

"Quality"produces a sequence of characters of the same length as the consensus, but

they instead encode the reliability of the consensus at each point.

Consensus sequence ﬁles can also encode the positions of the currently active tag types

by changing the case of the tagged characters (marking) or writing them in a diﬀerent

character set (masking) (see Section 2.2.7.2 [Active tags and masking], page 121).

The consensus algorithms are usually conﬁgured to produce only the characters A,C,G,T

and "-", but it is possible to set them to produce the complete set of IUB codes. This mode

is useful for some types of work and allows the range of observed base types at any position

to be coded in the consensus. How the IUB codes are chosen is described in the introduction

to the consensus algorithms (see Section 2.11.5 [The Consensus Algorithms], page 257).

Depending on the type of consensus produced, the consensus sequence ﬁles can be written

in three diﬀerent formats: Experiment ﬁles (see Section 11.3 [Experiment File], page 552),

FASTA (Pearson,W.R. Using the FASTA program to search protein and DNA sequence

databases. Methods in Molecular Biology. 25, 365-389 (1994)) or staden formats. If ex-

periment ﬁle format is selected a further menu appears that allows users to select for the

inclusion of tag data in the output ﬁle. For FASTA format the sequence headers include the

contig identﬁer as the sequence name and the project database name, version number and

the number of the leftmost reading in the contig as comments. e.g. ">xyzzy.s1 B0334.0.274"

is database B0334, copy 0, and the left most reading for the contig is number 274, which has

a name of xyzzy.s1. For staden format the headers include the project database name and

80 The Staden Package Manual

the number of the leftmost reading in the contig. e.g. "<B0334.00274——->" is database

B0334 and the left most reading for the contig is number 274. Staden format is maintained

only for historical reasons - i.e. there may still be a few unfortunate people using it. Ob-

viously Experiment ﬁle format can contain much more information, and can serve as the

basis of a submission to the sequence library.

1.9.1 Normal Consensus Output

This is the usual consensus type that will be calculated (and is available from the gap4

File menu). The currently active consensus algorithm is selected from the "Consensus algo-

rithm"dialogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm],

page 299).

Contigs can be selected from a ﬁle of ﬁle names or a list. In addition, tagged regions can

be masked or marked (see Section 2.2.7.2 [Active tags and masking], page 121), and output

can be in Experiment ﬁle, fasta or staden formats. If experiment ﬁle format is selected a

further menu appears that allows users to select for the inclusion of tag data in the output

ﬁle.

The contigs for which to calculate a consensus can be a particular "single"contig, "all

contigs", or a subset of contigs whose names are stored in a "ﬁle"or a "list". If a ﬁle or list

is selected the browse button will be activated, and if it is clicked, an appropriate browser

will be invoked. If the user selects "single"then the dialogue for choosing the contig, and

the section to process, becomes active.

If the user selects either "mask active tags"or "mark active tags"the "Select tags"

button is activated, and if it is clicked, a dialogue panel appears to enable the user to select

which tag types should be used in these processes. If "mask"is selected all segments covered

Chapter 1: Next generation assembly editing with Gap5 81

by the tag types chosen will not be written as ACGT but as deﬁ symbols. If "mark"is

selected the tagged segments will be written in lowercase characters. Masking is useful for

producing a sequence to screen against other sequences: only the unmasked segments will

produce hits.

The "strip pads"option will remove pads ("*"s) from the consensus sequence. In the

case of experiment ﬁles this will also automatically adjust the position and length of the

annotations to ensure that they still mark the correct segment of sequence.

Normally the consensus sequences are named after the left-most reading in each contig.

For the purposes of single-template based sequencing projects (eg cDNA assemblies) the

option exists to “Name consensus by left-most template” instead of by left-most reading.

The routine can write its consensus sequence (plus extra data for experiment ﬁles) in

"experiment ﬁle","fasta"and "staden"formats. The output ﬁle can be chosen with the

aid of a ﬁle browser. If experiment ﬁle format is selected the user can choose whether or not

to have "all annotations","annotations except in hidden", or "no annotations"written out

with the sequence. If the user elects to include annotations the "select tags"button will

become active, and if it is clicked, a dialogue for selecting the types to include will appear.

1.9.2 The Consensus Algorithms

The consensus calculation is a very important component of gap4. It is used to produce

an "on-the-ﬂy"consensus, responding to every individual change in the Contig Editor (see

Section 2.6 [Editing in gap4], page 160) and is used to produce the ﬁnal sequence for

submission to the sequence libraries. Some years ago Bonﬁeld, J.K. and Staden, R. The

application of numerical estimates of base calling accuracy to DNA sequencing projects.

Nucleic Acids Res. 23, 1406-1410 (1995) we put forward the idea of using base call accuracy

estimates in sequencing projects, and this has been partially realised with the values from

the Phred program (Ewing, B. and Green, P. Base-Calling of Automated Sequencer Traces

Using Phred. II. Error Probabilities. Genome Research. Vol 8 no 3. 186-194 (1998)).

These values are widely used and have deﬁned a decibel type scale for base call conﬁdence

values and gap4 is currently set to use conﬁdence values deﬁned on this scale. An overview

of our use of conﬁdence values is contained in the introductory sections of the manual (see

Section 2.2.5 [The use of numerical estimates of base calling accuracy], page 118).

As is described elsewhere (see Section 2.11.6 [List Consensus Conﬁdence], page 261)

being able to calculate the conﬁdence for each base in the consensus sequence makes it

possible to estimate the number of errors it contains, and hence the number of errors that

will be removed if particular bases are checked and, if necessary, edited.

Gap4 caters for base calls with and without conﬁdence values and hence provides a

choice of algorithms. There are currently three consensus algorithms that may be used.

The choice of the best algorithm will depend on the data that you have available and the

purpose for which you are using gap4.

The currently active consensus algorithm is selected from the "Consensus algorithm"di-

alogue in the main gap4 Options menu (see Section 2.20.2 [Consensus Algorithm], page 299).

The only way to produce a consensus sequence for which the reliability of each base is

known, is to use reading data with base call conﬁdence values. Their use, in combination

82 The Staden Package Manual

with the Conﬁdence Value algorithm (see Section 2.11.5.3 [Consensus Calculation Using

Conﬁdence Values], page 259). is strongly recommended.

For base calls without conﬁdence values use the Base Frequencies algorithm (see

Section 2.11.5.1 [Consensus Calculation Using Base Frequencies], page 258). This is also

a fast algorithm so it may be appopriate for very high depth assemblies such those for

mutation studies.

For data with simple base call accuracy estimates rather than those on the decibel scale,

the Weighted Base Frequencies algorithm should be used (see Section 2.11.5.2 [Consensus

Calculation Using Weighted Base Frequencies], page 259).

All conﬁdence values lie in the range 0 to 100. When readings are entered into a data-

base, gap4 assigns a conﬁdence of 99 to all bases without conﬁdence values. For all three

algorithms, a base with conﬁdence of 100 is used to force the consensus base to that base

type and to have a conﬁdence of 100. However,if two or more base types at any position

have conﬁdence 100, the consensus will be set to "unknown", i.e. "-", and will have a

conﬁdence of 0. Note that dash ("-") is our preferred symbol for "unknown"as, within a

sequence, it is more easily distinguished from A,C,G,T than "N".

The consensus sequence is also assigned a conﬁdence, even when base call conﬁdence

values are not used to calculate it. The scale and meaning of the consensus conﬁdence

changes between consensus algorithms. However the consensus cutoﬀ parameter always has

the same meaning. A consensus base with a conﬁdence ’X’ will be called as a dash when

’X’ is lower than the consensus cutoﬀ, otherwise it is the determined base type.

Both the consensus cutoﬀ and quality cutoﬀ values can be set by using the "Conﬁg-

ure cutoﬀs"command in the "Consensus algorithm"dialogue in the main gap4 Options

menu (see Section 2.20.2 [Consensus Algorithm], page 299). Within the Contig Editor (see

Section 2.6 [Editing in gap4], page 160) these values can be adjusted by clicking on the "<"

and ">" symbols adjacent to the "C:"(consensus cutoﬀ) and "Q:"(quality cutoﬀ) displays

in the top left corner of the editor. These buttons are repeating buttons - the values will

adjust for as long as the left mouse button is held down. Changing these values lasts only

as long as that invocation of the contig editor.

The consensus algorithms are usually conﬁgured to produce only the characters

A,C,G,T,* and "-", but it is possible to set them to produce the complete set of IUB

codes. This mode is useful for some types of work and allows the range of observed base

types at any position to be coded in the consensus. The IUB code at any position is

determined in the following way.

We assume that the user wants to know which base types have occurred at any point,

but may want some control over the quality and relative frequency of those that are used to

calculate the "consensus". For the simplest consensus algorithm there is no control over the

quality of the base calls that are included, but the Consensus Cutoﬀ can be used to control

how the relative frequency aﬀects the chosen IUB code. All base types whose computed

"conﬁdence"exceeds the Consensus Cutoﬀ will be included in the selection of the IUB code.

For example if only base type T reaches the Consenus Cutoﬀ the IUB code will be T; if both

T and C reach the cutoﬀ the code will be Y; if A, C and T each reach the cutoﬀ the code

will be H; if A, C, G and T all reach the cutoﬀ the code will be "N". For the Conﬁdence

Chapter 1: Next generation assembly editing with Gap5 83

Value algorithm the Quality Cutoﬀ can be used to exclude base calls of low quality, so that

all those that do not reach the Quality Cutoﬀ are excluded from the IUB code calculation.

Otherwise the logic of the code selection is the same as for the two simpler algorithms.

Both the consensus cutoﬀ and quality cutoﬀ values can be set by using the "Conﬁgure

cutoﬀs"command in the "Consensus algorithm"dialogue in the main gap4 Options menu

(see Section 2.20.2 [Consensus Algorithm], page 299).

The algorithms are explained below.

1.9.2.1 Consensus Calculation Using Base Frequencies

This algorithm can be used for any data, with or without conﬁdence values. Each standard

base type is given the same weight. The consensus will be the most frequent base type in a

given column provided that the consensus cutoﬀ parameter is low enough. All unrecognised

base types, including IUB codes, are treated as dashes. Dashes are given a weight of

1/10th that of recognised base types. Pads are given a weight which is the average of their

neighbouring bases.

The conﬁdence of a consensus base for this method is expressed as a percentage. So for

example a column of bases of A, A, A and T will give a consensus base of A and a conﬁdence

of 75. Therefore a consensus cutoﬀ of 76 or higher will give a consensus base of "-".

In the event that more than one base type is calculated to have the same conﬁdence, and

this exceeds the consensus cutoﬀ, the bases are assigned in descending order of precedence:

A, C, G and T.

The quality cutoﬀ parameter (Q in the Contig Editor) has no eﬀect on this algorithm.

1.9.2.2 Consensus Calculation Using Weighted Base Frequencies

This method can be used when simple, unquantiﬁed, base call quality values are available.

Instead of simply counting base type frequencies it sums the quality values. Hence a column

of 4 bases A, A, A and T with conﬁdence values 10, 10, 10 and 50 would give combined

totals of 30/80 for A and 50/80 for T (compared to 3/4 for A and 1/4 for T when using

frequencies). As with the unweighted frequency method this sets the conﬁdence value of

the consensus base to be the the fraction of the chosen base type weights over the total

weights (62.5 in the above example).

The quality cutoﬀ parameter controls which bases are used in the calculation. Only bases

with quality values greater than or equal to the quality cutoﬀ are used, otherwise they are

completely ignored and have no eﬀect on either the base type chosen for the consensus or

the consensus conﬁdence value. In the above example setting the quality cutoﬀ to 20 would

give a T with conﬁdence 100 (100 * 50/50).

In the event that more than one base type is calculated to have the same weight, and

this exceeds the consensus cutoﬀ, the bases are assigned in descending order of precedence:

A, C, G and T.

This is Rule IV of Bonﬁeld,J.K. and Staden,R. The application of numerical estimates

of base calling accuracy to DNA sequencing projects. Nucleic Acids Research 23, 1406-1410

(1995).

84 The Staden Package Manual

1.9.2.3 Consensus Calculation Using Conﬁdence values

This is the prefered consensus algorithm for reading data with Phred decibel scale conﬁdence

values. As will become clear from the follwing description, it is more complicated than the

other algorithms, but produces a much more useful result.

A diﬃculty in designing an algorithm to calculate the conﬁdence for a consensus derived

from several readings, possibly using diﬀerent chemistries, and hopefully from both strands

of the DNA, is knowing the level of independence of the results from diﬀerent experiments

- namely the readings. Given that sequencing traces are sequence dependent, we do not

regard readings as wholly independent, but at the same time, repeated readings which

conﬁrm base calls may give us more conﬁdence in their accuracy. In addition, if we get a

particularly good sequencing run, with consequently high base call conﬁdence values, we

are more likely to believe its base call and conﬁdence value assignments. The ﬁnal point in

this preamble is that the Phred conﬁdence values refer only to the probability for the called

base, and they tell us nothing about the relative likelihood of each of the other 3 base types

appearing at the same position. These diﬃculties are taken into account by our algorithm,

which is described below.

In what follows, a particular position in an alignment of readings is referred to as a

"column". The base calls in a column are classiﬁed by their chemistry and strand. We

currently group them into "top strand dye primer","top strand dye terminator","bottom

strand dye primer"and "bottom strand dye terminator"classes.

Within each class there may be zero or many base calls. For each class we check for

multiple occurrences of the same base type. For each base type we ﬁnd the highest conﬁdence

value, and then increase it by an amount dependent on the number of conﬁrming reads.

Then Bayes formula is used to derive the probabilities and hence the conﬁdence values for

each base type.

To further describe the method it is easiest to work through an example. Suppose we

have 5 readings with the following characteristics covering a particular column.

Dye primer, top strand, ’A’, confidence 20

Dye primer, top strand, ’A’, confidence 10

Dye primer, top strand, ’T’, confidence 20

Dye terminator, top strand, ’T’, confidence 10

Dye primer, bottom strand, ’A’, confidence 5

Hence there are three possible classes.

Examining the "dye primer top strand"class we see there are three readings (A, A and

T). The highest A is 20. We add to this a ﬁxed quantity to indicate one other occurence

of an A in this set. For this example we add 5. Now we have an adjusted conﬁdence of

25 for A and 20 for T. This is equivalent to a .997 probability of A being correct and .99

probability of T being correct. To use Bayes we split the remaining probabilies evenly. A

has a probability of .997 and so the remaining .003 is spread amongst the other base types.

Similarly for the .01 of the T. The result is shown in the table below.

|ACGT

--+-----------------------

A | .997 .001 .001 .001

Chapter 1: Next generation assembly editing with Gap5 85

T | .0033 .0033 .0033 .990

Bayesian calculations on this table then give us probabilities of approximately .766 for

A, .00154 for C, .00154 for G and .231 for T.

The other classes give probalities of .033 for A, C, G and .9 for T, and .316 for A, and

.228 for C, G and T.

To combine the values for each class we produce a table for a further Bayesian calculation.

Once again we ﬁll in the probabilities and spread the remainder evenly amongst the other

base types.

| A C G T

-----------+--------------------------

Primer Top | .766 .00154 .00154 .231

Term Top | .0333 .0333 .0333 .9

Primer Bot | .316 .228 .228 .228

From this Bayes gives the ﬁnal probabilities of .135 for A, .0002 for C, .0002 for G and

.854 for T. This is what would be expected intuitively: the T signal was present in both

dye primer and dye terminator experiments with 1/100 and 1/10 error rates whilst the A

signal was present on both strands with 1/100 and 1/3 error rates. Hence the consensus

base is T with conﬁdence 8.4 (-10*log10(1-.854)).

If a padding character is present in a column we consider the pad as a separate base

type and then evenly divide the remaining probabilities by 4 instead of 3.

1.9.2.4 The Quality Calculation

The Quality Calculation described here (which is available from the gap4 File menu) applies

either of the two simple consensus calculations (see Section 2.11.5.1 [Consensus Calculation

Using Base Frequencies], page 258) and (see Section 2.11.5.2 [Consensus Calculation Using

Weighted Base Frequencies], page 259) to the data for each strand of the DNA separately.

It produces, not a consensus sequence, but an encoding of the "quality"of the data which

deﬁnes whether it has been determined on both strands, and whether the strands agree.

This quality is used as the basis for problem searches, such as ﬁnd next problem, and the

Quality Display within the Template Display (see Section 2.5.1.5 [Quality Plot], page 137).

The categories of data and the codes produced are shown in the table. For example ’c’

means bad data on one strand is aligned with good data on the other.

+Strand -Strand

aGood Good (in agreement)

bGood Bad

cBad Good

dGood None

eNone Good

fBad Bad

gBad None

86 The Staden Package Manual

hNone Bad

iGood Good (disagree)

jNone None

the "Conﬁgure cutoﬀs"command in the

In the "Consensus algorithm"dialogue in the main gap4 Options menu (see

Section 2.20.2 [Consensus Algorithm], page 299), setting the conﬁguration to treat readings

ﬂagged using the "Special Chemistry"Experiment File line (CH ﬁeld) (see Section 11.3

[Experiment File], page 552) aﬀects this calculation. When set, the reading counts for

both strands in the Consensus and Quality Calculations, and hence is equivalent to having

data on both strands.

1.9.3 List Consensus Conﬁdence

The Conﬁdence Value consensus algorithm (see Section 2.11.5.3 [Consensus Calculation

Using Conﬁdence Values], page 259) produces a consensus sequence for which the expected

error rate for each base is known. The option described here (which is available from the

gap4 View menu) uses this information to calculate the expected number of errors in a

particular consensus sequence and to tabulate them.

The decibel type scale introduced in the Phred program uses the formula

-10xlog10(error rate) to produce conﬁdence values for the base calls. A conﬁdence value of

10 corresponds to an error rate of 1/10; 20 to 1/100; 30 to 1/1000; etc.

So for example, if 50 bases in the consensus had conﬁdence 10, we would expect those 50

bases (with an error rate of 1/10) to contain 5 errors; and if 200 bases had conﬁdence 20, we

would expect them to contain 2 errors. If these 50 bases with conﬁdence 10, and 200 bases

with conﬁdence 20 were the least accurate parts of the consensus, they are the bases which

we should check and edit ﬁrst. In so doing we would be dealing with the places most likely

to be wrong, and would raise the conﬁdence of the whole consensus. The output produced

by List Conﬁdence shows the eﬀect of working through all the lowest quality bases ﬁrst,

until the desired level of accuracy is reached. To do this it shows the cumulative number

of errors that would be ﬁxed by checking every consensus base with a conﬁdence value less

than a particular threshold.

The List Conﬁdence option is available from within the Commands menu of the Contig

Editor and the main gap4 View menu. From the main menu the dialogue simply allows

selection of one or more contigs. Pressing OK then produces a table similar to the following:

Sequence length = 164068 bases.

Expected errors = 168.80 bases (1/971 error rate).

Value Frequencies Expected Cumulative Cumulative Cumulative

errors frequencies errors error rate

--------------------------------------------------------------------------

0 0 0.00 0 0.00 1/971

1 1 0.79 1 0.79 1/976

2 0 0.00 1 0.79 1/976

3 3 1.50 4 2.30 1/985

Chapter 1: Next generation assembly editing with Gap5 87

4 30 11.94 34 14.24 1/1061

5 2 0.63 36 14.87 1/1065

6 263 66.06 299 80.94 1/1867

7 151 30.13 450 111.06 1/2841

8 164 25.99 614 137.06 1/5168

9 96 12.09 710 149.14 1/8344

10 80 8.00 790 157.14 1/14069

The output above states that there are 164068 bases in the consensus sequence with an

expected 169 errors (giving an average error rate of one in 971). Next it lists each conﬁdence

value along with its frequency of occurrence and the expected number of errors (as explained

above, frequency x error rate). For any particular conﬁdence value the cumulative columns

state: how many bases in the sequence have the same or lower conﬁdence, how many errors

are expected in those bases, and the new error rate if all these bases were checked and all

the errors ﬁxed.

Above it states that there are 790 bases with conﬁdence values of 10 or less, and estimates

there to be 157 errors in those 790 bases. As we expect there to be about 169 errors in the

whole consenus this implies that manually checking those 790 bases would leave only 12

undetected errors. Given that the sequence length is 164068 bases this means an average

error rate of 1 in 14069. It is important to note that by using this editing strategy, this

error rate would be achieved by checking only 0.48% of the total number of consensus bases.

This strategy is realised by use of the consensus quality search in the gap4 Contig Editor

(see Section 2.6.6.7 [Search by Consensus Quality], page 175).

1.9.4 List Base Conﬁdence

The various base-callers may produce a conﬁdence value for each base call. Previous sections

describe how this may be used to produce a consensus sequence along with a consensus

conﬁdence.

This function tabulates the frequency of each base conﬁdence value along with a count

of how many times is matches or mismatches the consensus. Given that the standard scale

for conﬁdence values follows the -10log10(probability of error) formula we can determine

what the expected frequency of mismatches should be for any particular conﬁdence value.

By comparing this with our observed frequencies we then have a powerful summary of the

amount of misassembled data.

Total bases considered : 45270

Problem score : 1.337130

Conf. Match Mismatch Expected Over-

value freq freq freq representation

---------------------------------------------------------------------

0 0 0 0.00 0.00

1 0 0 0.00 0.00

2 0 0 0.00 0.00

3 0 0 0.00 0.00

4 37 22 23.49 0.94

5 0 0 0.00 0.00

88 The Staden Package Manual

6 89 46 33.91 1.36

7 119 26 28.93 0.90

8 256 37 46.44 0.80

9 368 30 50.11 0.60

10 669 31 70.00 0.44

...

In the above example we see that there are 59 sequence bases with conﬁdence 4, of which

37 match the consensus and 22 do not. If we work on the assumption that the consensus

is correct then we would expect approximately 40% of these to be incorrect, but we have

measured 37% to be incorrect (22/59) giving 0.94 fraction of the expected amount.

For a more problematic assembly, we may see a section of output like this:

Total bases considered : 1617511

Problem score : 311.591358

Conf. Match Mismatch Expected Over-

value freq freq freq representation

---------------------------------------------------------------------

...

20 13432 384 138.16 2.78

21 23384 851 192.51 4.42

22 18763 487 121.46 4.01

23 13712 300 70.23 4.27

24 21182 363 85.77 4.23

25 20466 218 65.41 3.33

26 9752 123 24.80 4.96

27 23071 282 46.60 6.05

28 13816 158 22.15 7.13

29 27514 166 34.85 4.76

30 15664 140 15.80 8.86

...

We can see here that the observed mismatch frequency is greatly more than the expected

number. This indicates the number of misassemblies (or SNPs in the case of mixed samples)

within this project and is reﬂected by the combined “Problem score”. This score is simply

the sum of the ﬁnal column (or 1 over that column for values less than 1.0).

Chapter 1: Next generation assembly editing with Gap5 89

1.10 Other Miscellany

1.10.1 List Libraries

The List Libraries window is perhaps misnamed as it handles arbitrary groups of reads,

possibly due to the use of multiple libraries, multiple instrument types or simply multiple

lanes on a single instrument. For SAM/BAM ﬁles this informations comes from the @RG

header lines. For other formats Gap5 typically makes use of the input ﬁlename to group

data together.

The basic plot shows a list of library names and how frequently read pairs have been

identiﬁed as matching to the same contig. This is computed at the time of import via

tg index and so will not be updated on contig joining or breakage. The Type ﬁeld indicates

the instrument platform type (for example Illumina or 454), although this is often absent

from the input BAM ﬁles.

90 The Staden Package Manual

The Insert size and standard deviation (s.d.) are derived from the sequence alignments,

with assumptions of an approximately Gaussian distribution. While not entirely accurate

this is typically suﬃcient for most libraries when viewed in a summary table. Finally the

Orientation ﬁeld indicates the relative orientation in which most of the read-pairs have been

assembled. This will be one of “-> <-”, “<- ->” or “-> -> / <- <-” to indicate the relative

orientations of the read-pair. Whether the observed orientation is correct will depend on

the particular sequencing strategy used.

Underneath the list is a histogram of observed insert sizes for the currently selected

library. The graph is currently very rudimentary with no controls, but it will auto-scale

to ﬁt the data. The example shown above is an Illumina large insert library showing two

distinct distributions with the smaller being where the biotin enrichment failed and short

templates were included in the library. (Note in this example the sequence orientations

have been ﬂipped so the bulk of the data is in the orientation expected by other tools.)

Chapter 1: Next generation assembly editing with Gap5 91

1.10.2 Results Manager

Some commands within prog produce "results"that are updated automatically as data

is edited. The Result Manager provides a way to list these results, and to interact with

them.

A result is an abstract term used to deﬁne any collection of data. Typically this data can

be displayed, manipulated and is usually updated automatically when changes are made that

aﬀect it. Each set of matches from a particular search plotted on the Contig Comparator

(see Section 2.4 [Contig Comparator], page 126) is a result, as are entire displays such as

the Template Display.

The "results"window, shown above, can be invoked either from the View menu in the

main display or from the View menu of the Contig Comparator. Each result is listed in the

window on a separate line containing the time that the result was created (which may not

be the same as when it was last updated), the name of the function that created the result,

and the result number. The number is simply a unique identiﬁer to help distinguish two

results produced by the same function.

Each item in the list is consuming memory on your computer. Running functions over

and over again without removing the previous results will slow down your machine and it

will, eventually, run out of memory. Removing items from the list solves this.

Pressing the right mouse button over an listed item will display a popup menu of oper-

ations that can be performed on this result. The operations available will always contain

"Remove"which will delete this result and shut down any associated window, but others

listed will depend on the result selected. In the illustration above the popup menu for the

"Repeat search"can be seen. Here the operations relate to a set of repeat matches currently

being displayed in the Contig Comparator (not shown).

The Contig Comparator functions ("Find internal joins","Find read pairs","Find re-

peats","Check assembly"and "Find Sequences") are all listed in the Results Manager

once per usage of the function. It is worth remembering that the only places to completely

remove the plots from one of these functions is using the "Remove"command within the

Results Manager or to use the "Clear"button within the Contig Comparator to remove all

plots.

92 The Staden Package Manual

1.10.3 Lists

For many operations it is convenient to be able to process sets of data together - for example

to calculate a consensus sequence for a subset of the contigs. To facilitate this prog uses

lists.

Most prog commands dealing with batches of ﬁles or sets of readings or contigs can

use either ﬁles of ﬁlenames or lists. When selecting list names from within dialogues the

"browse"button will display a window containing all the currently existing lists. To select

a list simply double click on the list name. Alternatively the name may simply be typed in.

The List menu on the main menubar contains commands to Edit, Create, Delete, Copy,

Load, and Save lists. Some of these display a list editor. This is simply a scrollable text

window supporting simple editing facilities (see Section 10.2.3 [Text Windows], page 524).

The "Clear"button clears the list. The "Ok"button removes the list editor window.

It is not necessary to use "Ok"here before supplying the list name for input to another

option.

1.10.3.1 Special List Names

Some lists are automatically updated or are generated on-the-ﬂy as needed. The lists named

"contigs"and "readings"correspond to the currently selected contigs in the contig selector

window and the currently selected readings in the template displays. Note that lists (with

any names) can also be created from selected items in the contig editor. See Section 2.6.8.18

[Set Output List], page 186. The "allcontigs"and "allreadings"lists are created as needed

and always contain an identiﬁer for every contig and every reading identiﬁer.

Because of the way the lists are implemented, as is outlined below, there are some useful

"tricks"that can be employed. A list name consisting of a contig identiﬁer surrounded by

square brackets (’[’ and ’]’) will cause the creation of a list containing all of the readings

within that contig. For example, to use the Extract Readings option (see Section 2.12.7

[Extract Readings], page 273) to extract all the readings from contig ’xb54f8.s1’, the list

name given in the Extract Readings dialogue would be ’[xb54f8.s1]’.

A list name surrounded by curly brackets (’{’ and ’}’) will cause the creation of a list

containing all of the readings in the contigs named in the speciﬁed list name. So ’{contigs}’

is equivalent to all the readings in the contigs contained in the ’contigs’ list. Hence the

’allreadings’ list is identical to ’{allcontigs}’.

These tricks can be used anywhere where a list name is required except for editing and

deletion of lists. As a ﬁnal example, to produce a ﬁle of ﬁlenames for the currently selected

contigs, save the list named ’{contigs}’ to a ﬁle.

1.10.3.2 Basic List Commands

The basic operations that can be performed on lists include copying, loading, saving, editing,

creation and deletion. Joining and splitting can only be performed using the list editors

and using cut and paste between windows.

The Load and Save commands require a list name and a ﬁle name. If only the name of

the ﬁle is given the list is assumed to have the same name. If it is desired to load or save

Chapter 1: Next generation assembly editing with Gap5 93

a list from/to a ﬁle of a diﬀerent name then both should be speciﬁed. Creating a list that

already exists (or loading a ﬁle into an already existing list) is allowed, but will produce a

warning message.

The “Reading list” option controls whether the list to be loaded is a list of reading names

(which is normally the case). This will then turn on hyperlinking in any text views of this

list. Double-left clicking on an underlined reading name will bring up the contig editor

while right-clicking will bring up a command menu.

1.10.3.3 Contigs To Readings Command

This command produces a list or ﬁle of reading names for a single contig or for a set of

contigs. The user interface provides a dialogue to select the contigs and to select a list name

or ﬁlename.

1.10.3.4 Search Sequence Names

This command allows searching for sequences matching a preﬁx. The function produces both

a list in the text output window and a prog "list"of reading names. The highlighted

output is clickable, with the left mouse button invoking the contig editor and the right mouse

button displaying a popup-menu allowing additional operations (contig editor, template

display, reading notes and contig notes).

All searches are case sensitive and preﬁx only.

Chapter 2: Sequence assembly and ﬁnishing using Gap4 95

2 Sequence assembly and ﬁnishing using Gap4

2.1 Organisation of the gap4 Manual

The main body of the gap4 manual is divided, where possible, into sections covering related

topics. If appropriate, these sections commence with an overview of the functions they

contain. After the Introduction, the manual contains chapters on some important compo-

nents of the user interface: the Contig Selector (see Section 2.3 [Contig Selector], page 123),

the Contig Comparator (see Section 2.4 [Contig Comparator], page 126), and then, in the

chapter on Contig Overviews (see Section 2.5 [Contig Overviews], page 130) we describe the

Template Display (see Section 2.5.1 [Template Display], page 130), and its subcomponents

the Stop Codon Plot (see Section 2.5.5 [Stop Codon Map], page 156), and the Restriction

Enzyme Plot (see Section 2.5.6 [Restriction Enzyme Search], page 157).

Then there is a long chapter on the powerful Contig Editor (see Section 2.6 [Editor

introduction], page 160), followed by a chapter describing the many assembly engines and

assembly modes which gap4 can oﬀer (see Section 2.7 [Assembly Introduction], page 205).

Gap4 contains functions to use the data in an assembly database to ﬁnd the left to

right order of contigs, and to compare their consensus sequences to look for joins that

may have been missed during assembly. A "read-pair"is obtained by sequencing a DNA

template (or "insert") from both ends: we then know the relative orientations of the two

readings, and if we know the approximate template length, we know how far apart they

should be after assembly. The next chapter is on the use of read-pair data for ordering

contigs and checking assemblies and on the use of consensus comparisons for ﬁnding joins

(see Section 2.8 [Ordering and Joining Contigs], page 217).

The next chapter is on checking assemblies and removing readings (see Section 2.9

[Checking Assemblies and Removing Readings], page 235). The following chapter describes

gap4’s methods for suggesting experiments for helping to ﬁnish a sequencing project (see

Section 2.10 [Finishing Experiments], page 241). Then we describe the various consensus cal-

culation algorithms, and the options for creating consensus sequence ﬁles (see Section 2.11.5

[The Consensus Calculation], page 257). Next is the description of a set of miscellaneous

functions (see Section 2.12 [Miscellaneous functions], page 265), followed by chapters on

the Results Manager (see Section 2.13 [Results Manager], page 277), Lists (see Section 2.14

[Lists Introduction], page 278), Notes (see Section 2.15 [Notes], page 281), Conﬁguring gap4

(see Section 2.20.1 [Options Menu], page 298), gap4 Database Files (see Section 2.16 [Gap

Database Files], page 284), Checking Databases for corruptions (see Section 2.18 [Check

Database], page 290) and Doctoring corrupted databases (see Section 2.19 [Doctor data-

base], page 293).

2.2 Introduction

Gap4 is a Genome Assembly Program. The program contains all the tools that would

be expected from an assembly program plus many unique features and a very easily used

interface. The original version was described in Bonﬁeld,J.K., Smith,K.F. and Staden,R. A

new DNA sequence assembly program. Nucleic Acids Res. 24, 4992-4999 (1995)

96 The Staden Package Manual

Gap4 is very big and powerful. Everybody employs a subset of options and has their

favourite way of accessing and using them. Although there is a lot of it, users are encouraged

to go through the whole of the documentation once, just to discover what is possible, and

the way that best suits their own work. At the very least, the whole of this introductory

chapter should be read, as in the long run, it will save time.

This chapter serves as a cross reference point, to give an overview of the program and to

introduce some of the important ideas which it uses. The main topics that are introduced

are listed in the current section. We introduced the use of base call accuracy values for

speeding up sequencing projects (see Section 2.2.5 [The use of numerical estimates of base

calling accuracy], page 118). The ability to annotate segments of readings and the consensus

can be very convenient (see Section 2.2.7 [Annotating and masking readings and contigs],

page 121). Generally the 3’ ends of readings from sequencing instruments are of too low

a quality to be used to create reliable consensus, but they can be useful, for example, for

ﬁnding joins between contigs (see Section 2.2.6 [Use of the "hidden"poor quality data],

page 120).

One of the most powerful features of gap4 is its graphical user interface which enables

the data to be viewed and manipulated at several levels of resolution. The displays which

provide these diﬀerent views are introduced, with several screenshots (see Section 2.2.3

[Introduction to the gap4 User Interface], page 101).

It is important to understand the diﬀerent ﬁles used by our sequence assembly software,

and how the data is processed before it reaches gap4 (see Section 2.2.1 [Summary of the

Files used and the Preprocessing Steps], page 97).

Note that gap4 is a very ﬂexible program, and is designed so that it can easily be

conﬁgured to suit diﬀerent purposes and ways of working. For example it is easy to create

a beginners version of gap4 which has only a subset of functions. What is described in this

manual is the full version, and so is likely to contain some perhaps more esoteric options

that few people will need to use. This introductory section also contains a complete list of

the options in the gap4 main menus (see Section 2.2.4 [Gap4 Menus], page 115).

In addition to sequence assembly, gap4 can be used for managing mutation study data

and for helping to discover and check for mutations (see Section 3.1 [Introduction to Search-

ing for Mutations], page 309).

Two further useful facilities of gap4 are "Lists"and "Notes". For many operations it is

convenient to be able to process sets of data together - for example to calculate a consensus

sequence for a subset of the contigs. To facilitate this gap4 uses lists (see Section 2.14 [Lists

Introduction], page 278) A ‘Note’ (see Section 2.15 [Notes], page 281) is an arbitrary piece

of text which can be attached to any reading, any contig, or to the database in general.

Chapter 2: Sequence assembly and ﬁnishing using Gap4 97

2.2.1 Summary of the Files used and the Preprocessing Steps

Gap4 stores the data for an assembly project in a gap4 database. Before being entered into

the gap4 database the data must be passed through several preassembly steps, usually via

pregap4 (see Section 4.2 [Pregap4 introduction], page 326). These steps are outlined below.

The programs can handle data produced by a variety of sequencing instruments. They

can also handle data entered using digitisers or that has been typed in by hand. Usually

the trace ﬁles in proprietary format, such as those of ABI, are converted to SCF ﬁles (see

Section 11.1 [SCF introduction], page 533) or ZTR ﬁles. As originally put forward in Bon-

ﬁeld,J.K. and Staden,R. The application of numerical estimates of base calling accuracy to

DNA sequencing projects. Nucleic Acids Research 23, 1406-1410 (1995). gap4 makes im-

portant use of basecall conﬁdence values, (see Section 2.2.5 [The use of numerical estimates

of base calling accuracy], page 118) which are normally stored in the reading’s SCF ﬁle.

One of the ﬁrst steps in the preprocessing is to copy the base calls from the trace ﬁles

to text ﬁles known as Experiment ﬁles (see Section 11.3 [Experiment ﬁles], page 552). All

the subsequent processes operate on the Experiment ﬁles. Other preassembly steps include

quality and vector clipping. Each step is performed by a speciﬁc program controlled by the

program pregap4 (see Section 4.2 [Pregap4 introduction], page 326).

Experiment ﬁle format is similar to that of EMBL sequence entries in that each record

starts with a two letter identiﬁer, but we have invented new records speciﬁc to sequencing

experiments. One of pregap4’s tasks is to augment the Experiment ﬁles to include data

about the vectors, primers and templates used in the production of each reading, and if

necessary it can extract this information from external databases. Some of the information

is needed by pregap4 and some by gap4. (Note that in order to get the most from gap4 it is

essential to make sure that it is supplied, via the Experiment ﬁles, with all the information

it needs.)

The trace ﬁles are not altered, but are kept as archival data so that it is always possible

to check the original base calls and traces. Any changes to the data prior to assembly (and

we recommend that none are made until readings can be viewed aligned with others) are

made to the copy of the sequence in the Experiment ﬁle.

The reading data, in Experiment ﬁle format, is entered into the project database (see

Section 2.16 [Gap Database Files], page 284), usually via one of the assembly engines.

Because Experiment ﬁle format was based on EMBL ﬁle format, EMBL ﬁles can also be

entered and their feature tables will be convered to tags. There is no limit to the length of

readings which can be entered.

All the changes to the data made by gap4 are made to the copies of the data in the

project database. Once the data has been copied into the gap4 database the Experiment

ﬁles are no longer required.

Gap4 uses the trace ﬁles to display the traces (see Section 2.6.11 [Traces], page 188),

and to compare the edited bases with the original base calls (see Section 2.6.6.11 [Search by

Evidence for Edit (1)], page 176), (see Section 2.6.6.12 [Search by Evidence for Edit (2)],

page 176). However gap4 databases do not store trace ﬁles: they record only the names

of the trace ﬁles (which are copied from the readings’ Experiment ﬁles). This means that

if the trace ﬁles for a project are not in the same directory/folder as the gap4 database,

98 The Staden Package Manual

gap4 needs to be told where they are, otherwise it cannot use them. Ideally, all the trace

ﬁles for a project should be stored in one directory. To tell gap4 where they are the "Trace

ﬁle location"command in the Options menu should be used (see Section 2.20.8 [Trace File

Location], page 302).

Gap4 databases have a number of size constraints, some of which can be altered by users

and others which are ﬁxed.

While gap4 is running it often needs to calculate a consensus. The maximum size of

this sequence is controlled by a variable "maxseq". Most routines are able to automatically

increase the value of maxseq while they are running, but some of the older functions,

including some of the original assembly engines, are not. This means that it is important

for users to set maxseq to a suﬃciently high value before running these elderly routines.

By default maxseq is currently set to 100000, but users can set it on the command line or

from within the Options menu.

Gap4 databases contain one record for each reading and one for each contig. The sum of

these two sets of records is the "database size", and the maximum value that database size

is permitted to reach is "maxdb". When databases are initialised maxdb is set, by default,

to 8000. Users can alter this value on the command line or from within the Options menu

of gap4.

Gap4 databases also limit the number and names of readings so that various output

routines know how many character positions are required: the maximum number imposed

in this way is 99,999,999, and the maximum reading name length is 40.

Currently we have sites with single gap4 databases containing over 200,000 readings with

consensus sequences in excess of 7,000,000 bases.

A gap4 database can be used by several users simultaneously, but only one is allowed

to change the contents of the database, and the others are given "readonly"access. As

part of its mechanism to prevent more than one person editing a database at once gap4

uses a "BUSY"ﬁle to signify that the database is opened for writing. Before opening a

database for writing, gap4 checks to see if the BUSY ﬁle for that database exists. If it does,

the database is opened only for reading, if not it creates the ﬁle, so that any additional

attempts to open the database for writing will be blocked. When the user with write access

closes the database, the BUSY ﬁle is deleted, hence re-enabling its ability to be opened

for changes. It is worth remembering that a side eﬀect of this mechanism, is that in the

event of a program or system crash the BUSY ﬁle will be left on the disk, even though the

database is not being used. In this case users must remove the BUSY ﬁle before using the

database (see Section 2.16 [Gap4 Database Files], page 284).

The ﬁnal result from a sequencing project is a consensus sequence (see Section 2.11.5

[The Consensus Calculation], page 257) and gap4 can write these in Experiment ﬁle format,

fasta format or staden format. Of course the whole database and all the trace ﬁles are also

useful for future reference as they allow any queries about the accuracy of the sequence to

be answered.

Chapter 2: Sequence assembly and ﬁnishing using Gap4 99

2.2.2 Summary of Gap4’s Functions

The tasks which gap4 can perform can be roughly divided into assembly (see Section 2.7

[Assembly Introduction], page 205), ﬁnishing (see Section 2.10 [Finishing Experiments],

page 241), and editing (see Section 2.6 [Editor introduction], page 160). But gap4 contains

many other functions which can help to complete a sequencing project with the minimum

amount of eﬀort, and some of these are listed below.

Readings are entered into the gap4 database using the assembly algorithms (see

Section 2.7 [Assembly Introduction], page 205). In general these algorithms will build

the largest contigs they can by ﬁnding overlaps between the readings, however some,

perhaps more doubtful, joins between contigs may be missed, and these can be discovered,

checked and made using Find Internal Joins (see Section 2.8.3 [Find Internal Joins],

page 227), Find repeats (see Section 2.8.4 [Find repeats], page 233) and Join Contigs

(see Section 2.6.15 [The Join Editor], page 196). Find Internal Joins compares the ends

of contigs to see if there are possible overlaps and then presents the overlap in the Contig

Joining Editor, from where the user can view the traces, make edits and join the contigs.

Find Repeats can be used in a similar way, but unlike Find Internal Joins it does not

require the matches it ﬁnds to continue to the ends of contigs.

Read-pair data can be used to automatically put contigs into the correct order (see

Section 2.8.1 [Ordering Contigs], page 219), and information about contigs which share

templates can be plotted out (see Section 2.8.2 [Find Read Pairs], page 222). The re-

lationships of readings and templates, within and between contigs can also be shown by

the Template Display (see Section 2.5.1 [Template Display], page 130) which has a wide

selection of display modes and uses.

Problems with the assembly can be revealed by use of Check Assembly (see Section 2.9

[Checking Assemblies], page 236), Find repeats (see Section 2.8.4 [Find repeats], page 233),

and Restriction Enzyme mapping (see Section 2.5.6 [Plotting Restriction Enzymes],

page 157). Check Assembly compares every reading with the segment of the consensus it

overlaps to see how well it aligns. Those that align poorly are plotted out in the Contig

Comparator. Find Repeats also presents its results in the Contig Comparator, so if used in

conjunction with Check Assembly, it can show cases where readings have been assembled

into the wrong copy of a repeated element. At the end of a project the Restriction Enzyme

map function can be used to compare the consensus sequence with a restriction digest of

the target sequence. Problems can also be found by use of the various Coverage Plots

available in the Consistency Display (see Section 2.5.2 [Consistency Display], page 140).

These plots will show regions of low or high reading coverage (see Section 2.5.2.2 [Reading

Coverage Histogram], page 142), places with data for only one strand (see Section 2.5.2.4

[Strand Coverage], page 144), or where there is no read-pair coverage (see Section 2.5.2.3

[Read-Pair Coverage Histogram], page 143). Errors can be corrected by Disassemble

Readings (see Section 2.9.1.2 [Disassembling Readings], page 240) and Break Contig (see

Section 2.9.1.1 [Breaking Contigs], page 239) which can remove readings from contigs or

databases or can break contigs.

The general level of completeness of the consensus sequence can be seen diagrammatically

using the Quality Plot (see Section 2.5.1.5 [Quality Plot], page 137), and the conﬁdence

100 The Staden Package Manual

values for each base in the consensus sequence can be plotted (see Section 2.5.2.1 [Conﬁdence

Values Graph], page 142).

The most powerful component of gap4 is its Contig Editor (see Section 2.6 [Editor

introduction], page 160). which has many display modes and search facilities to enable very

rapid discovery and ﬁxing of base call errors.

If working on a protein coding sequence, the consensus can be analysed using the Stop

Codon Map (see Section 2.5.5 [Stop Codon Map], page 156), and its translation viewed

using the Contig Editor (see Section 2.6.8.1 [Status Line], page 180).

The ﬁnal result from a sequencing project is a consensus sequence (see Section 2.11.5

[The Consensus Calculation], page 257).

Chapter 2: Sequence assembly and ﬁnishing using Gap4 101

2.2.3 Introduction to the gap4 User Interface

Gap4 has a main window from which all the main options are selected from menus. When a

database is open it also has a Contig Selector which will transform into a Contig Comparator

whenever needed. In addition many of the gap4 functions, such as the Contig Editor or the

Template Display will create their own windows when they are activated. All the graphical

displays and the Contig Editor can be scrolled in register. The base of the graphical display

windows usually contains an Information Line for showing short textual data about results

or items touched by the mouse cursor. Gap4 is best operated using a three button mouse,

but alternative keybindings are available. Full details of the user interface are described

elsewhere (see [User Interface], page 523), and here we give an introduction based around

a series of screenshots.

The main window (shown below) contains an Output window for textual results, an Error

window for error messages, and a series of menus arranged along the top. The contents of

the two text windows can be searched, edited and saved. Each set of results is preceded by

a header containing the time and date when it was generated.

Some of the text will be underlined and shaded diﬀerently. These are hyperlinks which

perform an operation when clicked (with the left mouse button) on, typically invoking a

graphical display such as the contig editor. Clicking on these with the right mouse button

will bring up a menu of additional operations. At present only a few commands (Show

Relationships and the Search functions) produce hypertext, but if there is suﬃcient interest

this may be expanded on.

102 The Staden Package Manual

2.2.3.1 Introduction to the Contig Selector

The gap4 Contig Selector is used to display, select and reorder contigs. In the Contig

Selector all contigs are shown as colinear horizontal lines separated by short vertical lines.

The length of the horizontal lines is proportional to the length of the contigs and their

left to right order represents the current ordering of the contigs. Users can change the

contig order by dragging the lines representing the contigs. This is done by clicking and

holding the middle mouse button, or Alt left mouse button, on a line and then moving

the mouse cursor. The Contig Selector can also be used to select contigs for processing.

For example, clicking with the right mouse button on the line representing a contig will

invoke a menu containing the commands which can be performed on that contig. There

are several alternative ways of specifying which contig an operation should be performed

on. Contigs are identiﬁed by the name or number of any reading they contain. When a

dialogue is requesting a contig name, using the left mouse button to click on the contig in

the Contig Selector will transfer its name to the dialogue box. Other methods are available

(see Section 2.3.1 [Selecting Contigs], page 123).

As the mouse is moved over a contig, it is highlighted and the contig name (left most

reading name) and length are displayed in the Information Line. The number in brackets

is the contig number (actually the number of its leftmost reading). Tags or annotations

(see Section 2.2.7 [Annotating and masking readings and contigs], page 121) can also be

displayed in the Contig Selector window.

The ﬁgure shows a typical display from the Contig Selector. At the top are the File, View

and Results menus. Below that are buttons for zooming and for displaying the crosshair.

The four boxes to the right are used to display the X and Y coordinates of the crosshair. The

rightmost two display the Y coordinates when the contig selector is transformed into the

Contig Comparator. The two leftmost boxes display the X coordinates: the leftmost is the

position in the contig and the other is the position in the overall consensus. The crosshair

is the vertical line spanning the panel below. Tags are shown as coloured rectangles above

and below the lines (see Section 2.3 [Contig Selector], page 123).

Chapter 2: Sequence assembly and ﬁnishing using Gap4 103

2.2.3.2 Introduction to the Contig Comparator

Gap4 commands such as Find Internal Joins (see Section 2.8.3 [Find Internal Joins],

page 227), Find Repeats (see Section 2.8.4 [Find Repeats], page 233), Check Assembly

(see Section 2.9 [Check Assembly], page 236), and Find Read Pairs (see Section 2.8.2 [Find

Read Pairs], page 222) automatically transform the Contig Selector (see Section 2.3 [Contig

Selector], page 123) to produce the Contig Comparator. To produce this transformation a

copy of the Contig Selector is added at right angles to the original window to create a two

dimensional rectangular surface on which to display the results of comparing or checking

contigs.

Each of the functions plots its results as diagonal lines of diﬀerent colours. In general, if

the plotted points are close to the main diagonal they represent results from pairs of contigs

that are in the correct relative order. Lines parallel to the main diagonal represent contigs

that are in the correct relative orientation to one another. Those perpendicular to the

main diagonal show results for which one contig would need to be reversed before the pair

could be joined. The manual contig dragging procedure can be used to change the relative

positions of contigs. See Section 2.3.2 [Changing the Contig Order], page 125. As the

contigs are dragged the plotted results will automatically be moved to their corresponding

new positions. This means that, in general, if users drag the contigs to move their plotted

results close to the main diagonal they will simultaneously be putting their contigs into the

correct relative positions.

This plot can simultaneously show the results of independent types of search, making it

easy for users to see if diﬀerent analyses produce corroborating evidence for the ordering of

contigs. Indications that a reading may have been assembled in an incorrect position can

also be seen - if for example a result from Check Assembly lies on the same horizontal or

vertical projection as a result from Find Repeats, users can see the alternative position to

place the doubtful reading.

The plotted results can be used to invoke a subset of commands by the use of pop-up

menus. For example if the user clicks the right mouse button over a result from Find

Internal Joins a menu containing Invoke Join Editor (see Section 2.6.15 [The Join Editor],

page 196) and Invoke Contig Editors (see Section 2.6 [Editing in gap4], page 160) will pop

up. If the user selects Invoke Join Editor the Join Editor will be started with the two contigs

aligned at the match position contained in the result. If required one of the contigs will be

complemented to allow their alignment.

104 The Staden Package Manual

A typical display from the Contig Comparator is shown above. It includes results for

Find Internal Joins in black, Find Repeats in red, Check Assembly in green, and Find Read

Pairs in blue. Notice that there are several internal joins, read pairs and repeats close to the

main diagonal near the top left of the display. This indicates that the contigs represented in

that area are likely to be in the correct positions relative to one another. In the middle of

the bottom right quadrant there is a blue diagonal line perpendicular to the main diagonal.

This indicates a pair of contigs that are in the wrong relative orientation. The crosshairs

show the positions for a pair of contigs. The vertical line continues into the Contig Selector

part of the display, and the position represented by the horizontal line is also duplicated

there (see Section 2.4 [Contig Comparator], page 126).

Chapter 2: Sequence assembly and ﬁnishing using Gap4 105

2.2.3.3 Introduction to the Template Display

The Template Display can show schematic plots of readings, templates, tags, restriction

enzyme sites and the consensus quality. Colour coding distinguishes reading, primer and

template types. The Template Display can also be used to reorder contigs and to invoke

the Contig Editor.

An example showing all these information types can be seen in the Figure below.

The large top section contains lines and arrows representing readings and templates.

Beneath this are rulers; one for each contig, and below those is the quality plot. The

template and reading section of the display is in two parts. The top part contains the

templates which have been sequenced from both ends but which are in some way inconsistent

- for example given the current relative positions of their readings, they may have a length

that is larger or greater than that expected, or the two readings may, as it were, face

106 The Staden Package Manual

away from one another. Colour coding is used to distinguish between diﬀerent types of

inconsistency, and whether or not the inconsistency involves readings within or between

contigs. For example, most of the problems shown in the screendump above are coloured

dark yellow, indicating an inconsistency between a pair of contigs. The rest of the data,

(mostly dark blue indicating templates sequenced from only one end), is plotted below the

data for the inconsistent templates. Forward readings are light blue and reverse readings

are orange. Templates in bright yellow have been sequenced from both ends, are consistent

and span a pair of contigs (and so indicating the relative orientation and separation of the

contigs).

At the bottom is the restriction enzyme plot. The coloured blocks immediately above

and below the ruler are tags. Those above the ruler can also be seen on their corresponding

readings in the large top section. The display can be zoomed. The position of a crosshair

is shown in the two left most boxes in the top right hand corner. The leftmost shows the

distance in bases between the crosshair and the start of the contig underneath the crosshair.

The middle box shows the distance between the crosshair and the start of the ﬁrst contig.

The right box shows the distance between two selected cut sites in the restriction enzyme

plots (see Section 2.5.1 [Template Display], page 130).

Chapter 2: Sequence assembly and ﬁnishing using Gap4 107

2.2.3.4 Introduction to the Consistency Display

The Consistency Display provides plots designed to highlight potential problems in contigs.

It is invoked from the main gap4 View menu by selecting any of its plots. Once a plot has

been displayed, any of the other types of consistency plot can be displayed within the same

frame from the View menu of the Consistency Display.

An example showing the Conﬁdence Values Graph and the corresponding Reading Cov-

erage Histogram, Read-Pair Coverage Histogram and Strand Coverage is shown below.

108 The Staden Package Manual

If more than one contig is displayed, the contigs are drawn immediately after one another

but are staggered in the y direction.

The ruler ticks can be turned on or oﬀ from the View menu of the consistency dis-

play. The plots can be enlarged or reduced using the standard zooming mechanism. See

Section 10.5.1 [Zooming], page 528.

The crosshair toggle button controls whether the crosshair is visible. This is shown as a

black vertical and horizontal line. The position of the crosshair is shown in the 3 boxes to

the right of the crosshair toggle. The ﬁrst box indicates the cursor position in the current

contig. The second box indicates the overall position of the cursor in the consensus. The

last box shows the y position of the crosshair. (see Section 2.5.2 [Consistency Display],

page 140).

Chapter 2: Sequence assembly and ﬁnishing using Gap4 109

2.2.3.5 Introduction to the Restriction Enzyme Map

The restriction enzyme map function ﬁnds and displays restriction sites within a speciﬁed

region of a contig. Users can select the enzyme types to search for and can save the sites

found as tags within the database.

This ﬁgure shows a typical view of the Restriction Enzyme Map in which the results for

each enzyme type have been conﬁgured by the user to be drawn in diﬀerent colours. On the

left of the display the enzyme names are shown adjacent to their rows of plotted results.

If no result is found for any particular enzyme eg here APAI, the row will still be shown

so that zero cutters can be identiﬁed. Three of the enzymes types have been selected and

are shown highlighted. The results can be scrolled vertically (and horizontally if the plot is

zoomed in). A ruler is shown along the base and the current cursor position (the vertical

black line) is shown in the left hand box near the top right of the display. If the user clicks,

in turn, on two restriction sites their separation in base pairs will appear in the top right

hand box. Information about the last site touched is shown in the Information line at the

bottom of the display. At the top the edit menu is shown and can be used to create tags

for highlighted enzyme types (see Section 2.5.6 [Restriction Enzyme Search], page 157).

110 The Staden Package Manual

2.2.3.6 Introduction to the Stop Codon Map

The Stop Codon Map plots the positions of all the stop codons on one or both strands of

a contig consensus sequence. If the Contig Editor is being used on the same contig, the

Refresh button will be enabled, and if used, will fetch the current consensus from the editor,

repeat the search and replot the stop codons.

The ﬁgure shows a typical zoomed in view of the Stop Codon Map display. The positions

for the stop codons in each reading frame (here all six frames are shown) are displayed in

horizontal strips. Along the top are buttons for zooming, the crosshair toggle, a refresh

button and two boxes for showing the crosshair position. The left box shows the current

position and the right-hand box the separation of the last two stop codons selected by the

user. Below the display of stop codons is a ruler and a horizontal scrollbar. The information

line is showing the data for the last stop codon the user has touched with the cursor. Also

shown on the left is the View menu which is used to select the reading frames to display

(see Section 2.5.5 [Stop Codon Map], page 156).

Chapter 2: Sequence assembly and ﬁnishing using Gap4 111

2.2.3.7 Introduction to the Contig Editor

The gap4 Contig Editor is designed to allow rapid checking and editing of characters in

assembled readings. Very large savings in time can be achieved by its sophisticated prob-

lem ﬁnding procedures which automatically direct the user only to the bases that require

attention. The following is a selection of screenshots to give an overview of its use.

The ﬁgure above shows a screendump from the Contig Editor which contains segments

of aligned readings, their consensus and a six phase translation. The Commands menu is

also shown. The main components are: the controls at the top; reading names on the left;

sequences to their right; and status lines at the bottom. Some of the reading names are

written in light grey which indicates that their traces/chromatograms are being displayed

(in another window, see below).

One reading name is written with inverse colours, which indicates that it has been

selected by the user. To the left of each reading name is the reading number, which is

negative for readings which have been reversed and complemented. The ﬁrst of the status

lines, labelled "Strands", is showing a summary of strand coverage. The left half of the

segment of sequence being displayed is covered only by readings from one strand of the

DNA, but the right half contains data from both strands.

Along the top of the editor window is a row of command buttons and menus. The

rightmost pair of buttons provide help and exit. To their left are two menus, one of which

is currently in use. To the left of this is a button which initially displays a search dialogue,

and then pressing it again, will perform the selected search. Further left is the undo button:

each time the user clicks on this box the program reverses the previous edit command. The

next button, labelled "Cutoﬀs"is used to toggle between showing or hiding the reading

data that is of poor quality or is vector sequence. In this ﬁgure it has been activated,

showing the poor quality data in light grey. Within this, sequencing vector is displayed in

112 The Staden Package Manual

lilac. The next button to the left is the Edit Modes menu which allows users to select which

editing commands are enabled. The next command toggles between insert and replace and

so governs the eﬀect of typing in the edit window.

One of the readings contains a yellow tag, and elsewhere some bases are coloured red,

which indicates they are of poor quality. The Information Line at the bottom of the window

can show information about readings, annotations and base calls. In this case it is showing

information about the reliability of the base beneath the editing cursor.

A better way of displaying the accuracy of bases is to shade their surroundings so that

the lighter the background the better the data. In the ﬁgure above, this grey scale encoding

of the base accuracy or conﬁdence has been activated for bases in the readings and the con-

sensus. This screenshot also shows the Contig Editor displaying disagreements and edits.

Disagreements between the consensus and individual base calls are shown in dark green.

Notice that these disagreements are in poor quality base calls. Edits (here they are all pads)

are shown with a light green background. When they are present, replacements/insertions

are shown in pink, deletions in red and conﬁdence value changes in purple. The consensus

conﬁdence takes into account several factors, including individual base conﬁdences, sequenc-

ing chemistry, and strand coverage. It can be seen that the consensus for the section covered

by data from only one strand has been calculated to be of lower conﬁdence than the rest.

The Status Line includes two positions marked with exclamation marks (!) which means

that the sequence is covered by data from both strands, but that the consensus for each of

the two strands is diﬀerent. The Information Line at the bottom of the window is showing

Chapter 2: Sequence assembly and ﬁnishing using Gap4 113

information about the reading under the cursor: its name, number, clipped length, full

length, sequencing vector and BAC clone name.

The Contig Editor can rapidly display the traces for any reading or set of readings. The

number of rows and columns of traces displayed can be set by the user. The traces scroll in

Editor cursor can be scrolled by the trace cursor. A typical view is shown above.

This ﬁgure is an example of the Trace Display showing three traces from readings in

the previous two Contig Editor screendumps. These are the best two traces from each

strand plus a trace from a reading which contains a disagreement with the consensus. The

program can be conﬁgured to automatically bring up this combination of traces for each

problem located by the "Next search"option. The histogram or vertical bars plotted top

down show the conﬁdence value for each base call. The reading number, together with the

direction of the reading (+or -) and the chemistry by which it was determined, is given

at the top left of each sub window. There are three buttons (’Info’, ’Diﬀ’, and ’Quit’)

arranged vertically with X and Y scale bars to their right. The Info button produces a

window like the one shown in the bottom right hand corner. The Diﬀ button is mostly used

for mutation detection, and causes a pair of traces to be subtracted from one another and

the result plotted, hence revealing their diﬀerences. (see Section 2.6.11 [Traces], page 188).

114 The Staden Package Manual

2.2.3.8 Introduction to the Contig Joining Editor

Contigs are joined interactively using the Join Editor. This is simply a pair of contig editor

displays stacked one above the other with a "diﬀerences"line in between. The Contig Join

Editor is usually invoked by clicking on a Find Internal Joins, or Find Repeats result in the

Contig Comparator. In which case the two contigs will appear with the match found by

these searches displayed.

The few diﬀerences between the Join Editor and the Contig Editor can be seen in the

ﬁgure below. Otherwise all the commands and operations are the same as those for the

Contig Editor.

In this ﬁgure the Cutoﬀ or Hidden data is being displayed for the right hand contig. One

diﬀerence between the Contig Editor and the Join Editor is the Lock button. When set

(as it is in the illustration) the two contigs scroll in register, otherwise they can be scrolled

independently.

The Align button aligns the overlapping consensus sequences (see Section 2.6.15 [Editor

joining], page 196).

Chapter 2: Sequence assembly and ﬁnishing using Gap4 115

2.2.4 Gap4 Menus

The main window for gap4 contains File, Edit, View, Options, Experiments, Lists and

Assembly menus.

2.2.4.1 Gap4 File menu

The File menu includes database opening and copying functions and consensus calculation

options.

•Change Directory (see Section 2.16.1 [Directories], page 284)

•Check Database (see Section 2.18 [Check Database], page 290)

•New (see Section 2.16.2 [Opening a New Database], page 285)

•Open (see Section 2.16.3 [Opening an Existing Database], page 285)

•Copy Database (see Section 2.16.4 [Making Backups of Databases], page 285)

•Copy Readings (see Section 2.17.1 [Copying Readings], page 287)

•Save Consensus (see Section 2.11.5 [The Consensus Calculation], page 257)

•Extract Readings (see Section 2.12.7 [Extract Readings], page 273)

2.2.4.2 Gap4 Edit menu

The Edit menu contains options that alter the contents of the database.

•Edit Contig (see Section 2.6 [Editor introduction], page 160)

•Join Contigs (see Section 2.6.15 [Editor joining], page 196)

•Save Contig Order (see Section 2.8.1 [Order Contigs], page 219)

•Break Contig (see Section 2.9.1.1 [Break Contig], page 239)

•Complement a Contig (see Section 2.12.1 [Complement a Contig], page 265)

•Order Contigs (see Section 2.8.1 [Order Contigs], page 219)

•Quality Clip (see Section 2.12.8.2 [Quality Clipping], page 275)

•Quality Clip Ends (see Section 2.12.8.3 [Quality Clip Ends], page 275)

•Diﬀerence Clip (see Section 2.12.8.1 [Diﬀerence Clipping], page 274)

•N-Base Clip (see Section 2.12.8.4 [N-Base Clipping], page 276)

•Double Strand (see Section 2.10.1 [Double Strand], page 241)

•Disassemble Readings (see Section 2.9.1.1 [Break Contig], page 239)

•Enter Tags (see Section 2.12.2 [Enter Tags], page 265)

•Edit Notebooks (see Section 2.15 [Notes], page 281)

•Doctor Database (see Section 2.19 [Doctor database], page 293)

2.2.4.3 Gap4 View menu

The View menu contains options to look at the data at several levels of detail, and analytic

functions which present their results graphically.

•Contig Selector (see Section 2.3 [Contig Selector], page 123)

•ResultsManager (see Section 2.13 [Results Manager], page 277)

•Find Internal Joins (see Section 2.8.3 [Find Internal Joins], page 227)

116 The Staden Package Manual

•Find Read Pairs (see Section 2.8.2 [Find Read Pairs], page 222)

•Find Repeats (see Section 2.8.4 [Find repeats], page 233)

•Check Assembly (see Section 2.9 [Check Assembly], page 236)

•Sequence Search (see Section 2.12.6 [Find Oligos], page 271)

•Template Display (see Section 2.5.1 [Template Display], page 130)

•Show Relationships (see Section 2.12.4 [Show Relationships], page 267)

•Restriction Enzyme map (see Section 2.5.6 [Restriction Enzyme Search], page 157)

•Stop Codon Map (see Section 2.5.5 [Stop Codon Map], page 156)

•Quality Plot (see Section 2.5.1.5 [Quality Plot], page 137)

•List Conﬁdence (see Section 2.11.6 [List Conﬁdence], page 261)

•Reading Coverage Histogram (see Section 2.5.2.2 [Reading Coverage Histogram],

page 142)

•Read-Pair Coverage Histogram (see Section 2.5.2.3 [Read-Pair Coverage Histogram],

page 143)

•Strand Coverage (see Section 2.5.2.4 [Strand Coverage], page 144)

•Conﬁdence Values Graph (see Section 2.5.2.1 [Conﬁdence Values Graph], page 142)

2.2.4.4 Gap4 Options menu

The Options menu contains options for conﬁguring gap4.

•Consensus Algorithm (see Section 2.20.2 [Consensus Algorithm], page 299)

•Set Maxseq (see Section 2.20.3 [Set Maxseq], page 299)

•Set Fonts (see Section 2.20.4 [Set Fonts], page 299)

•Conﬁgure Menus (see Section 2.20.5 [Conﬁguring Menus], page 300)

•Set Genetic Code (see Section 2.20.6 [Set Genetic Code], page 300)

•Alignment Scores (see Section 2.20.7 [Alignment Scores], page 301)

•Trace File Location (see Section 2.20.8 [Trace File Location], page 302)

2.2.4.5 Gap4 Experiments menu

The Experiments menu contains options to analyse the contigs and to suggest experimental

solutions to problems.

•Suggest Long Readings (see Section 2.10.3 [Suggest Long Readings], page 245)

•Suggest Primers (see Section 2.10.2 [Suggest Primers], page 243)

•Compressions and Stops (see Section 2.10.4 [Compressions and Stops], page 247)

•Suggest Probes (see Section 2.10.5 [Suggest Probes], page 249)

2.2.4.6 Gap4 Lists menu

The Lists menu contains a set of options for creating and editing lists for use in various

parts of the program.

•Creation and Editing (see Section 2.14 [Lists Introduction], page 278)

•Contigs To Readings (see Section 2.14.3 [Contigs To Readings Command], page 279)

Chapter 2: Sequence assembly and ﬁnishing using Gap4 117

•Minimal Coverage (see Section 2.14.4 [Minimum Coverage], page 279)

•Unattached Readings (see Section 2.14.5 [Unattached Readings], page 279)

•Highlight Readings List (see Section 2.14.6 [Highlight Readings List], page 279)

•Search Sequence Names (see Section 2.14.7 [Search Sequence Names], page 279)

•Search Template Names (see Section 2.14.8 [Search Template Names], page 280)

•Search Annotation Contents (see Section 2.14.9 [Search Annotation Contents],

page 280)

2.2.4.7 Gap4 Assembly menu

The Assembly menu contains various assembly and data entry methods.

•Normal Shotgun Assembly (see Section 2.7.1 [Normal Shotgun Assembly], page 205)

•Directed Assembly (see Section 2.7.2 [Directed Assembly], page 211)

•Screen Only (see Section 2.7.3 [Assembly Screen Only], page 213)

•Assembly Independently (see Section 2.7.1.1 [Assembly Independently], page 209)

118 The Staden Package Manual

2.2.5 The use of numerical estimates of base calling accuracy

In this section we give an overview of our use, when available, of base call accuracy estimates

or conﬁdence values. We also explain the importance of the consensus calculations used by

gap4, and their role in minimising the work needed to complete sequencing projects.

We ﬁrst put forward the idea of using numerical estimates of base calling accuracy in

our paper describing SCF format Dear, S. and Staden, R, 1992. A standard ﬁle format

for data from DNA sequencing instruments. DNA Sequence 3, 107-110 and then expanded

on their use for editing and assembly in Bonﬁeld,J.K. and Staden,R. The application of

numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids

Res. 23, 1406-1410 (1995).

In Bonﬁeld and Staden (1995), we stated "...the most useful outcome of having a se-

quence reading determined by a computer-controlled instrument would be that each base

was assigned a numerical estimate of its probability of having been called correctly... having

numerical estimates of base accuracy is the key to further automation of data handling for

sequencing projects. ... The simple procedure we propose in this paper is a method of using

the numerical estimates of base calling accuracy to obviate much of the tedious and time

consuming trace checking currently performed during a sequencing project. In summary

we propose that the numerical estimates of base accuracy should be used by software to

decide if conﬂicts between readings require human expertise to help adjudicate. We argue

that if the accuracy estimates are reasonably reliable then the majority of conﬂicts can be

ignored... and so the time taken to check and edit a contig will be greatly reduced."

This has been achieved by making the consensus calculations (see Section 2.11.5 [The

Consensus Calculation], page 257) central to gap4, and by providing calculations which

make use of base call accuracy estimates to give each consensus base a quality measure.

The consensus is not stored in the gap4 database but is calculated when required by each

function that needs it, and hence always takes into account the current data. In the Contig

Editor the consensus is updated instantly to reﬂect any change made by the user.

In 1998 the ﬁrst useable probability values became available through the program Phred

(Ewing, B. and Green, P. Base-Calling of Automated Sequencer Traces Using Phred. II.

Error Probabilities. Genome Research. Vol 8 no 3. 186-194 (1998)). Phred produces a

conﬁdence value that deﬁnes the probability that the base call is correct. This was an

important step forward and these values are widely used and have deﬁned a decibel type

scale for base call conﬁdence values. Gap4 is currently set to use conﬁdence values deﬁned

on this scale.

The conﬁdence value is given by the formula

C_value = -10*log10(probability of error)

A conﬁdence value of 10 corresponds to an error rate of 1/10; 20 to 1/100; 30 to 1/1000;

and so on. Using the main gap4 consensus algorithm they enable the production of a

consensus sequence for which the expected error rate for each base is known.

As is described elsewhere (see Section 2.11.6 [List Consensus Conﬁdence], page 261)

being able to calculate the conﬁdence for each base in the consensus sequence makes it

possible to estimate the number of errors it contains, and hence the number of errors that

will be removed if particular bases are checked and, if necessary, edited. For example, if

Chapter 2: Sequence assembly and ﬁnishing using Gap4 119

1000 bases in the consensus had conﬁdence 20, we would expect those 1000 bases (with an

error rate of 1/100) to contain 10 errors.

Another program which produces decibel scale conﬁdence values for ABI 377 data is

ATQA Daniel H. Wagner, Associates, at http://www.wagner.com/.

For gap4 the conﬁdence values are expected to lie in the range 1 to 99, with 0 and 100

having special meanings to the program.

The conﬁdence values are stored in SCF or Experiment ﬁles and copied into gap4 data-

bases during assembly or data entry.

The searches provided by the Contig Editor (see Section 2.6.6 [Searching], page 174)

are one of gap4’s most important time saving features. The user selects a search type, for

example to ﬁnd places where the conﬁdence for the consensus falls below a given threshold,

and the search automatically moves the cursor to the next such position in the consensus.

The Contig Editor locates the next problem by applying the consensus calculation to the

contig. To edit a contig the user selects "Search"repeatedly, knowing that it will only

move to places where there is a conﬂict between good data or where the data is poor. Note

that the program is usually conﬁgured to automatically display the relevant traces for each

position located by the search option.

The main result is that far fewer disagreements between data are brought to the attention

of the user and fewer traces have to be inspected by eye, and so the whole process is faster.

Another consequence of the strategy is that, as fewer bases need changing to produce the

correct consensus, most of what appears on the screen will be the original base calls. Indeed

we have taken this a step further and suggest that if a base needs changing because it has

a high accuracy estimate, and is conﬂicting with other good data, then rather than change

the character shown on the screen, the user should lower its accuracy value. By so doing

more of the original base calls are left unchanged and hence are visible to the user. There

is a function within the contig editor to reset the accuracy value for the current base to 0.

Alternatively the accuracy value for the base that is thought to be correct can be set within

the contig editor to 100.

120 The Staden Package Manual

2.2.6 Use of the "hidden"poor quality data

In general sequences obtained from machines contain segments such as vector sequence and

poor quality data that need either to be removed or ignored during assembly and editing.

In our package we do not remove such segments but instead we mark them so that the

programs can deal with them appropriately. In gap4 such data is referred to as "hidden".

The positions to hide are determined initially by preprocessing programs such as vector clip

(see Chapter 6 [Screening Against Vector Sequences], page 401) and qclip (see Section 12.19

[qclip], page 597).

The hidden data can be revealed in the Contig Editor by toggling the Cutoﬀs button (see

Section 2.6.3.4 [Adjusting the Cutoﬀ data], page 169); can be used to search for possible

joins between contigs (see Section 2.8.3 [Find Internal Joins], page 227), and can be included

in the consensus sequence (see Section 2.11.2 [Extended consensus], page 253) to be used

by external screening programs. For these cases the program can distinguish data that is

hidden because it is vector and data that is hidden because it is of poor quality: only poor

quality data is included.

The position of hidden data can be changed interactively in the Contig Editor. In

addition the Double Strand function (see Section 2.10.1 [Double stranding], page 241) will

reduce the amount of hidden data for readings that cover single stranded regions of contigs,

if the data aligns well with that on the other strand.

Chapter 2: Sequence assembly and ﬁnishing using Gap4 121

2.2.7 Annotating and masking readings and contigs

Gap4 can label segments of readings and contigs using "tags"(see Section 2.6.5 [Create

Tag], page 171). The program recognises a set of standard tags types and users can also

invent their own. Each tag type has a unique four character identiﬁer, a name, a direction,

a colour and a text string for recording notes. Tags can be created, edited and removed

by users and by internal routines. Tags can also be input along with readings. This is

important when reference sequences are used during mutation detection (see Section 3.1.3

[Reference sequences], page 314).

2.2.7.1 Standard tag types

The standard tag types include those shown below plus the FT records from EMBL sequence

ﬁle entries. Users can also invent their own and add them to their personal GTAGDB. This is

a ﬁle that describes the available tag types and their colours (see Section 2.20.10 [Conﬁgure

the tag database], page 304).

Code Function

COMM Comment

COMP Compression

RCMP Resolved compression

STOP Stop

OLIG Oligo (primer)

REPT Repeat

ALUS Alu sequence

SVEC Sequencing vector

CVEC Cloning vector

MASK Mask me

FNSH Finished segment

ENZ0 Restriction enzyme 0

ENZ9 Restriction enzyme 9

MUTN Mutation

DIFF Sequence different to consensus

HETE Heterozygous mutation

HET+ Heterozygous mutation False +ve

HET- Heterozygous mutation False -ve

HOM+ Homozygous mutation False +ve

HOM- Homozygous mutation False -ve

FCDS FEATURE: CDS

F*** All other (60) EMBL FT record types

2.2.7.2 Active tags and masking

Tags are used for a variety of purposes and for each function in the program the user can

choose which tag types are currently "active". Where they are being used to provide visual

clues this will determine which tag types appear in the displays, but for other functions

they can be used to control which parts of the sequence are omitted from processing. This

mode of tag use is called "masking". For example the program contains a routine to search

122 The Staden Package Manual

for repeats, and if any are found, the user needs to know if such sequence duplications

are caused by incorrect assembly or are genuine repeats. Once the user has checked a

duplication reported by the program and found it to be a repeat, it can be labelled with a

REPT tag. If the repeat routine is run in masking mode and with REPT tags active, any

segment covered by a REPT tag will not be reported as a match. So once the "problem"

has been dealt with it can be labelled so it is not reported on subsequent searches. In

addition the tag is available to provide annotation for the completed sequence when it is

sent to the data libraries.

A more complicated application of masking is available for two of the other search

procedures in the program: (see Section 2.7.1 [Shotgun assembly], page 205) and (see

Section 2.8.3 [Find Internal Joins], page 227). The former is the general assembly function

and the latter is used to ﬁnd potential joins between contigs in the database. Below we

describe how masking can be used during assembly and similar comments apply to Find

Internal Joins.

In the assembly function the user can choose to employ masking and then select the types

of tags to be used as masks. Readings are compared in two stages: ﬁrst the program looks

for exact matches of some minimum length and then for each possible overlap it performs

an alignment. If the masking mode is selected the masked regions are not used during the

search for exact matches, but they are used during alignment. The eﬀect of this is that

new readings that would lie entirely inside masked regions will not produce exact matches

and so will not be entered. However readings that have suﬃcient data outside of masked

segments can produce matches and will be correctly aligned even if they overlap the masked

data. A common use for masking during assembly or Find Internal Joins is to avoid ﬁnding

matches that are entirely contained in Alu segments.

A further mode related to masking is "marking". Marking is available for the consensus

calculation (see Section 2.11 [Consensus calculation], page 251) and for Find Internal Joins

(see Section 2.8.3 [Find Internal Joins], page 227). Instead of masking the regions covered by

active tags these routines simply write these sections of the consensus sequence in lowercase

letters. That is they make it easy for users to see where the tagged segments are. Marking

has no other eﬀect.

Chapter 2: Sequence assembly and ﬁnishing using Gap4 123

2.3 Contig Selector

The prog Contig Selector is used to display, select and reorder contigs. It can be invoked

from the prog View menu, but will automatically appear when a database is opened.

In the Contig Selector all contigs are shown as colinear horizontal lines separated by short

vertical lines. The length of the horizontal lines is proportional to the length of the contigs

and their left to right order represents the current ordering of the contigs. This Contig Order

is stored in the gap database and users can change it by dragging the lines representing the

contigs in the display. The Contig Selector can also be used to select contigs for processing.

Tags (see Section 2.2.7 [Annotating and masking readings and contigs], page 121) can

also be displayed in the Contig Selector window. As the mouse is moved over a contig, it

is highlighted and the contig name (left most reading name) and length are displayed in

the status line. The number in brackets is the contig number. Unlike gap4, gap5 does not

display annotations within the Contig Selector window.

The ﬁgure shows a typical display from the Contig Selector. At the top are the File, View

and Results menus. Below that are buttons for zooming and for displaying the crosshair.

The four boxes to the right are used to display the X and Y coordinates of the crosshair.

The rightmost two display the Y coordinates when the contig selector is transformed into

the contig comparator (see Section 2.4 [Contig Comparator], page 126). The two leftmost

boxes display the X coordinates: the leftmost is the position in the contig and the other is

the position in the overall consensus. The crosshair is the vertical line spanning the panel

below.

This panel shows the lines that represent the contigs and the currently active tags. Those

tags shown above the contig lines are on readings and those below are on the consensus.

Right clicking on a tag gives a menu containing “information” (to see the tag contents) and

“Edit contig at tag” which invokes the contig editor centred on the selected tag.

The information line is showing data for the contig that is currently under the crosshair.

2.3.1 Selecting Contigs

Contigs can be selected by either clicking with the left mouse button on the line representing

the required contig in the contig selector window or alternatively by choosing the "List

contigs"option from the "View"menu. This option invokes a "Contig List"list box where

124 The Staden Package Manual