EK SEDRR RF 001 Error Detection, Recovery And Reporting Reference Manual

ERROR

DETECTION,

RECOVERY

AND

REPORTING

REFERENCE

MANUAL

Order No.

EK

-SEDR R-R F-001

ERROR

DETECTION,

RECOVERY

AND

REPORTING

REFERENCE

MANUAL

Order No. EK-SEDRR-RF-001

digital equipment corporation · maynard. massachuset.ts

First

Printing,

February

1976

The

information

in

this

document

is

subject

to

change

without

notice

and

should

not

be

construed

as

a

commitment

by

Digital

Equipment

Corporation.

Digital

Equipment

Corporation

assumes

no

responsibility

for

any

errors

that

may

appear

in

this

document.

The

software

described

in

this

document

is

furnished

under

a

license

and

may

be

used

or

copied

only

in

accordance

with

the

terms

of

such

license.

Digital

Equipment

Corporation

assumes

no

responsibility

for

the

use

or

reliability

of

its

software

on

equipment

that

is

not

supplied

by

DIGITAL.

Copyright

~

1976

by

Digital

Equipment

Corporation

The

postage

prepaid

READER'S

COMMENTS

form

on

the

last

page

of

this

document

requests

the

user's

critical

evaluation

to

assist

us

in

pre-

paring

future

documentation.

The

following

are

trademarks

of

Digital

Equipment

Corporation:

DIGITAL

DEC

PDP

DECUS

UNIBUS

COMPUTER

LABS

COMTEX

DDT

DECCOMM

DECsystem-10

DECtape

DIBOL

EDUSYSTEM

FLIP

CHIP

FOCAL

INDAC

LAB-8

DECsystem-20

MASSBUS

OMNIBUS

OS/8

PHA

RSTS

RSX

TYPESET-8

TYPESET-10

TYPESET-ll

PREFACE

CHAPTER

1

1.1

1.2

1.3

1.

3.1

1.

3.2

1.

3.3

1.

3.4

1.4

1.

4.1

1.

4.2

1.

4.3

1.

4.4

1.

4.5

1.5

1.

5.1

1.

5.2

1.

5.3

1.6

1.7

1.8

1.9

2

3

3.1

3.2

3.3

3.4

3.5

3.6

3.7

4

4.1

4.2

4.3

4.4

4.4.1

4.5

4.5.1

4.6

4.6.1

4.6.2

4.7

4.7.1

4.7.2

4.7.

3

CONTENTS

INTRODUCTION

AND

OVERVIEW

BACKGROUND

ERROR

CATEGORIES

In

any

discussion

about

error

recovery

, a

brief

description

of

the

errors

likely

to

be

seen

is

first

necessary.

These

errors

to

he

aetected

by

the

system

can

generally

be

divided

into

three

basic

types:

1.

User

programming

errors

2.

Operating

system

or

monitor

errors

3.

Machine

failures

1.3

USER

PROGRAMMING

ERRORS

Although

the

identification

and

repair

of

user

programming

errors

is

not

the

responsibility

of

field

service

engineers,

you

should

be

aware

that

these

errors

occur

and

understand

how

they

affect

the

entire

system.

For

this

reason,

only

brief

descriptions

of

these

errors

and

their

general

handling

procedure

are

included

here.

1.3.1

Violation

of

System

Architecture

There

are

several

types

of

user

programming

errors.

One

is

the

violation

of

the

architecture

or

the

basic

design

of

the

system.

The

programmer

must

follow

the

rules

set

by

the

designers

of

the

system

if

he

is

to

use

the

system

to

get

his

job

done.

The

best

example

of

this

basic

system

design

or

architecture

is

the

machine's

instruction

set.

Certain

instructions

may

not

be

allowed

to

be

executed

by

a

user

program

but

are

reserved

for

the

programs

which

both

serve

and

control

the

users.

These

programs

are

called

the

operating

system

or

monitor

and

the

reserved

instruction

may do

I/O

functions

or

control

the

allocation

of

core

memory.

If

a

user

program

attempts

to

execute

one

of

these

reserved

instructions

or

attempts

to

execute

an

instruction

the

machine

doesn't

understand,

the

user

has

violated

the

system

architecture

and

the

error

will

be

caught.

1.3.2

Data

Programming

Errors

Another

type

of

user-programming

error

involves

data

programming

which

may

include

either

using

the

wrong

format

for

data

or

incorrectly

handling

the

input/output

resources.

An

example

of

wrong

data

format

might

be

incorrect

specification

of

record

length

or

operations

on

data

using

the

wrong

assumptions

about

the

data's

layout.

Incorrect

handling

of

the

input/output

devices

might

result

from

trying

to

do

data

input

from

a

device,

such

as

a

paper-tape

punch

or

line

printer,

that

can

only

perform

output.

1.3.3

Storage

Allocation

Errors

Still

another

type

of

user-programming

error

is

an

attempt

to

use

or

address

more

storage

space

than

has

been

allocated

or

reserved

for

the

program.

This

might

even

include

trying

to

access

more

storage

than

is

physically

present

on

the

system.

This

would

usually

occur

only

in

programs

which

attempt

to

increase

their

storage

capacity

after

the

program

has

started

its

run.

This

class

of

error

may

also

include

attempting

to

access

I/O

devices

which

the

program

has

not

reserved

or

which

may

not

exist

on

the

system.

1-2

INTRODUCTION

1.3.4

Control

and

Recovery

of

User

Programming

Errors

Control

and

recovery

for

user

programming

errors

is

accomplished

by

both

the

system

hardware

and

operating

system

software.

In

the

case

of

violation

of

system

architecture,

the

CPU

will

most

likely

notice

the

problem

and

alert

the

system

software

via

an

interrupt

or

perhaps

a

trap.

In

any

case,

the

user

program

is

prevented

from

violating

the

rules

and

the

offending

job

is

automatically

stopped

and

given

an

error

message

describing

the

violation.

Data

programming

errors

are

usually

caught

by

the

data

management

section

of

the

operating

system

as

it

performs

the

input/output

request.

Here

the

monitor

may

detect

that

the

wrong

type

device

has

been

selected,

the

requested

data

is

not

present,

or

that

the

data

is

different

than

that

which

the

user

expects.

In

cases

where

data

for

several

different

users

resides

on

the

same

or

similar

devices

(such

as

disk

packs)

the

monitor

will

also

check

that

the

user

is

accessing

his

own

data

or

is

not

trying

to

modify

someone

else's

data.

Usually

errors

of

this

type

will

also

cause

early

termination

of

the

user's

job.

The

user's

program

may,

however,

include

special

routines

to

attempt

to

figure

out

what went

wrong

and

correct

the

problem.

In

the

event

the

operating

system

or

the

hardware

or

both

detect

a

user

trying

to

access

storage

space

or

devices

not

allocated

to

him,

control

is

immediately

transferred

to

the

monitor

and

the

user's

job

is

stopped.

The

monitor

may,

in

some

cases,

attempt

to

allocate

the

additional

storage

the

user

tried

to

access.

If

the

monitor

is

successful,

the

user's

program

will

be

continued;

otherwise

the

appropriate

error

message

will

be

given

and

the

user's

job

cannot

be

continued

(will

be

aborted).

Almost

all

user

programming

errors

will

result

in

that

user's

job

being

stopped.

The

rest

of

the

users

continue

to

run

and

in

this

manner

the

system

has

minimized

the

effect

of

these

errors.

1.4

MONITOR

ERRORS

The

second

cause

of

errors

to

be

considered

is

the

operating

system

or

monitor.

Here

full

identification

of

the

error

is

sometimes

much

more

complex

than

identification

of

user

programming

errors.

Again

the

correction

of

pure

monitor

errors

is

not

the

responsibility

of

field

service

engineers,

but

they

must

be

aware

of

much

more

information

about

this

class

of

errors

because

many

of

the

errors

may

be

caused

by

intermittent

hardware

problems.

Field

service

engineers

should

be

capable

of

discussing

monitor

errors

with

software

specialists

to

determine

if

software

or

hardware

is

the

cause

of

the

failure.

Monitor

or

system

software

errors

can

be

attributed

to

three

major

causes:

1.

Bad

programming

initially

2.

Unexpected

error

combinations

3.

Undetected

errors

or

outside

errors

1.4.1

Bad

Programming

Operating

systems

or

monitors

are

developed

by

highly

systems

programmers

who

have

an

intimate

knowledge

of

both

hardware

and

software

architecture;

but

they

are

still

sometimes

make

mistakes.

Most

of

these

mistakes

occur

when

designer

doesn't

consider

each

possible

eventuality

when

1-3

competent

the

system

human

and

the

system

developing

INTRODUCTION

the

system.

In

some

cases

the

programmer

may

consider

how a

system

may

arrive

at

a

given

point

but

he

may

not

know

how

to

get

out

of

the

point.

This

may

occur

when

three

or

four

or

more

situations

occur

at

the

same

time

with

each

situation

interacting

with

the

others

to

make

the

point

more

complex.

Without

extremely

careful

consideration

by

the

developer,

the

monitor

may

take

an

erroneous

path

out

of

the

situation

and

cause

some

violation

to

be

detected

several

hundred

instructions

later.

In

cases

such

as

this

(which

occur

very

seldom)

the

system

may

not

be

able

to

recover

at

all

and

must

stop

all

jobs.

1.4.2

Unexpected

Combinations

Indeed,

some

of

these

interacting

situations

which

are

seldom

encountered

are

in

fact

unforeseeable

by

the

programmer

wh~n

he

is

developing

the

monitor.

It

is

not

easy,

if

possible

at

all,

1n

most

cases

to

determine

whether

the

real

error

is

a

programmer's

mistake

or

a

combination

of

unexpected

events.

Usually

some

method

of

feedback

is

provided

to

inform

the

system

developers

of

the

failure

so

that

it

may

be

prevented

from

happening

again

if

possible.

The

effect

of

this

type

of

error

can

be

serious

if

outside

means

of

protection

are

not

employed

as

discussed

later

in

this

chapter.

1.4.3

Outside

Errors

Other

monitor

errors

may

occur

if

another

part

of

the

system

goes

awry

without

being

detected.

The

most

frequent

of

these

errors

causes

either

a

user

program

or

the

monitor

to

use

bad

data,

make a

wrong

decision,

or

otherwise

get

itself

in

trouble.

The

results

are

the

same.

1.4.4

Monitor

Self

Checks

Some

errors

detected

by

the

monitor

are

the

results

of

checks

made

by

the

monitor

on

its

own

integrity.

Frequently,

different

sections

of

a

monitor

are

used

to

perform

some

specific

function

using

data

supplied

when

the

function

is

needed.

Although

the

exact

data

is

not

known

when

the

section

is

written,

the

data

can

often

be

described

to

be

within

a

certain

range

of

values

or

in

a

specific

format.

In

most

cases

like

this,

the

monitor

checks

these

parameters

of

the

data

before

the

numerical

values

are

used.

For

example,

an

argument

(data)

for

a

sub-routine

may

always

have

to

be

between

1

and

10.

If

the

subroutine

always

checks

the

argument

to

be

within

range,

an

earlier

undetected

error

may

have

caused

the

value

to

be

15

and

the

monitor

will

detect

the

error.

These

forms

of

checking

are

sometimes

called

range

checks

or

consistency

checks.

The

recovery

depends

on

the

seriousness

of

the

function.

If

the

function

is

called

to

support

only

one

job

at

a

time

and

is

not

capable

of

being

retried,

only

that

job

would

be

stopped.

However,

if

the

function

affects

all

users

or

the

integrity

of

the

operating

system,

then

the

monitor

will

stop

all

jobs

on

the

system.

This

is

often

called

"crashing"

either

a

single

job

or

the

entire

system.

In

some

cases

the

system

software

may

arrive

at

a

point

or

condition

that

the

programmer

did

not

believe

possible

but

coded

for

the

eventuality

anyway.

Usually

the

error

detected

here

is

minor

and

affects

no

jobs.

In

this

instance

only

a

warning

is

usually

given

to

the

system

operator

and

the

monitor

continues.

1-4

INTRODUCTION

1.4.5

Recovery

of

Monitor

Errors

Most

errors

detected

by

software,

either

user

or

monitor,

are

considered

more

serious

than

any

other

errors.

These

errors

usually

can

not

be

recovered

by

restarting

the

function

in

progress.

For

this

reason

almost

all

of

these

errors

cause

the

abnormal

termination

or

crash

of

at

least

one

job

on

the

system.

These

errors

are

detected

only

by

the

software

without

any

indication

of

trouble

from

the

hardware.

These

errors

are

serious

because

software

or

programs

do

not

usually

go

bad

with

age.

After

the

program

is

initially

debugged

it

does

not

change

or

degrade

because

of

heat

as

hardware

does.

If

an

error

is

later

detected

by

the

system,

it

is

considered

to

be

caused

by

an

event

or

eventuality

the

programmer

did

not

consider,

and

usually

there

is

no

program

or

function

provided

to

correctly

handle

the

situation.

1.5

HARDWARE

ERRORS

The

third

type

of

error

detected

by

the

system

is

caused

by

the

hardware

or

the

machine

itself.

This

is

the

most

frequent

type

of

error

and

the

responsibility

for

identification

and

correction

of

these

failures

falls

directly

on

field

service

engineers.

The

system

hardware

can

age

and

cause

intermittent

failures.

These

failures

are

not

permanent,

i.e.,

the

failure

may

not

occur

during

two

sequential

attempts

at

the

same

operation.

For

this

reason

the

operating

systems

of

today

expend

a

lot

of

effort

to

recover

from

this

type

of

failure.

These

hardware

errors

can

be

divided

into

three

categories:

1.

CPU-instruction

and

addressing

failures

2.

Controller

and

channel

failures

3.

I/O

errors.

Because

the

system

hardware

cannot

be

expected

to

operate

continuously

without

failure,

producers

of

the

hardware

include

facilities

to

check

the

hardware

operation.

The

most

frequently

used

error

checking

scheme

is

anyone

of

several

types

of

parity

networks

although

many

other

schemes

are

available.

Once

the

hardware

has

detected

an

error

it

may

either

signal

the

CPU

and

system

software

that

an

error

has

occurred

or

attempt

to

recover

from

the

error

and

notify

the

software

if

it

cannot

recover

successfully.

1.5.1

CPU

Failures

occurring

in

the

CPU

and

main

storage

section

of

the

system

are

perhaps

the

most

difficult

to

handle

correctly.

These

failures

can

easily

modify

either

the

operating

system

software

or

a

user

program

or

cause

instructions

to

be

incorrectly

executed.

A

failure

in

an

addressing

section

may

cause

the

system

to

operate

with

wrong

data

or

unknowingly

modify

some

other

job's

program

or

data.

For

these

reasons

CPU

errors

will

ordinarily

cause

the

crash

of

a

job

or

the

entire

system

regardless

of

whether

a

user

or

the

monitor

is

in

control.

The

most

recently

developed

CPU's

have

attacked

this

problem

by

adding

more

checking

circuits

specifically

designed

to

stop

the

bad

effects

of

an

error

once

it

has

been

detected.

For

example,

if

a

word

(either

an

instruction

or

data)

sent

to

the

CPU

by

the

memory

fails

a

parity

check,

the

operation

in

progress

is

stopped

before

the

bad

data

is

used.

In

this

way,

the

system

localizes

the

effect

of

the

error

1-5

INTRODUCTION

and

the

impact

of

the

failure

is

reduced.

The

operating

system

may

crash

all

jobs

as

a

result

but

the

system's

data

base

(user's

data

files

and

programs)

will

not

be

affected.

In

other

instances

the

operating

system

may

be

able

to

retry

the

failing

instruction

or

memory

reference

successfully

and

not

have

to

crash

any

users

at

all.

1.5.2

Controllers

and

Channels

The

second

major

section

of

the

system

is

that

section

composed

of

the

various

controllers

and

channels.

The

system

controllers

monitor

and

control

several

I/O

devices

of

the

same

type,

and

the

channels

of

various

types

connect

the

CPU

and/or

main

storage

units

with

the

I/O

controllers

or

devices.

Failures

in

this

portion

of

the

system

can

usually

depend

on

rather

extensive

recovery

procedures

to

overcome

a

problem.

However,

these

errors

are

likely

to

affect

several

jobs

or

users

because

each

controller

or

channel

can

handle

several

I/O

devices

being

used

by

many

jobs.

The

checking

circuits

employed

here

are

of

the

same

type

and

perform

the

same

function

-

ensure

the

device

is

correctly

performing

the

requested

operation

and

ensure

the

device

is

transferring

the

requested

data

correctly.

Detected

errors

are

signaled

to

the

CPU

and

monitor

and

may

stop

the

current

operation

if

the

error

is

serious.

An

example

here

might

be

a

controller's

parity

check

of

a command

issued

by

the

cPU.

If

this

parity

check

fails,

the

command

would

not

be

performed

and

the

error

would

be

signaled

back

to

the

cPU.

The

recovery

procedure

invoked

by

the

operating

system

may

be

as

simple

as

retrying

the

failing

operation

a

number

of

times

or

as

elaborate

as

finding

another

path

to

the

same

point,

such

as

using

another

controller

attached

to

the

same

group

of

devices.

Some

of

the

controller/channel

errors

are

concerned

with

data

errors.

Here

the

recovery

procedure

may

include

correcting

the

data

after

it

is

in

main

storage

using

error

information

provided

by

the

controller

or

channel.

1.5.3

I/O

Errors

detected

by

a

single

I/O

device

are

recovered

in

the

same

manner

as

channel

or

controller

failures

but

usually

the

error

will

affect

only

one

job

or

task.

The

most

frequently

used

form

of

error

recovery

is

the

simple

retrying

of

the

failing

operation.

If

the

failure

continues

for

a

specified

number

of

consecutive

retries,

the

job

or

task

is

crashed.

These

retry

procedures

may

include

other

steps

every

so

often

during

the

recovery

operation.

These

steps

may

include

such

action

as

repositioning

the

heads

of

a

disk

drive

before

every

5th

retry

or

moving

a

magnetic

tape

over

a

tape

cleaner

mechanism

before

every

4th

attempt

to

recover.

Other

forms

of

I/O

error

recovery

may

include

moving

the

data

media

to

a

different

unit

if

possible.

For

example,

a

reel

of

magnetic

tape

(the

media)

may

be

moved

to

a

different

tape

drive

and

the

operation

started

again.

So

far

we

have

seen

that

there

are

several

methods

which

may

be

employed

to

attempt

to

recover

from

errors

after

they

have

been

detected.

Those

detected

by

software

alone

are

more

difficult

to

recover

and

have

more

severe

impact

on

the

system.

Those

errors

detected

by

the

system

hardware

vary

in

both

system

impact

and

recoverability.

The

impact

and

recoverability

of

the

errors

is

basically

the

logical

distance

away

from

the

cPU

as

summarized

in

Figure

1-1.

1-6

i-'

I

--.J

I

,

SOFTWARE

[ MONITOR

USER

PROGRAMS

MAIN

STORAGE

ON·LlNE

STORAGE

Errors have increasiflg effect on system

- - - - - -

.....

Increasing capability

to

recover

from

errors

HARDWARE

CPU

CONTROLLERS/

CHANNELS

I~------------~

CARDS

I/O

DEVICES PAPER TAPE

+

H

Z

>-'3

~

o

c

n

>-'3

H

o

Z

INTRODUCTION

The

methods

already

described

only

cover

the

initial

attempts

made

to

recover

from

the

error.

Many

of

the

errors

still

result

in

the

abn~rmal

termination

or

crash

of

a

job

or

the

whole

system.

Recent

hardware

and

software

development

in

the

area

of

error

detection

and

rec?very

has

reduced,the

number

of

errors

which

result

in

crashing

the

7nt~re

system.

Th~s

helps

to

achieve

the

goal

of

localizing

the

~mpact

of

errors.

Additionally,

new

hardware

and

software

have

been

designed

to

be

more

reliable

and

to

fail

less

often.

This

all

has

the

effect

of

increasing

system

availability

which

is

a

measure

of

a

system's

continuing

ability

to

handle

requests

for

computation.

1.6

CHECKPOINT/RESTART

AND

BACKUP

Another

aspect

of

error

recovery

is

the

effort

involved

to

get

back

to

the

point

of

processing

the

job

just

before

the

error

occurred.

Consider,

far

example,

a

job

that

requires

eight

hours

to

process.

If

an

error

occurs

during

the

6th

hour

of

the

job's

run,

one

of

two

events

will

occur;

either

the

error

will

recover

successfully

and

the

job

will

continue

(possibly

without

even

knowledge

of

the

error)

or

the

job

will

be

crashed

while

other

jobs

continue

to

run.

If

the

job

was

crashed,

the

recovery

cycle

would

not

be

complete

until

the

job

was

back

at

the

point

six

hours

into

the

"rerun".

The

recovery

time

would

include

the

six

hours

rerun

time

plus

any

additional

time

needed

to

recover

the

original

data.

In

order

to

reduce

this

rerun

time

and

help

increase

availability,

features

generally

known

as

checkpoint/restart

are

included

in

most

systems.

This

technique

is

simply

the

stopping

of

a

job

at

regular

intervals

and

saving

in

auxiliary

storage

the

current

state

of

the

job

and

any

program

data,

then

continuing

processing

of

the

job.

If

a

fatal

error

occurs,

the

subsequent

rerun

of

the

job

may

start

at

the

last

checkpoint

instead

of

at

the

beginning.

In

our

example,

if

checkpoints

were

taken

every

1/2

hour,

the

rerun

time

would

be

no

longer

than

1/2

hour

instead

of

six

hours

or

more.

The

advantage

of

this

facility

is

obvious

and

is

always

employed

in

any

system

environment

where

jobs

are

processed

on

a

tight

schedule

or

the

output

must

meet

a

deadline.

The

disadvantage

is

the

requirement

of

the

additional

auxiliary

storage

needed

to

hold

the

checkpoint.

This

same

general

procedure

is

also

employed

to

backup

the

entire

system

data

set.

In

most

computer

systems

all

of

th~

system's

data

base,

both

programs

and

data

files,

is

saved

on

magnetic

tape

and

stored

in

special

areas

such

as

vaults.

By

using

this

facility

the

system's

data

is

never

totally

lost

in

case

of

a

major

disaster

such

as

a

fire

in

the

computer

center.

This

method

of

backing

up

the

data

may

be

done

monthly,

weekly,

or

even

daily

depending

on

the

consequence

of

losing

the

most

current

data.

Some

computer

centers

may

even

back

up

their

data

to

fire

storage

each

time

the

data

is

changed.

1.7

OPERATOR

MESSAGES

AND

SYSTEM

RECONFIGURATION

The

second

goal

of

error

detection

and

recovery

is

to

localize

the

effects

of

every

detected

error.

This

can

also

be

accomplished

by

reducing

the

number

of

times

the

error

occurs.

If

the

error

can

be

prevented

from

repeating

itself,

the

effects

are

limited.

In

most

cases

of

monitor

and

hardware

errors

a

message

is

sent

to

the

operator's

teletypewriter

or

display.

Also,

special

programs

have

been

developed

to

report

the

status

of

the

system,

including

error

counts,

etc.,

to

the

system

operator.

In

this

manner

the

system

operator

is

knowledgeable

of

what

the

system

is

currently

doing.

If

errors

start

to

occur,

the

operator

may

seek

assistance

to

determine

the

cause

and

find

a

solution

for

the

errors.

1-8

INTRODUCTION

If

several

errors

can

be

traced

to

a

single

unit,

the

operator

may

attempt

to

recover

whatever

data

was

lost

and

then

switch

the

units

to

a

duplicate

device

and

inform

the

monitor

that

the

defective

piece

of

hardware

or

software

is

no

longer

available.

This

process

is

called

reconfiguration.

After

reconfiguration

the

system

may

operate

more

slowly

or

at

a

reduced

efficiency

rate,

but

at

least

still

operate

until

the

faulty

device

can

be

fixed.

In

some

cases

backup

units

may

be

used

to

keep

the

system

running

at

the

same

level.

This

method

is

rather

expensive

in

terms

of

additional

hardware

but

may

be

required

in

critical

applications.

Several

of

the

error

messages

for

the

operation

will

also

include

directions

for

corrective

action

or

steps

to

be

performed

as

part

of

the

recovery

sequence.

For

example,

if

a

deck

of

cards

being

read

by

the

system

has

an

error,

the

message

to

the

operator

would

state

the

error

and

tell

the

operator

to

put

the

deck

back

into

the

input

hopper

of

the

card

reader

and

restart

the

job.

In

another

case

the

operator

might

be

notified

of

several

non-recoverable

errors

while

reading

a

magnetic

tape

and

the

operator

may

be

asked

to

move

the

tape

to

a

different

unit

and

try

the

job

again.

More

sophisticated

operating

systems

may

even

mark

a

device

or

unit

unavailable

to

itself

after

the

error

rate

has

crossed

a

specified

threshold.

In

this

case,

the

operator

would

be

notified

after

the

fact

and

may

even

be

directed

to

contact

the

system

maintainers

about

the

faulty

device.

All

of

the

methods

and

procedures

discussed

so

far

have

dealt

with

detecting

errors

and

controlling

the

effects

of

these

errors.

Any

computer

system

which

incorporates

all

or

several

of

these

functions

will

provide

more

data

integrity

for

its

users.

When

errors

occur,

the

operating

system

and

hardware

will

detect

them

and

either

attempt

to

recover

or

crash

the

appropriate

job

to

prevent

damage

to

the

user,

his

data,

or

other

jobs

on

the

system.

If

several

errors

occur

in

a

non-critical

section

of

the

system,

the

faulty

device

may

be

taken

out

of

the

system

configuration

until

it

can

be

repaired.

1.8

ERROR

REPORTING

TO

FIELD

SERVICE

In

addition

to

providing

more

data

integrity,

more

recent

operating

systems

help

field

service

repair

faulty

devices

or

systems

by

acting

as

a

form

of

diagnostic

when

errors

occur.

This

aspect

of

an

operating

system

or

monitor

is

called

error

recording

and

reporting.

This

capability

has

proven

itself

to

be

one

of

the

most

valuable

tools

available

to

field

service

engineers.

This

facility

has

eliminated

many

hours

from

system

or

device

repair

times,

making

the

field

service

engineer's

job

easier

and

increasing

system

availability

to

the

customer

at

the

same

time.

For

the

purpose

of

this

discussion,

a

diagnostic

may

be

considered

to

consist

of

only

two

basic

sections.

The

first

section

is

an

e~erciser

which

creates

activity

(perhaps

of

a

closely

controlled

type)

on

some

portion,

if

not

all,

of

a

system.

Once

an

error

is

detected,

the

second

section

of

the

diagnostic

generates

and

presents

to

its

user

information

concerning

the

failure.

This

information

either

directly

identifies

the

failing

component

or

provides

enough

information

for

the

user

of

the

diagnostic

to

determine

the

failing

component.

This

information

is

presented

to

the

user,

usually

field

service,

in

a

manner

and

form

that

is

easily

understood

by

him.

1-9

INTRODUCTION

Because

a

monitor

drives

all

of

the

system

hardware

in

an

interactive

manner

for

long

periods

of

time,

it

can

be

considered

one

of

the

best

exercisers

available.

Once

an

error

occurs,

the

error

recording

sections

of

the

monitor

gather

all

of

the

available

hardware

and

software

information

concerning

the

error

and

preserve

this

information

in

auxiliary

storage

for

later

reporting

to

field

service.

The

reporting

section

of

the

package,

upon

command,

presents

this

information

to

field

service

in

a

manner

that

is

understandable

and

useful

in

identifying

the

failing

section

or

component

of

the

system.

By

using

this

capability

long-term,

exhaustive

diagnostics

do

not

have

to

be

run

to

recreate

the

errors

and

provide

error

information

after

the

error

was

originally

detected

by

the

monitor.

Field

service

engineers

need

only

collect

and

analyze

this

data

preserved

by

the

operating

system

to

determine

which

devices

are

detecting

errors.

Using

the

detailed

information

concerning

these

devices,

field

service

can

then

determine

the

most

efficient

method

to

accomplish

the

repair,

usually

not

requiring

any

diagnostic

runs

at

all.

In

addition

to

preserving

information

regarding

hardware

and

software

errors,

the

monitor

may

also

use

this

method

to

save

information

regarding

significant

operational

events

such

as

system

reloads

and

system

activity

rates

to

help

in

determining

overall

system

performance

and

error

rates

for

system

devices.

This

tool

for

field

service,

built

into

operating

systems,

coupled

with

a

functional

level

of

understanding

monitor

error

detection

and

recovery

procedures,

can

enable

field

service

engineers

to

effectively

maintain

systems

in

a

professional

manner

with

minimum

interference

to

customer's

operating

schedules.

1.9

SUMMARY

As

the

usefulness

and

complexity

of

computer

systems

increased

with

development,

so

did

the

dependency

on

the

computer's

output.

This

dependency

was

on

both

turnaround

time

and

accuracy

of

information

which

are

affected

by

the

error

detection,

recovery,

and

reporting

capabilities

of

the

computer

system.

The

errors

possible

from

a

system

are

basically:

1.

User

programming

errors

2.

Monitor

errors

3.

Hardware

errors

Those

errors

occurring

in

either

a

user's

program

or

the

monitor

usually

have

a

more

serious

effect

on

either

the

user's

job

or

the

entire

operating

system.

Such

errors

are

difficult

to

recover

from

because

of

their

complexity,

but

recent

developments

have

helped

to

reduce

significantly

the

overall

effects

of

these

errors.

Hardware

detected

errors

vary

in

recovery

capabilities

depending

on

where

the

errors

are

detected.

Recovery

procedures

may

vary

from

simple

retry

to

more

elaborate

alternate

path

methods.

The

effects

of

non-recoverable

errors

have

been

reduced

through

the

use

of

checkpoint/restart

and

backhlp

procedures.

Reoccurring

errors

may

sometimes

be

prevented

by

system

reconfiguration

techniques

or

backup

devices.

1-10

INTRODUCTION

All

errors

are

usually

reported

to

the

operator

of

the

system

and

detailed

information

about

errors

may

be

preserved

in

auxiliary

storage

for

field

service

analysis.

Effective

use

of

these

recording

facilities

of

the

monitor

can

enable

field

service

to

diagnose

system

malfunctions

without

running

long

duration

diagnostics

to

recreate

the

problem.

This

tool

may

be

the

most

used

and

most

helpful

tool

in

the

field

service

engineer's

toolbox.

1-11

CHAPTER

2

TOPS20

ERROR

DETECTION

AND

RECOVERY

(to

be

supplied)

2-1

CHAPTER

3

HOW

TO

RUN

SYSERR

3.1

INTRODUCTION

SYSERR

is

a

user

program

to

list

the

contents

of

the

system

error

file.

To

run

the

program

you

must

be

"logged

in"

and

have

maintenance

privileges.

If

you

are

not

familiar

with

how

to

"log

in"

to

the

system,

refer

to

the

manual

GETTING

STARTED

WITH

TOPS20.

3.2

BEFORE

RUNNING

SYSERR

After

logging

in

and

before

running

SYSERR

you

must

have

two

special

areas

defined

for

you

to

access.

These

two

areas

are

<SYSTEM>,

where

the

error

file

exists;

and

<SUBSYS>,

where

the

compatibility

package

exists.

To

do

this,

type

the

following

on

your

TTY:

@DEFINE(SPACE)SYS: (SPACE) <SYSTEM>,<SUBSYS) (CR)

NOTE

All

commands

which

are

input

to

the

system

are

underlined,

"(CR)"

means

carriage

return

and

"(SPACE)"

means

a

single

space.

If

you

have

already

defined

the

logical

name

SYS,

redefine

it

to

include

and

<SUBSYS>.

To

check

your

logical

name-assignment,

type

@INF(SPACE)LOG(CR)

NOTE

As

described

in

Appendix

C,

the

SYSERR

package

consists

of

3

modules,

SYSERR,

SYSERD,

and

SYSERS.

All

3

must

reside

in

the

same

directory.

Normally

this

directory

is

<SUBSYS>;

however,

if

the

3

modules

are

in

your

own

directory

the

logical

name

SYS

must

be

defined

to

include

"DSK:"

as

the

first

logical

name

in

the

assignment

for

the

package

to

work

correctly.

The

correct

command

for

this

is

@DEFINE(SPACE)SYS: (SPACE)DSK:,<SYSTEM>,<SUBSYS>(CR)

3-1

HOW

TO

RUN

SYSERR

To

call

the

program

SYSERR,

type

on

your

TTY:

@SYSERR

(CR)

and

the

program

will

respond

with

*

indicating

it

is

ready

to

accept

your

commands.

3.3

GENERAL

COMMAND

STRING

The

general

form

of

a command

string

to

SYSERR

is:

*

ODEV:OFILE.TYP=IDEV:IFILE.TYP/SWITCH/SWITCH

...

(CR)

where:

ODEV:

OFILE.TYP

IDEV:

IFILE.

TYP

/SWITCH:

The

output

device

where

you

want

the

listing

file.

May

be

any

device

which

can

perform

output.

NOTE

will

If

"ODEV:"

is

"LPT:"

automatic

spooling;

"PLPTO:"

output

will

go

to

printer

O.

you

if

"ODEV: "

physical

The

name

and

type

of

the

listing

file.

get

is

line

The

input

resides.

input.

device

May

be

where

any

the

system

error

file

device

which

can

perform

The

name

and

type

of

the

input

file.

The

control

switches

which

tell

SYSERR

what

types

of

errors

or

listings

you

desire.

It

is

not

necessary

to

type

a

full

command

to

SYSERR

because

certain

portions

have

a

default

value

which

SYSERR

uses

if

you

have

not

specified

that

portion

of

the

command

string.

The

default

values

used

by

SYSERR

are:

COMMAND

PORTION

ODEV:

OFILE.

TYP

IDEV:

IFILE.

TYP

/SWITCH

DEFAULT

DSK:

in

your

own

area

The

default

is

specified

such

MASALL.LST.

SYS:

ERROR.SYS

the

listing

control

switch

as

/MASALL.

Output

file

would

be

/ALLSUM -

If

this

default

is

used

the

output

file

name

default

is

ERROR.LST.

3-2

HOW

TO

RUN

SYSERR

The

listing

control

switches

available

for

use

include:

/ALL

/ALLPAR

/ALLPER

/ALLSUM

/CPUALL

/CPUPAR

/CPUPER

/CPUSUM

/MASALL

/MASPAR

/MASNXM

/MASSUM

LIST

ALL

ENTRIES

LIST

ALL

THOSE

CAUSED

BY

PARITY

ERRORS

LIST

ALL

PERFORMANCE

ENTRIES

GIVE

ALL

DEVICE

SUMMARY

LIST

ALL

PROCESSOR RELATED ENTRIES

LIST

THOSE

CAUSED

BY

PARITY ERRORS

LIST

ALL

CPU

PERFORMANCE

ENTRIES

GIVE PROCESSOR

SUMMARY

LIST

ALL

ENTRIES CONCERNING

MASSBUS

DEVICES

(TU16,

TU45 &

RP04)

LIST

ONLY

THOSE

CAUSED

BY

PARITY

ERRORS

LIST

THOSE

CAUSED

BY

NXM

LIST

SUMMARY

INFORMATION

3.4

OTHER

CONTROL

SWITCHES

Other

control

switches

are

also

available

to

further

control

the

listing.

This

type

of

switch

is

used

to

select

a

particular

device,

group

of

devices,

or

only

errors

occurring

during

a

specific

date/time

period.

Switches

of

this

type

include:

/BEGIN:MM-DD-YY:HH:MM:SS

Begin

listing

of

entries

logged

on

date

specified

by

MM-DD-YY.

Other

date

formats

such

as

DD-MM-YY

and

JAN-16-1976

are

acceptable.

/END:MM-DD-YY:HH:MM:SS

/DEV:name

/DEV:type

End

listing

specified.

acceptable.

Select

for

entries

of

entries

The

same

1

isting

on

the

formats

only

date

are

those

which

involve

the

device

specified

by

name

or

type.

Available

device

types

include

KLCPU,

llCPU,

LP20,

CD20,

DHll,

TU45,

TU16,

and

RP04.

To

indicate

a

specific

disk

drive

(DP)

or

magtape

drive

(MT)

by

/DEV:name,

you

must

use

the

form

DPabc

or

MTabc,

where

3-3

HOW

TO

RUN

SYSERR

/DE'I'AIL :

/RE'IRY:

3.5

EXAMPLES

a

b

c

=

the

logical

controller

address.

the

logical

MASSBUS

address.

=

the

logical

slave

address

for

and

0

for

DP.

NOTE

You

will

find

these

logical

addresses,

by

generating

the

first

summary

listing.

MT

If

/DEV:name

is

used,

the

listing

control

switch,

such

as

/MASALL,

must

be

used.

For

TOPS20

systems,

using

only

/DEV:type

listed

above

without

a

listing

switch,

such

as

/MASALL,

causes

SYSERR

to

examine

each

entry

and

force

listings

for

those

entries

whose

device

type

match

that

specified.

List

all

information

for

Massbus

magtape

instead

of

brief

listing.

be

abbreviated

to

"/DET".

List

only

count

is

specified.

those

entries

greater

than

whose

the

and

May

retry

value

Following

are

several

examples

of

command

strings

and

explanations

of

how

they

are

interpreted

by

SYSERR.

EXAMPLE

#1

* TTY: = (CR)

This

is

the

first

command

which

should

be

given

to

SYSERR.

It

will

list

summary

information

about

the

entire

contents

of

the

error

file

on

your

TTY. By

examining

this

printout

you

may

determine

those

portions

of

the

system

which

are

of

interest

to

you

and

give

further

commands

to

SYSERR

to

list

only

the

desired

reports.

Note

that

this

command

used

several

default

values.

The

values

which

were

defaulted

are

enclosed

in

[ ]

and

if

the

whole

command

were

typed

it

would

look

like:

*TTY: [ERROR.LST] = [SYS:ERROR.SYS/ALLSUM] (CR)

EXAMPLE

#2

* TTY:

=/BEGIN:-ID

(CR)

3-4

HOW

TO

RUN

SYSERR

This

is

basically

the

same

command

as

Example

#1.

However,

this

time

only

those

errors

which

have

occurred

in

the

last

24

hours

are

considered.

The

value

specified

in

the

/BEGIN:

switch

may

be

as

shown,

or

changed

to

increase

number

of

days

(-7D

for

1

week),

or

a

specific

date

included

as

described

under

/BEGIN:

EXAMPLE

#3

*

=/DEV:RP04/DETAIL

(CR)

This

command

tells

SYSERR

to

provide

complete,

detailed

reports

for

all

entries

which

concerned

any

RP04

in

an

output

file

called

ERROR.LST

on

your

disk

area.

Again

several

defaults

were

used.

EXAl>lPLE

#4

* = (CR)

This

is

the

easiest

command

to

type

to

SYSERR

and

it

uses

all

of

the

default

values.

It

is

identical

to

the

action

of

Example

#1

except

that

the

listing

file,

called

ERROR.LST

is

generated

on

your

disk

area.

EXAMPLE

#5

*

=/MASALL/BEGIN:JAN-1-76:13:00/END:JAN-7-76:13:00/DEV:DP030/DETAIL(CR)

This

command

will

tell

SYSERR

to

create

a

file

named

MASALL.LST

on

your

disk

area

which

contains

detailed

information

about

all

the

errors

detected

by

device

DP030

between

the

period

from

1

PM

on

Jan.

1,

1976

to

1

PM

Jan.

7,

1976.

EXAMPLE

#6

* MTAl:

=MTA2:/MASALL/CPUALL/BEGIN:-30D/END:-3D

(CR)

This

command

tells

SYSERR

to

create

the

list

file

on

MTAI

and

read

the

error

file

from

MTA2.

All

entries

concerning

either

a

massbus

device

or

the

CPU

during

a

period

beginning

30

days

ago and

ending

3

days

aqo

will

be

listed.

This

example

points

out

that

SYSERR

can

process

multiple

commands

from

one

command

string

and

does

not

need

to

always

have

the

input

and

output

files

on

disk.

EXAMPLE

#7

*

/HELP

(CR)

This

command

tells

SYSERR

you

have

forgotten

how

to

give

SYSERR

commands.

The

program

will

list

a HELP

file

on

your

TTY

which

gives

abbreviated

information

on

how

to

run

the

SYSERR

program.

After

each

command

is

processed

by

SYSERR,

the

program

gives

a

carriage

return,

line

feed

and

another

prompt

character

(*)

indicating

it

is

ready

for

another

command.

If

no

more

commands

are

required,

type

*

tc

(control

C)

and

the

program

will

exit

back

to

monitor.

If

the

output

files

were

created

on

your

disk

area

they

may

be

listed

on

the

line

printer

with

the

monitor

command

@ PRINT XXX.LST (CR)

where

XXX

is

the

name

of

the

file

you

want

to

list.

3-5

HOW

TO

RUN

SYSERR

3.6

OTHER

COMMANDS

If

you

are

running

SYSERR

on

an

LA36

with

wide

paper,

such

as

from

the

local

DEC

office,

an

additional

command

to

the

monitor

before

you

call

SYSERR

will

allow

the

full

width

of

the

paper

to

be

used

when

summary

listings

are

printed

on

this

TTY.

The

command

is:

@

TERMINAL

WIDTH

132

(CR)

3.7

INDIRECT

COMMANDS

SYSERR

has

the

capability

of

processing

commands

from

a

disk

file

as

well

as

from

your

TTY.

This

is

called

indirect

command

files

and

is

useful

if

you

have

several

"favorite"

commands

to

use

in

succession.

To

use

this

function

create

a

file

of

commands

just

as

you

would

type

them

on

your

TTY.

NOTE

SYSERR

does

not

support

line-sequence

numbers.

To

tell

SYSERR

to

use

this

file

the

command

is:

* @ DEV:NAME.TYP (CR)

where

DEV:

is

the

location

of

the

file

(DEFAULT

is

DSK:)

and

NAME.TYP

is

the

name

of

the

command

file.

3-6

CHAPTER 4

SYSERR REPORT

FORMATS

4.1

INTRODUCTION

This

chapter

describes

each

of

the

reports

generated

by

SYSERR.

It

is

the

intent

of

SYSERR

to

make

each

report

self-explanatory

for

those

people

who

are

knowledgeable

of

the

system.

This

chapter

is

included

to

provide

information

for

those

who

are

not

familiar

with

the

system

or

who

are

inexperienced

with

SYSERR.

4-1

SYSERR

REPORT

FORMATS

4.2

REPORTING

CONVENTIONS

USED

IN

SYSERR

All

numbers

output

by

SYSERR

are

either

octal,

decimal,

or

otherwise

noted.

All

decimal

values

are

followed

with

a

period

(.)

to

indicate

that

they

are

decimal.

All

other

values

are

octal.

Values

printed

in

half-word

format

have

leading

zeros

suppressed

in

each

half

of

the

word

and

the

halves

are

separated

with

a

comma

(,).

All

register

values

which

are

translated

to

text,

such

as

a

CONI

value,

have

text

translations

only

for

bits

or

bytes

of

interest

and

the

whole

value

is

dumped.

For

example,

the

CONI

value

listed

might

include

a

DONE

bit

and

a

PI

assignment,

but

these

bits

are

not

translated

to

text.

All

dates

and

times

used

by

SYSERR,

both

in

command

strings

and

report

listings

are

local

time

unless

otherwise

stated.

The

internal

day/time

maintained

by

the

TOPS20

monitor

and

all

day/time

values

stored

in

the

error

file

are

recorded

as

GMT.

4-2

4.3

HEADER

FORMAT

The

top

portion

of

each

type,

when

the

entry

time,

or

uptime,

at

the

number

of

the

CPU

where

SYSERR

REPORT

FORMATS

report

is

the

header.

It

describes

the

entry

was

recorded

by

the

monitor,

the

monitor

run

time

the

entry

was

recorded,

and

the

serial

the

error

was

detected.

***********************************************

TOPS20

SYSTEM

~ELOADED(CODE

101)

LOGGED

ON

MON

12

JAN

76 9115105PM

MONITOR

UPTIME

WAS

0100112

DETECTED

ON

SYSTEM.

1011.

***********************************************

The

code

number

in

parenthesis

after

the

report

name

is

the

event

type

number,

as

described

in

Appendix

B,

and

is

used

by

SYSERR

to

determine

how

to

list

this

entry.

4-3

SYSERR

REPORT

FORMATS

4.4

TOPS20

SYSTEM

RELOAD

SAMPLE:

***********************.**

••

*******************

TOPS20

SYSTEM

~ELOADEDCCODE

1~1)

LOGGED

ON

TUE

13

JAN

76 6115103PM

MONITOR

UPTIME

WAS

0100.22

OETE~TED

ON

SYSTEM'

1031,

****.**

••

***

•••••

*.********************.****.*.

CON~IGURATION

INrORMATION

SYSTEM

NAMEI

V

MONITOR

BUILT

ONI

CPU

SERIAL

"

MONITOR

VERSION.

RELOAD

BREAKDOWN.

WHY

PELOADI

SA

1,02,35,

TOPS-20

'PI

9

JAN

t031,

10235,

DEVELOPMENT

SYSTEM

'10l1

76 7136.27PM

This

entry

is

created

each

time

the

TOPS20

monitor

is

loaded.

The

configuration

information

section

includes

the

system

name

specified

at

the

time

the

monitor

was

built,

the

version

number

and

the

date

the

monitor

was

built.

4.4.1

Reload

Breakdown

This

section

explains

why

the

monitor

was

reloaded.

If

a

BUGHLT

occurred

and

the

system

was

set

for

auto-reload,

the

BUGHLT

address

will

be

listed

and

a

Code

102

(BUGHLT/BUGCHK

Report)

entry

will

provide

information

about

the

BUGHLT

which

caused

the

reload.

If

the

reload

was

other

than

an

auto-reload

caused

by

a BUGHLT,

this

section

will

list

the

operator's

answer

to

the

"WHY

RELOAD"

question

asked

by

the

system

software

at

startup.

There

are

no

restrictions

on

what

the

operator

may

say;

however,

the

answer

should

describe

either

what

happened

to

cause

the

reload,

such

as

"BAD

MICROCODE"

or

the

expected

future

status

of

the

system

such

as

"NEW

VERSION,"

or

"SCHEDULED."

4-4

SYSERR REPORT

FORMATS

4.5

TOP20 BUGHLT-BUGCHK

SAMPLE:

***********************************************

TOPS20

BUGHLT-BUGCHK(CODE

102)

LOGGED

ON

TUE

13

JAN

76

tl35116PM

MONITOR

UPTIMF.

WAS

2.36.29

DETECTED

ON

SYSTEM

1

1~31.

***********************************************

SYSTEM

NAMEI

V

1.02.35,

TOPS-20

DEVELOPMENT

SYSTEM

11031

SYSTEM

SEPIAL

••

1031.

MONITOR

BUILT

ONI

rRI

9

JAN

76 7136.27PM

MONITOR

VERSIONI

10235.

ERROR

INFORMATION.

DATE-TIME

OF

ERRORI

TUE

13

JAN

76 1r35110PM

•

OF

ERRORS

SINCE

RELOAD.

1.

FOPK

• ,

JOB

'1

51,12

USER'S

CONNECTED

DIR,

LOGGED

IN

OIRITPORADA

,

MCKIE

PROGPAM

NAME.

EXEC

ERROR.

BUGCHK

ADOPESS

OF

ERRORI

52211

NAMEI

ILLUUO

DESCRIPTION.

KIBADU.

ILLEGAL

UUO

FROM

MONITOR

CONTEXT

CONTENTS

OF

AC'S.

01

0,0

11

0,215365

2.

0,303770

31

0,30

4

777777,13

5

40000,0

6

~,100000

7

0,51

10

3,0

11

0,777777

12

0,370

13

22,356774

14

0,0

15

260740,301107

16

0,0

17

777642,777541

PI

STATUSI

0,177

SELECTED

VALUESI

2

0,0

4000,1

0,0

~,0

This

report

is

generated

each

time

the

TOPS20

monitor

detects

anyone

of

three

general

types

of

monitor

software

errors:

BUGHLT,

BUGCHK,

or

BUGINF.

The

most

serious

of

these

is

BUGHLT

which

will

always

crash

the

system.

At

this

point

something

is

very

seriously

wrong

and

the

monitor

doesn't

have

enough

integrity

to

attempt

any

further

error

recovery.

The

monitor

will,

however,

collect

pertinent

information

for

error

recording.

When

the

monitor

is

reloaded,

this

information

will

be

extracted

from

the

crash

dump

file,

if

present,

and

transferred

to

ERROR.SYS.

BUGCHK

and

BUGINF

are

less

serious,

perhaps

correctable

monitor-detected

errors

which

may

only

affect

particular

users

instead

of

the

entire

system.

These

errors

mayor

may

not

crash

the

user

depending

on

the

error

which

occurred.

For

a

more

complete

description

of

these

types

of

errors,

refer

to

Chapter

2.

4-5

SYSERR REPORT

FORMATS

4.5.1

Report

Contents

The

upper

section

of

this

report

describes

the

version

and

name

of

the

running

monitor

and

is

identical

to

the

same

section

of

the

system

reload

report.

The

ERROR

INFORMATION

section

contains

the

majority

of

information

for

this

error.

The

date

and

time

of

the

error

are

included

primarily

to

cover

the

situation

of

a

BUGHLT

finally

being

reported

some

length

of

time

after

it

occurred.

The

number

of

errors

since

reload

are

listed

because

only

5

occurrences

of

this

type

error

entry

are

allowed

in

the

monitor's

error

recording

buffer

at

anyone

time.

In

the

case

of

an

error

occurring

in

a

tight

loop,

more

than

5

entries

could

overflow

the

buffer

and

the

information

for

the

first

(and

usually

most

interesting)

occurrence

might

be

lost.

These

numbers

should

increment

by

one

for

each

report

listing;

however,

if

the

sequence

is

broken,

it

is

an

indication

that

more

than

5

entries

occurred

before

the

error

logger

module

in

the

monitor

could

empty

the

buffer.

The

FORK

#

and

JOB #

are

the

numbers

associated

with

the

current

user

at

the

time

of

the

error.

A

value

of

-lor

777777

indicates

that

the

monitor

was

performing

an

overhead

function

(such

as

scheduling)

and

there

was

no

current

user.

Note

that

the

FORK

#

and

JOB #

indicate

the

current

user

and

not

necessarily

the

user

being

serviced

by

the

monitor

interrupt

level

routines

(e.g.,

BUGCHK

detected

at

interrupt

level

during

I/O

for

a

different

user).

The

user's

connected

directory

and

logged

in

directory

are

also

for

the

current

user

and

are

listed

along

with

the

user's

program

name

to

aid

in

identifying

the

person

running

at

the

time

of

the

error.

If

several

reports

indicate

the

same

user

and/or

program,

talking

with

that

user

or

examining

that

program

should

help

in

identifying

the

source

of

the

problem.

Following

the

user

identification

is

information

specifically

identifying

the

name

and

description

of

the

error.

If

the

"/DETAIL"

switch

is

used

with

the

SYSERR

command

string,

more

information

will

be

listed

which

is

useful

for

further

analysis

of

the

error.

Included

are

the

contents

of

the

monitor's

block

of

AC's

and

the

PI

system

status.

Some

particular

errors

will

also

include

"SELECTED

VALUES."

A

maximum

of

4

values

may

be

preserved

in

the

error

file.

Description

of

these

values

is

dependent

on

the

type

of

error

which

occurred

and

may

be

obtained

from

the

monitor

listings.

4-6

SYSERR REPORT

FORMATS

4.6

MASSBUS

DEVICE

ERRORS

This

entry

is

recorded

in

the

ERROR.SYS

file

by

the

monitor

each

time

an

error

is

detected

in

the

Massbus

System

including

the

Massbus

devices

(RP04,

TU45,

and

TU16),

the

RH20

controller,

and

certain

errors

occurring

in

the

channel

logic.

4.6.1

Sample

Report

The

pages

show

sample

Massbus

device

error

reports.

4-7

SYSERR REPORT

FORMATS

***********************************************

MASSBUS

DEVICE

ERROR(CODE

III'

LOGGED

ON

TUE

13

JAN

76

12115103PM

MONITOR

UPTIME

WAS

1116117

DETECTED

ON

SYSTEM

I

1031.

***********************************************

UNIT

NAMEI

UNIT

TYPEI

VOLU"'E

101

DPI10

RP04

LBNI

59184.

CYL.

155.

SURF.

OPERATION

AT

ERRORI

USFRiS

CONNECTED

DIR,

LOGGED

IN

OIPIUNKNOWN

USER'S

PG"'I

USER'S FILEI

•

14.

SECT.

4.

DEV.AVAIL.,

GO

+

,

UNKNOWN

200000,7

2.

FINAL

ERROR

STATUSI

RETRIES

PERFOR"'ED'

ERROR.

RECOVERARLE

DRIVE

EXCEPTION,

CONTROLLER

INFOR"'ATION.

RH20

I 1

READ

DATA(70)

CONTROLLER.

CONI

AT

EPRORI

COtJl

AT

ENOl

nATAl

PTCR

AT

DATU

PTCR

AT

DATAl

PBAR

AT

DATAl

PBAR

AT

0,202415

•

DRIVE

EXCEPTION,

0,2405.

NO

ERPOR

BITS

DETECTED

ERROR.

732201,177471

END.

732201,177771

ERROR.

720001,7004

END.

720001,7007

CHANNEL

INFORMATION.

CHAN

STATUS

WD

0.

cw

II

620000,

'73 1000

CHN

STATUS

WD

II

200000,200374

CW2.

0,0

!l40100,200375

604000,73161!0 •

NOT

saus

ERR,NOT

we

•

0,LONG

WC

ERR,

CHtJ

STATUS

WD

2.

DEvICE

REGISTER

INFORMATION.

AT

END

OIry.

TEXT

CR(00)

SR

(01)

ER(02)

MR(03)

nA(A!)

OT(06)

LA

(Ql7)

OF

(11 )

DC(12)

CC

(13)

EP

(!6)

PL(

171

AT

ERROR

4070 4010 60 DEV.AVAIL.,

READ

DATA(70)

50700

100000

400

7007

24020

1740

I

0001/!0

233

5432

2000

DEVICE

STATISTICS

AT

TIME

OF

•

OF

READS.

94212.

•

or

I

SOFT

READ

ERRORSI

1.

I

HARD

READ

ERRORS.

0,

• soFT POSITIONING

ERRORS.

I

OF

MPEI

0.

I

or

NXMI

0.

10700 40000 ERR,MOL,DPR,DRY,VV,

0 100000

DCK,

400 0

7010

17

D.

TRK

•

16,

D,SECT, • 7

24020 0

1320 460

100000 0

AT

END.

OFFSET

•

NnNE

233 0

155,

233 0 15!!.

0 5412

0 2000

ERRORI

WRITESI

87776,

•

or

SEEKS

•

20330,

I

SOFT

WR

ITE

ERRORS.

0.

I

HARD

WR

ITE

ERRnAll.

0.

I

HARD

POSITIONING

ERRnRS.

0 •

I

or

OVERRUNS.

0.

4-8

SYSERR

REPORT

FORMATS

•••••••••••••••••••••••••••••••••••••••••••••••

MASSBUS

DEVICE

ERRORCCODE

1111

~OGGED

ON

TUE

11

JAN'6

114111RPM

MONITOR

UPTI~E

WAS

2142111

DETECTED

ON

SYSTEM.

1911,

•••••••••••••••••••••••••••••••••••••••••••••••

UNIT

NAME

I

UNIT

TYPEI

UNIT

SERIAL

II

VOLUME

101

MT952

TU45

0024

664,

OF

rILE

•

0,

LOCATIONI

RECORD

OPERATION

AT

ERROR

I

USER'S

CONNECTED

DIR, DEV,AVAIL,

GO

+

READ

FWD,

C101

~OGGED

IN

DIRIDEMO-I

USER'S

PGMI

AEGIS

USER'S FILEI

FINAL

ERROR

STATUS

I

RETRIES

PERFORMED

1

ERRORI

NON-RECOVERABLE

,

DEMO-I

II!

,I

31,

DRIVE

LXCEPTION,

CONTROLLER

INFORMATION

I

CONTROLLER

I

CONI

AT

ERPORI

CONI

AT

ENOl

DATAl

PTCR

AT

DATAl

PTCR

AT

DATAl

PBAP

AT

DATAl

PBAR

AT

RH2111

• 0

TM02'1

5

0,202415

•

DRIVE

EXCEPTION,

0,202415

•

DRIVE

EXCEPTION,

ERROR

I

712205,17'771

E~DI

73221115,177'71

E~RORI

720005,111

ENOl

720005,0

CHANNEL

INFORMATION

I

CHAN

STATUS

WD

01

20000~,

43572

CWII

400120,546'63

CHN

STATUS

WD

II

CHN

STATUS

WD

21

cw21

615520,257000

51110000,435'4 •

NOT

SBUS

60111111110,25"65

DEVICE

REGISTER

INFORMATION

I

CRC0011

SRC0111

ER(02)

I

MP(03)1

'C(05)1

DT(06)1

CK

(07)1

SN

C

1011

TC

(1)

I

AT

ERROR

4970

54629

10931110

45600

3410

142012

"4

44

191062

AT

END

4070

54620

IIII0llH"

45600

1410

142012

774

44

101062

OEVICE

STATISTICS

AT

TIME

or

EPRnRI

Dlr"

o

ERR,

TUT

DEV,AVAIL,

READ

FWD,(71)

ERR,MOL,WRL,DPP,DRY,SDWN,

COR/CRC,PE"LRC,INC/VPE,

ACCL,

.008PI

NRI

10

CO~PATIBLE

SLAVE

.2

•

or

READS

I

2754,'

or

WRITESI

0,

•

or

SEE~SI

I,

sorT

READ

ERRORS

I

4,

•

SOFT

WRITE

ERRORS

I

III,

HARD

READ

ERRORSI

.,

•

HARD

WRITE

ERRORS

I

III,

sorT

POSITIONING

ERRORSI

0,

•

HARD

POSITIONING

ERRORS

I

0,

•

OF

MPEI

III,

•

or

NXMI

0,

•

or

OVERRUNS

I

0,

4-9

SYSERR

REPORT

FORMATS

4.6.2

Report

Description

The

UNIT

NAME

refers

to

the

physical

Massbus

unit

active

at

the

time

of

the

error.

This

is

a 5

character

name

of

the

format

XXABC

where:

XX

is

the

device

type

DP

disk

drive

(RP04)

MT

mag

tape

(TU45

and

TU16)

A

is

the

logical

address

of

the

RH20

controller

for

this

device

(0-7)

B

is

the

logical

Massbus

address

for

this

device

(0-7).

magtape

units

this

is

the

TM02

address

on

the

Massbus

For

C

is

the

slave

number

of

this

magtape

unit.

For

RP04

devices

this

number

is

always

O.

The

LBN

listed

for

RP04

reports

refers

to

the

Logical

Block

Number

of

the

pack

being

addressed

when

the

error

occurred

and

is

translated

to

cylinder,

surface,

and

sector

to

provide

physical

location.

For

magtapes,

the

record

and

file

number

are

listed

to

show

location.

The

OPERATION

AT

ERROR

is

a

decode

of

the

last

command

issued

to

the

device

before

the

error

occurred.

If

this

command

does

not

agree

with

the

listed

contents

of

the

device's

control

register

(with

the

exception

of

the

GO

bit)

an

error

in

the

control

bus

may

have

occurred.

The

user's

connected

and

logged

in

directory,

and

program

are

listed

for

magtape

units

to

aid

in

finding

bad

tapes

which

may

be

causing

errors.

The

ERROR

statement

is

a

text

translation

of

the

RH20

CONI

at

error.

If

the

error

was

non-recoverable,

the

monitor

sets

bit

2

in

the

IORB

status

word

(MB%IRS)

and

SYSERR

states

the

error

is

NON-RECOVERABLE.

The

CONTROLLER

INFORMATION

lists

the

controller

type

and

logical

address;

and

(for

magtapes

only)

the

TM02

logical

address.

This

section

also

lists

the

CONI's

and

DATAl's.

PTCR

is

Primary

Transfer

Control

Register

and

PBAR

is

Primary

Block

Address

Register.

The

CHANNEL

INFORMATION

lists

the

contents

of

the

channel's

status

and

logout

area.

The

values

listed

for

CWI

and

CW2

are

the

contents

of

the

address

pointed

to

in

the

right

half

of

Channel

Status

Word 0

and

the

contents

of

the

address

pointed

to

+1.

The

DEVICE REGISTER INFORMATION

lists

the

contents

of

the

device's

registers

at

the

time

of

error

and

after

the

last

retry,

the

XOR

difference

and

the

text

translation

for

the

value

at

error.

The

only

exception

to

this

is

the

RP04 OFFSET

register.

The

text

translation

for

this

register

is

the

value

at

end

and

is

noted

by

"AT

END:".

If

both

the

AT

ERROR

and

AT

END

values

for

any

register

are

zero,

that

register

is

not

listed.

The

DEVICE

STATISTICS

provide

an

indication

of

the

error

rate.

The

number

of

reads

and

writes

for

magtape

indicate

frames

of

tape

transferred

and

for

disks

this

is

the

number

of

blocks

transferred.

4-10

SYSERR

REPORT

FORMATS

4.7

FRONT

END

DEVICE

ERRORS

These

types

of

entries

are

recorded

in

the

system

error

file

by

the

monitor

as

a

result

of

a

request

for

error

logging

from

the

front

end.

These

errors

are

detected

by

the

front

end

and

it

gathers

the

error

information

and

passes

the

packet

to

the

monitor

across

the

DTE-20

for

error

logging.

The

errors

detected

by

the

front

end

fall

into

two

basic

types:

those

concerning

the

front

end

hardware

and

software;

and

those

concerning

the

KL

CPU

hardware

and

software.

Descriptions

of

the

error

detection,

recovery

and

reporting

for

the

front

end

may

be

found

in

Chapter

2.

Currently,

reports

are

created

for

the

following

"devices:"

-

LP20,

CD20,

DHll,

KLCPU,

and

KLERROR.

4.7.1

Report

Description

The

top

section

for

all

reports

is

basically

the

same

and

includes

the

DTE-20

logical

address

for

this

front

end,

the

version

number

of

the

front

end

Software,

the

FORK

#

and

JOB #

associated

with

this

error.

If

the

FORK

#

and

JOB #

are

777777, 777777,

this

is

an

indication

that

the

TOPS20

monitor

knows

of

this

device

but

it

is

not

currently

assigned

to

any

fork

or

job.

777776,

777776

indicates

the

TOPS20

doesn't

know

anything

about

this

device.

The

upper

section

of

these

reports

also

includes

the

user's

connected

and

logged

in

directories

and

program

name

as

well

as

the

device

name

and

logical

address.

It

also

lists

the

octal

value

and

text

translation

for

the

standard

status

word

generated

by

the

front

end

for

each

transfer

across

the

DTE-20.

It

is

the

ERROR

LOG

REQUEST

bit

in

this

word

which

causes

the

packet

to

be

recorded

into

the

error

file.

The

remainder

of

each

device

report

is

dependent

on

the

being

reported.

If

SYSERR

does

not

know how

to

list

fact

will

be

stated

in

the

report

and

the

entry

will

octal.

4-11

type

of

device

a

device,

this

be

listed

in

SYSERR

REPORT

FORMATS

4.7.2

LP20

Report

Description

***********************************************

FRONT

END

DEVICE

EPRO~(CODE

t30)

LnGGF.D

ON

WED

7

JAN

76 12138120PM

MONITOR

UPTIME

WAS

0121145

DETECTED

ON

SYSTEM.

1031.

***********************************************

DTE20

"

0.

FE

SnFTWARE

VERI

0.

FORK

."JOell

35,t5

USER'S

CONNECTED

DIR,

LOGGED

IN

DIPIPORCHER

,

POPCHER

USER'S

PROGRAMI

TEeO

DEVICE. LP20 •

0,

STD.

STATUS

I 120 =

ERROR

LOG

REQUEST,

LP2~

GEN

STATUS.

0.

NO

ERROR

BITS

DETECT~D

LP20 DEVICr

REGISTERS

LPCSRAI

tt4102

•

ERROR,DAVFU,ON

LINE,INT

ENB,PAR

ENS,

LPCSRBI

10~e3

•

LPT

DATA

PAR,

DEMAND

TIMEOUT,GO

ERRnR,

LPBSADI

1~4504

LPRrTRI 354

LPPCTRI

7734

LP~A~OI

t0V.0~

LPCCT~I

1~2

COL.

CNTR.

D 0

CHAR.

BUF

• 152

LPTDATI

1554t2

CHKSU~.

333

LPT

DATA.

12

The

device-specific

section

of

this

report

includes

the

LP20

GENERAL

STATUS

created

by

the

front

end

software

and

contents

of

the

various

LP20

controller

registers.

Text

translations

are

also

included

for

both

status

registers

and

LPCCTR

and

LPTDAT.

If

the

contents

of

a

register

are

zero

in

the

error

file,

that

register

if

not

listed.

4.7.3

CD20

Report

Description

•••••••••••••••••

*

••

**

••••

*

••

*********

••

******.

r~ONT

END

DEVICE

ERRORCCODE

130)

LOGGED

ON

THU

8

JAN

76

111~8149AM

MONITOR

UPTIME

WAS

2102150

DETECTED

ON

SYSTEM.

1031,

**.********.***.*

••••

*

••••••

*

••••

*.**

••••••

**

••

D1E20

'1

0.

rE

SOFTWARE

VERI

0.

FORK

."JOB'I

777777,777777

USER'S

CONNECTED

DIR,

LOGGED

IN

DIRIUNKNOWN

,

UNKNOWN

USER'S

PROGRAM

I

DEVICE

I

CD20

•

0.

STD,

STATUS

I 106 •

ERROR

LOG

REQUEST,HDWR

ERR

OPR

REQ'O,OFF LINY,

CD20

GEN

STATUS

I

17

•

HOPPER

EMPTY,STACK

CHECK,PICK

CHECK,REA~

CHECK,

C020

DEVICE

REGISTERS

CDllST 10304 • OFF.LINE,REAOY,INT

£NB,HOPPER

CHECK,

COllDBI 177777

4-12

SYSERR

REPORT

FORMATS

As

with

the

LP20

the

remainder

of

this

report

includes

the

CD20

GENERAL

STATUS

word

maintained

by

the

front

end

software

and

the

octal

contents

and

text

translation

of

the

CDll

device

controller's

registers.

4.7.4

DHll

This

entry

is

created

by

the

front

end

each

time

it

detects

one

of

two

errors

associated

with

a

DHll.

These

two

errors

are

DEVICE

HUNG

and

LOST INTERRUPT.

Samples

of

each

report

are

shown

below.

***********************************************

FRONT

END

DEVICE

ERRO~CCODE

130)

LOGGED

ON

MON

5

JAN

76 12152155AM

MONITOR

UPTIME

WAS

~106106

DETECTED

ON

SYSTEM

I

1031.

***********************************************

OTE20

II

0.

FE

SOFTWARE

VERI

0.

FOPK

."JOB'I

777776,777776

USF.R'S

CONNECTED

DIR,

LOGGED

IN

OIRIUNK~OWN

,

UNKNOWN

USEP'S

PROGRAMI

DEVICEI

DH1t

•

0.

STD.

STATUSI

1100

•

DEV

HUNG,

ERROR

LOG

REQUEST,

CONTENTS

OF

COUNTERS

I 24

36

40

***********************************************

FRONT

END

DEVICE

ERRORCCODE

130)

L~r.GF.D

ON

MON

5

JAN

76 121481S8AM

MONITOR

UPTIME

WAS

0102.09

DF.TECTEO

ON

SYSTEM

I

1031.

***********************************************

01[20

••

0.

FE

SOFTWARE

VEPI

0.

fORK

."Joeli

777776,777776

USER'S

CONNECTED

DIP,

LOGGED

IN

OIRIUNKNOWN

,

UNKNOWN

USER'S

PPOGRAMI

DEVICE.

~H11

I

0.

STn.

STATUSI

2100 •

LOST

INTERRUPT,ERROR

LOG

REQUEST,

PAGE

AODR

OF

DH

WHICH

FAILEDI

160020

4-13

SYSERR REPORT

FORMATS

4.7.5

KLCPU

This

entry

is

created

each

time

the

front

end

reloads

the

CPU

without

the

front

end

itself

reloaded.

If

both

are

reloaded

the

generated

entry

is

SYSTEM

RELOADED

(CODE

101).

If

the

KL

reloads

the

front

end

a

FRONT

END

RELOADED

(CODE

131)

entry

is

created.

If

the

front

end

reloads

the

KL,

this

report

is

created.

The

report

includes

the

reason

for

reload

as

determined

by

the

front

end

software.

***********************************************

FRONT

E~D

DEVICE

ERRORCCODE

130)

LOGGED

ON

TUE

13

JAN

76

10t59t45AM

~ONITOR

UPTIME

WAS

0100158

DETECTED

ON

SYSTEM

_

1031.

***********************************************

DTE20

~t

0.

FE

SOFTWARE

VERt

0.

FORK

."JOB~I

777776,777776

USER'S

CONNECTED

DIR,

LOGGED

IN

DIRtUNKNOWN

,

UNKNOWN

USER'S

PROGRAMI

DEVICEt

KLCPU

STD.

STATUSI

100 •

ERROR

LOG

REQUEST,

KL

RELOAD

STATUS

FROM

FRONT

ENOl

20 •

KEEP

ALIVE

STOPPED,

4.7.6

KLERROR

This

report

is

perhaps

the

most

complex

of

the

error

file

entries.

If

the

KL

clock

stops

for

any

of

several

errors

(FAST

MEMORY

PARITY

ERROR,

CRAM

PARITY ERROR,

DRAM

PARITY ERROR,

or

FIELD

SERVICE STOP) a

software

routine

is

called

in

the

front

end

to

gather

a

snapshot

of

the

KLCPU.

This

routine

creates

an

output

file

on

either

the

dual-ported

RP04

or

the

floppy

disk.

The

the

system

is

reloaded

the

front

end

passes

the

contents

of

this

file

across

the

DTE20

and

then

into

the

system

error

file

in

several

entries.

SYSERR

pieces

this

file

back

together

in

core

and

then

lists

its

contents.

For

a

complete

description

of

this

portion

of

error

detection

and

recording,

refer

to

Chapter

2.

SAMPLE

REPORT

The

following

page

shows

an

example

of

the

report

for

the

first

record.

The

listing

format

of

the

2nd

record

is

identical.

4-14

..,.

1

t-'

U1

**

••

***.***.*.*

••••

*

•••••••••••••••••••••••••••

FRONT

END

DEVICE EPROP(CODE

130)

LOGGED

ON

TUE

1~

JAN

76

10.59.46AM

MONITOR

UPTIME

WAS

0.00.59

DETECTE~

ON

SYSTEM.

t03t

•

• *

•••••••

*

••••

*

•••

*

•••

*

••••••••••••••••••••••••

OTE20

••

0.

FE

SOFTWAPE

VEP'

0.

~OP~

a"JOB..

777776,777776

USER'S

CONNECTED

DIR,

LOGGED

IN DIP.UNKNOWN ,

UNKNOWN

USER'S

PROGRAM.

DEVICE.

KLCPU

CONTENTS

or

KLEPROR

FILE.

CONTENTS

or

PECORD'1

SEEN

AT

FIRST

ERROR

CREATED.

1.13.76

AT.

10.131\0

FILr.

U-CODE

VEP.

0.

FE SEP

••

FOR~AT

VERSION. 1

0,

KL

SER "

PECORO

LENGTH.

1000

BYTES

0.

MONITOR

VERI

0.

FE

SOFTWARE

VER.

0,

ERROR

CODE. 0 •

NONE

OTE

DIAG STATUSI

2104

DTE

OIAG

11

1400

DTE

DIAG 21

PEADS

1021

1101

1161

1241

1321

1401

1461

1541

1621

170'

1761

o

DTE

[lUG

31

21000

VALUES

RETURNED

fROM

DIAGNOSTIC FUNCTION

\001

000377602664

1011

000000002600

1061

00000~642000

107.

000000614640

1141

000000005700

1151

000033702\57

1221

000000000001

1231

000072000000

\301

32006p204000

1311

121210021207

\361

000000042004

1371

000000002004

1441

000~0e050005

1451

000000000125

\521

000001044014

1531

600001044031

1601

051600035230

1611

700002235322

1661

000000000000

1671

000000000000

1741

000000000000

1751

000000000000

00('10\

3400222

0P1107067462

000000101'1000

0700541'160000

5305040""303

00000004001(/14

211003016024

01/10(11Q'1044031

146010137664

"'13"'630000010

000000100"'000

103.

000024252024

1111

001500002O03

1171

70031'1000571'10

1251

0141'160720000

1311

000201000001

1411

00000000'''104

1471

211006276700

1551

100021/1000204

1631

265040125355

1711

"'00011/10010000P

1771

00000000100100

1041

000000032422

112'

00'10~000000

1'01

1'101'10001'100000

1261

000000414000

1341

500000001010

1421

0~"'0000!2405

150.

21100620611t14

1561

360000726722

1~41

020001'1337365

1721

~00000"'00"'00

1051

000000002421

113.

000000000000

121'

0000000000'-1

1271

1

]0046404000

115.

001210021207

141.

~0000000240!

151'

000001044000

1571

020100635722

1651

340000!

33305

173.

00100000000010

til

t<

til

t'l

~

t'l

'd

o

~

8

~

o

~

::s:

:J>I

8

til

SYSERR

REPORT

FORMATS

4.7.6.1

Report

Description

-The

date

and

time

in

the

header

of

this

report

indicates

when

the

last

packet

of

the

KLERROR

file

was

entered

in

the

system

error

file.

The

date

and

time

following

CREATED

is

the

date

and

time

of

the

error

as

recorded

by

the

front

end.

The

format

is

MM:DD:YY

AT:

HH:MM:SS. The

RECORD

LENGTH

is

the

length

of

this

record

in

PDPll

bytes

(8

bits=l

byte).

The

ERROR

CODE

is

the

octal

value

and

text

translation

of

what

error

was

seen

(CRAM

PARITY,

DRAM

PARITY,

etc.)

when

the

snapshot

was

taken.

The

DTE

DIAG

words

are

the

values

read

by

the

front

end

from

these

registers

in

the

DTE-20.

The

values

returned

from

the

diagnostic

function

reads

are

identified

with

the

FR

number

and

the

values

listed

are

12

octal

characters

(36

bits)

wide

and

are

in

the

same

format

as

that

listed

on

the

CTY

when

performing

a

function

read

command.

If

the

front

end

has

problems

gathering

and

recording

this

snapshot,

an

error

code

is

included

in

the

file

and

the

SYSERR

listing

will

state

that

the

data

may

not

be

valid

and

list

the

3

character

error

code

stored

in

the

error

file.

These

error

codes

are

listed

here

in

alphabetical

order

with

an

expanded

description

of

the

error:

Code

APC

AMP

APN

APP

APR

APS

BAE

BUG

CAE

CCC

CCR

CCS

CES

CFH

CSC

CSR

DAE

DMF

DNP

DSF

FRONT

END

ERROR

CODES

Description

Cache

directory

specified

by

user.

The

user

should

check

for

disk

quota

exceeded

or

file

protection

failures.

?ERROR

DURING

OUTPUT

SYSERR

detected

an

output

error

while

writing

the

list

file.

?CAN'T

OPEN

INPUT DEVICE

This

implies

that

SYSERR

was

unable

to

perform

an

open

on

the

input

device

specified

by

the

user.

?CAN'T

OPEN

OUTPUT

DEVICE

This

implies

that

SYSERR

was

unable

to

perform

an

open

on

the

output

device

specified.

?LOOKUP

ERROR

ON

INPUT

FILE

This

implies

that

SYSERR

was

unable

to

lookup

the

input

file

specified

by

the

user.

Check

to

see

that

the

input

file

is

on

the

input

device

and

is

not

read-protected.

?SYSERR TRYING

TO

DO

LISTING

This

error

message

indicates

that

the

first

high

segment,

SYSERR,

is

about

to

try

to

produce

a

listing,

however,

none

of

the

known

entry

codes

are

processed

by

this

segment.

The

following

warning

messages

are

output

by

SYSERR:

A-I

ERROR

MESSAGES

%EOF

MARKER

FOUND

IN

BODY

OF

SYSTEM

ERROR

FILE

SYSERR

has

seen

an

EOF

word

written

by

the

monitor

in

the

body

of

the

error

file.

This

is

normal

if

error

files

are

combined.

%DUMPING

UNKNOWN

ERROR

TYPE

IN

OCTAL

SYSERR

detected

an

entry

whose

error

code

did

not

match

any

of

the

known

error

types.

%EXCEEDED

PAGE

LIMIT

. PERFORMING

SUMMARY

SYSERR

has

output

more

than

the

allowable

pages

of

listing,

currently

defaulted

to

1000,

and

is

now

terminating

listing

and

performing

summary.

The

reason

for

this

limit

is

that

1000

pages

of

report,

is

more

than

anyone

can

absorb.

A

repeated

error

can

generate

a

lot

of

repeated

output.

The

user

should

examine

the

summary

and

select

the

subset

by

date

and

device

that

he

is

interested

in.

%ENTRY

WITH

ZERO

LENGTH

HEADER

SPECIFIED

%ENTRY

WITH

ZERO

LENGTH

BODY

SPECIFIED

Both

of

these

messages

are

indications

usually

that

SYSERR

has

lost

sync

in

the

error

file.

The

recovery

is

attempted

as

started

in

Appendix

B.

%SYRERI:FATAL

ERROR

READING

INPUT

FILE

SYSERR

has

encountered

a

checksum

or

parity

error

while

reading

the

current

input

file.

The

package

will

look

to

see

if

any

other

input

files

process

them,

and

then

generate

the

summary

listings.

%UNKNOWN

DEVICE

NAME

FOUND

IN

ENTRY

SYSERR

has

found

a

device

name

in

the

error

file

it

doesn't

recognize

such

as

DPA7

if

SYSERR's

configuration

only

knows

about

6

DPA's.

SYRUNV

should

be

changed

to

reflect

your

system

configuration.

See

Appendix

C

for

instructions

concerning

compiling

and

loading.

%EXPECTED

ERROR

CODE

NOT

FOUND

ON

TABLE

OF

SUBJECT

ERROR

CODES

An

event

code

has

been

found

in

the

error

file

within

the

range

of

those

codes

eligible

for

SYSERR

processing

(See

Appendix

B)

but

none

of

the

SYSERR

modules

have

the

ability

to

process

it.

The

entry

will

be

dumped

in

octal

in

the

output

file.

%SYRRNR:

RESTARTING IN

THE

OF

ERROR

FILE

SYSERR

has

encountered

problems

and

has

lost

sync

in

the

current

block.

It

has

gotten

the

of

the

file,

found

the

offset

and

has

started

processing

again

with

the

first

entry

in

this

A-2

ERROR

MESSAGES

%SYRCNR:

CANNOT

RE-SYNC,

TRYING

As

above,

SYSERR

has

lost

sync

and

gotten

the

of

the

file

but

there

is

no

pointer

word

to

the

start

of

the

first

entry

in

this

block.

SYSERR

will

look

at

each

block

until

either

it

finds

a

valid

pointer

word

or

end-of-file

is

encountered.

%DUMPING

PARTIAL

CONTENTS

OF

KLERROR

FILE

IN

OCTAL

SYSERR

was

building

the

file

in

core

and

either

of

two

events

occurred:

1)

an

inconsistency

was

detected

in

the

system

error

file,

or

2)

SYSERR

detected

the

start

of

another

KLERROR

file.

A-3

APPENDIX 8

ERROR

FILE

DESCRIPTIONS

This

appendix

contains

descriptions

of

the

format

of

the

error

file

and

contents

of

each

type

of

entry.

The

file

is

created

and

appended

to

by

a

portion

of

the

monitor

and

read

by

SYSERR.

Each

entry

is

considered

a

separate

entity

by

SYSERR

and

is

treated

separately.

The

recording

program

also

considers

each

entry

or

record

separately

and

appends

each

to

the

end

of

the

file.

The

only

exception

to

this

policy

is

the

synchronization

word

at

the

start

of

each

block

(128

words)

.

This

word

is

a

pointer

or

offset

to

the

start

of

the

first

entry

in

the

current

block

and

is

used

by

SYSERR

to

get

back

in

sync

in

case

of

trouble.

This

word

is

required

because

entries

may

cross

block

boundaries

to

conserve

disk

space.

The

use

of

this

resync

word

is

described

in

Appendix

A, SYSERR

ERROR

MESSAGES

and

the

following

diagram

shows

the

typical

layout

of

entries

across

a

block

boundary.

BLOCK X

BLOCK Y

Pointer

in

BLOCK

X

points

to

as

the

start

of

first

entry

in

this

block.

The

last

entry

in

this

block

(ENTRY

A)

crosses

into

BLOCK

Y.

The

pointer

in

BLOCK

Y

points

to

the

start

of

ENTRY

B.

B-1

ERROR

FILE

DESCRIPTIONS

Each

entry

or

record

in

the

error

file

is

composed

of

two

sections,

a

header

section

and

a

body

section.

The

header

section

contains

the

entry

type,

date

and

time

the

event

was

recorded,

the

processor

serial

number

which

detected

the

error,

and

the

length

of

the

header

and

body

sections.

The

body

section

contains

the

various

data

items

which

make

up

the

entry.

The

format

of

the

header

section

is

constant

for

each

version

of

the

header

section

regardless

of

the

entry

type

and

is

described

below.

The

format

of

the

body

section

for

each

entry

type

is

described

on

succeeding

pages.

ENTRY

HEADER

FORMAT

This

header

is

used

to

describe

the

contents

of

each

entry.

HDRCOD

HDRDAT

HDRUPT

HDRPSN

*HDRCOD

*

Date

&

time

of

entry

in

Universal

Format

System

uptime

at

entry.

LH

=

#days,

RH

=

fraction

of

day

Processor

serial

#

where

entry

was

recorded

BITS

0-8

9-16

17

18-23

24-26

27-35

DESCRIPTION

Entry

type,

tells

program

how

to

process

this

entry.

See

below

for

range

of

event

codes.

Reserved.

This

entry

recorded

by

TOPS20.

Header

Format

Version,

presently

1.

Header

Length,

presently

=

4.

Entry

Length

excluding

header,

maximum

777.

B-2

ERROR

FILE

DESCRIPTIONS

Event

Codes

All

event

codes

are

in

the

range

of

0

to

777

with

reservations

as

described:

000

Illegal

1-376

Reserved

by

DEC

for

use

with

SYSERR.

400-477

Reserved

for

customer

use

with

SYSERR.

500-577

Reserved

by

DEC

for

use

with

programs

other

than

SYSERR.

600-677

Reserved

for

customer

use

with

programs

other

than

SYSERR.

700-777

Reserved

for

all

for

error

file

control.

In

some

cases

(mostly

for

error

file

control)

only

the

first

word

(HDRCOD)

is

included.

This

is

the

minimum

required

for

any

entry.

The

current

event

codes

used

for

file

control

are

as

follows:

377

The

recording

program

has

detected

an

error

in

the

file

and

has

started

using

the

error

file.

See

Appendix

A

of

this

manual.

775

Offset

word

in

the

first

word

of

each

block

of

the

error

file.

RH

points

to

start

of

the

first

entry

in

this

block.

777

End

of

File,

tells

SYSERR

to

look

for

or

to

start

summary

listings

if

no

other

files

are

found.

8-3

ERROR

FILE

DESCRIPTIONS

SYSTEM

RELOADED

EVENT

CODE

101

SEC%RL==lOl

;Event

code

RL%SVN==O

RL%STD==l

RL%VER==2

RL%SER==3

RL%OPR==4

RL%HLT==5

RL%FLG==6

RL%SIZ==7

RL%LEN==RL%SIZ+30

;System

name

(ASCIZ PTR)

;Time

of

system

build

(universal

FMT)

;System

version

number

;APR

serial

number

;Operator

answer

to

why

reload

(ASC

PTR)

;Bughlt

address

(if

auto

reload)

;Flags

;Size

of

data

block

;Size

of

whole

block

(incl

2

strings)

B-4

ERROR

FILE

DESCRIPTIONS

iBUGHLT/BUGCHK

EVENT

CODE

102

SEC%BG==102

BG%SVN==O

BG%SER==l

BG%VER==2

BG%SDT==3

BG%FLG==4

BG%CHK==lBl

BG%INF==lB2

BG%HLT==lB3

BG%ADR==5

BG%JOB==6

BG%USR==7

BG%PNM==lO

BG%MSG==l1

BG%ACS==12

BG%PIS==32

BG%RCT==33

BG%REG==34

BG%NAM==40

BG%DAT==41

BG%CNT==42

BG%SIZ==43

BG%LEN==BG%SIZ+30

iEvent

code

iSystem

name

(ASCIZ)

iAPR

serial

number

iMonitor

version

iTAD

of

Monitor

build

iFlags

iBUGCHK

type

code

iBUGINF

type

code

iBUGHLT

type

code

iAddress

of

HLT/CHK

iFORKX"job

number

iUser

number

iProgram

name

(sixbit)

iMessage

(ASCIZ)

iACS

iPI

status

iRegister

count

iRegisters

(maximum

of

4)

iSixbit

name

of

check

iTime

and

date

of

BUGHLT/BUGCHK

iNumber

of

bug

checks

since

startup

iSize

of

data

Block

iLength

of

total

block,

incl

2

strings

B-5

ERROR

FILE DESCRIPTIONS

MASSBUS

DEVICE

ERROR

EVENT

CODE

III

SEC%MB==lll

iEvent

Code

MB%NAM==O

MB%VID==l

MB%TYP==2

MB%LOC==3

MB%FES==4

MB%CNI==5

MB%CIF==6

MB%SEK==7

MB%RED==lO

MB%WRT==l1

MB%UAD==42

MB%SPE==43

MB%HPE==44

MB%OVR==45

MB%ICR==46

iDevice

name

(if

available)

iVolume

ID

(sixbit)

iChannel

.•

device

type

-

see

PHYPAR

iLocation

of

error

-

sector

or

file

••

record

iFinal

error

state

-

device

dependent

iCONI

initial

iCONI

final

iNumber

of

seeks

iNumber

of

blocks/frames

read

iNumber

of

blocks/frames

written

iUnit

address

iSoft

Positioning

errors

iHard

positioning

errors

iOverruns

i

Ini

tial

TCR

iThe

following

locations

are

the

units

Massbus

registers

in

order

iFinal

contents"initial

error

contents

MB%REG==47

MB%SIZ==MB%REG+20

MB%LEN==MB%SIZ

iSize

of

data

block

iTotal

length,

currently

no

strings

reported

B-6

ERROR

FILE DESCRIPTIONS

FRONT

END

ERRORS

EVENT

CODE

130

SEC%FE==130

FE%FJB==O

FE%DIR==l

FE%ID==2

FE%PGM==3

FE%COD==4

FE%PRT==5

FE%DTE==6

FE%INF==7

FE%SIZ==7

FE%LEN==FE%SIZ

iEvent

code

iFork

number"job

number

iDirectory

numbers

iFront

end

software

version

iSixbit

name

of

program

iProtocol

device

code

(lBO=unknown)

i-Length

of

data"start

of

data

iDTE

number

iStart

of

error

information

iSize

of

data

block

(header)

;Minimum

block

to

allocate

B-7

ERROR

FILE

DESCRIPTIONS

FRONT

END

RELOAD

ENTRY. GIVES

-11

REBOOT

INFORMATION

EVENT

CODE

131

5EC%11==131

Rl%NUM==O

Rl%STS==l

.RIGTF==lBO

.RIOPF==lBl

.RIDPF==lB2

.R110E==lB3

.

R111E==lB4

.RIASF==lB5

.RIRLF==lB6

.RIDPF==lB7

.RIPUF==lB8

.RIR/VIF==lB9

.RIBSF==lBIO

.

RINRL==lB11

.RIRTC==6B35

Rl%FNM==2

Rl%SIZ==3

Rl%LEN==Rl%SIZ+~D20

;-11

Reload

i

-11

number

iReload

status

bits

iGTJFN

failed

for

dump

file

iOPENF

failed

for

dump

file

iDump

failed

iTo

-10

error

on

dump

iTo

-11

error

on

boot

iASGPAG

failed

on

dump

iReload

failed

i-II

didn't

power

down

i-II

didn't

power

up

;ROM

did

not

ack

the

-10

i-II

boot

program

didn't

make

it

to

the

-11

;11

took

more

than

1

minute

to

reload

.

;will

cause

a

retry

;Retry

count

;File

name

pointer

;Number

of

entries

;Allow

long

string

8-8

ERROR

FILE DESCRIPTIONS

PROCESSOR

PARITY

TRAP

EVENT

CODE

160

SEC%PT==160

PT%PFW==O

PT%Bm-l==l

PT%GDW==2

PT%USR==3

PT%JOB==4

PT%PGM==5

PT%PMA==6

PT%TRY==7

PT%HRD==lBl

PT%CCF==lB2

PT%CCH==lB3

PT%ESW==lB4

PT%SIZ==lO

PT%LEN==TP%SIZ

;Event

code

;

Page

fail

word

;Bad

data

word

;Good

data

word

;User

number

;FORKX"JOBN

;Program

name

(sixbit)

;Physical

memory

address

;Flags"retry

count

;Hard

error

;Cache

failure

;Cache

in

use

;Errors

on

sweep

to

core

;Size

of

data

block

;Length

of

total

block

B-9

ERROR

FILE DESCRIPTIONS

PROCESSOR

PARITY INTERRUPT

EVENT

CODE

161

SEC%PI==161

PI

%CNI==O

PI%ERA==l

PI%FP2==2

PI%SWP==3

PI%AAD==4

PI%OAD==5

PI%ADA==6

PI%ODA==7

PI%SBD==lO

iEvent

code

iCONI

APR

i

ERA

iPC

iNumber

of

errors

this

sweep

iLogicl

and

of

bad

addresses

iLogical

or

of

bad

addresses

iLogical

and

of

bad

data

iLogical

or

of

bad

data

iLogical

SBus

diag

function

data

PI%NSD==~DIO

iNumber

of

SBus

diag

fn

words

PI%ADD==22

PI%DAT==34

iFirst

10.

bad

addresses

iFirst

10.

bad

data

words

PI%N8W==~DIO

iNumber

of

bad

words

PI%SIZ==46

PI%LEN==PI%SIZ

iSize

of

data

block

iLength

of

total

block

8-10

(10

wds)

APPENDIX C

ASSEMBLY

INSTRUCTIONS

FOR

SYSERR

PACKAGE

The

SYSERR

package

for

DECsystem-20

is

comprised

of

4

source

modules:

SYRUNV.MAC

SYSERR.MAC

SYSERD.MAC

SYSERS.MAC

Universal

file

containing

revision

history,

macro

definitions,

and

low

segment

data

area

definitions

and

storage

locations,

etc.

This

module

is

only

used

during

the

assembly

process

and

is

not

used

at

run

time.

Routines

for

file

initialization

and

parsing.

command

PROCSD

routines

used

for

listing

DECsystem-20

entries.

Summary

listing

routines

for

all

entries.

Additional

routines

required

to

load

with

the

package

include:

SCAN.REL

HELPER.REL

Command

scanner,

general

utility

routines.

Finds

and

lists

the

HELP

file.

and

output

A

separate

file

which

gives

brief

instructions

for

running

SYSERR

is

SYSERR.HLP

and

should

be

located

in

the

same

directory

as

the

SYSERR

package.

The

easiest

method

to

compile,

load,

and

save

the

SYSERR

package

is

to

use

the

batch

control

file

distributed

with

the

package.

To

submit

the

job

to

batch,

the

command

is:

@ SUBMIT

SYSERR/RESTART:I/TIME:20:00/UNIQ:O

If

CREF

listings

are

desired

add

/TAG:CREF

to

the

command.

The

control

file

may

also

be

listed

and

the

commands

typed

on

your

terminal

as

they

appear

in

the

file

if

you

don't

wish

to

use

batch.

The

package

should

always

be

loaded

with

local

symbols

to

allow

debugging

if

required

without

having

to

re-compile

or

re-Ioad

the

package.

C-1

/BEGIN,

3-3,

3-4

BUGHLT/BUGCHK,

4-4, 4-5, 4-6,

4-30

CD20,

3-3,

4-11,

4-12

COMMANDS,

3-1, 3-2, 3-3,

3-5

COMMANDS,

Indirect,

3-6

DEFAULT,

3-2, 3-4,

3-6

DEFINE,

3-1

/DETAIL,

3-4,

3-5,

4-6

/DEV,

3-3,

3-5

DH11,

3-3,

4-11,

4-13

/END,

3-3,

3-5

FRONT

END,

4-11

through

4-19,

4-30

/HELP,

3-5

IICPU,

3-3

KLCPU,

3-3,

4-11,

4-14

INDEX

LP20,

3-3,

4-11,

4-12,

4-13

MASSBUS,

3-3,

3-5,

4-7

through

4-10,

4-30

PRINT,

3-5

RELOAD,

4-3, 4-4,

4-19

/RETRY,

3-4

SUMMARY,

4-23

through

4-30

TERMINAL,

3-6

KLERROR,

4-11,

4-14

through

4-18

Index-l

Error

Detection,

Recovery

and

Reporting

Reference

Manual

EK-SEDRR-RF-OOI

READER'S

COMMENTS

NOTE:

This

form

is

for

document

corrments

only.

Problems

with

software

should

be

reported

on

a

Software

Problem

Repcrt

(SPR)

form.

Did

you

find

errors

in

this

manual?

If

so,

specify

by

page.

Did

you

find

this

manual

understandable,

usable,

and

well-organized?

Please

make

suggestions

for

improvement.

Is

there

sufficient

documentation

on

associated

system

programs

required

for

use

of

the

software

described

in

this

manual?

If

not,

what

material

is

missing

and

where

should

it

be

placed?

Please

indicate

the

type

of

user/reader

that

you

most

nearly

represent.

[]

Assembly

language

programmer

[]

Higher-level

language

programmer

[]

Occasional

programmer

(experienced)

[]

User

with

little

programming

experience

[]

Student

programmer

[]

Non-programmer

interested

in

computer

concepts

and

capabilities

Name

Date

________________________ ___

Organization

________________________________________________________________

__

Street

__________________________________________________________________

_

City

___________________________

State

_____________

Zip

Code

____________

__

or

Country

If

you

require

a

written

reply,

please

check

here.

[]

-------------------------------------------------------------Fold

lIere------------------------------------------------------------

------------------------------------------------

Do

Not

Tear·

Fold

lIere

and

Staple

-----------------------------------------------

BUSINESS REPLY MAIL

NO POSTAGE STAMP NECESSARY

IF

MAILED IN

THE

UNITED STATES

Postage will be paid by:

Software

Communications

P.

O.

Box

F

Maynard,

Massachusetts

01754

FIRST

CLASS

PERMIT

NO. 33

MA YNARD, MASS.

EK SEDRR RF 001 Error Detection, Recovery And Reporting Reference Manual

EK-SEDRR-RF-001 Error Detection, Recovery and Reporting Reference Manual EK-SEDRR-RF-001 Error Detection, Recovery and Reporting Reference Manual

Navigation menu

Versions of this User Manual:

Views

Navigation