ILLIAC_IV_System_Study_Progress_Report_No_2_Oct66 ILLIAC IV System Study Progress Report No 2 Oct66

OCTOBER 13,

1966

illlAC IV

SYSTEM

STUDY

PROGRESS

REPORT

NO. 2

SUBMITTED

TO

UNIVERSiTY OF ILLINOIS

UNIVERSITY

OF

ILLINOIS

PURCHASE

ORDER

NO.

09852-B

SECTION

I

SECTION

II

SECTION

III

CONTENTS

INTRODUCTION

•

THE

ILLlAC

IV

SYSTEM

ROUTING

Shifting

at

the

PE

Level.

Shifting

at

the

Quadrant

Level.

Inter-Quadrant

Transfers

•

Machine

Instruction

•

THE

MULTIPLY

ALGORITHM

EIGHT-BIT

WORD

LENGTHS

COMMON

DATA

PATHS.

Input-Output

Buffering

•.

•

From

Control

Unit

to

Proces

sing

Element.

From

Processing

Element

to

Control

Unit.

Evaluation

.

MODE.

Introductory

Considerations

The

Problem

Back-up

Storage

in

the

Control

Unit.

Back-up

Storage

in

Processing

Element

Functional

Capability.

Comparison.

REVISED

PE

LOGIC

DESIGN

•

PROCESSING

ELEMENT

DESCRIPTION

•

MEMORY

DA

TA

REGISTER

(MDR)

•

OPERAND

SELECT

GATES

(OSG) •

A-B

REGISTER

(ACCUNIULA

TOR)

LOOIC

UNIT

BARREL

SWITCH.

LEADING

ONES

DETECTOR

•

MULTIPLJCAND

SELECT

GATES

Page

1-1

2-1

2-3

2-7

2-8

2-12

2-14

2-15

2-16

2-17

2-18

2-19

2-20

3-1

3-4

3-5

-

iii-

CONTENTS

(Cont.

)

Page

SECTION

III

(Continued)

PSEUDOADDER

TREE

3 -5

CARRY

PROPAGATE

ADDER

(CPA)

3

-5

MODE

REGISTER.

3-6

ADDRESS

REGISTER.

3-6

PE

INDEX

REGISTER

3-6

MEMORY

ADDRESS

REGISTER

(MAR)

.

3-7

SPECIAL

REGISTER

(SR)

3-7

SECTION

IV

THE

I/O

SUBSYSTEM

4-1

GENERAL

CONSIDERATIONS.

4-1

DISK

STORAGE.

4-2

SECTION

V

PROGRAMMING

5 -1

REVISED

PE

INSTRUCTIONS

. 5 -1

CONTROL

UNIT

INSTRUCTIONS.

5-4

Elements

of

the

Control

Unit

5-6

Discussion

of

the

Instructions.

5-6

SECTION

VI

ILLIAC

IV

APPLICATIONS

STUDY.

6-1

DESCRIPTION

OF

THE

COOLEY-TUKEY

ALGORITHM

6-1

MACHINE

I:=\lPLEMENTA

TION

6-3

IMPLEMENTATION

ON

ILLIAC

IV

.

6-4

STORAGE

OF

THE

COEFFICIENTS

A(k)

.

6-6

COMPUTATION

AND

STORAGE

OF

INTERMEDIATE

RESULTS.

6-6

COMPUTA

TIONS

REQUIRED

6-10

SECTION

VII

CIRCUIT

DESIGN

-

THE

ECL

CIRCUIT

7

-1

-iv-

Figure

2-1

2-2

2-3

2-4

2-5

2-6

3-1

6-1

6-2

6-3

7-1

7-2

7-3

7-4

7-5

Table

2-1

2-2

LIST

OF

ILLUSTRA

TIONS

PE

Level

Shifting

•

Quadrant

Nearest

Neighbor

Connections.,

Pseudo

Physical

Layout

of

a

Quadrant

•

"Nearest

Neighbors"

Arrangement

of

Full

Array,

Showing

Quadrant

Subarrays

•

Logic

Elements

Involved

in

Multiply

Algorithm

•

System

for

Handling

Data

From

Control

Unit

to

Processing

Element,

Block

Diagram

•

Revised

PE

Logic

Design,

Block

Diagram.

Interpretation

of

k

as

a

Binary

Number

to

Define

Storage

Location

of

A(k}

The

Numbering

of

the

PEls

Within

a

Quadrant

The

Dis

tribution

of

the

Coefficients

A(k}

in

the

Array

Initially

•

ECL

Gate,

Schematic

Diagram

Driver,

Schematic

Diagram.

Receiver,

Schematic

Diagram.

ECL

Driver-Receiver,

Balanced

Signals,

Schematic

Diagram

•

Use

of

Drivers

and

Receivers

for

Distributing

Control

.

Signals

from

CU

to

PEls

Shift

Table,

Showing

Shifts

by

One's

and

Shifts

by

Eight's.

Relationship

of

Gate

Size

to

Multiplier

Size

Page

2-2

2-6

2-10

3-2

6-5

6-6

7-2

7-4

7-7

2-5

·2-11

-v-

-vi-

Table

3-1

4-1

6-1

6-2

7-1

LIST

OF

ILLUSTRA

TIONS

(Cont'd)

PE

Logic

Requirements

512-Bitl

Parallel

Organizations

for

the

Librascope

4802

Shifts

Required

for

Combinations

of

Bits

j 2

and

k •

r-

m-r

Shifting

Required

for

the

Final

Eight

Iterations.

Signals

Requiring

Driver

and

Receiver

Page

3-3

4-6

6-8

6-9

7-6

SECTION

I

INTRODUCTION

This

report

is

the

second

of

three

reports

to

be

submitted

during

the

Phase

I

effort

of

the

ILLIAC

IV

Program.

At

this

measured

milestone

in

the

development

of

the

ILLIAC

IV

system

real

progress

is

indeed

evident.

There

exists

now

a

detailed

specification

of

the

Processing

Element

for

which

a

fully

compatible

and

complete

instruction

repertoire

has

been

developed.

In

addition,

the

data

transfer

paths

establishing

the

links

between

the

Input-Output

and

the

Array,

between

the

Control

Unit

and

the

Array,

and

between

the

elements

of

the

Array

have

been

specified.

Progress

in

defining

the

system

hardware

has

also

been

made.

A

family

of

logic

circuits

has

been

selected

and

is

currently

undergoing

an

intense

packaging

effort.

The

size,

type

and

speed

of

the

memory

system

for

the

Array

ha's

been

selected.

,Progress

in

defining

the

applications

areas,

particularly

that

progress

made

at

the

University

of

Illinois,

is

especially

evident.

The

contents

of

this

report

in

conjunction

with

the

first

report

contain

much

of

the

rationale

for

the

present

system

definition.

For

the

remainder

of

Phase

I,

much

work

remains.

Although

most

of

the

functions

and

specifications

of

the

110

and

the

Control

Units

have

been

identified

some

additional

work

is

needed.

In

the

area

of

packaging,

both

at

the

cabinet

level

and

the

logic

level,

detail

design

remains.

Such

items

as

power

distribution,

system

cooling

and

general

installation

detail

must

be

specified.

Finally

the

entire

procurement

must

be

matched

to

a

comprehensive

program

plan

of

schedule

and

delivery

which

is

mutually

acceptable.

1-1

SECTION

II

THE

ILLIAC

IV

SYSTEl'vl

This

section

contains

discussions

of

specific

systerns

problems

which

are

con-

sidered

to

be

of

major

importance

in

the

design

of

the

ILLIA

C

IV

System.

ROUTING

One

of

the

instructions

in

ILLiAC

IV

is

to

transfer

data

residing

in

the

PE

array

to

new

locations

in

that

array.

This

is

called

the

"routing"

instruction.

One

version

of

this

instruction

will

take

the

data

from

the

nth

PE

and

transfer

it

to

the

(n

+

m)th

PE~

where

m

is

an

indexable

variant

contained

within

the

instructionJ

and

n

runs

over

all

PEr

s.

Disabled

PEr

s

are

not

to

receive

any

new

data

or

lose

any

of

the

data

they

are

currently

holding

as

a

result

of

this

instruction.

The

immediate

reaction

to

such

a

requirement

is

to

implement

all

the

required

various

paths

by

brute

force.

Such

a

solution

is

not

only

expensive~

but

unneces-

sary,

and

of

inferior

performance.

To

find

a

superior

method

of

inlplementing

routing,

it

will

help

to

consider

individual

properties

of

the

routing

process.

From

a

functional

viewpoint

the

four

properties

of

concern

for

routing

are

the

level~

the

timing,

the

modulo~

and

the

increment.

In

the

array

of

256

PE's

the

levels

of

shifting

with'which

we

are

concerned

may

be

divided

into

shifts

between

quadrants,

shifts

between

the

same

elements

of

a

quadrant

and

shifts

within

a

Processing

Element.

The

timing

refers

to

the

execution

time

of

the

shift

and

its

concurrency

with

other

array

instructions.

The

modulo

is

the

end-around

size

of

the

shift

which

can

be

variable

up

to

a

given

maximum

size.

For

instance

a

64-

bit

shifter

may

be

designed

to

shift

64

bits

end-around

or

eight

8-bit

bytes

end-around

depending

on

the

modulo

control.

The

increment

is

the

smallest

shifting

amount

of

which

all

shifts

are

a

multiple.

Bit

shifts

would

have

an

increment

of

1"

byte

shifts

would

require

8,

etc.

2-1

2-2

BYT

CON

PE 1

A

E

TRO~

BARREL

SWITCH

ORF

UNCTION

PE

1

PE

9

PE

57

B

PE 2

PE

3

A A

I I

-

..

I

I I

I

t I

t I t

I ,

BA

RREL

SWITCH

I

BARREL

SWITCH

I

, B I B

t

~

'---

--I

I

Figure

2-1.

PE

Level

Shifting

PE

2

NORTH

SOUTH

EAST

WEST

- - - -

----

--

----+---.!

QUADRANT

GATE

COUNT

PE

8

4

><

64

X 64 =

16.000

Gat

s/Quadr

nt

Gates/Bit

X

Bits/WD

><

PE

s/Q

PE

58

14---

-

---

-

--

- -

---

-~~

PE

64

Figure

2-2.

Quadrant

Nearest

Neighbor

Connections

The

variations

on

these

properties

are

many,

and

since

the

function

involves

a

very

large

number

of

bits

the

different

variations

involve

substantial

differences

in

cost.

Shifting

at

the

PE

Level

The

first

level

of

consideration

is

the

shifting

of

operands

within

the

PEe

Here

a

barrel

switch

is

provided

which

will

shift

in

increments

of

1-bit-multiples

at

the

64-

bit-operand

level.

A

modulo

control

may

be

employed

by

adding

an

additional

logic

input

to

the

first

level

of

gating

in

the

barrel

switch

and

an

additional

gate

to

the

receiving

re-

gister

input.

Such

modulo

capability

may

include

8-,

32-,

and

64-

bit

bytes

and

be

extended

to

link

PEts

by

a

full

word

transfer

at

the

end

of

the

switching

cycle

(figure

2

-1).

The

execution

time

of

this

shifting

will

require

one

pass

through

the

barrel

switch

for

modulo

64

shifts.

two

passes

through

the

barrel

switch

for

end

off

switching.

and

four

passes

for

end-

around.

Otherwise

a

transfer

of

Register

B

to

the

neighboring

Register

A

will

extend

this

capability

between

PE's.

For

the

most

part

this

logic

capability

exists

within.a

PE,

and

there

is

no

con-

currency

of

execution

at

this

level.

'Shifting

at

the

Quadrant

Level

At

this

level

of

shifting,

considering

the

variety

of

desirable

shifting

properties

available,

there

are

many

possibilities.

Two

seemingly

practical

forms

at

the

Quadrant

level

are

nearest

neighbor

connections

and

the

quadrant

barrel

switch.

These

two

shifting

approaches

represent

the

practical

minimum

and

maximum

cost

for

the

ILLIA

C

IV

System.

NEAREST

NEIGHBOR

CONNECTIONS

-Figure

2-2

shows

the

necessary

data

paths

and

gates

to

implement

this

shift

approach.

It

has

the

capability

of

shift-

ing

left

or

right

4096

bits

in

increments

of

64

and

512

bits.

Increments

smaller

than

this

may

be

shifted

with

the

local

barrel

switch

if

desirable

since

con-

currency

is

not

possible

here.

All

shifting

must

be

done

in

sequence

with

all

arithmetic

and

logic

instructions;

all

shifted

amounts

are

combinations

of

the

basic

two

increments.

Shifting

of

a

modulo

size

smaller

than

the

whole

quadrant

is

done

by

mode

control

of

the

PEts.

THE

QUADRANT

BA

RREL

SWITCH

-

The

main

advantage

of

the

quadrant

barrel

switch

is

its

ability

to

execute

any

shifted

amount

(in

increments

of

8)

in

two,

passes

through

it.

Each

shift

can

be

executed

concurrently

with

other

instructions

2-3

2-4

in

the

PEe

Modulo

control

is

accomplished

by

controlling

the

contents

of

the

Routing

Register

between

partial

shifts.

A

COMPARISON

OF

THE

TWO

APPROACHES

-

1.

The

two

method

s

can

produce

identical

results

with

differences

in

execution

time

and

cost.

Ignoring

time

and

cost

the

methods

are

functionally

equivalent.

2.

There

is

an

approximate

cost

difference

of

six

to

one.

If

we

ignore

the

requirement

for

special

circuits

to

transmit

signals

long

distances,

the

simple

gate

count

for

the

quadrant

barrel

amounts

to

200/0

of

the

total

system.

3.

Physical

distance

is

an

important

consideration.

Figure

2-

3

depicts

a

psuedo

physical

layout

of

the

quadrant

in

which

the

interconnected

PE

for

the

Nearest

Neighbor

Connections

may

be

reasonably

close.

Perhaps

a

maximum

distance

of

6

feet

may

be

realized.

With

the

quadrant

barrel

however

some

paths

are

long.

Considering

a

centrally

placed

barrel

switch

in

the

middle

of

a 27

foot

quadrant

cabinet

row,

one

row

per

quadra~t,

the

worst-

case

distance

would

be

about

30

feet.

This

means

that

every

shift,

however

short,

will

involve

30

feet

of

cable

delay

plus

logic.

The

following

computation

shows

anticipated

delay

times.

Barrel

Switch

Cable:

30

1 X

1.

7

nsecsj

ft.

= 51. 0

nsecs.

Logic:

8

gates

X 3

nsecsj

levels

= 24. 0

75. 0

nsecj

shift

Nearest

Neighbor

Cable:

61 X 1. 7

nsecsj

ft. =

10.2

nse·cs.

Logic:

3

gates

X 3

nsecsj

gate

= 9. 0

19.2

nsecsj

shift

Avera~e

Barrel

Switch

Time

=

75.0

nsecs.

Avera~e

Nearest

Neighbor

Time

= *

4 X 19. 2

=76.

8

nsecs.

*

Refer

to

table

2-1,

page

2-

5.

Table

2-1.

Shift

Table~

Showing

Shifts

by

One's

and

Shifts

by

Eight's

Desired

Shift

Desired

Shift Shift

Shift

by

One's

by

Eight's

Total

Shift

by

One's

by

Eight's

Total

0 0 0 0

32

0 4 4

1 +1 0 1

33

+1 4 5

2 +2 0 2

34

+2 4 6

3 +3 0 3

35

+3 4 7

4

+4

0 4

36

-4

-3

7

5

-3

+1

4 37

-3

6

-2

+1

3

38

-2

-3

5

7

-1

+1

2

39

-1 -3

4

8 0

+1

1

40

0

-3

3

9

+1

2 41 +1

-3

4

10

+2 +1

3 42

+2

-3

5

11

+3

+1

4 43

+3

-3

6

12

+4

+1

5

44

-4 -2

6

13

-3

+2

5

45

-3

-2

5

14

-2

+2

4

46

-2 -2

4

15

-1

+2

3

47

-1

-2

3

16 0

+2

2 48 0

-2

2

17 +1

+2

3

49

+1

-2

3

18

+2 +2

4

50

+2

-2

4

19

+3

+2

5 51 +3

-2

5

20

+4

+2

6 52

-4

-1

5

21

-3

+3

6 53

-3

-1

4

22

-2

+3

5

54

-2

-1

3

23

-1

+3

4

55

-1

2

24

0

+3

3

56

0

-1

1

25

+1

+3

4 57 +1

-1

2

26

+2

+3

5 58 +2

-1

3

27 +3

+3

6

59

+3

-1

4

28

+4

+3

7

60

-4

0 4

29

-3

4 7 61

-3

0 3

30

-2

4 6 62

-2

0 2

31

-1

4 5 63

-1

0 1

Average

number

of

shifts

=

2~~

= 4

4.

However,

real

problems

are

not

random

with

average

shifts.

The

data

to

be

shifted

often

falls

into

patterns

which,

with

proper

programming,

can

be

made

to

somewhat

fit

the

available

shifting

patterns

of

the

partial

shifting

scheme,

and

save

time

over

the

average

time

taken

for

transfer

of

randomly

placed

data.

Some

examples

of

such

savings

follow.

(a)

The

Cooley-

Tukey

algorithm

for

spectrum

analysis

in

one

form

involves

swapping

data

between

PE's

which

are

half

an

array

apart

at

the

first

swap,

a

quarter

of

an

array

apart

at

the

second

swap,

an

eighth

at

the

third,

and

so

on.

Over

each

64

-PE

quadrant,

these

transfers

are

accomplished

with

four,

two,

and

one

partial

shifts

respectively,

or

an

average

of

2.

333

partial

shifts

for

each

actual

data

transfer.

This

is

considerably

faster

than

the

four

partial

shifts

required

for

a

random

data

transfer,

and

is

faster

than

the

data

transfer

effected

through

the

a

11-

possible-paths

scheme.

(b)

Another

use

for

the

data

transfer

paths

which

cover

the

array

is

in

the

computation

of

global

variables

("Global"

here

means

a

variable

which

acquires

its

definition

from

data

which

covers

all

the

PE's).

Maximum,

minimum,

logical

AND~

across-word

parity,

sum,

product,

are

examples

of

functions

which

one

might

want

to

perform

on

corresponding·

words

in

all

PE's,

more

or

less

in

parallel

acro"ss

the

array,

producing

a

one-word

result.

Since

PE's

are

only

capable

of

two-operand

operations,

a

combination

of

the·

64

corresponding

variables

from

the

64

PEs

of

one

quadrant

into

one

resulting

variable

must

take

at

least

six

operational

steps,

·consisting

of

32

operations

on

the

variables

by'

pairs,

then

a

pairwise

combining

of

the

16

resulting

pairs,

and

so

on.

Between

2-5

Figure

2-3.

Pseudo

Physical

Layout

ofa

Quadrant

0 17 16 2 15 3 14 4 13 5

12

6

11

7 10

360 377 361 376 362 375 363 374 364 373 365 372 366 371 367

370

20 37

21

36 22 35

23

34

24 33 25 32 26 31

27

30

340

357 341 356 342 355 343 354 344 353 345 352 346 351 347 350

40

57

41

56 42 55

43

54

44

53

45

52

46

51

47

50

320 337 321 336 322 335 323 334 324 333 325 332

326

331 327 330

60 77

61

76

62

75

63

741 64 73 65 72

66

71

67

70

1

300 317 301 316 302 315 303

3141304

313 305 312

306

311 307 310

---------------r-----~--------

100 117 101 116 102 115 103 1141 104 113 105 112 106 111 107 110

1

260 277 261 276 262 275 263

2741264

273 265 272 266 271 267 270

1

120 137 121 136 122 135 123

134/

124 133 125 132 126 131 127 130

1

240 257 241 256 242 255 243

254:

244 253 245 252 246 251 247 250

140 157 141 156 142 155 143

154:

144 153 145 152 146 151 147 150

220 237 221 236 222 235 223 ,

234

1224

233 225 232 226 231 227 230

160 177 161 176 162 175 163

174:164

173 165 172 166 171 167 170

200 217 201 216 202 215 203

214:204

2

13

205 2

12

206 2

11

207 210

Figure

2-4.

"Nearest

Neighbors"

Arrangement

of

Full

Array,

Showing

Quadrant

Subarrays

2-6

each

operation

a

transfer

of

data

occurs

to

place

it

in

position

for

the

tion.

Thus

a

global

function

takes

six

binary

operations

and

five

data

transfer

times

when

implemented

within

the

system

which

has

all

possible

paths.

The

above

description

of

a

sequence

of

steps

is

correct

even

if

a

single

instruction

calls

forth

the

whole

sequence.

Global

functions

are

in

the

same

category

as

the

Cooley-

Tukey

algorithm

as

far

as

data

transfer

is

concerned.

The

data

is

shifted

by

2

and

combined

with

an

unshifted

copy,

and

so

on

up

to

a

shift

of

32#

for

the

64-

element

subarray.

For

the

nearest

neighbors

scheme~

an

average

of

2. 33

partial

shifts

per

actual

shift

is

called

for.

The

nearest

neighbor

connections

would

thus

seem

to

have

a

slight

advantage

in

speed

over

the

quadrant

barrel

in

executing

global

functions

such

as

maximum,

minimum,

global

AND,

global

OR,

and

across-word

parity.

5.

The

conclusion

is

that

we

do

not

buy

anything

by

implementing

schemes

which

tranfer

data

across

from

one

side

of

the

array

to

the

other

·with

a

single

step.

The

time

it

takes

for

the

data

to

travel

across

the

array

is

suf-

ficient

to

permit

several

logical

operations.

It

is

possible

to

find

a

set

of

partial

data

transfer

paths

which

result

in

simpler

mechanization

and

lesser

wiring,

and

which

do

not

degrade

performance.

Furthermore,

by

avoiding

the

barrel,

we

also

avoid

the

necessity

for

allowing

maximum

delay

on

each

transfer,

and

when

the

data

is

ordered,

or

partially

ordered,

a

speedup

{s

accomplished

which

would

be

impossible

in

the

more

expensive

system.

Inter-Quadrant

Transfers

Shifting

between

quadrants

is

dependent

on

the

type

of

intraquadrant

shifting

selected,

but

the

considerations

at

the

interface

are

similar.

The

choice

of

transferring

all

words

of

a

quadrant

as

a

column

(row)

may

be

made

for

either

case.

Here

also

a

similar

layout

of

PE

~s

was

done

for

the

quadrant

(figure

2-3)

can

make

edge

switching

faster

than

quadrant

switching.

Figure

2-4

shows

an

arrangement

of

the

whole

array,

256

PE's,

with

the

same

properties

for

the

whole

array

that

figure

2-3

has

for

the

64

PEl

sof

the

single

quadrant.

PE's

which

differ

in

number

by

±1

or

±16

from

a

subject

PE

are

physica.l

neighbors

of

it,

just

as

in

figure

2-3,

where

PE's

which

differ

in

number

by

±1

or

±8

of

a

given

PE

are

physical

neighbors

thereof.

Furthermore,

each

quarter

of

the

diagram

of

figure

2-

4,

as

indicated

by

the

dotted

lines,

is

a

copy

of

figure

2~

3.

To

allow

one

to

see

this

latter

point,

the

PE's

in

figure

2-4

are

numbered

from

o

to

377

in

octal.

To

translate

from

an

entire

array

PE

numbering,

as

shown

in

figure

2-4,

to

quadrant

numbering,

one

may

suppress

the

first

and

fifth

bit

of

the

binary

equivalent

of

the

octa.l

PE

number.

When

this

is

done,

each

quarter

of

figure2-4

is

like

figure

2-3,

with

those

on

the

right

reflected

about

the

verti-

cal

axis,

and

those

on

the

bottom

reflected

about

the

horizontal

axis.

The

lower

right

quadrant,

for

example,is

like

figure

2-3

both

upside

down

and

backwards.

2-7

2-8

Machine

Instruction

The

machine

instruction

to

use

this

equipment

thus

req

uires

considerable

equip-

ment

to

translate

the

instruction

into

the

var

ious

shifts

required.

The

number

of

shifts

will

differ,

depending

upon

how

far

around

a

given

quadrant,

or

whole

array,

the

data

is

to

be

shifted.

Variants

of

the

shift

instruction

will

be

able

to

call

upon

the

intraquadrant

shift

even

when

the

entire

array

is

being

used.

This

is

of

advantage

during

one

possible

implementation

of

the

Cooley-

Tukey

algorithm,

for

example.

A

shift

of

one

in

the

32-bit-word

mode

also

involves

a

transfer

of

half-words

betwee'ri.'neighboring

PE's,

since

each

one

is

now

acting

as

two

half-

word

PEls.

The

shift

instruction

therefore

calls

up,

possibly

an

internal

barrel

shift

in

the

PE,

possibly

a

half-word

transfer

between

neighboring

PE's,

a

variable

number

of

east-west

neighbor

shifts,

and

a

variable

number

of

north-

south

shifts.

To

preserve

the

contents

of

the

A

and

B

registers

of

the

disabled

PE's,

intervening

steps

must

avoid

the

A

and

B

registers,

which

may

be

only

an

initial

source

of

data,

or

the

destination

at

the

final

shift.

Either

the

MDR

or

the

S

register,

therefore,

is

involved

in

routing,

with

strong

preference

to

the

MDR,

since

it

is

desirable,

even

though

maybe

not

as

absolutely

necessary,

to

save

the

S

register

as

well

as

the

A

and

B.

THE

MULTIPLY

ALGORITHM

The

most

important

single

logic

function

from

the

standpoint

of

both

performance

and

cost

is

the

multiply.

The

emphasis

placed

on

this

instruction

in

its

design

and

application

singles

it

out

for

special

discussion.

When

first

considering

the

many

different

ways

to

implement

the

multiply

the

ILLIA

C

array

itself

offers

the

first

direction.

There

is

a

class

of

algorithms

which

takes

advantage

of

the

statistical

nature

of

the

ONE

and

ZERO

trains

in

the

multiplier.

The

average

execution

time

of

such

a

multiply

is

always

less

than

a

worst-case

pattern

of

ONE

and

ZERO

in

the

multiplier

and,

therefore,

in

the

course

of

a

program

run,

the

multiply

time

is

the

average

multiply

time.

However,

because

of

the

lock-

step

synchronous

operation

of

the

Array

which

handles

up

to

256

pairs

of

operands

simultaneously,

the

average

execution

time

becomes

the

worst-case

time.

Such

methods

therefore

are

not

applicable

to

the

ILLIA

C

system.

Following

the

above,

the

to

be

made

is

how

many

bits

of

the

multi-

plier

are

to

be

examined

and

disposed

of

simultaneously

during

a

single

step

in

the"

cyclic

multiply

sequence.

Apart

from

circuit

speeds,

this

is

the

single

basis

for

determining

the

speed

of

the

multiply.

Two

seoarate,

yet

interdeoendent,

techniaues,

are

available

to

do

this.

One

technique

is

to

encode

fields

of

the

m

ultiDlier

into

a

larger

number

base*

;

and

the

second

techniaue

is

to

combine

manifold

summands

selected

by

the

conditions

of

the

multiolier

bits.

In

practice

these

two

techniaues

are

combined

into

a

single

comprehensive

design.

The

following

general

relation

is

useful

in

dete"rmining

the

siz

e

of

the

partial

multiplier:

Where:

Execution

time:

(MB

+3)

G·

6 t

XB

MB

Nunlber

of

bits

in

the

mantissia

XB

Number

of

bits

in

the

partial

multiplier

G

Number

of

gates

in

typical

delay

chain

6t

Nominal

gate

delay

including

loading,

wire

length,

etc.

Reasonable

values

for

the

above

paramete

rare:

MB

48

bits

G

10

gates

selected

as

typical

PE

logic

chain

6. t = 3

nanoseconds/

gate

Execution

time

=

memory

cycle

time

=

250

nanoseconds.

Therefore:

XB

= 8

bits.

Table

2-2

shows

the

relative

speed-cost

relation

for

the

range

of

possible

partial

multiply

sizes.

The

important

feature

in

this

table

is

the

gate

count

differences

for

the

different

selected

sizes.

In

terms

of

a

10K-gate

Processing

Element

the

8-

bit

selection

appears

to

be

a

reasonable

maximum

hardware

investment.

*

MacSorley,

O.

L.,

"High-

Speed

Arithmetic

in

Binary

Computer.

"

Proceedings

of

the

IRE,

pp.

67

-71,

January

1961.

Wallace,

C.

S.,

"A

Suggestion

for

a

Fast

Multiplier,

"IEEE

Transactions

on

Electronic

Computers,

V.

EC-13,

February

1964.

2-9

2-10'

.MEMORY

DATA

REGISTER

1

~

OPERAND

SELECT

REG.

OPERAND

SELECT

GATE

~

t

t !

MULTIPLIER

DECODE

CARRY

SA

VE

TREE

1

l

~

B

REGISTER

FULL

ADDER

I--

t 1

+

BARREL

SWITCH

A

REGISTER

t

J

Figure

2-5.

Logic

Elements

Involved

in

Multiply

Algorithm

Memory

Access

Memory

Access

Buffer

(512

bits)

Buffer

(512

bits)

, 1 8

words

1 8

words

,

Control

P.E. P.E.

Control

Unit

Array

Unit

1 1 2 2

INTER

-QUADRAN~

EXCHANGE

Memory

Access

Memory

Access

Buffer

(512

bits)

-I

Buffer

(512

bits)

t 8

words

18

words

Control

P.E.

Control

Unit

Array

Unit

3 3 4 4

Figure

2-6.

SystelTI

for

Handling

Data

From

Control

Unit

to

Processing

Element

l

Block

Diagram

Table·2-2.

Relationship

of

Gate

Size

to

Multiplier

Size

Number

of

Logic

Gate

Count

for

Estimated

Execution

Time

Partial

Multiplier

Bits

Multiply

Only

(nanoseconds)

6

2,000

330

8

2,800

270

12

4,800

210

Having

decided

what

constitutes

a

reasonable

number

of

multiplier

bits

to

handle

in

a

single

cycle

(in

this

case

8

bits),

the

is

to

determine

how

to

maintain

a

10-

gate

delay

for

the

worst-

case

logic

chain.

To

evaluate

this

criteria

the

following

substeps

in

the

multiply

must

be

completed

within

a

10-gate

chain:

1.

Decode

the

partial

multiplier.

2.

Add

the

to

the

partial

product.

3.

Shift

the

multiplier

and

the

partial

product.

4.

Normalize

the

result.

Figure

2-

5

is

a

block

diagram

of

the

logic

elements

involved

in

the

execution

of

this

algorithm.

The

delay

chain

involved

is

the

time

to

go

from

register

to

register.

The

multiply

algorithm

selected

first

decodes

the

8

bits

of

the

multiplier

into

a

base-

4

representation

into

which

0, +

I,

-1,

and

+2

values

of

the

operand

are

selected.

This

decode

is

stored

in

the

Operand

Select

Register

(OSR).

The

OSR

is

12

bits

long

for

storing

the

fully

decoded

four

conditions

for

each

of

four

operands.

Because

the

8

bits

of

the

multiplier

are

taken

as

bit

pairs,

four

summands

must

be

added

to

the

partial

product

in

a

single

cycle.

The

carry

save

tree,

which

contains

three

full

length

carry

save

adders

(2

less

than

the

total

number

of

summands)

combines

four

summands

plus

the

partial

product

into

a

new

sum

and

carry

value.

The

full

adder

now

combine

s

this

sum

and

carry

into

a

new

partial

product.

In

terms

of

worst-case

delay,

the

logic

chain

through

the

operand

select

gate,

.

the

carry

save

tree,

and

the

full

adder

represents

the

worst-case

delay.

Detailed

logic

design

analysis

has

shown

that

a

nominal

30-nanosecond

delay

through

the

chains

is

possible.

A

breadboard

of

the

actual

hardware

must

be

built

and

operated

to

obtain

a

true

final

result.

2-11

2-12

EIGHT-BIT

WORD

LENGTHS

The

ILLIAC

IV

is

to

be

built

with

several

different

word

lengths.

Each

64-bit

PE

is

capable

of

being

split,

effectively,

into

at

least

two

32-bit

PE's.

Another

desirable

word-length

breakpoint

is

at

eight

8-

bit

effective

PEl

s

within

any

given

PEe

To

evaluate

this

possibility

we

require,

first;

to

know

what

it

costs

to

expand

PE

capability

to

handle

eight

independent

8-bit

words,

and

second,

we

need

to

estimate

the

worth

of

the

extra

capability

produced

by

such

an

expansion

over

and

above

the

capability

already

inherent

in

the

64-

bit

PE

for

handling

8-

bit

~

pieces.

The

conclusion

is

that

all

logical

operations,

probably

even

add

and

.

subtract,

are

easily

programmed

for

the

64-

bit

PE

so

that

they

make

effective

use

of

the

parallelism

of

the

64-

bit

machine.

The

only

feature

that

is

lacking

is

the

independent

mode

control

of

the

8-bit

sections

of

th~

64-bit

PEe

The

price

for

8-bit

words

is

mainly

in

the

fragmentation

of

the

controls

which

results.

Existing

circuit

design

contemplates

a

buffer

capable

of

driving

24

loads.

With

independent

mode

control

on

every

8-

bit

slice,

each

8

bits

of

a

64-gate

transfer

command

would

have

to

be

independently

controlled,

thus

re-

quiring

8

gates

instead

of

the

3

gates

required

by

straightforward

implenlentation

of

the

64-

bit

PEe

Since

there

are

estimated

to

be

150

command

lines,

and

the

64-gate

transfer

command

is

typical

of

them,

we

estimate

that

750

gates

per

PE

are

added

by

the

fragmentation

of

the

command

lines.

In

addition,

the

barrel

controls

multiply,

from

the

three

control

signals

per

level

required

by

only

one

word

size,

to

24

control

signals

per

level.

However,

only

two

levels

of

control

are

used

in

the

module

eight

barrel,

so

that

42

additional

gates

are

added

from

this

account.

The

extra

mode

register

bits

for

8-

bit

operation

imply

extra

lines

to

the

control

unit,

and

extra

drivers

and

receivers

for

them

Twenty-

eight

such

bits

per

PE

will

require

28

drivers

and

28

receivers

in

the

PE

itself,

and

3,

584

extra

drivers

and

receivers

in

the

control

unit.

The

design

difficulties

of

cable

bundles

of

this

bulkiness

are

considerable

at

the

speeds

under

consideration.

The

8-bit

operation

contemplated

here

assumes

we

still

have

no

more

than

one

index

register

per

PEe

To

get

eight

8-

bit

words

out

of

independently

specified

memory

addresses

requires

eight

memory

accesses,

a

situation

which

is

exactly

the

same

as

though

8-

bit

fields

were

programmatically

extracted

from64-

bit

words

in

64-

bit

operation.

The

most

promising

design

of

the

memory

currently

contemplated

for

the

ILLIA

C

IV,

representing

the

best

compromise

between

cost

and

cycle

time,

is

a

nonde-

structively

read

memory.

W~iting

a

small

field,

such

as

8

bits,

into

a

memory

word

will

require

two

memory

cycles,

one

to

read

those

portions

of

the

word

which

are

not

to'be

changed,

and

one

to

write

back

the

word,·

one

8-bit

field

of

which

has

been

changed.

A

store

instruction

in

8-

bit

operation,

if.

independent

addressing

is

called

for,

on

each

8-bit

word,

will

require,

therefore)

16

memory

cycles

..

The

above

hardware

operations

are

hardly

better

than

programmatic

implementation

in

64-

bit

mode.

A

pparently,

the

chief

virtue

of

implementing

8-

bit

operations

in

the

PE

hardware

is

that

mode

register

control

over

the

individual

operations

can

be

achieved.

All

the

above

circuitry

amounts

to

an

additional

800

gates

in

the

PE

for

8-

bit

operation.

Probably

1000

gates

is

a

better

estimate.

From

this

we

conclude

that

adding

8-bit

operation

to

the

64-

bit

PE

add

s

about

10o/c

to

the

equivalent

gate

count,

and

presumably

also

adds

10o/c

to

the

estimated

cost

of

the

PE

exclusive

of

memory.

In

addition

to

the

direct

cost,

there

is

degradation

of

the

performance

in

normal

64-bit

mode.

Were

the

equipment

packaged

as

a

solid

volu.me,

lengths

would

be

increased

by

3.

2o/c

as

a

result

of

the

10o/c

increase

in

components.

In

a

well-

matched

system,

equally

sensitive

to

memory

and

logic

speeds,

the

result

would

be

a

net

1.

6o/c

slowdown.

Power

and

coollng

requirements

also

follow

the

component

count.

The

opposite

question

is

to

enquire

to

what

extent

can

8-

bit

operations

1::>e

carried

on

in

parallel

in

a

64-

bit

PE

without

the

assistance

of

specific

8-

bit

operations

in

the

hardware.

Certain

operations"

which

are

expected

to

be

frequent

operations

in

8-bit

pro-

gramming,

are

independent

of

word

length,

such

as

all

bit-

by-

bit

Boolean

operations"

compare

all

words

against

zero,

set

to

specified

value,

read

from

memory,

store

to

memory,

and

others.

Certain

operations

are

programmable

in

such

a

way

as

to

rnake

use

of

much

of

the

parallelism

of

the

64-

bit

machine

even

though

the

entities

being

manipUlated

are

only

8

bits

long.

Add,

subtract,

and

routing,

are

examples

of

this

class

of

operation.

Add,

for

example,

on

words

of

7

bits

plus

sign

each,

can

be

handled

by

removing

the

sign

bit

and

using

the

resulting

space

for

overflow

control,

thus

allowing

as

many

adds

in

parallel

as

the

adder

is

long.

Routing

can

be

partially

implemented

on

all

eight

8-bit

words

at

once

by

means

of

the

barrel.

End-around

shifts

can

be

implemented

by

two

shifts,

which

are

then

partially

blanked

in

accordance

with

a

mask

which

could

be

held

in

the

S

register,

and

the

two

shifted

words

then

ORed

together.

The

only

operations

of

any

signiflgance

which

are

not

included

in

the

above

paragraph

are

either

things

such

as

multiply

or

independent

indexibllity,

which

were

never

intended

for

8-

bit

operation

because

of

their

expense,

and

mode

control.

_

To

program

mode

control,

without

actually

having

it

in

the

hardware,

requires

manufacturing

masks"

which

blank

out

whole

8-

bit

segments

of

the

64-

bit

word,

in

response

to

decision

bits

which

usually

will

show

up

in

the

sign-

bit

position

of

the

8-

bit

words.

2-13

2-14

This

programming,

while

facilitated

by

the

one-

clock-time

barrel

l

by

the

com-

plete

set

of

Boolean

operations

l

and

by

the

fast

single-bit

instructions,

will

not

be

trivial.

COMMON

DATA

PATHS

This

section

describes

the

various

paths

by

which

information

is

transferred

between

the

PE

and

its

memory

both

to

the

input-output

interface,

and

to

the

·control

unit.

A s a

result

of

the

.conIlicting

requirements

on

transfer

rates,

frequency

of

use

l

and

destination,

there

are

three

segments

of

the

equipment

provided

for

data

transfer.

These

segments

are

an

input-

output

buffer

l a

memory

access

buffer

with

each

control

unit,

and

direct

lines

'from

the

control

unit

to

and

from

the

PE.

The

rest

of

this

section

describes

this

hardware

in

terms

of

its

use

in

each

class

of

signals

which

must

be

handled

into

and

out

of

each

PE.

!.nput-

Output

Buffering

Present

plans

are

to

provide

an

input-output

buffer

of

4096

bits

for

each

quad-

rant

of

the

array.

Each

bit

of

the

I/O

buffer

has

connection

to

one

bit

of

the

memory

data

register

of

one

PE.

When

data

is

to

be

transferred

to

or

from

a

PE

memory

from

the

I/O

buffer,

one

memory

cycle

is

taken

from

PE

opera-

tions

l

and

all

64

PE's

insert

one

data

word

into

their

memory

at

that

memory

cycle.

Data

is

transferred

across

the

external

interface

of

the

I/O

buffer

in

word

sizes

appropriate

to

the

device

found

at

that

interface.

The

external

device

may

be

a

buffer

memory

with

very

long

words

l

say

4096

bits

each.

In

this

case,

I/O

operations

are

accomplished

by

stealing

one

PE

memory

cycle

from

the

array

for

each

cycle

of

the

external

buffer

memory.

On

the

other

hand,

the

external

device

could

conceivably

be

a

device

of

lesser

word

size,

512/bits

per

word,

for

example,

in

which

case

an

independent

loaqing

and

unloading

control

is

required

to

assemble

the

shorter

external

words

into

the

4096-

bit

word

in

the

110

buffer.

This

independent

control

communicates

with

the

array

control

unit

both

to

steal

memory

cycles,

and

to

report

I/O

complete

when

a

whole

biock

of

data

has

been

successfully

transferred.

The

design

of

this

interface

is

almost

independent

of

the

rest

of

ILLIA

C IV.

It

is

not

the

intent

of

this

section

of

this

report

to

discuss

the

design

of

the

inde-

pendent

I/O

control

unit,

as

this

design

is

dependent

upon

matters

discussed

els.ewhere

in

this

report.

From

Control

Unit

to

Processing

Element

Data

for

the

use

of

the

PE,

data

to

be

stored

in

the

PE

memory

for

the

control

unit's

own

use,

and

commands

for

the

PE

are

transmitted

from

the

control

unit

to

the

processing

element.

Common

data

lines

from

the

control

unit

are

used

for

data

which

goes

from

the

control

unit

to

the

PE,

whether

this

data

is

for

PE

use

or

for

the

control

unit's

private

use.

To

avoid

continually

interrupting

PE

operations

for

data

fetches

and

stores

which

are

control

unit

purposes,

many

words

should

be

fetched

and

stored

in

parallel.

A

reasonable

compromise

between

amount

of

hardware

and

interference

with

PE

operation

appears

to

be

no

more

than

32

words

of

data

transmitted

to

the

PE

array

in

parallel.

Further

discussion

of

the

buffer

size

is

found

below.

We

plan

to

provide

32

words

of

buffering.

Of

these

32

words,

eight

are

assigned

to

each

quadrant.

A

block

diagram

of

the

system

for

handling

data

is

shown

in

figure

2-

6

(page

2-10).

It

is

simplest

to

describe

operation

by

starting

with

the

isolated

quadrant.

Eight

words

of

data

are

transmitted

to

the

memory

access

buffer

from

the

control

unit.

This

transmission

is

overlapped

with

PE

operations

provided

that

the

program

is

such

that

the

requirement

for

transmitting

data

can

be

recogni7ed

far

enough

ahead

of

time.

This

will

generally

always

be

recognizable

ahead

of

time

when

the

data

.

is

simply

to

be

stored

in

memory.

One

clock

time

per

word

is

expected

to

be

required,

since

the

data

paths

within

the

control

unit

are

expected

to

be

typically

one

word

wide.

Each

word

in

the

memory

access

buffer

has

access

to

a

column

in

the

array.

The

eight

words

have

simultaneous

access

to

one

element

in

every

column,

namely

a

row.

As

a

result,

the

eight

words

can

be

transmitted

in

paral-

lel

to

every

row

in

the

array_

If

data

is

to

be

stored

in

memory,

one

row

accepts

the

data.

If

data

is

being

broadcast,

rows

receive

the

data.

In

this

case,

the

eight

words

of

data

are

eight

copies

of

a

single

word

whicl}

originated

in

the

control

unit.

When

the

array

is

operating

as

a

coherent

whole,

all

four

control

units

operate

on

t~e

same

program

str

ing

and

data

in

parallel,

and

have

identical

internal

states.

Out

of

a

package

of

32

words

to

be

sent,

the

f,irst

eight

can

be

derived

from

information

supplied

by

the

fir

st

control

unit,

the

second

eight

can

be

derived

from

information

supplied

by

the

second

control

unit,

and

soon.

Like-

wise,

when

broadcasting

data,

the

four

control

units

will

have

four

identical

copies

of

the

word.

to

be

broadcast,

each

of

the

four

is

copied

eightfold

into

the

memory

access

buffers,

and

each

of

the

resulting

32

copies

is

then

used

to

drive

eight

PE's.

Also

transmitted

from

the

control

unit

to

all

PE's

in

parallel

are

control

signals

.

These

control

signals,

are

identical

for

all

PE's.

Retiming

considerations

will

demand

that

there

be

a

flip-flop

in

the

control

unit

for

each

such

line.

Present

planning

calls

for

a

receiver

per

cabinet

for

each

such

signal,

and

eight

PE's

per

cabinet.

2-15

2-16

Addresses

must

also

be

issued

from

the

control

unit

to

all

PE's.

These

will

use

the

least

significant

12

bits

of

each

word

in

the

memory

access

register,

and

the

broadcast

data

bus.

Issuing

of

addresses

and

broadcasting

of

data

require

a

fixed

delay

of

several

clock

times.

The

design

of

the

microsequences

as

manufactured

by

the

control

unit

is

such

as

to

allow

for

this

constant

delay.

From

Processing

Element

to

Control

Unit

When

data

is

fetched

from

PE

to

control

unit

in

the

isolated

quadrant,

the

mechanism

is

the

reverse

of

the

process

described

above

for

disseminating

data.

Namely,

the

information

is

collected

in

the

form

of

a

single

word

from

each

single

column

of

the

quadrant,

and

the

eight

words

from

each

row

are

then

transmitted

to

the

control

unit.

For

the

instruction"

store

to

broadcast

register,

"

the

eight

words

are

ORed

together

to

form

a

single

word

in

the

control

unit,

very

like

a

broadcast

in

reverse.

When

data,

or

program

string,

are

being

fetched

in

32-word

packages

from

the

united

array

an

additional

complication

sets

in,

since

each

one

of

the

memory

access

buffers

contains

only

a

quarter

(eight

words)

of

the

32-word

package

and

each

control

units

wants

all

32

word

s.

In

this

case

it

will

be

necessary

to

trans-

fer

data

around

among

the

memory

access

buffers

until

each

quarter

package

of

eight

words

has

shown

up

once

in

each

of

the

eight-

word

memory

access

buffers.

Single

word

fetching

makes

use

of

the

same

paths

by

disabling

all

but

one

of

the

words

being

received.

Also

entering

the

control

unit

are

lines

from

each

of

the

mode

register

flip-flops

in

the

PE'

s.

To

the

control

unit

of

one

quadrant,

'a

bit

in

the

mode

register

appears

as

a

64-bit

register

to

the

control

unit,

which

may

be

set,

read,

com-

pared

with

other

registers

available

to

the

control

unit,

etc,

When

operating

in

united

mode,

control

units

must

cooperate

in

the

sensing

of

the

state

or

mode

registers.

For

example,

a

"jump

on

any

bit

equal

to

ONE"

means

to

jump

if

any

bit

in

anyone

of

the

four

64-

bit

registers

involved

is

ONE.

Since

all

four

control

units

run

the

same

program

instructions

at

the

same

time,

only

twelve

wires,

one

to

communicate

between

each

pair

of

control

units

,are

required

to

secure

the

necessary

cooperation.

Evaluation

Some

of

the

above

described

data-transferring

procedures

take

more

time

than

one

might

at

first

expect.

One

250-nsec.

memory

cycle

is

required

to

load

the

memory

access

buffer.

This

memory

cycle

interferes

with

the

action

of

the

PE

only

insofar

as

the

PE

memory

or

memory

data

buffer

is

required.

The

memory

access

buffer

for

a

single

quadrant

can

be

unloaded

into

the

appropri-

ate

area

in

program

lookahead

or

local

data

buffer

in

eight

clock

times.

However,

these

potentially

interfere

with

other

used

of

program

lookahead

or

local

data

buffer.

With

proper

implementation

of

the

controls,

these

eight

clock

times

can

be

largely

hidden

by

being

taken

from

otherwise

idle

time

in

the

two

buffer

areas

in

the

control

unit.

When

the

array

is

in

united

operation,

one

must

count

not

only

32

clock

times

rather

than

8,

one

also

must

transfer

the

data

from

one

memory

access

buffer

to

another.

If

cables

of

up

to

30

or

40

feet

long

separate

the

memory.

access

buffers,

then

the

time

to

transfer

data

from

one

memory

access

buffer

to

another

may

well

be

three

clock

times,

and

three

transfers

are

needed.

The

conclusion

is

that

as

much

as

41

clock

times

are

needed

to

transfer

the

32

words,

read

in

one

memory

cycle

time,

into

the

appropriate

areas

of

the

four

control

units.

Seven

clock

cycles

in

each

memory

cycle

is

a

likely

design

choice.

In

this

case,

41

clock

cycles

represents

1.

46

microseconds.

This

1.

46

microseconds

is

over-

lapped

with

other

operations

as

long

as

a

reservoir

of

instructions

for

PE

operation

can

be

maintained

in

the

control

unit.

However,

this

1.

46

microseconds

often

gets

in

the

way

when

loading

the

program

lookahead.

It

is

a

penalty

to

be

taken

whenever

the

program

jumps

to

a

location

not

contained

in

program

lookahead.

Further,

the

coarser

block

size

means

that

loops

between

32

and

64

words

long

will

less

often

be

found

within

the

program

lookahead,

and

program

fetching

will

take

place

more

often

when

the

block

which

is

fetched

is

larger.

Large

block

size

also

interferes

with

broadcast

operations,

since

a

larger

delay

occurs

between

the

instruction

to

fetch

a

block

of

data

to

the

data

buffer

and

the

first

opportunity

to

broadcast

one

of

those

words,

when

blocks

are

larger.

Optimum

block

size

is

that

which

finds

the

best

tradeoff

between

interference

with

the

operation

of

the

PE's

in

the

array,

and

slowing

of

operations

in

the

~

control

unit.

Block

sizes

of

8,

16,

and

32

words

are,

easily

available

with

minor

modifications

of

the

structure

here

described,

and

a

choice

of

a

smaller

block

size

than

the

32

words

here

proposed

can

be

easily

made

during

the

early,

design

months

of

the

MODE

Introductory

Considerations

.

Each

PE

is

supplied

with

a

mode

register,

which

it

can

change

as

the

result

of

tests,

and

which

the

control

unit

can

set

at

will.

Instructions

will

be

executed

or

not

as

a

result

of

the

setting

of

the

mode

register.

This

arrangement

is

in

lieu

of

branching

at

the

PE

level,

since

all

PE's

must

share

the

common

instruction

stream,

and

therefore

cannot

independently

execute

transfers

of

control.

2-17

2-18

Modes

should

be

remembered

and

recoverable.

At

the

very

least,

each

routine

which

one

enters

must

be

supplied

with

fresh

capabiHty

for

setting

and

changing

its

mode,

while

remembering

the

mode

of

the

calling

sequence.

Actually,

it

seems

that

considerably

more

flexibility

is

desirable.

In

the

limit,

one

could

specify

a

particular

mode

register,

out

of

some

set

of

mode

registers,

for

each

instruction.

The

Problem

At

issue

is

the

question

of

the

location

of

the

backup

storage

of

the

noncurrent

modes.

The

choice

is

between

supplying

back-up

mode

register

storage

in

the

PE's,

or

of

supplying

the

control

unit

with

the

capability

of

reading,

storing,

and

restoring

the

PE

modes.

The

implementation

of

mode

control

is

con-

siderably

different,

depending

on

the

results

of

this

choice,

so'that

we

are

really

choosing

between

two

different

systems

for

controlling

PE

operations

in

response

to

mod

e.

Back-

Up

Storage

In

The

Control

Unit

The

first

system

to

discuss

is

that

with

back-up

storage

for

other

than

current

modes

in

the

control

unit.

In

this

system,

only

one

mode

register's

worth

of

flip-flops

are

in

the

PE,

and

the

setting

of

the

mode

register,

for

which

a

given

instruction

will

be

executed

in

a

specific

PE

is

known

at

the

time

of

that

instruction.

Either

four

bits

accompany

the

instruction,

specifying

which

of

the

possible

modes

perInit

the

execution

of

the

instruction,

or

else

a

pre-

existing

decision,

based

on

the

content

of

the

mode

register,

controls

the

execution

01'

nonexecution

of

instructions

in

the

PE.

Bit

economy

in

the

instruction

stream

favors

the

latter,

if

mode

is

not

to

change

at

every

instruction.

A

decision

based

on

speed

also

favors

separation

of

the

mode

decision

from

the

execution

of

the

instruction,

since

then

less

control

gating

is

involved

for

the

individual

instruction

at

the

PE.

One

of

the

savings

in

speed

of

the

ILLIA

C IV

type

of

computer

ought

to

lie

in

the

fact

that

the

subcommand

matrix

of

a

normal

computer

finds

itself

mostly

in

the

control

unit,

so

that

the

instruction

decoding

and

timing

wpich

consumes

one

clock

time

per

instruction

in

most

computers

can

be

spent

in

the

control

unit,

overlapped

with

useful

arithmetic

work

in

the

PE.

For

most

complete

overlap,

the

PE

has

an

"on-

off"

flip-flop

to

control

the

execution

or

nonexecution

of

the

Individual

instructions

in

such

a

scheme

will

never

wait

for

the

decoding

of

mode

information

before

they

can

start

and

noninterferin~

microseQuences

can

freely

overlap.

Occasionally

a

one-

clock-time

instruction

would

be

needed

to

change

the

setting

of

the"

on-

off"

flip-flop

in

response

to

some

new

interpretation

of

the

mode

register.

The

"on-off"

flip-flop

also

must

respond

(sometimes)

to

arithmetic

overflow

as

well

as

to

programmed

tests

which

change

the

mode

bits.

There

is

an

over-

flow

flip-flop

which

appears

to

the

control

unit

much

like

third

mode

bit.

In

this

system

there

are

therefore

four

flip-flops

per

effective

PE

with

mode-like

functions.

Thus

there

are

four

flip-flops

per

32-blt

word,

or

eight

flip-flops

per

actual

PE,

two

programmatically

specifiable

mode

bits,

the

overflow

flip-flop,

and

the

"enable-disable"

flip-flop.

There

are

PE

instructions

to

set

or

reset

each

mode

bit

in

response

to

programmed

tests.

Each

bit

in

the

PE

appears

to

the

control

unit

as

a

64-

bit

word,

since

there

are

64

PEl

s

per

control

unit.

There

are

instructions

to

read,

save,

and

set

these

word

s,

the

instructions

being

control

unit

instructions

rather

than

PE

instructions.

The

control

unit

is

provided

with

high-

speed

register

storage

for

such

saving.

It

is

also

pro-

vided

with

logical

instructions

to

manipulate

the

modes,

namely

AND,

OR,

and

COMPLEMENT

instructions

which

can

operate

on

the

words

formed

from

the

mode

register

bits.

Jump

instructions

in

the

control

unit

would

test

mode

bits

(either

"any

mode

bit"

or

"all

mode

bits

").

They

would

also

test

old

modes

stored

in

the

control

unit.

This

system

thus

has

a

reservoir

of

old

modes

which

can

be

reactivated

on

short

notice.

This

back-up

is

in

the

control

unit.

Response

to

old

mode

settings

is

effected

in

one

clock

time

by

setting

the

"on-off"

flip-flop

in

each.

,?E.

The

instruction

stream

has

no

mode

field

per

instruction,

but

does

have

mode-

mani-

pulating

instructions,

which

are

decoded

by

the

control

unit

in

parallel

with

the

issuing

of

instructions

to

the

PE.

Storage

of

old

modes

in

the

control

unit

is

backed

up

further

by

the

storage

of

words

from

old

mode

registers

back

to

memory.

It

is

assumed

that

the

control

unit

has

some

means

of

addreSSing

memory,

both

for

read

and

wr

ite.

Back-

Up

Storage

In

The

Processing

Element

The

second

system

we

discuss

is

that

with

back-up

storage

of

old

modes

assigned

to

the

PE's.

In

this

system,

where

several

modes

are

stored

in

the

PE,

a

method

must

be

chosen

for

choosing

the

applicable

mode

for

any

given

stream

of

in-

structions~

A

settable

pointer

could

be

used,

and

the

pointer

setting

changed

on

command

from

the

control

unit,

in

between

actual

processing

instructions.

Using

the

pointer,

the

operation

would

be

fully

equivalent

to

operation

with

back-up

storage

in

the

control

unit

except

for

the

location

of

the

stored

bits.

Whe,n a

mode

register

address

is

lissued

with

each

instruction,

more

flexibility

is

obtained.

A s

described

to

us,

the

system

was

used

with

six

mode

bits

with

each

instruction:

Two

bits

of

mode

register

address,

and

four

bits

to

interpret

the

contents

of

that

register.

Arithmetic

overflows

are

lost

in

the

implementation

unless

tested

for

immediately

by

means

of

a

jump

instruction

in

the

control

unit.

They

cannot

influence

mode

directly,

although,

clearly,

when

tested,

they

can

be

used

to

control

modes.

Even

with

back-up

s,torage

of

modes

in

the

PEl

s,

some

means

of

setting

new

modes

under

control

unit

control

is

required.

Whether

a

path

directly

from

control

unit

to

PE

for

loading

the

mode

registers

is

needed,

or

whether

indirect

methods

suffice,

has

not

yet

been

determined.

2-19

2-20

Each

instruction

in

the

PE

must

therefore

look

to

the

mode

register

for

condi-

tions

against

which

execution

is

made

conditionaL

Some

instructions

are

made

one

clock

time

longer

becuase

of

this.

Furthermore,

no

overlapping

of

non-interfering

microsequences

appears

possible.

Jump

instructions

are

made

dependent

on

a

bit,

one

per

PE,

returned

to

the

control

unit.

These

instructions

are,

like

any

other,

conditional

on

one

of

the

mode

registers,

so

that

jumps

can

easily

be

made

conditional

on

mode

register

settings.

This

capability

is

similar

to

that

achieved

in

the

competing

system

by

testing

old

modes

in

the

control

unit.

The

system

has

a

limited

reservoir

of

old

modes

which

can

be

reactivated

on

short

notice.

This

reservoir

is

limited

in

length,

but

very

fast

of

access.

The

instruction

stream

has

a

6-

bit

mode

field

per

instruction.

Storage

of

additional

modes

is

in

PE

memory.

If

they

are

to

be

packed

effcient-

ly

in

memory,

a

short

mod

e-

packing

routine

is

necessary.

Functional

Capability

Both

competing

systems

can

execute

the

same

functions.

A

partial

list

of

capabilities

follows

..

•

Quick

change

of

control.from

one

mode

register

to

another.

•

Storage

6f

several

different

mode

registers

per

PEe

•

Transfer

of

control

possible

in

response

to

mode

register

setting.

Comparison

A

list

of

features

in

which

the

suggested

methods

of

supplying

back-up

mode

registers

differs

reveals

none

in

which

the

functional

capabilities

made

avail-

able

by

the

one

cannot

also

be

made

available

by

the

other.

However,

there

are

other

differences.

"When

back-up

mode

storage

is

in

the

PE

every

microsequence

requires

the

operation

of

mode

gating

in

the

PE,

and

overlapping

of

microsequences

is

not

possible.

Six

bits

of

every

instruction

are

expended

on

mode

information.

"When

back-up

mode

storage

is

in

the

control

unit,

every

change

of

mode

requires

a

one

t-time

operation

interleaved

between

the

regular

arithmetic

operations.

Each

change

of

mode

takes

a

short

instruction,

say

16

bits.

What

is

needed,

to

compare

the

advantages

and

disadvantages

listed

above,

is

some

estimate

of

the

number

of

instructions,

on

the

average,

executed

with-

out

change

of

mode

setting,

or

executed

while

ignoring

mode.

Even

if

average

..

the

string

length

is

only

two

or

three.

The

speed

advantage

would

appear

to

lie

with

the

system

of

storing

modes

in

the

CD.

Another

difference

is

in

the

PE

hardware.

Considerable

savings

are

expected

by

placing

the

back

-up

mode

storage

in

the

CD,

whose

long

registers

are

much

cheaper

per

bit

than

PE

short

registers.

At

a

very

detailed

machine

language

level

of

coding

there

are

differences

in

the

progran1.ming

of

mode

control.

With

back-up

modes

stored

in

short

registers

in

the

PE,

a

different

mode

with

every

instruction

is

appropriate,

but

program-

ming

compl~xity

mounts

when

mode

registers

above

the

fourth

are

used.

With

control

unit

back-up

storage,

programming

complexity

remains

essentially

constant

for

any

depth

of

mode

storage,

although

producing'memory

interference

when

depth

of

storage

exceeds

that

available

in

the

high

speed

registers

of

the

control

unit.

The

same

assembly

language

could

be

used

in

either

case,

if

desired,

putting

the

above

differences

solely

into

the

assembly

program.

2

-21

SECTION

III

REVISED

PE

LOGIC

DESIGN

This

section

describes

the

mechanization

and

hardware

requirements

for

the

PE.

A

block

diagram

is

included

containing

the

data

paths

for

each

item

required

to

mechanize

a

PE.

The

logic

requirements

per

unit

are

tabulated

in

table

3-1.

PROCESSING

ELEMENT

DESCRIPTION

The

revised

PE

logic

design

is

shown

in

figure

3

-1.

A

PE

is

essentially

a

three-

register

system

which

can

execute

a

complete

general

purpose

computer

order

code

as

described

in

Section

VII

of

TR66-3

or

otherwise

modifieq

in

this

report.

A

fourth

register

was

added

to

store

intermediate

results.

Capabilities

of

some

units

may

be

increased

or

decreased

to

vary

operation

code

execution

times.

A

summary

of

each

unit

in

the

PE

follows.

MEMORY

DATA

REGISTER

(MDR)

This

unit

is

abuffer

register

between

a

PE

and

its

PE's

thin

film

stack

and

the

outside

world.

The

register

serves

as

an

intermediate

buffer

when

performing

array

shifts

or

memory

stores

and

fetches.

This

register

is

accessible

even

though

the

PE's

mode

status

is

disabled.

The

MDR

receives

operands

from

the

:

Common

Data

Bus,

A

Register,

Operand

Select

Gates,

Address

Adder

and

its

PE's

.memory.

The

register's

output

is

connected

to

the

Common

Data

Bus,

Address

Adder,

PE's

memory

and

Operand

Select

Gates

located

within

its

PE's

boundaries

or

four

nearest

neighbors.

3-1

CONTROL

UNIT

-

TO

I/O

REGISTERS

,_~1

__

,

COMMON

DA TA

BUS

1 N

SEW

3 1 3

2&3

t +

••

DRIVERs'rr

MODE

I

DRIVERS

& J l

DRIVERS

&

RECEIVERS

I I

RECEIVERS

REGISTER

RECEIVERS

: I

t t

r--=-

I I

1 , • + I I

I

I I

PE

MEMORY

DATA

REGISTER

I

PE

I

ENABLE

(MDR)

I

iMEMORY

I

SIGNALS

ADDRESS

I I

J

ADDER

I I

RECEIVERS

I I

OPERAND

I

MEMORY

I I

SELECT

GATES

PE

I

r-r--

r

INDEX

r

ADDRESS

I I

H S

REGISTERS

(OSG)

REGISTER

I I

I

REGISTER

(MAR)

I I

1---

1

----1

1

~

.-

j

I B

REGISTER

I

MULTIPLICAND

I

SELECT

GATES

(MSG) 2 & 3

-'-

~

I I

•

DRIVERS

I

PSEUOADDER

J

~

TREE

1

+ + • t

A

REGISTER

1 B

REGISTER

,J

t--

AND

OPERAND

N S E W

SELECT

GA

TES

t I

SELECT

GA

TES

I CARRY

NOTES:

PRO

PA GA

TE

ADDER

1.

The

blocks

iabeled

number

1

will

be

implemente

(CPA')

DDa

cabinet

or

half

Cabinet

BaSiS,

rather

than

a

1

PE

basis.

d

1 I

2.

The

blocks

labeled

number

2

are

implemented

to

r=:l

~

•

only

2

of

the

Signals

.

I l I

3.

The

blocks

labeled

number

3

will

be

implemente

LOGIC

as

required

on

a

PE

basis.

A

REGISTER

f

UNIT

d

1 I

t ,

I

LEADING

l I

ONEs

~

BARREL

DETECTOR

SWITCH

1

Figure

3-1.

Revised

PE

Logic

Design,

Block

Diagram

Table

3-l.

PE

Logic

Requirements

UNITS

FUNCTIONS

BIT

GATES/

TOTAL

FLIP

DRIVERS/

POSITIONS

BIT

GATES

FLOPS

RECEIVERS

Memory

Data

Register

Buffer

register

between

thin

film

stack

and

logic

64

6

384

64

Address

Adder

inputs,

bit

positions

52

through

63 12 1 12

Drivers

and

Receivers

to

and

from

Common

Data

Bus

64/64

Drivers

to

adjacent

PE

Operand

Select

Gates

64/00

Drivers

and

Receivers

to

Input

Output

Registers

64/64

Operand

Select

Gates

Selects

an

Operand

for

PE

processing

64 6

384

Receivers

from

two

adjacent

num-

bered

PE's

00/128

Multiplicand

Select

Selects

4-56

bit

words

based

on

8

multiplier

bits

56/

word

16 896

Control

word

enable

signal

gen-

eration,

some

fan

out

288

49

Carry

Propagate

Consists

of

six

adder

sections:

Adder

Receives

mantissa

field

inputs

from

A-B

Registers,

Pseudoadder

Tree

and

Operand

Select

Gates.

Gate

and

Flip-flop

requirement

for

six

sections

648

12

Input

Gates:

A

Register

48 2 96

B

Register

and

Operand

Select

Gates

48 3

144

Address

Adder

Consists

of

two

adder

sections:

Per-

form

address

modification,

exponent

summation

and

supplementary

sec-

tions

to

Carry

Propagate

Adder

Gate

and

Flip-flop

requirement

for

two

sections

216

4

Input

Gates:

Address

Modification

12 4 48

Exponent

Summation

16 5 80

Control

Inputs

16 3 48

Output

Gates:

16

3 48

PE's

Index

Register

Stores

Address

variable

12 12

Memory

Address

Buffer

register

between

thin

film

Register

stack

and

logic

12 12

A

Register

Main

operand

store

and

Overflow

flip-flop

64 7 448 65

B

Register

Auxiliary

operand

store

64 6

384

64

Address

Adder

input:,:

pOSitions

o

through

15 16 1 16

S

Register

Intermediate

operand

store

64 64

Pseudoadder

Tree

Converts

4

Words

+

Partial

Product

into

CPA

inputs

56

27

1512

Logic

Unit

Performs

logic

functions

between

A-B

Registers

64 6

396

Barrel

Switch

Performs

shifting

functions.

65 14 910 1

Barrel

Switch

Control

Logics

48

Mode

Register

Buffer

register

between

a

PE

and

the

Control

Unit

8 8

Mod.e

control

input

gates

38

Drivers

and

Receivers

to

Control

Unit

8/8

Leading

One's

Detects

position

of

leading

one,

Detector

mantissa

field

184

PE

Control

Receives

apprOximately

200

Signals

Signal

Rec.

frqm

the

Control

Unit

000/200

7228

355

200/264

A

ssurnptions:

1.

The

flip

flops

are

constructed

of

4

gates

per

element.

2.

The

Drivers

and

receivers

are

constructed

of

one

logic

gate.

Total

gates

used.

7228

1420

664

9312

3-3

3-4

OPERAND

SELECT

GATES

(OSG)

This

unit

consists

of

one

logic

level

of

decoding

gates.

The

enabled

gate

selects

an

incoming

operand

from

its

MDR

or

nearest

neighbor's

MDR

for

array

transfer

operations

or

PE

processing.

During

an

array

transfer

operation,

.

the

OSG

out-

put

is

stored

temporarily

in

the

MDR.

PE

processing

requirements

are

that

the

selected

operand

be

connected

to

the

B

Register

input

gates

and

Multiplicand

Select

Gates.

Depending

upon

the

instruction

an

operand,

or

multiple

thereof,

is

loaded

into

the

recipient

register.

A-B

REGISTER

(ACCUMULATOR)

The

A

Register

and

B

Register

provide

main

operand

store

and

auxiliary

operand

store,

respectively.

The

Barrel

Switch

output

is

connected

to

both

registers.

These

register

inputs

are

used

to

execute

operation

codes

that

require

main

or

auxiliary

operand

shifts,

or

logical

combinations

thereof.

Both

register

outputs

are

routed

to

the

Carry

Propagate

Adder

(CPA)

and

Logic

Unit.

In

addition

the

A

Register

receives

inputs

from

the

CPA

and

the

Modified

Address

Adder.

*

The

CPA

output

is

gated

into

the

A

Register's

mantissa

field.

This

output,

depending

on

the

algorithm,

could

be

a

partial

product

or

remainder,

a

summed

mantissa

or

just

an

operand

transfer

from

the

B

Register

or

the

OSG.

During

an

operand

transfer,

the

appropriate

exponent

is

gated

directly

into

the

register's

exponent

field.

The

Modified

Address

Adder

output

is

a

summed

value

loaded

into

the

exponent

field.

The

A

Register's

outputs

are

connected

to

the

Leading

ONEs

De

tector,

MDR,

Special

Register

(SR)

and

the

Modified

Address

Adder.

The

B

Register

receives

additional

inputs

from

the

CPA;

OSG,

and

SR.

The

CPA

input

is

a

completed

8

-bit

product

obtained

in

a

multiplication

micro-

sequence

step.

The

OSG

input

is

the

incoming

operand

from

this

PE's

or

an

adjacent

PE's

MDR.

The

SR

input

is

the

incoming

.operand

obtained

from

a

Other

B

Register's

outputs

are

connected

to

the

Multiplicand

Select

Gates

and

Modified

Address

Adder.

The

Multiplicand

Select

Gate

inputs

are

multiplier

bits

decoded

into

enable

signals

for

the

step.

The

Modified

Address

Adder

input

is

the

exponent

field

to

be

summed

in

the

adder.

LOGIC

UNIT

This

unit

performs

logic

functions

between

the

A

Register

and

B

Register.

The

unit

receives

outputs

from

both

accumulator

registers.

The

various

logic

functions

are

performed

as

specified

in

Section

VII

of

TR66

-3

or

otherwise

modified

in

this

report.

This

unit

provides

the

input

gating

function

to

the

Barrel

Switch

unit.

*

The

Modified

A.ddress

Adder

term

refers

to

the

Address

Adder

extended

4

bit

positions.

BARREL

SWITCH

The

Barrel

Switch

is

a

matrix

of

symmetrical

gates

which

shifts

a 64

-bit

parallel

input

any

number

of

places

to

the

left

or

right

either

end-off

or

end-around.

Oper-

ation

is

started

upon

receipt

of

a 64

-bit

parallel

input

and

appropriate

control

signals.

The

output

is

connected

to

both

the

A

and

B

Registers.

LEADING

ONES

DETECTOR

The

input

to

this

unit

consists

of

the

A

Register's

mantissa

bit

positions.

This

unit

detects

the

location

of

the

most

significant

set

bit,

decodes

its

position

as

a

radix

2

power

and

enables

appropriate

Barrel

Switch

displa'cement

control

signals.

The

unit

also

senses

the

absence

of

a

set

bit,

thereby

detecting

a

zero

value.

This

decoded

signal

is

used

to

zero

the

exponent

field

of

the

A

Register.

MULTIPLICAND

SELECT

GA

TES

These

gates

select

four

multiplicand

words

each

based

on

2

bits

of

the

multiplier

mantissa

only.

The

Multiplicand

Select

Gates

are

partitioned

into

two

parts,

one

which

receives

the

of

multiplier

bits

to

be

decoded

into

multiplicand

word

enable

signals

and

the

other

which

gates

the

appropriate

multiplicand

(word)

into

the

Pseudoadder

Tree.

During

multiplication,

four

2

-bit

pairs

of

multiplier

bits

are

received

and

decoded

in

advance

to

form

the

enable

signal

that

gates

the

required

words

into

the

Pseudoadder

Tree

for

the

step.

The

decoded

signal

may

select

multiplicand

multiples

of

times

one,

times

minus

one,

or

times

two.

This

word

selection

enable

signal

is

stored

in

flip

-flops

at

the

termination

of

the

current

microsequence

step.

During

an

operand

transfer,

the

word

selected'

will

enter

the

Pseudoadder

Tree

so

that

its

output

is

positioned

at

the

A

Register

mantissa

input

gates.

PSEUDOADDER

TREE

The

Pseudoadder

Tree

consists

of

three

carry

save

adders

per

bit

position.

During

multiplication,

this

unit

receives

inputs

consisting

of

four

words

from

the

multipli-

cand

Select

Gates

and

the

current

partial

product

from

the

A

Register.

The

carry

save

adders

reduce

these

five

inputs

to

two

outputs,

one

being

the

summand

sum

and

the

other

the

summand

carry.

These

outputs

are

gated

into

the

CPA

in

their

true

and

complemented

form.

CARRY

PROPAGATE

ADDER

(CPA)

Upon

receipt

of

an

input

signal

set

from

the

Pseudoadder

Tree

or

the

A

-B

Regis-

ters,

or

the

A

register

and

Operand

Select

Gates,

the

CPA

adds

the

inputs

in

a

3-5

3-6

microsequence

step.

When

executing

an

arithmetic

instruction,

the

extended

CPA's

output

bit

positions

0

thru

47

are

gated

into

the

A

Register

mantissa

field.

When

executing

the

multiplication

algorithm,

the

CPA's

output

positions

48

thru

53

are

gated

into

the

B

Registers

eight

most

significant

locations,

and

positions

0

thru

47

are

gated

into

the

A

Registers

mantissa

field.

During

multiplication

the

CPA

functions

in

two

distinct

modes:

group

carry

save

additions,

and

extended

mantissa

addition.

The

saved

group

carries

are

dis-

placed

8

bit

positions

to

the

right

and

re

-entered

into

that

particular

adder

group.

The

group

carry

save

addition

is

performed

on

those

adder

sections

whose

output

is

entered

into

the

A

Register's

mantissa

field.

Additional

adder

sections

are

required

to

perform

this

algorithm.

These

adder

sections

receive

their

inputs

from

the

most

significant

16

bit

positions

of

the

B

Register

and

group

carry

save

bits

from

the

step.

The

least

significant

section

performs

section

carry

save

addition.

The

saved

carry

is

re

-entered

into

this

adder

displaced

eight

bit

positions

to

the

right.

The

last

step

in

the

multi-

plication

algorithm

requiring

product

summation

is

performed

by

the

CPA

oper-

ate

mode

of

extended

mantissa

addition

..

The

additional

adder

sections

are

connected

to

the

CPA

to

form

the·

extended

tna,ntissa

adder.

This

microsequence

step

allows

the

carry

to

propagate

throughout

the

adder

to

form

the

completed

product.

MODE

REGISTER

The

Mode

Register

is

a

buffer

register

between

the

PE

and

the

Control

Unit.

The

PE

status

(operative

or

inoperative)

is

controlled

by

two

bits

of

this

register.

These

two

bits

are

controlled

exclusively

by

the

Control

Unit.

The

PE

controls

other

register

flip

-flops

when

executing

specific

instructions.

ADDRESS

ADDER

The

Address

Adder

inputs

are

received

from

the

Common

Data

Bus,

MDR,

PE's

Index

Register

and

the

Memory

Address

Register.

These

added

sums

are

con-

nected

to

the

input

gates

to

the

MDR,

PE's

Index

Register

and

the

Memory

Address

Register.

The

Modified

Address

Adder

inputs

are

received

from

the

exponent

bit

positions

of

the

A

-B

Registers

and

the

OSG.

This

added

output

is

either

decoded

to

form

Barrel

Switch

control

signals

or

entered

into

the

A

Register's

exponent

field.

PE

INDEX

REGISTER

This

unit

receives

inputs

from

the

Address

Adder.

Its

output

is

connected

to

the

Address

Adder

where

compare

operation

or

address

modifications

are

performed.

MEMORY

ADDRESS

REGISTER

(MAR)

The

MAR

is

a

buffer

register

between

a

PE

and

its

thin

film

stack.

The

MAR

out-

put

is

the

thin

film

stack

operand

address

location.

Depending

upon

the

instruction

the

MAR

can

be

loaded

or

modified

with

Address

Adder

inputs.

The

Address

Adder

input

is

gated

into

the

MAR

at

the

start

of

a

memory

cycle.

SPECIAL

REGISTER

(SR)

The

SR

Register

is

used

to

store

intermediate

results

obtained

during

PE

pro-

cessing.

The

operand

input

is

received

from

the

A

Register.

The

SR

register's

output

is

gated

into

the

B

Register,

thus

providing

a

buffer

'store

for

a

previously

computed

operand

without

a

memory

fetch.

3-7

SECTION

IV

THE

INPUT-OUTPUT

SUBSYSTEM

GENERAL

CONSIDERATIONS

The

following

design

considerations

are

pertinent

to

the

ILLIAC

IV

I/O

Subsystem.

1.

A

disk

storage

system

will

provide

the

principal

on-line

backup

storage

for

the

ILLIAC

IV

System.

Present

memory

state

of

art

singles

out

disk

file

stor-

age

as

the

only

medium

with

the

necessary

volume

and

cost

parameters

to

satisfy

the

ILLIAC

IV

requirements.

Requirements

for

the

disk

file

system

are

further

considered

at

the

end

of

this

section.

2.

I/O

word

transfers

to

and

from

the

PE

Array

will

be

in

the

form

of

a

4096

-bit

word.

This

capability

makes

maximum

use

of

the

interrupt

time

of

a

quadrant

and

keeps

I/O

interference

with

the

Array

program

to

a

minimum.

It

also

provides

a

separate

1/

a

path

to

each

PE

to

accommodate

applications

which

require

asynchronous

direct

inputs

to

the

PE

memories.

Some

real-time

problems

involving

large

array

sensor

systems

are

typical

of

this

application.

3.

Capability

to

perform

interlaced

I/O

transfers

from

simple

descriptor

operation

is

desirable.

Considering

the

variety

and

number

of

peripheral

devices

the

system

may

ultimately

incorporate,

the

capability

to

store

and

rapidly

fetch

at

least

64

I/O

descriptors

to

control

the

interlaced

I/O

transfers

is

necessary.

4.

Much

routine

I/O

processing

such

as

data

packing

and

unpacking,

descriptor

assembly

and

updating,

etc.,

must

be

performed

external

to

and

inde-

pendent

of

the

Array

or

Control

Unit.

Therefore,

a

processing

'device

similar

to

a

medium

scale

computer

module

would

be

most

suitable.

There

are,

however,

some

1/

a

programs

such

as

code

and

format

conversion

which

are

eminently

suitable

to

Array

processing.

4-1

4-2

5.

The

existence

of

an

economical

core

memory

system

separate

and

distinct

from

the

Array

memories

is

necessary

for

the

following

reasons:

a)

Transfers

from

tape

to

disk,

or

card

to

disk,

or

disk

to

printer,

must

have

a

buffer

area

available

to

collect

reasonable

block

sizes

before

making

trans-

fers

to

the'

disk

in

order

to

minimize

latency

time

overhead.

b)

Frequently

used

subroutines

which

overflow

array

storage

can

be

kept

in

random

access,

zero

latency

time

memory

to

avoid

millisecond

delays

in

processing.

c)

To

permit

I I 0

lookahead

transfers

from

disk

to

the

array

to

overlap

latency

time

with

processing,

a

separate

buffer

memory

is

desirable

as

a

staging

area

for

such

transfers.

It

is

possible

that

the

buffer

memory

required

to

interface

the

disk

mass

memory

and

the

PE

array

memory

may

be

very

large.

Anum

ber

of

core

memory

manu-

facturers

are

being

surveyed

to

identify

possible

candidates

for

core

memories

in

the

capacity

range

up

to

16

million

bits

(modules

independently

accessible

up

to

8

million

bits),

with

word-lengths

in

the

range

of

1024

to

4096

bits,

and

cycle

times

in

the

range

of

2

to

8

microseconds.

In

the

indicated

capacity

range

the

ferrite

core

memories

are

currently

dominant

from

the

cost

standpoint.

Memory

organizations

involving

only

2

wires

per

core,

such

as

linear

select

or

variations

of

"2

1/2

D,

"

are

generally

indicated

to

minimize

mat

fabrication

cost.

This

is

also

a

significant

consideration

here

but

in

this

case

it

is

likely

that

the

electronics

will

remain

a

very

significant

portion

of

the

cost.

For

the

longer

word

lengths

(2K

-4K

bits)

it

appears

that"

2 1 / 2

D"

would

offer

little

advantage

over

linear

select

and

the

large

number

of

sense

amplifiers

and

data

bit

drivers

would

reflect

heavily

on

the

cost

per

bit.

It

also

appears

that

practical

considerations

may

limit

word

drivers

to

something

in

the

order

of

1000

cores

per

line

and

hence

might

require

multiple

driver

and

switch

matrices

for

the

longer

word

lengths.

Neces-

sarily

tentative

extrapolations

of

fairly

recent

cost

data

suggests

that

it

may

be

difficult

to

obtain

costs

much

lower

than

3

cents

per

bit

for

a

core

memory

for

this

requirement.

At

this

time

several

core

memory

manufacturers,

including

the

Burroughs

Com-

ponent

Division,

have

been

contacted

about

this

requirement.

'

The

unusual

geometry

of

this

unit

precludes

the

use

of

an

existing

product

line

item.

This

fact

has

delayed

detailed

responses

beyond

publication

of

this

report.

DISK

STORAGE

Effort

to

determine

candidates

for

a

disk

file

mass

memory

offering

high

data

transfer

rates

and

economic

storage

in

capacities

in

the

range

of

300

million

to

1

billion

bits

has

continued.

Obtaining

the

high

data

transfer

rates

desired

in

this

applicat~on

(greater

than

300

megabits

/

second)

requires

a

high

degree

of

parallelism

in

the

transfer,

e.

g. a

large

data

storage

word.

The

parallel

mode

of

operation

increases

the

amount

of

read-write

electronics

required

and

it

is

desirable

to

keep

this

increase

to

a

minimum.

It

is

possible

to

reduce

the

number

of

read

-write

channels

below

one

per

bit

if

basic

bit

rate

capability

can

be

traded

to

effect.

the

reduction.

This

tradeoff

is

indeed

a

necessity

if

fewer

heads

are

simultaneously

accessible

than

the

number

of

bits

required

in

parallel.

The

reduction

of

the

number

of

required

read

-write

channels

is

effected

by

using

a

multiple

zone

storage

format

in

which

several

bits

per

word

time

are

recorded

serially

in

the

outer

zones.

In

general

a

two

-zone

format

allows

the

greatest

.

reduction

in

the

number

of

read-write

channels

for

a

given

sacrifice

of

maximum

bit

rate

while

formats

of

three

or

more

zones

afford

greater

capacity

utilization.

More

than

three

zones

are

seldom

warranted

since

the

required

increase

in

the

number

of

clock

rates

tends

to

cancel

the

attractiveness

of

the

small,

further

incre-

ments

to

utilization

efficiency.

The

following

relations

enable

specific

disk

organizations

to

be

evaluated.

For

a

three-zone

format

the

number

of

tracks

in

each

zone

must

be

follows:

2n

+ n = N

[nb

+ 2 '-

K]

1 2 T

~

and

for

a

two

-zone

format

where

n1 =

the

number

of

tracks

in

the

outer

zone

n2 =

the

number

of

tracks

in

the

!Y'tddle

zone

NT

=

total

number

of

tracks

per

disk

face

K =

number

of

bits

per

track

per

word

stored

serially

in

the

outer

zone.

N =

the

number

of

bits

per

word

per

face

nf

N =

the

number

of

bits

per

storage

word

nf =

the

number

of

disk

faces

used

per

word

4-3

The

number

of

read-write

channels

required

is:

n = n. n

c n f

where

nh

is

the

number

of

heads

per

face

which

are

simultaneously

selected.

For

a

reduction

of

read

-write

channels

below

one

per

bit

of

the

parallel

storage

word

it

is

required

that

In

order

that

the

available

capacity

of

the

disk

file

be

used

efficiently

it

is

also

required

that

nh

be

an

integral

submultiple

of

the

total

number

of

tracks

per

disk

face,

NT'

and

that

nf

shall

be

an

integral

submultiple

of

the

total

number

of

faces

per

file,

N(

For

the

usual

ranges

of

interest

the

maximum

packing

density

occurs

in

the

inner-

most

track

of

the

outer

zone.

It

is

at

this

critical

radius,

R ,

that

the

maximum

bit

packing

density,

B,

applies.

C

where.

R2

=

outermost

track

radius

T =

the

radial

track

density.

The

number.

of

storage

words

per

data

cylinder,

that

is,

the

number

of

words

accessible

in

one

disk

revolution

without

head

switching,

is

given

by:

2TT

RcB

~=

K

The

word

transfer

rate,

UW,

is

then

determined

in

terms

of

the

disk

rpm

as

follows:

• _

(RPMt

TT

RcB

(RPM)

~

-

nW

\60)-

30K

The

total

file

capacity

is;

bits/

file.

4-4

The

ideal

packing

efficiency#

a

measure

of

capacity

utilization

referred

to

the

maximum

possible

capacity

which

would

be

available

at

uniform

packing

density

in

all

tracks#

is

given

by:

where

R2

and

R1

are

the

radius

of

the

outermost

and

innermost

tracks

respectively.

A

factor

based

on

the

maximum

possible

capacity

at

a

constant

number

of

bits

per

track

for

all

tracks

is:

Note

that

this

factor

is

in

general

greater

than

unity

because

the

multizone

format

utilizes

available

disk

area

more

efficiently

than

would

be

possible

with

a

single

clock

rate

for

the

entire

disk

(as

would

be

required

for

a

straight

parallel

configuration).

In

the

multiple

zone

format

every

track

in

the

same

zone

contains

the

same

number

of

bits#

but

each

zone

requires

a

different

clock

rate.

The

clock

rates

for

a

two-

zone

format#

for

example,

would

be:

K

nW

in

the

outer

zone

(K

-1)

nW

in

the

inner

zone.

As

previously

reported

J

the

Librascope

4800

disk

file

series

is

potentially

attractive

because

of

the

large

number

of

heads

per

face.

The

model

4802

is

of

particular

interest

because

the

increased

packing

density

offers

higher

basic

bit

rates

and

more

capacity

in

the

same

basic

unit.

The

salient

characteristics

are

summarized

here.

No.

disk

per

file

-6

(48"

dia.

)

Tracks

per

face

-

432

Packing

density

-

2000

bits/inch

Track

density

-48

tracks

I

inch

RPM

-

900

(35

ms

avg.

access)

4-5

Table

4-1.

512-Bit,

Parallel

Organizations

for

the

Librascope

4802

No

Head

All

Selected

Heads

on

Modifications

One

Disk

One

Face

Bits Bits

3:2

4:3

Format

(Out

Z

Inner

Zone)

3:2

4:3

er

one:

Head

Groups/Head

Assembly

1 1 3 4

Bits/Face/Word

86

128

256

512

Heads/Sel.

/Face/Word

36 36

108

144

Heads/Group

12 12 4 3

Faces

Sel.

/Unit

6

of

12 4

of

12 2

of

12 1

of

12

Ideal

Packing

Efficiency

83%

86.

3%

83%

86.3%

No.

Stg.

Words/Data

Cylinder

83840

58000

84500

58000

32-Bit

Segments/Data

Cylinder

2620

1815

2640

1815

Word

Transfer

Rate,

MHz

1.

26

O.

87

1.

27

0.87

Read-Write

Channels

216

·144

216

144

Data

Cylinders

/Unit

24

36

24

36

Total

Stg.

Cap.

Words/File

(10

6)

2.0

2.09 2.02

2.09

Total

Stg.

Cap.

Bits/File

(10

6)

1024

1070

1030

1070

Clock

Rate,

Inner,

MHz

2. 52 2.

62

2.

54

2. 62

Clock

Rate,

Outer,

MHz

3. 78 3.

48

3.81

3.48

No.

Tracks,

Outer

Zone

168

240

160

240

No.

Tracks,

Inner

Zone

264

192

272

192

Head

Groups,

Outer

Zone

14

20

40

80

Head

Groups,

Inner

Zone

22 16 68

64

Head

Sticks,

Outer

Zone

14

20

13

+

1/3

20

Head

Sticks,

Inner

Zone

22

16

22

+

2/3

16

Clock

Channels

/Unit

12 8 4 2

4-6

The

large

number

of

accessible

heads

permit

a

number

of

possible

disk

organizations.

Without

sacrificing

capacity

the

following

might

represent

limiting

transfer

rate

capabilities.

1.

Up

to

128

bits

per

face

or

1024

bits

per

file

at

a

word

transfer

rate

of

approximately

o.

8

megaHertz

without

modifications

to

head

stick

wiring.

2.

Up

to

512

bits

per

face

or

2048

bits

per

file

at

a

word

transfer

rate

of

approximately

o.

8

megaHertz

if

head

stick

wiring

is

revised

to

enable

simultaneous

selection

of

four

heads

per

stick.

3.

Up

to

256

bits

per

face

or

1024

bits

per

file

at

a

word

transfer

rate

of

approximately

1.

2

megaHertz

if

head

wiring

is

revised

to

enable

simultaneous

selection

of

three

heads

per

stick.

Still

larger

word

lengths

would

be

possible

by

revising

head

stick

wiring

to

permit

simultaneous

access

to

all

heads

but

the

resulting

increase

in

the

nun1.ber

of

head

wires

from

15

to

39

would

likely

be

troublesome.

With

no

fewer

than

3

heads

per

group

the

number

of

wires

per

stick

can

be

held

to

23

for

completely

flexible

head-

track

sparing,

or

to

14

wires

per

head

stick

if

restriction

to

group

sparing

is

acceptable.

At

the

word

lengths

of

interest

it

is

necessary

to

either

subdivide

the

head

stick

wiring

or

concurrently

select

heads

over

several

diskfaces.

Both

have

some

potential

disadvantages;

the

latter

from

the

possible

necessity

for

skew

cor-

rection

due

to

relative

timing

errors

between

several

faces

and/

or

head

plates,

and

the

additional

clock

channels

required.

At

a

word

length

of

512

bits

either

approach

is

a

possibility.

Table

4-1

presents

a

summary

for

four

different

organ-

izations:

two

involve

no

head

stick

revisions,

and

two

require

head

revisions

but

involve

only

one

disk

or

one

disk

face,

respectively.

Two

different

two

-zone

for-

mats

(indicated

in

the

table

as

the

number

of

serial

bits

per

word

time

in

outer

zone:

the

number

of

serial

bits

per

word

time

in

the

inner

zone)

offer

words

trans-

fer

rates

of

O.

8

to

1.

2

megaHertz.

At

the

initial

contact

with

General

Precision,

Inc.,

they

indicated

a

tentative

preference

for

organizations

involving

concurrent

head

selection

on

a

single

face

because

they

did

not

think

they

had

adequate

information

at

that

time

with

respect

to

the

skew

problem.

4-7

SECTION

V

PROGRAMMING

PE

OPERA

TIONS

This

section

describes

modifications

and

additions

to

the

list

of

instructions

ex-

ecuted,

at

least

in

part,

by

the

logic

within

Processing

Elements.

A11

instruc-

tions

not

mentioned

here

are

functionally

unchanged

from

those

described

in

Sec-

tion

VII

of

Progress

Report

No.1

(TR-66-3;

August

26,

1966).

However,

some

changes

in

instruction

forma.ts

are

contemplated

to

accommodate

certain

new

in-

structions.

In

particular,

some

of

the

new

instructions

will

be

ma.de

var

iants

of

old

instructions

instead

of

new

op-

codes

and

these

old

instructions

will

have

their

formats

changed

to

incorporate

variant

fields.

In

32-bit

mode,

bits

0

and

32

are

the

sign-bits

of

the

two

operands

in

either

the

A

or

the

B

Register;

bits

1-7

and

33-39

are

the

exponents;

bits

8-31

and

40-

63

are

the

mantissa

magnitudes.

Fixed-

point

operations

always

involve

the

mantissa

magnitudes

and

the

mantissa

signs

only.

Thus,

in

32-

bit

mode

fixed-

point,

op-

erations

are

performed

on

24-bit-plus-sign

quantities.

When

an

operation

uses

-

operands

from

both

the

A

and

the

B

Registers,

corresponding

bits

from

the

two

registers

are

involved.

A s

an

example,

a

double-length

arithmetic

shift

(SHD)

treats

bits

8-

31

of

the

A

and

B

Registers

as

one

48-

bit

quantity

and

bits

40-

63

of

the

two

registers

as

another

48-bit

quantity.

Shlft

counts

are

interpreted

modulo

32

in

32-blt

mode.

The

PE

Index

Register

is

not

affected

by

the

word-length

mode.

For

example,

Load

Index

from

A

Register

(LXA)

always

transmits

bits

52-

63

of

the

A

Register

'to

the

Index

Register.

.

There

are

now

nine

mode-bits

per

PE,

designated

as

the

\V,

E,

E1,

F,

F1,

G,

I,

and

J

bits.

The

\\1'-

Bit

is

always

set

or

reset

by

the

Control

Unit

and

designates

the

word-length

(ZERO

implies

64-bit

operands,

ONE

implies

32-bit

operands').

5-1

5-2

In

64-bit

mode

the

E-Bit

enables

or

disables

the

changing

of

full

operand

registers

and

the

E1-

Bit

is

not

available

for

program

use.

In

32 -

bit

mode,

the

E-

Bit

con-

trols

the

changing

of

bits

0-

31

of

the

registers

and

the

E1-

Bit

controls

bits

32-

63:

In

test

instructions

executed

by

PEts,

any

register

bit

which

is

disabled

from

being

changed

is

also

disabled

from

being

tested

to

generate

a

change

in

any

"mode

bit.

In

64-bit

mode,

the

F-Bit

is

set

to

ONE

whenever

an

arithmetic

fault

occurs

in

the

PE

and

the

F1-Bit

is

not

available

for

program

use.

In

32-bit

mode,

the

F-Bit

is

set

to

ONE

whenever

a

fault

occurs

in

arithmetic

performance

on

bits

0-31

of

operands

and

the

F1-bit

is

set

to

ONE

whenever

a

fault

occurs

in

bits

32-63.

In

64-bit

mode,

the

instruction

list

is

augmented

to

permit

specification

of

anyone

of

the

G,

H,

I,

and

J

bits

to

hold

the

result

of

any

test.

For

example,

the

tests

for

the

A-Register

being

zero

now

include

GAZ

and

HAZ

(Set

G/H

if

A

Equals

Zero)

as

well

as

IAZ

and

JAZ.

The

Control

Unit

can

enable

or

disable

a

PE

as

a

result

of

the

condition

of

the

G

and

H

bits

in

a

manner

identical

to

the

with

the

I

and

J

bits.

In

32-bit

mode,

the

IAZ

instruction

sets

the

G-Bit

if

bits

0-31

of

the

A-Register

are

zero

and

sets

the

I-Bit

if

bits

32-63

of

the

A-Register

are

zero.

Of

course,

if

the

E-Bit

equals

ZERO,

the

G-Bit

is

unchanged

and

if

the

E1-Bit

equals

ZERO,

the

I-Bit

is

unchanged.

Similarly,

JAZ

affects

both

the

Hand

J

bits

in

3~-bit

mode.

The

operations

of

instructions

calling

for

setting

of

the

G

and

H

bits

(e.

g.

GAZ,

HAZ)

are

presently

undefined

for

32-bit

mode.

The

new

register

in

each

PE

is

designated

the

ItS

Register"

(for

Save

Operand

Register

or

Special

Register).

The

contents

of

the

S

Register

are

not

involved

or

altered

by

any

arithmetic

or

shift

instruction.

Three

instructions

have

been

defined

involving

the

S

Register:

SAS

Store

A

Register

in

S

Register

Copy

all

of

the

A

Register

into

the

S

Register.

If

in

64-

bit

mode

and

E =

ZERO,

do

not

change

the

S

Register.

If

in

32-bit

mode

and

E =

ZERO,

do

not

change

bits

0-31

of

the

S

Register.

If

in

32-

bit

mode

and

E1

=

ZERO,

do

not

change

bits

31-63

of

the

S

Register.

IBS

SWAPS

Load

B

Register

from

S

Register

Swap

A,

Band

S

Registers

A

Goes

to

S

B

Goes

to

A

S

Goes

to

B

There

is

now

a

set

of

instructions

which

permit

operand

transmissions

to

begin

and/

or

end

at

the

MDR

without

involving

the

A

and

B

Registers.

These

instruc-

tions

are

designed

so

that

intermediate

PE's

can

participate

in

multi-

PE

routing.

without

having

their

operand

registers

altered..

The

Control

Unit

has

a

single

instruction

which

causes

a

sequence

of

transfers

among

neighboring

PE's

to

accomplish

the

desired

multi-PE

routing.

The

instructions

transmitted

to

the

PE's

by

the

Control

Unit

in

response

to

the

single

instruction

executed

by

the·

Control

Unit

are

also

available

for

the

program

to

call

upon

individually.

These

instructions

include

MDR-to-MDR

transmission

with

direction

specified,

storage

from

the

A

or

the

B

Register

to

the

MDR

of

the

same

PE,

and

leading

of

the

A

or

B

Register

from

the

MDR

of

the

same

PEe

A

disabled

PE

does

not

execute

those

instructions

which

load

its

A

or

B

Register

but

does

participate

in

the

others.

This

control

of

which

PEls

finally

receive

multi-PE

transmissions,

coupled

with

the

programmed

control

of

the

path

distance

and

directions,

permit

the

single

Control

Unit

instruction

required.

In

effect,

all

PE'

s

originate

transmissions

but,

after

the

path

control

has

been

counted

down,

only

enabled

PE's

accept

the

result

of

the

transmission.

The

operands

which

arrive

at

disabled

PE'

s

are

retained

in

their

MDR'

s

and

may

be

discarded

by

the

causing

their

replacement.

However,

some

saving

of

transmission

time

may

be

accomplished,

for

certain

routing

patterns,

if

the

first

multi-PE

transmission

is

followed

by

a

new

transmission

that

starts

with

MDR-to-MDR

transmissions

and

which

terminates

with

a

different

set

of

PEls

being

enabled

to

accept

the

transmission

results

from

their

MDR's

into

their

A

and

B

Registers.

There

are

now

instructions

which

cause

the

clearing

of

the

exponent

field

of

the

A

or

the

B

Register.

Clearing

of

the

mantissa

field

has

always

been

available

as

an

end-off

arithmetic

shift

with

any

shift

count

exceeding

the

length

of

the

mantissa.

With

the

Barrel,

shifting

to

clear

a

field

takes

no

longer

than

the

execution

of

a

special

clear

field

instruction.

Obviously,

in

assembly

language

a

clear

mantissa

instruction

may

exist

which

would

assemble

as

a

shift.

There

is

now

a

round

variant

on

each

of

the

pertinent

arithmetic

instructions.

In

floating-point,

rounding

is

accomplished

before

normalization

to

preserve

the

prope

r

significance.

There

are

now

instructions

which

add

address

fields

to

the

of

the

Memory

Address

Register,

with

the

address

fields

being

optionally

provided

by

fields

within

the

instruction

or

the

PE

Index

Register

or

both.

This

permits

straightforward,

multi-level,

indirect

addressing

when

the

level

is

known.

There

is

now

an

instruction

which

stores

an

operand

from

the

A,

B

or

Memory

Data

Register

of

an

enabled

PE

to

the

Common

Data.

Bus.

When

only

one

PE

is

enabled,

this

provides

the

required

transmission

of

an

operand

from

a

PE

to

the

Control

Unit

without

requiring

memory

access.

When

two

or

more

PEls

are

enabled,

the

result

of

this

instruction

is

undefined.

There

are

now

instructions

which

carry

an

indexable-

6-bit

field

specifying

one

bit

within

either

the

A

or

the

B

Registers.

The

bit

number

is

interpreted

modulo

64

in

64-bit

mode..

In

32-bit

mode,

the

bit

number

is

interpreted

modulo

32

and

as

modulo

32

plus

32,

allowing

specification

of

corresponding

'bits

in

both

halves

of

the

Register.

The

operations

on

and

with

the

specified

hit

are:

•

Set

either

the

G,

H,

I

or

J

bit

to

equal

the

specified

bit.

In

32-bit

~ode,

only

I

and

J

are

defined.

)

5-3

5-4

•

Change

the

specified

bit.

•

Set

the

specified

bit

to

ZERO.

•

Set

the

specified

bit

to

ONE.

There

are

now

instructions

which

perform

arithmetic

on

the

magnitudes

of

the

operands.

There

are

now

instructions

which

compare

the

magnitude

of

the

A

Register

with

the

magnitude

of

the

B

Register.

Also,

the

magnitude

of

either

register

may

be

compared

with

the

Common

Value.

Comparison

with

the

magnitude

of

the

Common

Value

is

not

included

as

a

separate

PE

inst

ruction

since

this

may

be

accomplished

by

having

the

Control

Unit

broadcast

the

magnitude

only

of

the

Common

Value.

The

instructions

which

operate

upon

or

compare

the

value

of

the

PE

Index

Register

have

been

augmented

so

that

the

Index

Register

may

be

stored

in,

loaded

from,or

compared

with

the

least

significant

twelve

bits

of

the

B

Register

or

a

memory

word.

When

a

memory

word

is

involved

in

any

of

these

instruc-

tions,

the

address

in

memory

is

indexable.

CONTROL

UNIT

INSTRUCTIONS

In

the

following

numeric

list

of

instructions,

the

first

syllable

is

given

in

octal.

Op-code

"000"

is

interpreted

by

the

CU

as

a

halt.

Op-codes

"001-177"

are

ex-

ecuted

in

part

by

the

CU

and

part

by

the

PE

array.

Op

-codes"

200

-377"

are

interpreted

fully

by

the

CD

and

no

direct

PE

action

results.

In

ulti

-syllabic

instructions,

the

following

abbreviations

are

used

to

indicate

the

coding

of

the

syllables

other

than

the

first:

A

bit

which

does

not

affect

the

operation

of

the

instruction

being

described.

a A

bit

which

is

part

of

an

address

or

literal

field.

b A

bit

which

is

part

of

a

field

which

designates

a

bit-

number

within

a

register.

c A

bit

which

is

part

of

a

shift

count

(

in

some

instructions

a

bit-number).

C

An

eight-bit

syllable

used

to

designate

an

address

within

CU

local

memory.

d A

bit

which

designates

shift

direction.

e A

bit

which

distinguishes

between

end

-off

and

end

-around

shifts.

End

-around

shifts

are

mnemonically

referred

to

as

"Rotate.

II

i A

variant

bit

which

is

defined

as

ONE

for

the

specific

variaI1t

being

discussed.

o A

variant

bit

which

is

defined

as

ZERO

for

the

variant

being

discussed.

L

An

eight

-bit

syllable

which

is

part

of

a

literal

field.

M

An

eight

-bit

syllable

which

is

part

of

an

address

field

used

by

the

CD

to

address

IVlain

Memory.

v A

variant

bit

which

has

meaning

in

defining

a

variant

described

elsewhere.

Examples

of

this

are

given

following

this

list.

x A

variant

bit

which

controls

indexing,

by

the

PE

Index

Register,

of

the

address,

shift-count

or

bit-number

field

given

in

the

in-

struction.

As

an

example

of

the

use

of

these

abbreviations

in

the

instruction

list,

consider

the

instructions

which

transmit

data

between

neighboring

PE's.

Each

of

these

instructions

starts

with

an

op

-code

syllable

equal

to

octal

120

and

has

a

second

syllable

specifying

variants.

The

variants

permit

the

choice

of

the

A

Register,

the

B

Register

or

the

Memory

Data

Register

as

the

transmission

origin;

the

same

three

registers,

or

the

Memory

Address

Register,

may

be

the

destination;

the

transmission-direction

may

be

North,

East,

South,

or

West.

Two

bits

are

used

for

each

of

these

variants,

leaving

two

undefined

bits

in

the

variant

syllable.

The

two

most

significant

bits

specify

the

originating

register,

the

bits

specify

the

destination

register

and

the

two

least

Significant

bits

specify

the

direction.

Transmission

from

the

Memory

Data

Register

to

the

Memory

Data

Register

of

the

North

neighbor

is

indicated

by

coding"

10"

for

each

of

the

register

-designation

bits

and

"00"

for

the

direction:

1010

..

00,

where

the

periods

indicate

the

undefined

bits.

In

the

instruction

list

this

would

appear

"ioio

..

00".

However,

to

avoid

listing

each

possible

variant

separately,

the

list

contains

the

following

t.hree

entries:

120

iovv

..

vv

120

vvio

..

vv

120

vvvv

..

00

D-T-

-DT-

--TN

The

first

entry

denotes

the

coding

of

the

bits

that

specify

the

origin

and

indicates

that

the

other

variant

bits

have

no

affect

on

the

origin.

Similarly,

the

second

entry

denotes

destination

and

the

third

entry

denotes

direction.

The

mnemonic

abbre-

viations

show

the

characters

that

represent

the

variants

and

hyphens

are

used

for

character

positions

that

are

used

to

denote

other

variants.

In

the

specific

example

being

considered,

"DDTN"

means

MDR

-to-MDR

transmission

North

and

is

encoded

"ioio

..

00".

5-5

5-6

Elements

of

the

Control

Unit

The

control

unit

is

conveniently

described

in

terms

of

its

registers

and

other

functional

entities.

The

registers

consist

of:

•

Eight

mode

function

registers,

E

l'

E2, F

l'

F , G, H, I

and

J

although

physically

located

in

the

processing

e~ements,

are

addressed

by

program

as

though

they

were

registers

of

the

control

unit

• Index·

registers,

each

32

bits

long

•

64-bit

registers

in

high

speed

scratchpad

storage.

Sixty-four

of

these

can

be

used

as

the

broadcast

buffer.

•

Program

counter

•

Interrupt

register

and

mask,

or

interrupt

control,

register

•

Local

register

address

pointer,

8

bits

of

register

address.

The

above

registers

are

addressible

uniformly

using

an

8

-bit

register

address

in

the

instructions.

In

addition,

there

are

several

registers

which

are

not

addressible

by

the

8

-bit

field,

but

are

implicity

addressed

by

all

instructions

which

are

relevant

to

their

use.

They

are:

'.

Program

lookahead,

for

holding

a

reservoir

of

program

steps

independently

of

memory

accessing.

•

Address

register,

which

serves

as

an

accummulator

for

address-

sized

fields.

It

is

24

bits

long.

• A

ccumulator

(for

want

of

a

better

name),

a

common

register

referenced

by

all

data

manipulating

instructions.

It

is

64

bits

long.

• A

queue

of

instructions

and

q,ata,

which

decouples

the

operations

which

are

solely

within

the

CU

from

those

that

refer

to

the

PE's.

This

queue,

like

that

in

the

B8500,

has

no

effect

on

the

operations

of

the

instructions

except

to

permit

a

certain

amount

of

parallelism,

and

therefore

is

not

discussed

further.

Discussion

of

the

Instructions

All

registers

within

the

CU

are

uniformly

addressed

by

an

8

-bit

field,

which

is

in-

dexable.

The

operation

of

transferring

the

I

Register

to

the

E

Register,

which

is

"enable

those

PE's

which

had

a

true

result

on

the

most

recent

test

involving

the

I-Bit,

"

is

accomplished

by

the

same

sequence

of

instructions

as

transferring

register

number

24

to

register

number

42

which

merely

moves

data

around

the

scratchpad.

Similarly,

a

jump

is

accomplished

by

storing

a

new

value

in

the

numbered

register

which

is

the

program

counter.

A

bonus

which

comes

from

this

approach

is

that

capabili

ty

which

is

invented

for

the

use

of

one

special

case,

such

as

reading

a

PE

number

from

the

E

register,

can

be

used

on

any

data

within

the

CD,

thus

increasing

programming

flexibility.

The

CUAccummulator

is

central

to

most

CU

operations.

Its

use

is

implied

in

most

CD

data

transmission

instructions.

Whenever

an

instruction

is

transmitted

from

the

CD

to

the

PE

array,

the

CD

Accummulator

may

modify

the

PE

instruction

or

otherwise

participate

in

the

operation

of

the

PEt

For

example,

the

PE

in-

struction

"Load

A

Register

from

Common

Value"

(LAC)

transmits

the

contents

of

The

CD

Accummulator

to

the

A

Registers

of

all

enabled

PE's.

The

accummulator

also

receives

the

transmission

from

the

PE

in

the

"Store

A

Register

to

Common

Data

Bus"

Instruction

(SAC).

When

an

instruction

with

an

address,

shift-count,

or

bit-number

field

is

trans-

mitted

to

the

PE's,

the

contents

of

the

CU

Accumnlulator

are

added

to

this

field

before

transmission.

Thus,

multiple

indexing

with

CU

index

registers

is

accom-

plished

by

ordinary

addition

to

the

CD

Accummulator

before

issuance

of

the

PE

instruction.

The

CU

also

has_

one

special

index

register

for

its

own

use

in

addressing

w:ithin

its

local

memory.

The

act

of

placing

a

value

in

this

index

automatically

causes

the

addition

of

this

value

to

the

with

a

CD

register

number.

This

register

is

always

cleared

after

its

use.

000

HALT

Halt

All

Operations

002

CHSA

Change

Sign

of

A

Register

003

CHSB

Change

Sign

of

B

Register'

004

SAP

Set

A

Register

Positive

005

SBP

Set

B

Register

Positive

006

SAN

Set

A

Register

Negative

007

SBN

Set

B

Register

Negative

010

CEA

Clear

Exponent

of

A

Register

011

CEB

Clear

Exponent

of

B

Register

013

CMB

Complement

B

Register

015

CLB

Clear

B

Register

016

NORM

Normalize

A

Register

020

SAD

Store

from

A

Register

to

Memory

Data

Register

021

SBD

Store

from

i3

Register

to

Memory

Data

Register

022

LAD

Load

A

Register

From

Memory

Data

Register

5-7

5-8

023

024

025

026

027

.

030

031

032

033

034

035

036

037

040

041

042

043

044

045

046

047

050

051

052

053

054

055

06_0

061

062

063

064

LBD

SAC

SBC

LAC

LBC

LMA

LMB

LMC

LMD

LXA

LXB

LXC

LXD

SXA

SXB

SXC

SXD

ADAX

ADBX

ADCX

-ADDX

LDAM

LDBM

STAM

STBM

LDMM

LDDM

SAS

LBS

SWAP

SWS

DBA

Load

B

Register

from

Memory

Data

Register

Store

from

A

Register

to

Common

Data

Bus

Store

from

B

Register

to

Comn1on

Data

Bus

Load

A

Register

from

Common

Value

Load

B

Register

from

COlnmon

Value

Load

Memory

Address

Register

from

A

Register

Load

Memory

Address

Register

from

B

Register

Load

Memory

Address

Register

from

Common

Value

Load

Memory

Address

Register

from

Memory

Data

Register

Load

Index

Register

From

A

register

Load

Index

Register

from

B

Register

Load

Index

Register

from

Common

Value

Load

Index

Register

from

Memory

Data

Register

Store

Index

Register

in

A

Register

Sotre

Index

Register

in

B

Register

Store

Index

Register

to

Common

Data

Bus

Store

Index

Register

in

l\1emory

Data

Register

Add

A

Register

to

Index

Register

Add

B

Register

to

Index

Register

Add

Common

Value

to

Index

Register

Add

Memory

Data

Register

to

Index

Register

Load

A

Register

as

Designated

by

Memory

Address

Register

Load

B

Register

as

Designated

by

Memory

Address

Register

Store

A

Register

as

Designated

by

Memory

Address

Register

Store

B

Register

as

Designated

by

Memory

Address

Register

Load

Memory

Address

Register

as

Designated

by

Memory

Address

Register

Load

Memory

Data

Register

as

Designated

by

Memory

Address

Register

Store

A

Register

in

S

Register

Load

B

Register

from

S

Register

Swap

A

and

BRegisters

Swap

with

S

Register

(A

to

S: S

to

B:

B

to

A)

Duplicate

B

Register

from

A

Register

100

xdcc

cccc

SHA

Shift

A

Register

Mantissa

101

xdcc

cccc

SHB

Shift

B

Register

Mantissa

102

xdcc

cccc

SAL

Shift

A

Register

Logical

103

xdcc

cccc

SBL

Shift

B

Register

Logical

104

xdcc

cccc

RAL

Rotate

A

Register

Logical

105

xdcc

cccc

RBL

Rotate

B

Register

Logical

106

xdcc

cccc

SHD

Shift

Double

-Length

Mantissa

107

xdcc

cccc

SDL

Shift

Double

-Length

Logical

110 x.

bb

bbbb

CHBA

Change

Specified

Bit

of

A

Register

111 x.

bb

bbbb

CHBB

Change

Specified

Bit

of

B

Register

112

x.

bb

bbbb

SBA

Set

Specified

Bit

of

A

Register

113 x.

bb

bbbb

SBB

Set

Specified

Bit

of

B

Register

114 x.

bb

bbbb

CLBA

Clear

Specified

Bit

of

A

Register

115

x.

bb

bbbb

CLBB

Clear

Specified

Bit

of

B

Register

120

oovv

..

vv

A-T-

Inter

-PE

Transmission

frorn

A

Register

120

oivv

..

vv

B-T-

Inter

-PE

Transmission

from

B

Register

120

iovv

•.

vv

D-T-

Inter-PE

Transmission

from

Memory

Data

Register

120

vvoo

•.

vv

-AT-

Inter

-PE

Transmission

to

A

Register

120

vvoi

..

vv

-BT-

Inter-PE

Transmission

to

B

Register

120

vvio

..

vv

-DT-

Inter

-PE

Transmission

to

Memory

Data

Register

120

vvii

..

vv

-MT-

Inter

-

PE

Transmission

to

l\1emory

Addres

s

Re

gister

120

vvvv

••

00

--TN

Inter-PE

Transmission

North

120

vvvv

· .

oi

--TE

Inter

-

PE

Transmis

sion

East

120

vvvv

· .

io

--TS

Inter-PE

Transmission

South

120

vvvv

· .

ii

--TW

Inter-PE

Transmission

West

121

oioo

ioio

CLA

Clear

A

Register

BOOOO

Boolean

Function

0000

121

oHi

ioio

AND

A

AND

B;

Result

to

A

Register

BOO01

Boolean

Function

0001

121

oiii

ioii

NIMP

Not

(A

Implication

B);

Result

to

A

Register

B0010

Boolean

Function

0010

121·

oiii

iiio

NRIMP

Not(B

Implication

A);

Result

to

A

Register

BO

100

Boolean

Function

0100

5-9

5-10

121

oooi

ioio

121

ioii

ioio

121

ooii

ioio

121

o

iii

iiii

121

ioii

121

oooi

ioii

121

ooii

ioii

121

ooio

iiio

121

ooli

iiio

12

1

iiii

122

oovv

vvvv

122

oivv

vvvv

122

vvvv

vvvo

122

vvvv

vvvi

122

vvoo

ooov

122

vvoo

ooiv

122

vvoo

oiov

122

vvoo

oiiv

122

vvoo

ioov

122

vvoo

ioiv

122

.

vvoo

iiov

122

vvoo

iiiv

122

vvoi

ooiv

DAB

'B0101

EOR

BOlIO

OR

BOllI

NOR

BIOOO

MEQ

B100l

NOTB

B1010

RIMP

BI011

CMA

Bl100

IMP

BIlOl

NAND

B1ll0

Duplicate

A

Register

fromB

Register

Boolean

Function

0101

A

Exclusive

OR

B;

Result

to

A

Register

Boolean

Function

0110

A

OR

B;

Result

to

A

Register

Boolean

Function

0111

Not

(A

or

B);

Result

to

A

Register

Boolean

Function

1000

A

Material

Equivalence

B;

Result

to

B

Register

Boolean

Function

1001

Complement

of

B

Register

Transmitted

to

A

Register

Boolean

Function

1010

B

Implication

A;

Result

to

A

Register

Boolean

Function

1011

Complement

A

Register

Boolean

Function

1100

A

Implication

B;

Result

to

A

Register

Boolean

Function

1101

Not

(A

AND

B);

Result

to

A

Register

Boolean

Function

1110

Perform

A

rithmetic

on

Sign

and

Magnitude

----M

Perform

Arithmetic

on

Magnitude

Only

----R

No

Rounding

of

Result

----R

Round

Result

ADD

Fixed

Point

Add

A

to

B;

Result

to

A

Register

UFAD

Unnormalized

Floating

Add

A

to

B;

Result

to

A

Register

SUB

Fixed

Point

Subtract

B

from

A;

Result

to

A

Register

UFSU

Unnormalized

Floating

Subtract

B

from

A;

Result

to

A

Register

MUL

Fixed

Point

Multiply

A

by

B;

Result

to

A & B

UFMU

Unnormalized

Floating

Multiply

A

by

B;

Result

toA

& B

DIV

Fixed

Point

Divide

A

by

B;

Quotient

to

A~

Remainder

to

B

UFDV

Unnormalized

Floating

Divide

A

by

B;

Quotient

to

A~

Remainder

to

B

FAD

Float:ing

A

dd

A

to

B;

Result

to

A

122

:vvoi

oiiv

122

vvoi

ioiv

122

vvoi

iiiv

122

vvio

ooiv

122

vvio

oiiv

122

vvio

iiov

123

iiii

iiio

123

iiii

ioii

123

ioii ioii

130

oovv

vvvv

130

oivv

vvvv

130

iovv

vvvv

130

iivv

vvvv

130

vvoo vvvv

130

vvoi

vvvv

130

vvio

vvvv

130

vvvv

oovv

130

vvvv

oivv

130

vvvv

iovv

130

vvvv

vvoo

130

vvvv

vvoi

130

vvvv

vvio

131

oovv

vv

..

131

oivv

vv

..

FSU

FMU

FDV

EAD

ESU

IDV

GR8

LS8

EQ8

G---

H---

1---

J---

Floating

Subtract

B

from

A;

Result

to

A

Floating

Multiply

A

by

B;

Result

to

A & B

Floating

Divide

A

by

B;

Quotient

to

A,

Remainder,

to

B

Extend

Precision

After

Floating

Add;

Extension

of

Sum

to

A

Register

Extend

Precision

After

Floating

Subtract;

Extension

of

Difference

to

A

Register

Integer

Divide

A

by

B;

Quotient

to

A,

Remainder

to

B

Te

st

8-

bit

Bytes

for

A

Greater

than

B;

Result

to

A

Test

8-

Bit

Bytes

for

A

Less

than

B;

Result

to

A

Test

8-Bit

Bytes

for

A

Equal

to

B;

Result

to

A

Set

G-Bit

as

Result

of

Comparison

of

A

Register

with

B

Register

Set

H-Bit

as

Result

of

Comparison

of

A

Register

with

B

Register

Set

I-Bit

as

Result

of

Comparison

of

A

Register

with

B

Register

Set

J

-Bit

as

Result

of

Comparison

of

A

Register

with

B

Register

Compare

Sign

and

Magnitude

of

A

Register

with

Specified

State

of

B

Register

-

M-

- -

Compare

Magnitude

only

of

A

Register

with

Specified

State

of

B

Register

-

L---

Compare

Logical

Value

of

A

Register

with

Specified

State

of

B

Register

--

-

Compare

Sign

and

Magnitude

of

B

Register

with

Specified

State

of

A

Register

- -

--

M

Compare

Magnitude

Only

of

B

Register

with

Specified

State

of

A

Register

----L

Compare

Logical

Value

of

B

Register

with

Specified

State

of

A

Register

--

EQ-

Test

for

A

Equal

to

B

--

LS-

Test

for

A

Less

than

B

--GR-

Test

for

A

Greater

than

B

G---

Set

G-Bit

as

Result

of

Test

of

A

or

B

Register

H---

Set

H-Bit

as

Result

of

Test

of

A

or

B

Register

5-11

131

iovv

vv

.•

l---

Set

1-

Bit

as

Result

of

Test

of

A

or

B

Register

131

iivv

VV

••

J---

Set

J -

Bit

as

Result

of

Test

of

A

or

B

Register

131

vvov

vv

..

-A--

Test

A

Register

131

vviv

vv

..

-B--

Test

B

Register

131

vvoo

oi.

.

--Z

Test

for

Plus

or

Minus

Zero

131

vvvo

io

..

--PZ

Test

for

Plus

Zero

131

vvvo

ii.

.

--0

Test

for

ALL

ONES

(Minus

Full

Scale)

131

vvvi

00

••

--S

Copy

Sign-

Bit

to

Designated

Test

Bit

132

oovv

vv

..

GX--

Set

G-Bit

as

Result

of

Index-Register

Test

132

oivv

vv

..

HX··-

Set

H-Bit

as

Result

of

Index-Register

Test

132

iovv

vv

..

IX--

Set

I-Bit

as

Result

of

Index-

Register

Test

132

iivv

vv

..

JX--

Set

J

-Bit

as

Result

of

Index-

Register

Test

132

vvoo

vv

..

-XE-

Compare

Index

Register

for

Equality

with

Specified

Quantity

132

vvoi

vv

..

-XL-

Test

for

Index

Register

Less

than

Specified

Quantity

132

vvio

vv

..

-XG-

Test

for

Index

Register

Greater

than

Specified

Quantity

132

vvii

vv

..

-XZ

Test

for

Index

Zero

132

vvvv

00

••

-X-A

Compare

Index

Register

with

A

Register

132

vvvv

oi.

.

-X-B

Compare

Index

Register

with

B

Register

132

vvvv

io

..

-X-C

Compare

Index

Register

with

Common

Value

132

vvvv

ii

..

-X-D

Compare

Index

Regsiter

with

Memory

Data

Register

134

xobb

bbbb

GBA

Set

G-Bit

to

Designated

Bit

of

A

Register

134

xibb

bbbb

GBB

Set

G-Bit

to

Designated

Bit

of

B

Register

135

xobb

bbhb

HBA

Set

H-Bit

to

Designated

Bit

of

A

Register

135

xibb

bbbb

HBB

Set

H -

Bit

to

Designated

Bit

of

B

Register

136

xobb

bbbb

lBA

Set

I-Bit

to

Designated

Bit

of

A

Register

136

xibb

bbbb

lBB

Set

l-

Bit

to

Designated

Bit

of

B

Register

137

xobb

bbbb

JBA

Set

J -

Bit

to

Designated

Bit

of

A

Register

137

xibb

bbbb

JBB

Set

J

-Bit

to

Designated

Bit

of

B

Register

5-12

140

oovv

aaaa

GX-L

Set

G-Bit

as

Result

of

Comparison

of

Index

Register

with

Literal

140

oivv

aaaa

HX-L

Set

H-Bit

as

Result

of

Comparison

of

Index

Register

with

Literal

140

iovv

aaaa

aaaa aaaa

IX-L

Set

I-

Bit

as

Result

of

Comparison

of

Index

Register

with

Literal

140

iivv

aaaa

aaaa aaaa

JX-L

Set

J

-Bit

as

Result

of

Comparison

of

Index

Register

with

Literal

140

vvoo

aaaa

-XEL

Test

for

Index

Equal

to

Literal

140

vvoi

aaaa

-XLL

Test

for

Index

Less

than

Literal

140

vvio

aaaa

-XGL

Test

for

Index

Greater

than

Literal

141 ·

ovv

aaaa

aaaa aaaa

L-L

Load

Specified

Register

with

Literal

141 ·

ivy

aaaa

ADL-

Add

Literal

to

Specified

Register

141 ·

voo

aaaa aaaa

aaaa

LALj

ADLA

Specify

A

Register

141 ·

vol

aaaa

LBLj

ADLB

Specify

B

Register

141 •

via

aaaa

LXLj

ADLX

Specify

Index

Register

150

x.

00

aaaa aaaa aaaa

STA

Store

A

Register

150

x.oi

aaaa

aaaa aaaa

STB

Store

B

Register

150

x.

io

aaaa

STD

Store

Memory

Data

Register

150

x.ii

aaaa

STX

Sotre

Index

Register

.....

151

x.oo

aaaa

LDA

Load

A

Register

151

v.oi

aaaa

LDB

Load

B

Register

151

x.

io

aaaa

LDD

Load

:LyIemory

Data

Register

151

x.ii

aaaa aaaa aaaa

LDX

Load

Index

Register

152'

xooo

aaaa

aaaa aaaa

LDM

Load

Memory

Address

Register

152

xioo

aaaa

aaaa aaaa

ADM

Add

to

Memory

Address

Register

152

xiii

aaaa

ADMX

A

dd

from

Memory

to

Index

Register

200

DUP

Duplicate

Non-Zero

Half

of

CU

A

ccummulator

.201

FULL

Enter

Full-

Word

(64-bit)

Mode

202

HALF

Enter

Half-

Word

Mode

204

ZEROF

If

CU

Accumulator

Is

Not

All

ZEROS~

Skip

Until

the

Instruction

5-13

205

ZEROT

If

CU A

ccumulator

is

All

ZEROS,

Skip

Until

the

P"

Instruction

206

ONESF

If

CU

Accumulator

Is

Not

All

ONES,

Skip

Until

the

Instruction

207

ONEST

If

CU A

ccum

ulator

is

all

ONES,

Skip

Until

the

Instruction

210

SKIPF

If

the

Result

of

the

Last

Test

was

False,

Skip

Until

the

In-

struction

211

SKIPT

If

the

Result

of

the

Last

Test

was

True,

Skip

Until

the

In-

struction

212

UNSKIP

Resume

Executing

All

Instructions

220

LEA

DO

Find

Leading

ONE

in

the

CU

Accum-

ulator;

Put

Bit-

Number

in

CU

Accumulator

221

LEADZ

Find

Leading

ZERO

in

the

CU

Accum-

ulator;

Put

Bit-

Number

in

CU

Accumulator

230

CCL

Clear

CU

A

ccumulato!'

231

CCOM

Complement

CU

Accumulator

232

XCUA

Index

by

CU

Accumulator

240

aaaa aaaa

STL

Store

CU A

ccumulator

in

CU

Local

Memory

241

aaaa

STLC

Store

CU A

ccumulator

in

CU

Local

Memory,

Complemented

242

aaaa

LDL

Load

CU

f..

ccumulator

from

CU

Local

Memory

244

aaaa

EXCH

Exchange

CU. A

ccumulator

with

CU

Local

Memory

245

aaaa

EXCHC

Exchange

Complement

of

CU

Accumulator

with

CU

Local

Memory

246

aaaa

CADD

Add

to

CU

Accumulator

247

aaaa aaaa

CSUB

Substract

from

CU

Accumulator

250

aaaa

CAND

AND

to

CU

Accumulator

251

aaaa

COR

OR

to

CU

Accumulator

252

aaaa

CEOR

Exclusive

OR

to

CU

Accumulator

5-14

260

oobb

bbbb

CCLB

Clear

Designated

Bit

260

oibb

bbbb

CSBO

Set

Designated

Bit

to

ONE

260

iobb

bbbb

DDHB

Change

De

signated

Bit

261

edcc

cccc

CSHIFT

Shift

262

oobb

bbbb

CTSBZ

Test

Bit

for

ZERO

262

oibb

bbbb

CTSBO

Test

Bit

for

ONE

270

aaaa

EQUALT

271

aaaa aaaa

EQUALF

272

aaaa

GRTRT

273

aaaa aaaa

GRTRF

274

aaaa

LESST

275

aaaa

LESSF

276

aaaa aaaa

XADD

Add

to

CU

Index

277

aaaa

SUB

Subtract

from

CU

Index

30U

ooVY

nnnn

RTA-

Route

from

A

Registers

300

oivv

nnnn

RTB-

Route

from

B

Register

s

300

iovv

nnnn

RTD-

Route

from

Memory

Data

Registers

300

vvoo

nnnn nnnn

RT-A

Route

to

A

Registers

300

vvoi

nnnn

RT-B

Route

to

B

Registers

300

vvio

nnnn nnnn

RT-D

Route

to

Memory

Data

Registers

300

vvii

nnnn

RT-M

Route

to

Memory

Addr(;ss

Registers

310

LLL

SLIT

Short

(24-

bit)

Literal

to

CU

Accumulator

311

MMM

STO

Store

One

"Vord

from

CU

Accumulator

to

Main

Memory

312

MMM

LOAD

Load

One

Word

from

Main

Memory

to

CU

Accumulator

320

CMMM

BIN

Block

Transfer

into

CU

Memory

from

Main

Memory

321

CMMM

BOUT

Block

Transfer

Out

from

CU

Memory

to

Main

Memory

3"30

LLLLLLLL

LIT

Full-

Word

Literal

to

CU

Accumulator

5-15

SECTION

VI

ILLIAC

IV

APPLICATIONS

STUDY

The

implementation

on

ILLIAC

IV

of

the

Cooley-Tukey

algorithm

for

the

calculation

of

complex

Fourier

series

has

been

studied.

A

method

is

described

in

this

section.

DESCRIPTION

OF

THE

COOLEY-TUKEY

ALGORITHM

This

is

a

method

for

evaluating

the

function

X{j)

for

N

values

of

the

argument

(j

=

0,1,

...

,N-1)

when

we

are

given

the

N

complex

coefficients

A(k),

k =

0,1,

••.

,

N

-I,

appearing

in

the

Fourier

sum

that

is

used

to

define

the

function

X(j).

N-1

X

(j)

= I A

(k).

Wj k

k=O

j =

O.

1

•••.

,

N-1.

Here

W

is

defined

to

be

the

principal

N

-th

root

of

unity.

w = e

27Ti

/N

I

(i

=J-T).

(1)

(2)

rhe

inverse

problem

can

also

be

solved

by

the

same

method.

for

if

the

N

values

of

X(j)

corresponding

to

j =

0.1

•••.•

N-1

are

given.

then

the

Fourier

coefficients

appearing

in

equation

(1)

are

defin

ed

by

N-1

A(k)

=;

L

X(j)

W~jk

j=

0

which

is

similar

in

form

to

equation

(1).

k =

O.

1

•••••

N-1

(3)

6-1

6-2

The

following

discussion

is

concerned

with

the

problem

as

stated

in

the

form

given

in

equation

(1).

The

straightforward

use

of

equation

(1)

is

equivalent

to

pre-multiplying

the

N-

component

vector

A(k)

by

the

NxN

matrix

Wjk

to

obtain

the

N

-component

solution

vector

X(j).

This

is

easily

implemented

on

ILLIAC

IV

which

computes

the

com-

ponents

of

X

in

parallel.

The

total

number

of

operations

required

would

be

about

N2

where

an

operation

is

considered

to

be

a

complex

multiplication

followed

by

a

complex

addition.

The

algorithm

des

cribed

by

Cooley

and

Tukey*

can

achieve

the

same

result

with

much

less

computation.

The

number

of

operations,

in

the

most

favorable

case,

is

proportional

to

N.

log

N

rather

than

to

N2.

It

is

also

economical

in

st'orage

requirements.

These

features

make

it

a

highly

desirable

method

for

this

problem,

especially

for

large

values

of

N.

Cooley

and

Tukey*

showed

that

choosing

N

to

be

a

power

of

2

(N

=

2m)

has

par-

ticular

advantages

for

computation

on

a

binary

machine.

"Vith

this

choice,

the

algorithm

takes

the

form

of

generating

iteratively

a

sequence

of

m

N-component

vectors.

The

first

member

of

the

sequence

is

derived

by

iteration

on

the

vector

A(k)

and

the

final

member

is

the

required

vector

X(j).

It

is

assumed

throughout

what

follows

that

the

choice

N = 2m

has

been

made.

To

define

the

sequence

of

vectors

the

indices

are

written

in

binary

form.

The

equation

(1)

can

then

be

written

,

...

, L

k

m-1

(4a,

b)

By

evaluating

the

indicated

sums

sequentially,

the

following

definition

of

the

sequence

Ar

(r

= 1, 2,

•••

,

m)

is

obtained.

The

notation

used

is

that

of

Cooley

and

Tukey

except

that

the

iteration

parameter

is

represented

by

r

instead

of

Z,.

for

ease

of

typing.

*

J.

W.

Cooley

and

J.

W.

Tukey:

An

Algorithm

for

the

Machine

Calculation

of

Complex

Fourier

Series.

Mathematics

of

Computation,

Vol.

19

(1965).

pp.

297

-

301.

(5)

A

(jo'·

••

' j

l'

k

l'

• •

.,

ko

>:=

I

r

r-

m-r-

k

Pr

A 1

(jo'

• • • , j

2'

k , • • • , k O

>·

W

r-

m-r

where

p

:=

j • k • 2m-1

1 0

m-1

(6)

p

:=

(j

•

r

r-1

r

-1

+ . ) k • 2m - r

2

+,

•••

, JO

m-r

Writing

out

the

two

terms

of

the

sum

in

equation

(5)

we

obtain

A

(jo,

•.•

,j

l,k

1,···,k

O

):=A

l(jO,···,j

2:

0,

k

I'···'

kO) +

(7)

r

r-

m-r-

r-

m-r-

( - 1 ) j r

-1

A 1 {j

0'

• . •

,j

2'

1, k

l'

• • • , k

0>.

W

qr

r-

m-r-

r

:=

2, 3,

•••

, m

where

-

(J'

2r

-2

+ +

J'

).

2m-r

qr

- r - 2 • , . . . , 0

(8

)

The

desired

components

of

X

are

then

defined

by

the

last

menlber

of

the

sequence.

(9)

MACHINE

IMPLEMENTA

TION

It

was

the

suggestion

of

Cooley

and

Tukey

that

the

value

of

A

(j

0'

• •

.,

j

I'

k

l'

• • . , k O)

. r

r-

m-r-

calculated

by

means

of

equation

(7)

be

stored

in

a

location

whose

address

is

.•

2m-1

+

m-r

m-r-l

JO ' • • • t + j

1·

2 + k

1·

2 + ,

.•

q +

kO·

.

r-

m-r-

6-3

6-4

When

this

is

done

the

storage

requirements

are

minimized

and

the

last

array

cal-

culated

gives

the

desired

Fourier

sum

s,

equation

(9),

in

such

an

order

that

the

index

of

an

X

must

have

its

binary

bits

put

in.

reverse

order

to

yield

its

index

in

array

Am.

On

any

iteration,

the

components

of

Ar

may

be

computed

in

parallel

since

the

cal-

culations

defined

by

equation

(7)

may

be

carried

0 ut

with

all

values

of

jo'

. •

.,

j

r-

2

and

kO' •

•.

, k 1

simultaneously.

m-r-

IMPLEMENTATION

ON

ILLIAC

IV

In

order

to

perform

the

calculations

indicated

on

ILLIAC

IV,

it

is

necessary,

on

the

r-th

cycle,

to

have

the

values

of

both

A 1

(jo'

••

•

'.

j

2'

0,

k

l'

•••

,

kO)

r-

m-r-

and

q

available

in

the

same

PE

memory.

The

value

of

W r

must

also

be

available

to

the

PE.

One

then

computes,

according

to

equation

(7),

the

value

of

Ar

with

the

same

two

indices.

To

obtain

the

values

of

A

with

the

desired

pair

of

indices

in

the

same

PE

memory

it

may

be

necessary

to

slln'l

data

from

PE

to

PE

during

the

course

of

the

calculations.

The

constant

powers

of

W

required

by

each

PE,

during

the

entire

course

of

the

computation,

are

predetermined

by

the

method

in

which

the

original

coefficients

A{k)

are

distributed

within

the

PE

memories

and

by

the

scheme

adopted

for

shifting

data

between

PE

memories

as

the

compu

tation

proceeds.

These

details

will

now

be

discuss

ed.

STORAGE

OF

THE

COEFFICIENTS

A(k)

Use

is

made

of

the

fact

that

N

has

been

chosen

to

be

a

power

of

2:

N::

2m.

It

is

further

assumed

that

m

is

greater

than

8.

To

determine

the

location

in

which

a

particular

A(k)

is

stored,

the

representation

of

k

as

a

binary

number

as

in

equation

(4b)

is

used.

Let

the

last

eight

bits

(k

7,

•••

,kO)

of

this

binary

representa-

tion

of

k

define

the

PE

in

which

A(k)

is

stored.

The

las

t

two

of

these

bits

(k

1, kO)

d.etermine

the

number

o~

the

quadrant,

the

preceeding

six

bits

(k

7,

•••

,k

2)

deter-

mine

the

number

of

the

PE

within

the

quadrant.

The

remaini.ng

m - S

bits

(km-I,

•.•

, kS)

may

be

interpr

eted

as

the

add

res

s

of

a

storage

location

within

the

PE

memory.

To

allow

for

the

storage

of

both

real

and

imaginary

parts

of

A(k),

the

real

parts

can

be

stored

in

the

indicated

location

while

the

imaginary

part

can

be

stored

in

the

corresponding

location

of

another

block

of

m

emory,

congruent

to

the

block

in

which

the

real

parts

are

stored.

The

size

of

each

of

these

blocks

of

m

emory

will

be

2m-

a

words.

The

interpretation

of

the

binary

representation

of

the

index

k

is

thus

as

shown

in

figure

6-1.

!

Storage

Location

Withill

PE

(

PE

Number

Within

Quadrant

If"

Quadrant

I km-1

ka

k7

k2

I

kl

kO

I

m-S

bits

6

bits

2

bits

Figure

6-1.

Interpretation

of

k

as

a

Binary

Number

to

Define

Storage

Location

of

A(k).

The

quadrants

of

the

ILLIAC

IV

array

are

numbered

0, 1, 2, 3.

The

numbering

of

the

PE's

within

the

quadrant

is

shown

in

figure

6-

2.

If

p

is

the

PE

number

then

the

last

3

bits

of

the

binary

representation

of

p

are

the

column

and

the

first

3

bits

are

the

row

numbers.

When

the

initial

data

is

stored

as

described,

its

distribution

through-

out

the

arrays

is

shown

in

figure

6-3.

The

number

shown

are

k

mod

256.

~

Row

0 1 2 3 4 5 6 7

.

0 0 1 2 3 4 5 6 7

1 8 9

10

11

12 13

14 15

2

16

Row

I

Col

;

3

24

3

bits

3

bits

4

32

5

40

6 48

7

56

57 58

59

60

61.

62

63

-

Figure

6-2.

The

Numbering

of

the

PElS

withIn

a

Quadrant

6-5

6-6

0 4 8

12

16 20 24 28 1 5 9

13 17

21

25 29

32 36 40 33

37

41

64

68 65

69

o 96

97

1

/ 128

160

192

224 228 232 236 240 244 248 252 225 253 \

QUADRANT

NUMBERS

2 6 10 14 18 22 26 30

34

38

66 70

3 7 11

15

19

23 27

31

35 39

43

67

71

I

3

98 99

226 254 227 255

Figure

6-3.

The

.Distribution

of

the

Coefficients

A (k)

in

the

A

rray

Initially

COMPUTATION

AND

STORAGE

OF

INTERMEDIATE

RESULTS

The

calculations

fall

naturally

into

two

parts,

the

first

m-

8

iterations

and

the

final

8

iterations.

In

the

following

de

scr

iption,

the

binary

bit

s

(always

n~

in

number)

of

the

locations

referred

to

are

to

be

interpreted

in

the

same

manner

as

the

bi-

~ary

.representation

of

k

just

described.

The

first

m-S

cycles

r =

1,2,

...

,m-S.

1.

Calculate

A

(j,

...

, j

I'

k

l'

...

kO)'

by

use

of

equation

(7).

r 0

r-

m-

r-

2.

Store

the

result

in

location

(jo'

...

,j

l'

k .

1'"

.

,k

O

)'

i.

e.,

at

r-

m-r-

address

(j

"",

j

l'

k 1"

•.

, kS)

of

PE

(k

7,

...

, kO)'

Note

that

on

the

(nl-S

/

th

o .

r-

m-r-

cycle

it

is

address

(jo'

...

,

jm-

9)

of

PE

(k

7,

...

, kO)

that

is

meant.

The

calculation

is

performed

for

all

values

of

(jo'

...

,j

-1)

and

of

(k

m _ _

l'

...

, kO)'

It

is

seen

from

equation

(7)

that

for

the

computation

of

tbe

A

for

any

Inaex,

the

storage

locations

defined

for

the

quantities

on

both

the

left

~~d

right

hand

sides

of

the

equation

are

in

the

same

PE

memory

(the

last

S

bits

are

the

salne).

Hence

no

transfer

of

data

between

PE's

is

required.

Thus

all

PE'sScompute

in

parallel

and,

m-

on

each

cycle,

each

PE

computes

the

value

of

A

for

2

different

values

of

the

. d r

In

ex.

The

final

S

cycles

r =

m-7,

m-

6,

...

, n1.

1.

ShiftA

l(jO,···,j

2,k

,

...

,k

O)

from

location

(jo,

...

,j

10,j

2)

r-

m-

r .

m-

r-

in

PE

(j

9 I • • • I j

3'

k I • •

.,

kO>

to

location

(j

0'

...

, j

10'

k )

in

P~

m-

r-

m-r

m-

m-r

(j

9'

...

, j

2'

k

l'

...

, kO)'

Note

that

on

the

(m-7

)th

cycle

the

shift

meant

m-

r-

m-r-

is

from

location

(jo'

...

,

jm-

9)

in

PE

(k

7,

...

, kO)

to

location

(jo'

...

,

jm-l

0'

k7)

inPE

(jm-

9'

k6,

...

, kO)'

Note

also,

on

the

mth

cycle

the

shift

meant

is

from

location

(jo'

...

,

jm-lO'

jm-2)

in

PE

(jm-

9'

...

,

jm-3'

kO)

to

location

(jo'

...

,

jm-l

0'

kO)

in

PE

(j

9'

...

, j

2)'

m-

2.

Calculate

A

(jo"'"

j

l'

k

I'

...

, kO>

by

use

of

equation

(7).

r

r-

m-r-

Note

on

the

mth

cycle

it

is

A

(jo"'"

j 1)

that

is

calculated.

r

r-

3.

Store

the

result

in

location

(jo'

...

, j

10'

j

1)

in

PE

(j

9'

...

,

m-

r-

m-

j

2'

k

I'

...

,k

O

)'

Note

on

the

mth

cycle

it

is

location

(jo'

...

, j

10'

j 1 >

r-

m-r-

.

m-

in

PE

(j

9'

...

, j

2)

that

is

meant.

m- m-

Again

the

calculation

is

performed

for

all

values

of

(jo'

•••

'

jr-l)

and

of

(k

I" • • • I kO).

The

shift

called

for

in

step

1

is

designed

to

bring

together

in

.

m-r-

the

same

PE

the

values

of

A 1

with

the

two

indices.

They

differ

only

in

the

r-th

r-

bit

appearing

on

the

right-hand

side

of

equation

(7).

Once

the

shifts

have

been

accomplished

all

PEls

compute

in

parallel

and

each

computes

the

values

of

A

for

2

ffi

-S .. r

values

of

the

index.

6-7

6-8

The

shifts

required

in

step

1

are

determined

by

the

values

of

the

bits

j 2

and

k .

Since

only

four

possible

combinations

of

the

values

of

these

blt§-are

pos-

sillrl,

the

corresponding

shifts

may

be

tabulated.

Table

6-1.

Shifts

Required

for

Combinations

of

Bits

j 2

and

k

r-

m-r

Bit

Shift

jr-2

k

m-r

0 0

No

shift.

0 1

Shift

to

PE

of

lower

number

(_2m-r).

Increase

location

by

1.

1 0

Shift

to

PE

of

higher

number

(+2 m-

r).

Decrease

location

by

1.

1 1

No

shift.

'----.

Since

the

possible

combinations

of

bit

values

occur

with

equal

frequency

it

is

seen

from

the

above

table

that

on

any

cycle

(r)

precisely

half

the

data

has

to

be

shifted

by

inter-PE

shiftin£.

Of

the

data

that

is

shifted,

half

goes

to

a

PE

of

higher

number

(+2m-r)

and

half

goes

to

a

PE

of

lower

number

(_2m-r).

This

shifting

is

such

that

on

cycle

r,

two

PE's

whose

number

in

the

array

differ

only

in

the

r-

(m-

8)

bit

position

exchange

a

word

of

data,

for

each

two

word

s

they

contain.

The

required

shifting

of

data

between

PE

memories

is

accomplished

by

one

or

two

routing

instructions.

In

the

fir

st

cycle

requiring

such

a

shift

each

PE

in

the

block

of

128

PE's

(those

numbered

0-127)

sends

and

receives

words

from

the

corres-

ponding

PE

in

the

second

block

of

12~

PE's

(128-

255).

Examination

of

the

PE

numbering

system

of

figure

6-3

shows

that

the

first

four

rows

of

PE's

in

each

quadrant

exchange

words

with

the

second

four

rows.

The

end-around

cylindrical

connection

of

the

PE's

on

the

North

and

South

edges

of

each

quadrant

are

such

that

sending

and

receiving

can

be

accomplished

in

one

instruction,

since

all

PE's

in

a

quadrant

shift

a

word

4

rows

South,

end-

around,

simultaneously.

In

the

second

cycle

requiring

a

shift,

each

PE

in

2

blocks

of

64

PE's

(0-

63;

128-191)

exchanges

words

with

the

corresponding

PE

in

the

corresponding

block

of

the

re-

maining

2

blocks

of

64

PEls

(64-127;

192-255).

This

requires

the

first

two

rows

of

PEls

within

a

quadrant

to

exchange

words

with

the

second

two

rows

and

the

third

pair

of

rows

to

exchange

words

with

the

fourth

pair

of

rows.

In

this

case

the

end-

around

connection

cannot

be

used

and

the

exchange

takes

place

under

mode

control

in

the

two

parts.

First,

rows

I,

2,

5,

6

transmit

data

two

rows

South

to

rows

3,

4,

7, 8

respectively.

Second,

rows

3,

4,

7,

8

transmit

data

two

rows

North

to

rows

I,

2,

5,

6

respectively.

In

the

third

cycle

information

is

exchanged

between

adjacent

rows.

The

following

three

cycles

repeat

the

same

pattern

of

shifts

but

between

columns

rather

than

rows.

shift

of

8

rows

is

required.

This

involves

the

whole

array

as

data

moves

from

quadrant

to

quadrant

for

the

first

time.

Use

could

be

.

made

of

the

end-

around

cyclindinal

connection

of

the

whole

array,

as

in

the

first

shift

described.

The

final

shift

is

similar,

being

of

distaI1ce

8

between

columns.

This

pattern

of

shifts

is

listed

in

table

6-2.

Table

6-

2.

Shifting

Required

for

the

Final

Eight

Iterations

Block

s

of

PEl

s

Distance

of

Cycle

Shift

Required

Number

Size

Nearest

Number

m-7

2

128

*

4 NS

m-6

4

64

2

NS

m-5

8

32

1 NS

m-4

16

4*~~

WE

m-3

32

~

2

WE

m-2

64 4 1

WE

m-l

128

2 8*

NS

m

256

1

8**

WE

*

Denotes

that

end-

around

connectivity

may

be

used

to

allow

both

of

the

shifts

to

occur

simultaneously.

**

Shifts

can

also

use

end-

around

connectivity,

but

will

require

one

more

step

in

routing

than

"*"

shifts,

because

of

the

differences

in

edge

connectivity

patterns.

6-9

6-10

We

now

consider

the

powers

of

W

that

have

to

be

available

in

each

PE.

According

to

equation

(7)

the

computation

on

the

r-th

iteration,

of

A

(jo"'"

j

-1'

k _

-1"'"

k )

requires

the

use

of

W

qr

where

q

is

given

by

equation

f8).

With

1he

dJra

r

sPorage

scheme

described

the

addre§s

of

A

within

the

PE

determines

the

power

. r .

of

W

required

in

its

calculation.

According

to

equation

(8),

on

the

first

m-8

He

rations,

put

the

first

r-1

bits

of

the

address

in

reverse

and

m.ultiply

the

resul-

tant

number

by

2m-r•

This

gives

the

required

powerOof

W.

On

the

first

iteration,

no

bits

are

selected

by

this

rule--corresponding

to

W •

For

these

iterations

all

PEls

require

the

same

power

of

the

W

at

the

same

time.

Thus,

these

powers

should

be

broadcast,

and

not

stored

repetitively

in

each

PE.

On

the

final

8

iterations

one

has

Ar_1

(jo,

.•.

,

jr-2

km-s'

•.•

, kO>

stored

in

location

jo,".'

jnl-10'

km_r>

in

PE

(jm-9,""

jr-2'

km-r-1,

•.

·,

kO

)"

After

the

required

shifting

has

taken

place,

take

the

r-1

bits

that

have

been

underlined

(the

first

m-9

in

the

memory

address

followed

by

the

first

r-m

+ 8

in

the

PE

number),

invert

these

bits

and

multiply

the

number

obtained

by

2m-r.

This

is

the

power

of

W

required.

Since

these

powers

of

W

depend

on

the

PE

number

they

should

be

stored

within

the

PE

memories.

On

each

iteration

one

power

of

W

for

each

pair

of

indices

of

Ar

within

the

PE

is

required.

The

number

of

such

pairs

is

2m-9

and

the

number

of

iterations

in

this

mode

is

8.

Thus

2m-6

values

of

powers

of

Ware

required

by

each

PE.

A s

in

the

standard

algorithm,

at

the

completion

of

the

m-th

iteration,

to

f.ind

the

location

of

X(j)

in

the

A

array,

one

interprets

the

bits

of

j

in

reverse

as

a

lo-

cation

in

the

PE

memorTes.

However,

after

reversing

the

bits

of

j

it

is

necessary

to

shift

the

final

bit

(j

-1)

left

eight

plac

es

(to

between

j

-10

and

j -9>

before

terpreting

the

locatio!?as

in

figure

6-1.

m m

COMPUTA

TIONS

REQUIRED

To

give

some

idea

of

the

magnitudes

involved

some

figures

are

given

here

for

N =

4096

=

212.

(m

=

12).

1.

2.

3.

Number

of

iterations

required

(m)

= 12

Number

of

coefficients

A

(k)

in

each

PE(2

m-

8)

= 16

m-6

Number

of

values

of

W

stored

in

each

PE

(2 ) = 64

4.

Computation

required

for

each

pair

of

coefficients

(2 m-9 =

8)

per

P~,

is

indicated

by

equation

(7),

1

complex

multiplication

and

2

complex

additions.

This

in

terms

of

real

arithmetic,

amounts

to

4

multiplications

and

6

additions

for

each

pair

of

coefficients.

On

the

final

8

iterations

data

transfers

occur

and

2

words

(real

and

imaginary

part

of

one

of

the

pair)

are

transmitted

and

2

words

received

for

this

amount

of

calculation.

Since

there

are

8

pairs

of

coefficients

in

each

PE,

for

each

iteration

a

PE:

1.

Transmits

16

word

s

of

data

2.

Receives

16

words

of

data

3.

Performs

32

multiplications

4.

Performs

48

additions.

}

Not

required

on

first

4

of

the

12

iterations

During

the

first

4

iterations

one

is

effectively

solving

256

problems

in

parallel

(one

in

each

PE),

each

problem

being

of

size

N = 16.

Duripg

the

final

8

itera-

tions

one

is

solving

equivalently,

8

problems,

each

problem

being

of

size

N =

512.

These

last

problems

are

distributed

uniformly

throughout

the

array.

6-11

SECTION

VII

CIRCUIT

DESIGN

-

THE

ECL

CIRCUIT

The

basic

logical

circuit

for

ILLIAC

IV

is

the

current

switch,

or

ECL

circuit,

shown

in

figure

7 -1.

Logical

operations

can

be

performed

in

this

circuit

by

sup-

plying

input

transistors

in

parallel,

by

tying

collectors

together

into

a

common

collector

resistor,

and

by

connecting

together

the

output

ernitters.

Gating

on

the

input

transistors

appears

as

"not

OR"

for

positive

-going

signals

when

seen

at

the

inverting

output

"b,

"

or

as

OR

when

seen

at

the

noninverting

output,

Ila."

For

negative

-going

signals,

gating

at

the

input

transistors

appears

as

"not

AND"

and

AND,

respectively,

at

the

inverting

and

noninverting

outputs.

Logic

gating

can

also

be

done

by

sharing

a

single

resistor

among

otherwise

independent

collectors.

Such

sharing

produces

an

OR

for

negative

signals

or

an

AND

for

positive

signals,

and

would

appear

to

be

a

way

of

adding

another

gate

to

the

logic

without

adding

any

components

or

any

delay_

However,

one

must

take

care

of

the

case

that

two

col-

lectors

are

delivering

current

into

the

resistor

simultaneously,

either

by

clamping

the

voltage,

which

costs

components,

or

by

ensuring

in

the

logical

design

that

no

more

than

one

current

flows

at

anyone

time.

Mixing

outputs

by

tying

the

emitters

together

performs

exactly

the

same

function

as

mixing

inputs

of

the

in

the

input

transistor,

namely

negative

AND

or

positive

OR.

Implementation

of

any

logic

equation

using

such

gates

can

always

be

found

by

translating

the

AND

-

OR

description

implicit

in

the

19ical

equation

into

a

NAND-NAND

description,

level

for

level.

As

reference

to

the

logical

design

of

the

processing

element

elsewhere

in

this

report

will

show,

designs

can

often

be

simplified

considerably

from

the

directly

translated

version.

The

use

of

ECL

circuits

differs

depending

upon

the

amount

of

wiring

which

must

be

driven

and

the

speed

to

be

obtained.

When

integrating

within

the

semiconductor

array,

freedom

to

use

the

ECL

gates

just

described

is

virtually'

unlimited.

When

_

any

wiring

is

involved,

however,

the

low

impedance

point,

namely

the

emitter

outputis

used

as

the

source

when

signals

must

be

sent

along

conductors

from

one

circuit

package

to

another.

Further,

fanout

suffers

on

signals

which

must

leave

the

circuit

package

because

of

the

need

to

supply

damping

and

terminating

resistors

for

the

wiring.

7

-1

7-2

v

cc

IN

o---;--t

+1.

2v

In

From

Logic

0--+--1

V

ee

-0

(b)

Inverting

Output

(a)

Non-

Invertlng

Output

Figure

7-1.

EeL

Gate~

Schematic

Diagram

61

183

I

+3.5v

1

I

1

____

---.

I

200

Line

36

I

-3.

v I

o-----~------------I~--------~

--l---l

-2.

Ov

I

Note:

All

Resistance

Values

in

Ohms.

I~

From

Line

Figure

7-2.

Driver~

Schematic

Diagram

2K

I

__

..J

320

-3.5v

120

V

ref

Note:

All

Resistance

Values

in

Ohms.

Figure

7-3.

Receiver~

Schematic

Diagram

Equivalent

Load

+1.

2v

Out

to

Logic

A t

the

collector,

signal

levels

are

approximately

+1. 2v

and

+0. 4v.

The

swing

is

set

essentially

by

the

current

in

the

emitter,

it

is

to

V , V

and

resistor

ratios.

The

more

positive

level

departs

from

V

by

an

arrrcrunt

cc

determined

by

base

current

in

the

outp.1 t

transistor,

a

few

peF8ent

at

most

of

the

"on"

current.

At

the

II

output"

emitter

in

figure

7

-I,

the

voltage

is

downshifted

by

an

amount

equal

to

the

base

-to-emitter

voltage

of

the

output

transistor.

The

outp.1 t

therefore

swings

from

+0.

4v

approximately

to

-0.4v

approximately.

These

output

voltages

are

made

meaningful

with

respect

to

the

nominally

zero

volt

input

thresh-

old

whatever

gate

receives

the.

Signals

which

must travel

considerable

distance,

such

as

between

cabinets,

for

instance,

may

need

more

margin

than

that

supplied

in

the

signals

at

the

output

of

the

ECL

gates.

The

extra

margin

is

needed

to

overcome

distortions

in

the

signal

due

to

imperfect

impedance

matching

in

the

wiring,

unwanted

components

due

to

crosstalk,

noise

from

external

sources,

and

discrepancies

in

temperature

and

perhaps

even

in

"zero

volts"

between

the

two

cabinets.

The

requirements

for

these

non

-EeL

signals

appear

to

be

as

follows:

•

Compatibility

with

ECL

levels

at

the

input

of

the

driver

and

the

output

of

the

receiver

•

Larger

signal

swing

(3.

Ov

based

upon

experience)

f)

Drive

capability

for

several

transmission

line

characteris

impedances

in

parallel

(at

least

two;

or

36

ohms

load

.on

each

signal,

if

72

-ohm

line

is

used)

•

Fanout

of

64

(from

control

unit

to

all

64

P.

E'

s);

this

implies

an

input

inlpedance

of

over

2.

3k

ohms

per

receiver

if

32

receivers

are

to

be

attached

to

a

single

72

-ohm

line.

)

A

driver

circuit

which

satisfies

these

requirements

is

shown

in

figure

7

-2.

Note

that

the

portion

of

the

circuit

to

the

left

of

the

dotted

line

in

the

driver

circuit

is

identical

(except

for

a

somewhat

lower

"impedance

.level)

to

our

standard

EeL

gate.

When

this

driver

circuit

is

implemented

as

a

portion

of

a

large-scale

integrated

array,

the

portion

of

the

circuit

on

the

left,

which

is

"the

same"

as

the

standard

EeL

circuit,

is

·available

for

performing

some

logic

function.

Only

the

non-

inverted

output,

which

feeds

the

large

-swing

signal,

would

be

unavailable

for

normal

EeL

use.

A

receiver

circuit

which

satisfies

these

requirements

is

shown

in

figure

7

-3.

Note

that

the

portion

of

the

circuit

to

the

right

of

the

dotted

line

in

the

receiver

circuit

is

identical

to

our

standard

EeL

gate.

When

this

receiver

circuit

is

im-

plemented

as

a

portion

of

a

large

-scale

intregrated

array,

the

portion

of

the

circuit

on

the

right,

which

is

the

same

as

the

standard

EeL

circuit,

is

available

for

performing

logic

functions

within

the

array.

7

-3

+

1.2v

-3.5v

IN

LONG

LINE

-3.5v

DRIVER

RECEIVER

Figure

7-4.

EeL

Driver-Receiver,

Balanced

Signals,

Schematic

Diagram

'---+---o

OUT

~FALSE)

The

signal

across

the

interface,

in

this

system,

swings

from

+2.

5v

to

-0.

4v.

Threshold

is

at

1.

2v,

at

room

temperature,

and

has

a

slight

negative

temperature

coefficient.

Since

rise

times

no

faster

than

15

ns

are

of

interest,

the

gain

of

the

transistors

at

25

mc

or

thereabout

controls

the

input

impedance

of

the

receiver.

With

transistors

whose

cutoffs

are

in

the

hundreds

of

n'legacycles,

gains

of

30

are

easily

available.

The

+3.

5v

supply

can

be

disabled

to

control

intercabinet

transfers.

An

alternative

dirver

-receiver

circuit

has

been

suggested

in

which

each

signal

is

transmitted

balanced

with

respect

to

ground.

This

circuit

is

shown

in

figure

7

-4.

It

has

a

signal

swing

of

1.

6v,

the

differential

signal

between

the

two

outputs,

but

makes

up

for

lowered

signal

swing

with

increased

noise

immunity.

The

noise

immunity

is

almost

equal

to

half

the

minimum

signal

swing,

compared

to

a

noise

immunity

of

30%

to

350/0

of

the

signal

swing

in

the

system

exemplified

by

the

driver

and

receiver

of

figures

7

-2

and

7

-3.

This

scheme

has

a

further

advantage

in

rejecting

noise,

in

that

some

noise

sources

tend

to

induce

a

common

mode

com-

ponent

in

the

line.

This

single

-ended

scheme

rejects

common

mode

noise

essentially

perfectly

up

to

some

maximum

amplitude,

while

the

scheme

depends

on

the

coupling

between

the

two

conductors

of

the

line

to

induce

a

noise

com-

ponent

in

one

conductor

equal

to

the

noise

component

induced

on

the

other

by

some

external

source.

The

latter

scheme

is

extreluely

effective,

but

not

as

effective

as

balanced

signaJs.

An

advantage

of

the

balanced

signal

driver

and

receiver

is

that

they

are

identi-

cal

in

design

and

fabrication

with

standard

EeL

gates.

Disadvantages

of

the

balanced

signal

scheme

are

severe

for

certain

proposed

uses.

In

particular,

it

is

not

possible

to

put

more

than

one

driver

on

one

signal

wire.

For

the

connections

from

the

PE's

back

to

the

memory

access

buffer,

it

would

be

necessary

to

give

each

PE

its

own

expensive

wire

back

to

its

own

private

receiver

at

the

memory

access

buffer.

Using

the

unbalanced,

3.0

v

signal,

one

can

combine

all

eight

drivers

on

a

single

data

line.

This

defect

will

be

general,

whenever

several

data

sources

converge

on

a

common

destina-

tion.

A

second

defect

of

the

balanced

signal

scheme

arises

because

the

wiring

must

be

balanced

with

respect

to

ground.

Thus

one

must

use

pairs

of

wires,

either

twisted

pair,

or

shielded

twisted

pair.

Twisted

pair

is

inferior

to

coaxial

cable

or

ribbon

cable

in

terms

of

crosstalk;

shielded

twisted

pair

is

considerably

more

expensive

both

to

buy

and

to

install

than

coaxial

cable

and

ribbon

cable.

Even

unshielded

twisted

pair,

when

procurred

in

belted

form,

can

be

surprisingly

expensive.

A

third

limitation

of

the

balanced

signal

scheme

arises

because

of

the

difficulty

in

designing

for

low

.output

impedance.

As

described

to

us,

the

drive

capability

of

the

balanced

driver

was

only

sufficient

to

drive

a

single

transmission

line.

In

the

case

of

data

being

transmitted

from

one

PE

to

neighboring

PE's,

the

data

transmission

paths

go

in

both

directions

from

the

transmitting

PEa

For

optirEum

per.formance,

it

is

necessary

to

drive

two

transmission

lines,

one

going

in

each

7

-5

7-6

direction.

Therefore,

two

balanced

driver

circuits

are

required

to

handle

the

signal.

Greater

fan

-out

at

the

receiver

end

is

also

available

with

the

un-

balanced

design.

The

conclusion

is

that

either

driver

and

receiver

design

is

satisfactory

for

use

in

cables

up

to

some

maximum

length,

where

not

more

than

one

driver

per

sig-

nal

set

is

required.

Ordinary

twisted

pair

can

probably

be

driven

by

the

balanced

driver

design

in

medium

length

runs

of

up

to

10

or

30

feet

where

the

unbalanced

driver

would

require

coaxial

cable

or

the

equivalent.

Beyond

that,

in

either

case,

higher

quality

wire

is

required,

such

as

ribbon

cable

using

three

wires

per

signal,

coaxial

cable,

or

shielded

twisted

pair.

For

this

last

class

of

signals,

single

-ended

signals

would

be

more

economical

of

wiring,

would

be

directly

compatible

with

popular

types

of

"foreign"

logical

circuitry

which

is

likely

to

be

found

in

external

equipment,

and

would

be

more

compatible

with

any

require-

ment

such

as

r.

f.

filtering.

A

table

of

"long

lines"

signals

is

in

table

7-1.

Table

7

-1

contains

our

conclusions

on

the

implementation

of

such

"long

lines"

Signals.

Either

balanced

or

unbalanced

signals

will

be

acceptable

within

each

quadrant,

assuming

each

quadrant

to

be

packaged

within

a

single

unit,

a

set

of

Table

7

-1.

Signals

Requiring

Driver

and

Receiver

Estimated

Suitable

Class

of

Signals

Length

(feet)

Driver

Design

Pe

to

PE

(same

cabinet)

short

Pe

to

PE

(different

cabinets)"

6

either

PE

to

110

Buffer

30

either

PE

to

memory

access

buffer

30

single

ended

PE

from

CD,

control

signals

and

mode

control

I 30

either

CD

to

CD,

different

quadrants

80

single

ended

CD

to

peripheral

devices

200

single

ended

Comments

no

driver

needed,

stand-

ard

ECL

signals

suitable

----

--

drivers

must

be

aRable

-

---

economically

more

attractive

wiring

compatibility

with

foreign

signals.

cabinets

bolted

together.

For

longer

runs,

the

lower

price

of

coaxial

c;~::.,

..

the

greater

compatibility

of

unbalanced

signals

at

a

"foreign"

interface

~.,(,.'::,'

,

call

for

the

use

of

3.

Ov

unbalanced

signals

between

quadrants,

and

frGrn

th

..

:

array

processor

to

the

outside.

Figure

7

-5

shows

an

example

of

how

the

unbalanced

system

nlight

be

used

~('

distribute

control

signals

from

the

CU

to

the

64

PE's

assunling

tha

t

the

I'~:

:';: :,"

receiver

in

each

PE.

A

fanout

of

64,

as

shown,

may

be

beyond

the

C'~qnL~l:

~:'

of

either

scheme

at

the

required

speed,

especially

the

balanced

signal

SC!:(·~

....

However,

the

grouping

of

the

PE

s

is

such

that

one

receiver

should

do

for

;1<'.::'

or

eight

PE'

s.

Drive

r

(in

CU)

32

Receivers

(in

PE's)

r----4f-+-----1~---_t>_+-------

- - - - - -

----e---f---

r----*-----*-----+------------------------

...

-

...

-

...

-

..

32

Receivers

(in

PE's)

L.-...1f---Clf-+-----1~---~>_+------

- - - - - -

----.e---t--

L..

.......

_________

-----+----------.--------->-----~-."

..

'"

..

Figure

7-5.

Use

of

Drivers

and

Receivers

for

Distributing

COc.tI>,.)1

Signals

from

C U

to

PEl

S

ILLIAC_IV_System_Study_Progress_Report_No_2_Oct66 ILLIAC IV System Study Progress Report No 2 Oct66

Navigation menu

Versions of this User Manual:

Views

Navigation