ILLIAC_IV_System_Study_Progress_Report_No_2_Oct66 ILLIAC IV System Study Progress Report No 2 Oct66

User Manual: ILLIAC_IV_System_Study_Progress_Report_No_2_Oct66

Open the PDF directly: View PDF PDF.
Page Count: 74

OCTOBER 13,
1966
illlAC IV
SYSTEM
STUDY
PROGRESS
REPORT
NO. 2
SUBMITTED
TO
UNIVERSiTY OF ILLINOIS
UNIVERSITY
OF
ILLINOIS
PURCHASE
ORDER
NO.
09852-B
SECTION
I
SECTION
II
SECTION
III
CONTENTS
INTRODUCTION
THE
ILLlAC
IV
SYSTEM
ROUTING
Shifting
at
the
PE
Level.
Shifting
at
the
Quadrant
Level.
Inter-Quadrant
Transfers
Machine
Instruction
THE
MULTIPLY
ALGORITHM
EIGHT-BIT
WORD
LENGTHS
COMMON
DATA
PATHS.
Input-Output
Buffering
•.
From
Control
Unit
to
Proces
sing
Element.
From
Processing
Element
to
Control
Unit.
Evaluation
.
MODE.
Introductory
Considerations
The
Problem
Back-up
Storage
in
the
Control
Unit.
Back-up
Storage
in
Processing
Element
Functional
Capability.
Comparison.
REVISED
PE
LOGIC
DESIGN
PROCESSING
ELEMENT
DESCRIPTION
MEMORY
DA
TA
REGISTER
(MDR)
OPERAND
SELECT
GATES
(OSG)
A-B
REGISTER
(ACCUNIULA
TOR)
LOOIC
UNIT
BARREL
SWITCH.
LEADING
ONES
DETECTOR
MULTIPLJCAND
SELECT
GATES
Page
1-1
2-1
2-1
2-3
2-3
2-7
2-8
2-8
2-12
2-14
2-14
2-15
2-16
2-16
2-17
2-17
2-18
2-18
2-19
2-20
2-20
3-1
3-1
3-1
3-4
3-4
3-4
3-5
3-5
3-5
-
iii-
CONTENTS
(Cont.
)
Page
SECTION
III
(Continued)
PSEUDOADDER
TREE
3 -5
CARRY
PROPAGATE
ADDER
(CPA)
3
-5
MODE
REGISTER.
3-6
ADDRESS
REGISTER.
3-6
PE
INDEX
REGISTER
3-6
MEMORY
ADDRESS
REGISTER
(MAR)
.
3-7
SPECIAL
REGISTER
(SR)
3-7
SECTION
IV
THE
I/O
SUBSYSTEM
4-1
GENERAL
CONSIDERATIONS.
4-1
DISK
STORAGE.
4-2
SECTION
V
PROGRAMMING
5 -1
REVISED
PE
INSTRUCTIONS
. 5 -1
CONTROL
UNIT
INSTRUCTIONS.
5-4
Elements
of
the
Control
Unit
5-6
Discussion
of
the
Instructions.
5-6
SECTION
VI
ILLIAC
IV
APPLICATIONS
STUDY.
6-1
DESCRIPTION
OF
THE
COOLEY-TUKEY
ALGORITHM
6-1
MACHINE
I:=\lPLEMENTA
TION
6-3
IMPLEMENTATION
ON
ILLIAC
IV
.
6-4
STORAGE
OF
THE
COEFFICIENTS
A(k)
.
6-6
COMPUTATION
AND
STORAGE
OF
INTERMEDIATE
RESULTS.
6-6
COMPUTA
TIONS
REQUIRED
6-10
SECTION
VII
CIRCUIT
DESIGN
-
THE
ECL
CIRCUIT
7
-1
-iv-
Figure
2-1
2-2
2-3
2-4
2-5
2-6
3-1
6-1
6-2
6-3
7-1
7-2
7-3
7-4
7-5
Table
2-1
2-2
LIST
OF
ILLUSTRA
TIONS
PE
Level
Shifting
Quadrant
Nearest
Neighbor
Connections.,
Pseudo
Physical
Layout
of
a
Quadrant
"Nearest
Neighbors"
Arrangement
of
Full
Array,
Showing
Quadrant
Subarrays
Logic
Elements
Involved
in
Multiply
Algorithm
System
for
Handling
Data
From
Control
Unit
to
Processing
Element,
Block
Diagram
Revised
PE
Logic
Design,
Block
Diagram.
Interpretation
of
k
as
a
Binary
Number
to
Define
Storage
Location
of
A(k}
The
Numbering
of
the
PEls
Within
a
Quadrant
The
Dis
tribution
of
the
Coefficients
A(k}
in
the
Array
Initially
ECL
Gate,
Schematic
Diagram
Driver,
Schematic
Diagram.
Receiver,
Schematic
Diagram.
ECL
Driver-Receiver,
Balanced
Signals,
Schematic
Diagram
Use
of
Drivers
and
Receivers
for
Distributing
Control
.
Signals
from
CU
to
PEls
Shift
Table,
Showing
Shifts
by
One's
and
Shifts
by
Eight's.
Relationship
of
Gate
Size
to
Multiplier
Size
Page
2-2
2-2
2-6
2-6
2-10
2-10
3-2
6-5
6-5
6-6
7-2
7-2
7-2
7-4
7-7
2-5
·2-11
-v-
-vi-
Table
3-1
4-1
6-1
6-2
7-1
LIST
OF
ILLUSTRA
TIONS
(Cont'd)
PE
Logic
Requirements
512-Bitl
Parallel
Organizations
for
the
Librascope
4802
Shifts
Required
for
Combinations
of
Bits
j 2
and
k
r-
m-r
Shifting
Required
for
the
Final
Eight
Iterations.
Signals
Requiring
Driver
and
Receiver
Page
3-3
4-6
6-8
6-9
7-6
SECTION
I
INTRODUCTION
This
report
is
the
second
of
three
reports
to
be
submitted
during
the
Phase
I
effort
of
the
ILLIAC
IV
Program.
At
this
measured
milestone
in
the
development
of
the
ILLIAC
IV
system
real
progress
is
indeed
evident.
There
exists
now
a
detailed
specification
of
the
Processing
Element
for
which
a
fully
compatible
and
complete
instruction
repertoire
has
been
developed.
In
addition,
the
data
transfer
paths
establishing
the
links
between
the
Input-Output
and
the
Array,
between
the
Control
Unit
and
the
Array,
and
between
the
elements
of
the
Array
have
been
specified.
Progress
in
defining
the
system
hardware
has
also
been
made.
A
family
of
logic
circuits
has
been
selected
and
is
currently
undergoing
an
intense
packaging
effort.
The
size,
type
and
speed
of
the
memory
system
for
the
Array
ha's
been
selected.
,Progress
in
defining
the
applications
areas,
particularly
that
progress
made
at
the
University
of
Illinois,
is
especially
evident.
The
contents
of
this
report
in
conjunction
with
the
first
report
contain
much
of
the
rationale
for
the
present
system
definition.
For
the
remainder
of
Phase
I,
much
work
remains.
Although
most
of
the
functions
and
specifications
of
the
110
and
the
Control
Units
have
been
identified
some
additional
work
is
needed.
In
the
area
of
packaging,
both
at
the
cabinet
level
and
the
logic
level,
detail
design
remains.
Such
items
as
power
distribution,
system
cooling
and
general
installation
detail
must
be
specified.
Finally
the
entire
procurement
must
be
matched
to
a
comprehensive
program
plan
of
schedule
and
delivery
which
is
mutually
acceptable.
1-1
SECTION
II
THE
ILLIAC
IV
SYSTEl'vl
This
section
contains
discussions
of
specific
systerns
problems
which
are
con-
sidered
to
be
of
major
importance
in
the
design
of
the
ILLIA
C
IV
System.
ROUTING
One
of
the
instructions
in
ILLiAC
IV
is
to
transfer
data
residing
in
the
PE
array
to
new
locations
in
that
array.
This
is
called
the
"routing"
instruction.
One
version
of
this
instruction
will
take
the
data
from
the
nth
PE
and
transfer
it
to
the
(n
+
m)th
PE~
where
m
is
an
indexable
variant
contained
within
the
instructionJ
and
n
runs
over
all
PEr
s.
Disabled
PEr
s
are
not
to
receive
any
new
data
or
lose
any
of
the
data
they
are
currently
holding
as
a
result
of
this
instruction.
The
immediate
reaction
to
such
a
requirement
is
to
implement
all
the
required
various
paths
by
brute
force.
Such
a
solution
is
not
only
expensive~
but
unneces-
sary,
and
of
inferior
performance.
To
find
a
superior
method
of
inlplementing
routing,
it
will
help
to
consider
individual
properties
of
the
routing
process.
From
a
functional
viewpoint
the
four
properties
of
concern
for
routing
are
the
level~
the
timing,
the
modulo~
and
the
increment.
In
the
array
of
256
PE's
the
levels
of
shifting
with'which
we
are
concerned
may
be
divided
into
shifts
between
quadrants,
shifts
between
the
same
elements
of
a
quadrant
and
shifts
within
a
Processing
Element.
The
timing
refers
to
the
execution
time
of
the
shift
and
its
concurrency
with
other
array
instructions.
The
modulo
is
the
end-around
size
of
the
shift
which
can
be
variable
up
to
a
given
maximum
size.
For
instance
a
64-
bit
shifter
may
be
designed
to
shift
64
bits
end-around
or
eight
8-bit
bytes
end-around
depending
on
the
modulo
control.
The
increment
is
the
smallest
shifting
amount
of
which
all
shifts
are
a
multiple.
Bit
shifts
would
have
an
increment
of
1"
byte
shifts
would
require
8,
etc.
2-1
2-2
BYT
CON
PE 1
A
E
TRO~
BARREL
SWITCH
ORF
UNCTION
PE
1
PE
9
PE
57
B
PE 2
PE
3
A A
I I
-
..
I
I I
I
t I
t I t
I ,
BA
RREL
SWITCH
I
BARREL
SWITCH
I
I
I
, B I B
t
~
'---
--I
I
Figure
2-1.
PE
Level
Shifting
PE
2
NORTH
SOUTH
EAST
WEST
- - - -
----
--
--
----+---.!
QUADRANT
GATE
COUNT
PE
8
4
><
64
X 64 =
16.000
Gat
s/Quadr
nt
Gates/Bit
X
Bits/WD
><
PE
s/Q
PE
58
14---
-
---
-
--
- -
---
-~~
PE
64
Figure
2-2.
Quadrant
Nearest
Neighbor
Connections
The
variations
on
these
properties
are
many,
and
since
the
function
involves
a
very
large
number
of
bits
the
different
variations
involve
substantial
differences
in
cost.
Shifting
at
the
PE
Level
The
first
level
of
consideration
is
the
shifting
of
operands
within
the
PEe
Here
a
barrel
switch
is
provided
which
will
shift
in
increments
of
1-bit-multiples
at
the
64-
bit-operand
level.
A
modulo
control
may
be
employed
by
adding
an
additional
logic
input
to
the
first
level
of
gating
in
the
barrel
switch
and
an
additional
gate
to
the
receiving
re-
gister
input.
Such
modulo
capability
may
include
8-,
32-,
and
64-
bit
bytes
and
be
extended
to
link
PEts
by
a
full
word
transfer
at
the
end
of
the
switching
cycle
(figure
2
-1).
The
execution
time
of
this
shifting
will
require
one
pass
through
the
barrel
switch
for
modulo
64
shifts.
two
passes
through
the
barrel
switch
for
end
off
switching.
and
four
passes
for
end-
around.
Otherwise
a
transfer
of
Register
B
to
the
neighboring
Register
A
will
extend
this
capability
between
PE's.
For
the
most
part
this
logic
capability
exists
within.a
PE,
and
there
is
no
con-
currency
of
execution
at
this
level.
'Shifting
at
the
Quadrant
Level
At
this
level
of
shifting,
considering
the
variety
of
desirable
shifting
properties
available,
there
are
many
possibilities.
Two
seemingly
practical
forms
at
the
Quadrant
level
are
nearest
neighbor
connections
and
the
quadrant
barrel
switch.
These
two
shifting
approaches
represent
the
practical
minimum
and
maximum
cost
for
the
ILLIA
C
IV
System.
NEAREST
NEIGHBOR
CONNECTIONS
-Figure
2-2
shows
the
necessary
data
paths
and
gates
to
implement
this
shift
approach.
It
has
the
capability
of
shift-
ing
left
or
right
4096
bits
in
increments
of
64
and
512
bits.
Increments
smaller
than
this
may
be
shifted
with
the
local
barrel
switch
if
desirable
since
con-
currency
is
not
possible
here.
All
shifting
must
be
done
in
sequence
with
all
arithmetic
and
logic
instructions;
all
shifted
amounts
are
combinations
of
the
basic
two
increments.
Shifting
of
a
modulo
size
smaller
than
the
whole
quadrant
is
done
by
mode
control
of
the
PEts.
THE
QUADRANT
BA
RREL
SWITCH
-
The
main
advantage
of
the
quadrant
barrel
switch
is
its
ability
to
execute
any
shifted
amount
(in
increments
of
8)
in
two,
passes
through
it.
Each
shift
can
be
executed
concurrently
with
other
instructions
2-3
2-4
in
the
PEe
Modulo
control
is
accomplished
by
controlling
the
contents
of
the
Routing
Register
between
partial
shifts.
A
COMPARISON
OF
THE
TWO
APPROACHES
-
1.
The
two
method
s
can
produce
identical
results
with
differences
in
execution
time
and
cost.
Ignoring
time
and
cost
the
methods
are
functionally
equivalent.
2.
There
is
an
approximate
cost
difference
of
six
to
one.
If
we
ignore
the
requirement
for
special
circuits
to
transmit
signals
long
distances,
the
simple
gate
count
for
the
quadrant
barrel
amounts
to
200/0
of
the
total
system.
3.
Physical
distance
is
an
important
consideration.
Figure
2-
3
depicts
a
psuedo
physical
layout
of
the
quadrant
in
which
the
interconnected
PE
for
the
Nearest
Neighbor
Connections
may
be
reasonably
close.
Perhaps
a
maximum
distance
of
6
feet
may
be
realized.
With
the
quadrant
barrel
however
some
paths
are
long.
Considering
a
centrally
placed
barrel
switch
in
the
middle
of
a 27
foot
quadrant
cabinet
row,
one
row
per
quadra~t,
the
worst-
case
distance
would
be
about
30
feet.
This
means
that
every
shift,
however
short,
will
involve
30
feet
of
cable
delay
plus
logic.
The
following
computation
shows
anticipated
delay
times.
Barrel
Switch
Cable:
30
1 X
1.
7
nsecsj
ft.
= 51. 0
nsecs.
Logic:
8
gates
X 3
nsecsj
levels
= 24. 0
75. 0
nsecj
shift
Nearest
Neighbor
Cable:
61 X 1. 7
nsecsj
ft. =
10.2
nse·cs.
Logic:
3
gates
X 3
nsecsj
gate
= 9. 0
19.2
nsecsj
shift
Avera~e
Barrel
Switch
Time
=
75.0
nsecs.
Avera~e
Nearest
Neighbor
Time
= *
4 X 19. 2
=76.
8
nsecs.
*
Refer
to
table
2-1,
page
2-
5.
Table
2-1.
Shift
Table~
Showing
Shifts
by
One's
and
Shifts
by
Eight's
Desired
Shift
Shift
Desired
Shift Shift
Shift
by
One's
by
Eight's
Total
Shift
by
One's
by
Eight's
Total
0 0 0 0
32
0 4 4
1 +1 0 1
33
+1 4 5
2 +2 0 2
34
+2 4 6
3 +3 0 3
35
+3 4 7
4
+4
0 4
36
-4
-3
7
5
-3
+1
4 37
-3
-3
6
6
-2
+1
3
38
-2
-3
5
7
-1
+1
2
39
-1 -3
4
8 0
+1
1
40
0
-3
3
9
+1
+1
2 41 +1
-3
4
10
+2 +1
3 42
+2
-3
5
11
+3
+1
4 43
+3
-3
6
12
+4
+1
5
44
-4 -2
6
13
-3
+2
5
45
-3
-2
5
14
-2
+2
4
46
-2 -2
4
15
-1
+2
3
47
-1
-2
3
16 0
+2
2 48 0
-2
2
17 +1
+2
3
49
+1
-2
3
18
+2 +2
4
50
+2
-2
4
19
+3
+2
5 51 +3
-2
5
20
+4
+2
6 52
-4
-1
5
21
-3
+3
6 53
-3
-1
4
22
-2
+3
5
54
-2
-1
3
23
-1
+3
4
55
-1
-1
2
24
0
+3
3
56
0
-1
1
25
+1
+3
4 57 +1
-1
2
26
+2
+3
5 58 +2
-1
3
27 +3
+3
6
59
+3
-1
4
28
+4
+3
7
60
-4
0 4
29
-3
4 7 61
-3
0 3
30
-2
4 6 62
-2
0 2
31
-1
4 5 63
-1
0 1
Average
number
of
shifts
=
2~~
= 4
4.
However,
real
problems
are
not
random
with
average
shifts.
The
data
to
be
shifted
often
falls
into
patterns
which,
with
proper
programming,
can
be
made
to
somewhat
fit
the
available
shifting
patterns
of
the
partial
shifting
scheme,
and
save
time
over
the
average
time
taken
for
transfer
of
randomly
placed
data.
Some
examples
of
such
savings
follow.
(a)
The
Cooley-
Tukey
algorithm
for
spectrum
analysis
in
one
form
involves
swapping
data
between
PE's
which
are
half
an
array
apart
at
the
first
swap,
a
quarter
of
an
array
apart
at
the
second
swap,
an
eighth
at
the
third,
and
so
on.
Over
each
64
-PE
quadrant,
these
transfers
are
accomplished
with
four,
two,
and
one
partial
shifts
respectively,
or
an
average
of
2.
333
partial
shifts
for
each
actual
data
transfer.
This
is
considerably
faster
than
the
four
partial
shifts
required
for
a
random
data
transfer,
and
is
faster
than
the
data
transfer
effected
through
the
a
11-
possible-paths
scheme.
(b)
Another
use
for
the
data
transfer
paths
which
cover
the
array
is
in
the
computation
of
global
variables
("Global"
here
means
a
variable
which
acquires
its
definition
from
data
which
covers
all
the
PE's).
Maximum,
minimum,
logical
AND~
across-word
parity,
sum,
product,
are
examples
of
functions
which
one
might
want
to
perform
on
corresponding·
words
in
all
PE's,
more
or
less
in
parallel
acro"ss
the
array,
producing
a
one-word
result.
Since
PE's
are
only
capable
of
two-operand
operations,
a
combination
of
the·
64
corresponding
variables
from
the
64
PEs
of
one
quadrant
into
one
resulting
variable
must
take
at
least
six
operational
steps,
·consisting
of
32
operations
on
the
variables
by'
pairs,
then
a
pairwise
combining
of
the
16
resulting
pairs,
and
so
on.
Between
2-5
Figure
2-3.
Pseudo
Physical
Layout
ofa
Quadrant
0 17 16 2 15 3 14 4 13 5
12
6
11
7 10
360 377 361 376 362 375 363 374 364 373 365 372 366 371 367
370
20 37
21
36 22 35
23
34
24 33 25 32 26 31
27
30
340
357 341 356 342 355 343 354 344 353 345 352 346 351 347 350
40
57
41
56 42 55
43
54
44
53
45
52
46
51
47
50
320 337 321 336 322 335 323 334 324 333 325 332
326
331 327 330
60 77
61
76
62
75
63
741 64 73 65 72
66
71
67
70
1
300 317 301 316 302 315 303
3141304
313 305 312
306
311 307 310
---------------r-----~--------
100 117 101 116 102 115 103 1141 104 113 105 112 106 111 107 110
1
260 277 261 276 262 275 263
2741264
273 265 272 266 271 267 270
1
120 137 121 136 122 135 123
134/
124 133 125 132 126 131 127 130
1
240 257 241 256 242 255 243
254:
244 253 245 252 246 251 247 250
140 157 141 156 142 155 143
154:
144 153 145 152 146 151 147 150
220 237 221 236 222 235 223 ,
234
1224
233 225 232 226 231 227 230
160 177 161 176 162 175 163
174:164
173 165 172 166 171 167 170
200 217 201 216 202 215 203
214:204
2
13
205 2
12
206 2
11
207 210
Figure
2-4.
"Nearest
Neighbors"
Arrangement
of
Full
Array,
Showing
Quadrant
Subarrays
2-6
each
operation
a
transfer
of
data
occurs
to
place
it
in
position
for
the
next
opera-
tion.
Thus
a
global
function
takes
six
binary
operations
and
five
data
transfer
times
when
implemented
within
the
system
which
has
all
possible
paths.
The
above
description
of
a
sequence
of
steps
is
correct
even
if
a
single
instruction
calls
forth
the
whole
sequence.
Global
functions
are
in
the
same
category
as
the
Cooley-
Tukey
algorithm
as
far
as
data
transfer
is
concerned.
The
data
is
shifted
by
2
and
combined
with
an
unshifted
copy,
and
so
on
up
to
a
shift
of
32#
for
the
64-
element
subarray.
For
the
nearest
neighbors
scheme~
an
average
of
2. 33
partial
shifts
per
actual
shift
is
called
for.
The
nearest
neighbor
connections
would
thus
seem
to
have
a
slight
advantage
in
speed
over
the
quadrant
barrel
in
executing
global
functions
such
as
maximum,
minimum,
global
AND,
global
OR,
and
across-word
parity.
5.
The
conclusion
is
that
we
do
not
buy
anything
by
implementing
schemes
which
tranfer
data
across
from
one
side
of
the
array
to
the
other
·with
a
single
step.
The
time
it
takes
for
the
data
to
travel
across
the
array
is
suf-
ficient
to
permit
several
logical
operations.
It
is
possible
to
find
a
set
of
partial
data
transfer
paths
which
result
in
simpler
mechanization
and
lesser
wiring,
and
which
do
not
degrade
performance.
Furthermore,
by
avoiding
the
barrel,
we
also
avoid
the
necessity
for
allowing
maximum
delay
on
each
transfer,
and
when
the
data
is
ordered,
or
partially
ordered,
a
speedup
{s
accomplished
which
would
be
impossible
in
the
more
expensive
system.
Inter-Quadrant
Transfers
Shifting
between
quadrants
is
dependent
on
the
type
of
intraquadrant
shifting
selected,
but
the
considerations
at
the
interface
are
similar.
The
choice
of
transferring
all
words
of
a
quadrant
as
a
column
(row)
may
be
made
for
either
case.
Here
also
a
similar
layout
of
PE
~s
was
done
for
the
quadrant
(figure
2-3)
can
make
edge
switching
faster
than
quadrant
switching.
Figure
2-4
shows
an
arrangement
of
the
whole
array,
256
PE's,
with
the
same
properties
for
the
whole
array
that
figure
2-3
has
for
the
64
PEl
sof
the
single
quadrant.
PE's
which
differ
in
number
by
±1
or
±16
from
a
subject
PE
are
physica.l
neighbors
of
it,
just
as
in
figure
2-3,
where
PE's
which
differ
in
number
by
±1
or
±8
of
a
given
PE
are
physical
neighbors
thereof.
Furthermore,
each
quarter
of
the
diagram
of
figure
2-
4,
as
indicated
by
the
dotted
lines,
is
a
copy
of
figure
2~
3.
To
allow
one
to
see
this
latter
point,
the
PE's
in
figure
2-4
are
numbered
from
o
to
377
in
octal.
To
translate
from
an
entire
array
PE
numbering,
as
shown
in
figure
2-4,
to
quadrant
numbering,
one
may
suppress
the
first
and
fifth
bit
of
the
binary
equivalent
of
the
octa.l
PE
number.
When
this
is
done,
each
quarter
of
figure2-4
is
like
figure
2-3,
with
those
on
the
right
reflected
about
the
verti-
cal
axis,
and
those
on
the
bottom
reflected
about
the
horizontal
axis.
The
lower
right
quadrant,
for
example,is
like
figure
2-3
both
upside
down
and
backwards.
2-7
2-8
Machine
Instruction
The
machine
instruction
to
use
this
equipment
thus
req
uires
considerable
equip-
ment
to
translate
the
instruction
into
the
var
ious
shifts
required.
The
number
of
shifts
will
differ,
depending
upon
how
far
around
a
given
quadrant,
or
whole
array,
the
data
is
to
be
shifted.
Variants
of
the
shift
instruction
will
be
able
to
call
upon
the
intraquadrant
shift
even
when
the
entire
array
is
being
used.
This
is
of
advantage
during
one
possible
implementation
of
the
Cooley-
Tukey
algorithm,
for
example.
A
shift
of
one
in
the
32-bit-word
mode
also
involves
a
transfer
of
half-words
betwee'ri.'neighboring
PE's,
since
each
one
is
now
acting
as
two
half-
word
PEls.
The
shift
instruction
therefore
calls
up,
possibly
an
internal
barrel
shift
in
the
PE,
possibly
a
half-word
transfer
between
neighboring
PE's,
a
variable
number
of
east-west
neighbor
shifts,
and
a
variable
number
of
north-
south
shifts.
To
preserve
the
contents
of
the
A
and
B
registers
of
the
disabled
PE's,
intervening
steps
must
avoid
the
A
and
B
registers,
which
may
be
only
an
initial
source
of
data,
or
the
destination
at
the
final
shift.
Either
the
MDR
or
the
S
register,
therefore,
is
involved
in
routing,
with
strong
preference
to
the
MDR,
since
it
is
desirable,
even
though
maybe
not
as
absolutely
necessary,
to
save
the
S
register
as
well
as
the
A
and
B.
THE
MULTIPLY
ALGORITHM
The
most
important
single
logic
function
from
the
standpoint
of
both
performance
and
cost
is
the
multiply.
The
emphasis
placed
on
this
instruction
in
its
design
and
application
singles
it
out
for
special
discussion.
When
first
considering
the
many
different
ways
to
implement
the
multiply
the
ILLIA
C
array
itself
offers
the
first
direction.
There
is
a
class
of
algorithms
which
takes
advantage
of
the
statistical
nature
of
the
ONE
and
ZERO
trains
in
the
multiplier.
The
average
execution
time
of
such
a
multiply
is
always
less
than
a
worst-case
pattern
of
ONE
and
ZERO
in
the
multiplier
and,
therefore,
in
the
course
of
a
program
run,
the
multiply
time
is
the
average
multiply
time.
However,
because
of
the
lock-
step
synchronous
operation
of
the
Array
which
handles
up
to
256
pairs
of
operands
simultaneously,
the
average
execution
time
becomes
the
worst-case
time.
Such
methods
therefore
are
not
applicable
to
the
ILLIA
C
system.
Following
the
above,
the
next
decision
to
be
made
is
how
many
bits
of
the
multi-
plier
are
to
be
examined
and
disposed
of
simultaneously
during
a
single
step
in
the"
cyclic
multiply
sequence.
Apart
from
circuit
speeds,
this
is
the
single
basis
for
determining
the
speed
of
the
multiply.
Two
seoarate,
yet
interdeoendent,
techniaues,
are
available
to
do
this.
One
technique
is
to
encode
fields
of
the
m
ultiDlier
into
a
larger
number
base*
;
and
the
second
techniaue
is
to
combine
manifold
summands
selected
by
the
conditions
of
the
multiolier
bits.
In
practice
these
two
techniaues
are
combined
into
a
single
comprehensive
design.
The
following
general
relation
is
useful
in
dete"rmining
the
siz
e
of
the
partial
multiplier:
Where:
Execution
time:
(MB
+3)
6 t
XB
MB
Nunlber
of
bits
in
the
mantissia
XB
Number
of
bits
in
the
partial
multiplier
G
Number
of
gates
in
typical
delay
chain
6t
Nominal
gate
delay
including
loading,
wire
length,
etc.
Reasonable
values
for
the
above
paramete
rare:
MB
48
bits
G
10
gates
selected
as
typical
PE
logic
chain
6. t = 3
nanoseconds/
gate
Execution
time
=
memory
cycle
time
=
250
nanoseconds.
Therefore:
XB
= 8
bits.
Table
2-2
shows
the
relative
speed-cost
relation
for
the
range
of
possible
partial
multiply
sizes.
The
important
feature
in
this
table
is
the
gate
count
differences
for
the
different
selected
sizes.
In
terms
of
a
10K-gate
Processing
Element
the
8-
bit
selection
appears
to
be
a
reasonable
maximum
hardware
investment.
*
MacSorley,
O.
L.,
"High-
Speed
Arithmetic
in
Binary
Computer.
"
Proceedings
of
the
IRE,
pp.
67
-71,
January
1961.
Wallace,
C.
S.,
"A
Suggestion
for
a
Fast
Multiplier,
"IEEE
Transactions
on
Electronic
Computers,
V.
EC-13,
February
1964.
2-9
2-10'
.MEMORY
DATA
REGISTER
1
~
OPERAND
SELECT
REG.
OPERAND
SELECT
GATE
~
t
t !
MULTIPLIER
DECODE
CARRY
SA
VE
TREE
1
l
~
B
REGISTER
FULL
ADDER
I--
t 1
+
BARREL
SWITCH
A
REGISTER
t
J
Figure
2-5.
Logic
Elements
Involved
in
Multiply
Algorithm
Memory
Access
Memory
Access
Buffer
(512
bits)
Buffer
(512
bits)
, 1 8
words
1 8
words
,
Control
P.E. P.E.
Control
Unit
Array
Array
Unit
1 1 2 2
INTER
-QUADRAN~
EXCHANGE
Memory
Access
Memory
Access
Buffer
(512
bits)
-I
Buffer
(512
bits)
t 8
words
18
words
Control
P.E.
P.E.
Control
Unit
Array
Array
Unit
3 3 4 4
Figure
2-6.
SystelTI
for
Handling
Data
From
Control
Unit
to
Processing
Element
l
Block
Diagram
Table·2-2.
Relationship
of
Gate
Size
to
Multiplier
Size
Number
of
Logic
Gate
Count
for
Estimated
Execution
Time
Partial
Multiplier
Bits
Multiply
Only
(nanoseconds)
6
2,000
330
8
2,800
270
12
4,800
210
Having
decided
what
constitutes
a
reasonable
number
of
multiplier
bits
to
handle
in
a
single
cycle
(in
this
case
8
bits),
the
next
consideration
is
to
determine
how
to
maintain
a
10-
gate
delay
for
the
worst-
case
logic
chain.
To
evaluate
this
criteria
the
following
substeps
in
the
multiply
must
be
completed
within
a
10-gate
chain:
1.
Decode
the
partial
multiplier.
2.
Add
the
next
summand
to
the
partial
product.
3.
Shift
the
multiplier
and
the
partial
product.
4.
Normalize
the
result.
Figure
2-
5
is
a
block
diagram
of
the
logic
elements
involved
in
the
execution
of
this
algorithm.
The
delay
chain
involved
is
the
time
to
go
from
register
to
register.
The
multiply
algorithm
selected
first
decodes
the
8
bits
of
the
multiplier
into
a
base-
4
representation
into
which
0, +
I,
-1,
and
+2
values
of
the
operand
are
selected.
This
decode
is
stored
in
the
Operand
Select
Register
(OSR).
The
OSR
is
12
bits
long
for
storing
the
fully
decoded
four
conditions
for
each
of
four
operands.
Because
the
8
bits
of
the
multiplier
are
taken
as
bit
pairs,
four
summands
must
be
added
to
the
partial
product
in
a
single
cycle.
The
carry
save
tree,
which
contains
three
full
length
carry
save
adders
(2
less
than
the
total
number
of
summands)
combines
four
summands
plus
the
partial
product
into
a
new
sum
and
carry
value.
The
full
adder
now
combine
s
this
sum
and
carry
into
a
new
partial
product.
In
terms
of
worst-case
delay,
the
logic
chain
through
the
operand
select
gate,
.
the
carry
save
tree,
and
the
full
adder
represents
the
worst-case
delay.
Detailed
logic
design
analysis
has
shown
that
a
nominal
30-nanosecond
delay
through
the
chains
is
possible.
A
breadboard
of
the
actual
hardware
must
be
built
and
operated
to
obtain
a
true
final
result.
2-11
2-12
EIGHT-BIT
WORD
LENGTHS
The
ILLIAC
IV
is
to
be
built
with
several
different
word
lengths.
Each
64-bit
PE
is
capable
of
being
split,
effectively,
into
at
least
two
32-bit
PE's.
Another
desirable
word-length
breakpoint
is
at
eight
8-
bit
effective
PEl
s
within
any
given
PEe
To
evaluate
this
possibility
we
require,
first;
to
know
what
it
costs
to
expand
PE
capability
to
handle
eight
independent
8-bit
words,
and
second,
we
need
to
estimate
the
worth
of
the
extra
capability
produced
by
such
an
expansion
over
and
above
the
capability
already
inherent
in
the
64-
bit
PE
for
handling
8-
bit
~
pieces.
The
conclusion
is
that
all
logical
operations,
probably
even
add
and
.
subtract,
are
easily
programmed
for
the
64-
bit
PE
so
that
they
make
effective
use
of
the
parallelism
of
the
64-
bit
machine.
The
only
feature
that
is
lacking
is
the
independent
mode
control
of
the
8-bit
sections
of
th~
64-bit
PEe
The
price
for
8-bit
words
is
mainly
in
the
fragmentation
of
the
controls
which
results.
Existing
circuit
design
contemplates
a
buffer
capable
of
driving
24
loads.
With
independent
mode
control
on
every
8-
bit
slice,
each
8
bits
of
a
64-gate
transfer
command
would
have
to
be
independently
controlled,
thus
re-
quiring
8
gates
instead
of
the
3
gates
required
by
straightforward
implenlentation
of
the
64-
bit
PEe
Since
there
are
estimated
to
be
150
command
lines,
and
the
64-gate
transfer
command
is
typical
of
them,
we
estimate
that
750
gates
per
PE
are
added
by
the
fragmentation
of
the
command
lines.
In
addition,
the
barrel
controls
multiply,
from
the
three
control
signals
per
level
required
by
only
one
word
size,
to
24
control
signals
per
level.
However,
only
two
levels
of
control
are
used
in
the
module
eight
barrel,
so
that
42
additional
gates
are
added
from
this
account.
The
extra
mode
register
bits
for
8-
bit
operation
imply
extra
lines
to
the
control
unit,
and
extra
drivers
and
receivers
for
them
Twenty-
eight
such
bits
per
PE
will
require
28
drivers
and
28
receivers
in
the
PE
itself,
and
3,
584
extra
drivers
and
receivers
in
the
control
unit.
The
design
difficulties
of
cable
bundles
of
this
bulkiness
are
considerable
at
the
speeds
under
consideration.
The
8-bit
operation
contemplated
here
assumes
we
still
have
no
more
than
one
index
register
per
PEe
To
get
eight
8-
bit
words
out
of
independently
specified
memory
addresses
requires
eight
memory
accesses,
a
situation
which
is
exactly
the
same
as
though
8-
bit
fields
were
programmatically
extracted
from64-
bit
words
in
64-
bit
operation.
The
most
promising
design
of
the
memory
currently
contemplated
for
the
ILLIA
C
IV,
representing
the
best
compromise
between
cost
and
cycle
time,
is
a
nonde-
structively
read
memory.
W~iting
a
small
field,
such
as
8
bits,
into
a
memory
word
will
require
two
memory
cycles,
one
to
read
those
portions
of
the
word
which
are
not
to'be
changed,
and
one
to
write
back
the
word,·
one
8-bit
field
of
which
has
been
changed.
A
store
instruction
in
8-
bit
operation,
if.
independent
addressing
is
called
for,
on
each
8-bit
word,
will
require,
therefore)
16
memory
cycles
..
The
above
hardware
operations
are
hardly
better
than
programmatic
implementation
in
64-
bit
mode.
A
pparently,
the
chief
virtue
of
implementing
8-
bit
operations
in
the
PE
hardware
is
that
mode
register
control
over
the
individual
operations
can
be
achieved.
All
the
above
circuitry
amounts
to
an
additional
800
gates
in
the
PE
for
8-
bit
operation.
Probably
1000
gates
is
a
better
estimate.
From
this
we
conclude
that
adding
8-bit
operation
to
the
64-
bit
PE
add
s
about
10o/c
to
the
equivalent
gate
count,
and
presumably
also
adds
10o/c
to
the
estimated
cost
of
the
PE
exclusive
of
memory.
In
addition
to
the
direct
cost,
there
is
degradation
of
the
performance
in
normal
64-bit
mode.
Were
the
equipment
packaged
as
a
solid
volu.me,
lengths
would
be
increased
by
3.
2o/c
as
a
result
of
the
10o/c
increase
in
components.
In
a
well-
matched
system,
equally
sensitive
to
memory
and
logic
speeds,
the
result
would
be
a
net
1.
6o/c
slowdown.
Power
and
coollng
requirements
also
follow
the
component
count.
The
opposite
question
is
to
enquire
to
what
extent
can
8-
bit
operations
1::>e
carried
on
in
parallel
in
a
64-
bit
PE
without
the
assistance
of
specific
8-
bit
operations
in
the
hardware.
Certain
operations"
which
are
expected
to
be
frequent
operations
in
8-bit
pro-
gramming,
are
independent
of
word
length,
such
as
all
bit-
by-
bit
Boolean
operations"
compare
all
words
against
zero,
set
to
specified
value,
read
from
memory,
store
to
memory,
and
others.
Certain
operations
are
programmable
in
such
a
way
as
to
rnake
use
of
much
of
the
parallelism
of
the
64-
bit
machine
even
though
the
entities
being
manipUlated
are
only
8
bits
long.
Add,
subtract,
and
routing,
are
examples
of
this
class
of
operation.
Add,
for
example,
on
words
of
7
bits
plus
sign
each,
can
be
handled
by
removing
the
sign
bit
and
using
the
resulting
space
for
overflow
control,
thus
allowing
as
many
adds
in
parallel
as
the
adder
is
long.
Routing
can
be
partially
implemented
on
all
eight
8-bit
words
at
once
by
means
of
the
barrel.
End-around
shifts
can
be
implemented
by
two
shifts,
which
are
then
partially
blanked
in
accordance
with
a
mask
which
could
be
held
in
the
S
register,
and
the
two
shifted
words
then
ORed
together.
The
only
operations
of
any
signiflgance
which
are
not
included
in
the
above
paragraph
are
either
things
such
as
multiply
or
independent
indexibllity,
which
were
never
intended
for
8-
bit
operation
because
of
their
expense,
and
mode
control.
_
To
program
mode
control,
without
actually
having
it
in
the
hardware,
requires
manufacturing
masks"
which
blank
out
whole
8-
bit
segments
of
the
64-
bit
word,
in
response
to
decision
bits
which
usually
will
show
up
in
the
sign-
bit
position
of
the
8-
bit
words.
2-13
2-14
This
programming,
while
facilitated
by
the
one-
clock-time
barrel
l
by
the
com-
plete
set
of
Boolean
operations
l
and
by
the
fast
single-bit
instructions,
will
not
be
trivial.
COMMON
DATA
PATHS
This
section
describes
the
various
paths
by
which
information
is
transferred
between
the
PE
and
its
memory
both
to
the
input-output
interface,
and
to
the
·control
unit.
A s a
result
of
the
.conIlicting
requirements
on
transfer
rates,
frequency
of
use
l
and
destination,
there
are
three
segments
of
the
equipment
provided
for
data
transfer.
These
segments
are
an
input-
output
buffer
l a
memory
access
buffer
with
each
control
unit,
and
direct
lines
'from
the
control
unit
to
and
from
the
PE.
The
rest
of
this
section
describes
this
hardware
in
terms
of
its
use
in
each
class
of
signals
which
must
be
handled
into
and
out
of
each
PE.
!.nput-
Output
Buffering
Present
plans
are
to
provide
an
input-output
buffer
of
4096
bits
for
each
quad-
rant
of
the
array.
Each
bit
of
the
I/O
buffer
has
connection
to
one
bit
of
the
memory
data
register
of
one
PE.
When
data
is
to
be
transferred
to
or
from
a
PE
memory
from
the
I/O
buffer,
one
memory
cycle
is
taken
from
PE
opera-
tions
l
and
all
64
PE's
insert
one
data
word
into
their
memory
at
that
memory
cycle.
Data
is
transferred
across
the
external
interface
of
the
I/O
buffer
in
word
sizes
appropriate
to
the
device
found
at
that
interface.
The
external
device
may
be
a
buffer
memory
with
very
long
words
l
say
4096
bits
each.
In
this
case,
I/O
operations
are
accomplished
by
stealing
one
PE
memory
cycle
from
the
array
for
each
cycle
of
the
external
buffer
memory.
On
the
other
hand,
the
external
device
could
conceivably
be
a
device
of
lesser
word
size,
512/bits
per
word,
for
example,
in
which
case
an
independent
loaqing
and
unloading
control
is
required
to
assemble
the
shorter
external
words
into
the
4096-
bit
word
in
the
110
buffer.
This
independent
control
communicates
with
the
array
control
unit
both
to
steal
memory
cycles,
and
to
report
I/O
complete
when
a
whole
biock
of
data
has
been
successfully
transferred.
The
design
of
this
interface
is
almost
independent
of
the
rest
of
ILLIA
C IV.
It
is
not
the
intent
of
this
section
of
this
report
to
discuss
the
design
of
the
inde-
pendent
I/O
control
unit,
as
this
design
is
dependent
upon
matters
discussed
els.ewhere
in
this
report.
From
Control
Unit
to
Processing
Element
Data
for
the
use
of
the
PE,
data
to
be
stored
in
the
PE
memory
for
the
control
unit's
own
use,
and
commands
for
the
PE
are
transmitted
from
the
control
unit
to
the
processing
element.
Common
data
lines
from
the
control
unit
are
used
for
data
which
goes
from
the
control
unit
to
the
PE,
whether
this
data
is
for
PE
use
or
for
the
control
unit's
private
use.
To
avoid
continually
interrupting
PE
operations
for
data
fetches
and
stores
which
are
related
to
control
unit
purposes,
many
words
should
be
fetched
and
stored
in
parallel.
A
reasonable
compromise
between
amount
of
hardware
and
interference
with
PE
operation
appears
to
be
no
more
than
32
words
of
data
transmitted
to
the
PE
array
in
parallel.
Further
discussion
of
the
buffer
size
is
found
below.
We
plan
to
provide
32
words
of
buffering.
Of
these
32
words,
eight
are
assigned
to
each
quadrant.
A
block
diagram
of
the
system
for
handling
data
is
shown
in
figure
2-
6
(page
2-10).
It
is
simplest
to
describe
operation
by
starting
with
the
isolated
quadrant.
Eight
words
of
data
are
transmitted
to
the
memory
access
buffer
from
the
control
unit.
This
transmission
is
overlapped
with
PE
operations
provided
that
the
program
is
such
that
the
requirement
for
transmitting
data
can
be
recogni7ed
far
enough
ahead
of
time.
This
will
generally
always
be
recognizable
ahead
of
time
when
the
data
.
is
simply
to
be
stored
in
memory.
One
clock
time
per
word
is
expected
to
be
required,
since
the
data
paths
within
the
control
unit
are
expected
to
be
typically
one
word
wide.
Each
word
in
the
memory
access
buffer
has
access
to
a
column
in
the
array.
The
eight
words
have
simultaneous
access
to
one
element
in
every
column,
namely
a
row.
As
a
result,
the
eight
words
can
be
transmitted
in
paral-
lel
to
every
row
in
the
array_
If
data
is
to
be
stored
in
memory,
one
row
accepts
the
data.
If
data
is
being
broadcast,
rows
receive
the
data.
In
this
case,
the
eight
words
of
data
are
eight
copies
of
a
single
word
whicl}
originated
in
the
control
unit.
When
the
array
is
operating
as
a
coherent
whole,
all
four
control
units
operate
on
t~e
same
program
str
ing
and
data
in
parallel,
and
have
identical
internal
states.
Out
of
a
package
of
32
words
to
be
sent,
the
f,irst
eight
can
be
derived
from
information
supplied
by
the
fir
st
control
unit,
the
second
eight
can
be
derived
from
information
supplied
by
the
second
control
unit,
and
soon.
Like-
wise,
when
broadcasting
data,
the
four
control
units
will
have
four
identical
copies
of
the
word.
to
be
broadcast,
each
of
the
four
is
copied
eightfold
into
the
memory
access
buffers,
and
each
of
the
resulting
32
copies
is
then
used
to
drive
eight
PE's.
Also
transmitted
from
the
control
unit
to
all
PE's
in
parallel
are
control
signals
.
.
These
control
signals,
are
identical
for
all
PE's.
Retiming
considerations
will
demand
that
there
be
a
flip-flop
in
the
control
unit
for
each
such
line.
Present
planning
calls
for
a
receiver
per
cabinet
for
each
such
signal,
and
eight
PE's
per
cabinet.
2-15
2-16
Addresses
must
also
be
issued
from
the
control
unit
to
all
PE's.
These
will
use
the
least
significant
12
bits
of
each
word
in
the
memory
access
register,
and
the
broadcast
data
bus.
Issuing
of
addresses
and
broadcasting
of
data
require
a
fixed
delay
of
several
clock
times.
The
design
of
the
microsequences
as
manufactured
by
the
control
unit
is
such
as
to
allow
for
this
constant
delay.
From
Processing
Element
to
Control
Unit
When
data
is
fetched
from
PE
to
control
unit
in
the
isolated
quadrant,
the
mechanism
is
the
reverse
of
the
process
described
above
for
disseminating
data.
Namely,
the
information
is
collected
in
the
form
of
a
single
word
from
each
single
column
of
the
quadrant,
and
the
eight
words
from
each
row
are
then
transmitted
to
the
control
unit.
For
the
instruction"
store
to
broadcast
register,
"
the
eight
words
are
ORed
together
to
form
a
single
word
in
the
control
unit,
very
like
a
broadcast
in
reverse.
When
data,
or
program
string,
are
being
fetched
in
32-word
packages
from
the
united
array
an
additional
complication
sets
in,
since
each
one
of
the
memory
access
buffers
contains
only
a
quarter
(eight
words)
of
the
32-word
package
and
each
control
units
wants
all
32
word
s.
In
this
case
it
will
be
necessary
to
trans-
fer
data
around
among
the
memory
access
buffers
until
each
quarter
package
of
eight
words
has
shown
up
once
in
each
of
the
eight-
word
memory
access
buffers.
Single
word
fetching
makes
use
of
the
same
paths
by
disabling
all
but
one
of
the
words
being
received.
Also
entering
the
control
unit
are
lines
from
each
of
the
mode
register
flip-flops
in
the
PE'
s.
To
the
control
unit
of
one
quadrant,
'a
bit
in
the
mode
register
appears
as
a
64-bit
register
to
the
control
unit,
which
may
be
set,
read,
com-
pared
with
other
registers
available
to
the
control
unit,
etc,
When
operating
in
united
mode,
control
units
must
cooperate
in
the
sensing
of
the
state
or
mode
registers.
For
example,
a
"jump
on
any
bit
equal
to
ONE"
means
to
jump
if
any
bit
in
anyone
of
the
four
64-
bit
registers
involved
is
ONE.
Since
all
four
control
units
run
the
same
program
instructions
at
the
same
time,
only
twelve
wires,
one
to
communicate
between
each
pair
of
control
units
,are
required
to
secure
the
necessary
cooperation.
Evaluation
Some
of
the
above
described
data-transferring
procedures
take
more
time
than
one
might
at
first
expect.
One
250-nsec.
memory
cycle
is
required
to
load
the
memory
access
buffer.
This
memory
cycle
interferes
with
the
action
of
the
PE
only
insofar
as
the
PE
memory
or
memory
data
buffer
is
required.
The
memory
access
buffer
for
a
single
quadrant
can
be
unloaded
into
the
appropri-
ate
area
in
program
lookahead
or
local
data
buffer
in
eight
clock
times.
However,
these
potentially
interfere
with
other
used
of
program
lookahead
or
local
data
buffer.
With
proper
implementation
of
the
controls,
these
eight
clock
times
can
be
largely
hidden
by
being
taken
from
otherwise
idle
time
in
the
two
buffer
areas
in
the
control
unit.
When
the
array
is
in
united
operation,
one
must
count
not
only
32
clock
times
rather
than
8,
one
also
must
transfer
the
data
from
one
memory
access
buffer
to
another.
If
cables
of
up
to
30
or
40
feet
long
separate
the
memory.
access
buffers,
then
the
time
to
transfer
data
from
one
memory
access
buffer
to
another
may
well
be
three
clock
times,
and
three
transfers
are
needed.
The
conclusion
is
that
as
much
as
41
clock
times
are
needed
to
transfer
the
32
words,
read
in
one
memory
cycle
time,
into
the
appropriate
areas
of
the
four
control
units.
Seven
clock
cycles
in
each
memory
cycle
is
a
likely
design
choice.
In
this
case,
41
clock
cycles
represents
1.
46
microseconds.
This
1.
46
microseconds
is
over-
lapped
with
other
operations
as
long
as
a
reservoir
of
instructions
for
PE
operation
can
be
maintained
in
the
control
unit.
However,
this
1.
46
microseconds
often
gets
in
the
way
when
loading
the
program
lookahead.
It
is
a
penalty
to
be
taken
whenever
the
program
jumps
to
a
location
not
contained
in
program
lookahead.
Further,
the
coarser
block
size
means
that
loops
between
32
and
64
words
long
will
less
often
be
found
within
the
program
lookahead,
and
program
fetching
will
take
place
more
often
when
the
block
which
is
fetched
is
larger.
Large
block
size
also
interferes
with
broadcast
operations,
since
a
larger
delay
occurs
between
the
instruction
to
fetch
a
block
of
data
to
the
data
buffer
and
the
first
opportunity
to
broadcast
one
of
those
words,
when
blocks
are
larger.
Optimum
block
size
is
that
which
finds
the
best
tradeoff
between
interference
with
the
operation
of
the
PE's
in
the
array,
and
slowing
of
operations
in
the
~
control
unit.
Block
sizes
of
8,
16,
and
32
words
are,
easily
available
with
minor
modifications
of
the
structure
here
described,
and
a
choice
of
a
smaller
block
size
than
the
32
words
here
proposed
can
be
easily
made
during
the
early,
design
months
of
the
next
contract.
MODE
Introductory
Considerations
.
Each
PE
is
supplied
with
a
mode
register,
which
it
can
change
as
the
result
of
tests,
and
which
the
control
unit
can
set
at
will.
Instructions
will
be
executed
or
not
as
a
result
of
the
setting
of
the
mode
register.
This
arrangement
is
in
lieu
of
branching
at
the
PE
level,
since
all
PE's
must
share
the
common
instruction
stream,
and
therefore
cannot
independently
execute
transfers
of
control.
2-17
2-18
Modes
should
be
remembered
and
recoverable.
At
the
very
least,
each
routine
which
one
enters
must
be
supplied
with
fresh
capabiHty
for
setting
and
changing
its
mode,
while
remembering
the
mode
of
the
calling
sequence.
Actually,
it
seems
that
considerably
more
flexibility
is
desirable.
In
the
limit,
one
could
specify
a
particular
mode
register,
out
of
some
set
of
mode
registers,
for
each
instruction.
The
Problem
At
issue
is
the
question
of
the
location
of
the
backup
storage
of
the
noncurrent
modes.
The
choice
is
between
supplying
back-up
mode
register
storage
in
the
PE's,
or
of
supplying
the
control
unit
with
the
capability
of
reading,
storing,
and
restoring
the
PE
modes.
The
implementation
of
mode
control
is
con-
siderably
different,
depending
on
the
results
of
this
choice,
so'that
we
are
really
choosing
between
two
different
systems
for
controlling
PE
operations
in
response
to
mod
e.
Back-
Up
Storage
In
The
Control
Unit
The
first
system
to
discuss
is
that
with
back-up
storage
for
other
than
current
modes
in
the
control
unit.
In
this
system,
only
one
mode
register's
worth
of
flip-flops
are
in
the
PE,
and
the
setting
of
the
mode
register,
for
which
a
given
instruction
will
be
executed
in
a
specific
PE
is
known
at
the
time
of
that
instruction.
Either
four
bits
accompany
the
instruction,
specifying
which
of
the
possible
modes
perInit
the
execution
of
the
instruction,
or
else
a
pre-
existing
decision,
based
on
the
content
of
the
mode
register,
controls
the
execution
01'
nonexecution
of
instructions
in
the
PE.
Bit
economy
in
the
instruction
stream
favors
the
latter,
if
mode
is
not
to
change
at
every
instruction.
A
decision
based
on
speed
also
favors
separation
of
the
mode
decision
from
the
execution
of
the
instruction,
since
then
less
control
gating
is
involved
for
the
individual
instruction
at
the
PE.
One
of
the
savings
in
speed
of
the
ILLIA
C IV
type
of
computer
ought
to
lie
in
the
fact
that
the
subcommand
matrix
of
a
normal
computer
finds
itself
mostly
in
the
control
unit,
so
that
the
instruction
decoding
and
timing
wpich
consumes
one
clock
time
per
instruction
in
most
computers
can
be
spent
in
the
control
unit,
overlapped
with
useful
arithmetic
work
in
the
PE.
For
most
complete
overlap,
the
PE
has
an
"on-
off"
flip-flop
to
control
the
execution
or
nonexecution
of
the
next
instruction.
Individual
instructions
in
such
a
scheme
will
never
wait
for
the
decoding
of
mode
information
before
they
can
start
and
noninterferin~
microseQuences
can
freely
overlap.
Occasionally
a
one-
clock-time
instruction
would
be
needed
to
change
the
setting
of
the"
on-
off"
flip-flop
in
response
to
some
new
interpretation
of
the
mode
register.
The
"on-off"
flip-flop
also
must
respond
(sometimes)
to
arithmetic
overflow
as
well
as
to
programmed
tests
which
change
the
mode
bits.
There
is
an
over-
flow
flip-flop
which
appears
to
the
control
unit
much
like
third
mode
bit.
In
this
system
there
are
therefore
four
flip-flops
per
effective
PE
with
mode-like
functions.
Thus
there
are
four
flip-flops
per
32-blt
word,
or
eight
flip-flops
per
actual
PE,
two
programmatically
specifiable
mode
bits,
the
overflow
flip-flop,
and
the
"enable-disable"
flip-flop.
There
are
PE
instructions
to
set
or
reset
each
mode
bit
in
response
to
programmed
tests.
Each
bit
in
the
PE
appears
to
the
control
unit
as
a
64-
bit
word,
since
there
are
64
PEl
s
per
control
unit.
There
are
instructions
to
read,
save,
and
set
these
word
s,
the
instructions
being
control
unit
instructions
rather
than
PE
instructions.
The
control
unit
is
provided
with
high-
speed
register
storage
for
such
saving.
It
is
also
pro-
vided
with
logical
instructions
to
manipulate
the
modes,
namely
AND,
OR,
and
COMPLEMENT
instructions
which
can
operate
on
the
words
formed
from
the
mode
register
bits.
Jump
instructions
in
the
control
unit
would
test
mode
bits
(either
"any
mode
bit"
or
"all
mode
bits
").
They
would
also
test
old
modes
stored
in
the
control
unit.
This
system
thus
has
a
reservoir
of
old
modes
which
can
be
reactivated
on
short
notice.
This
back-up
is
in
the
control
unit.
Response
to
old
mode
settings
is
effected
in
one
clock
time
by
setting
the
"on-off"
flip-flop
in
each.
,?E.
The
instruction
stream
has
no
mode
field
per
instruction,
but
does
have
mode-
mani-
pulating
instructions,
which
are
decoded
by
the
control
unit
in
parallel
with
the
issuing
of
instructions
to
the
PE.
Storage
of
old
modes
in
the
control
unit
is
backed
up
further
by
the
storage
of
words
from
old
mode
registers
back
to
memory.
It
is
assumed
that
the
control
unit
has
some
means
of
addreSSing
memory,
both
for
read
and
wr
ite.
Back-
Up
Storage
In
The
Processing
Element
The
second
system
we
discuss
is
that
with
back-up
storage
of
old
modes
assigned
to
the
PE's.
In
this
system,
where
several
modes
are
stored
in
the
PE,
a
method
must
be
chosen
for
choosing
the
applicable
mode
for
any
given
stream
of
in-
structions~
A
settable
pointer
could
be
used,
and
the
pointer
setting
changed
on
command
from
the
control
unit,
in
between
actual
processing
instructions.
Using
the
pointer,
the
operation
would
be
fully
equivalent
to
operation
with
back-up
storage
in
the
control
unit
except
for
the
location
of
the
stored
bits.
Whe,n a
mode
register
address
is
lissued
with
each
instruction,
more
flexibility
is
obtained.
A s
described
to
us,
the
system
was
used
with
six
mode
bits
with
each
instruction:
Two
bits
of
mode
register
address,
and
four
bits
to
interpret
the
contents
of
that
register.
Arithmetic
overflows
are
lost
in
the
implementation
unless
tested
for
immediately
by
means
of
a
jump
instruction
in
the
control
unit.
They
cannot
influence
mode
directly,
although,
clearly,
when
tested,
they
can
be
used
to
control
modes.
Even
with
back-up
s,torage
of
modes
in
the
PEl
s,
some
means
of
setting
new
modes
under
control
unit
control
is
required.
Whether
a
path
directly
from
control
unit
to
PE
for
loading
the
mode
registers
is
needed,
or
whether
indirect
methods
suffice,
has
not
yet
been
determined.
2-19
2-20
Each
instruction
in
the
PE
must
therefore
look
to
the
mode
register
for
condi-
tions
against
which
execution
is
made
conditionaL
Some
instructions
are
made
one
clock
time
longer
becuase
of
this.
Furthermore,
no
overlapping
of
non-interfering
microsequences
appears
possible.
Jump
instructions
are
made
dependent
on
a
bit,
one
per
PE,
returned
to
the
control
unit.
These
instructions
are,
like
any
other,
conditional
on
one
of
the
mode
registers,
so
that
jumps
can
easily
be
made
conditional
on
mode
register
settings.
This
capability
is
similar
to
that
achieved
in
the
competing
system
by
testing
old
modes
in
the
control
unit.
The
system
has
a
limited
reservoir
of
old
modes
which
can
be
reactivated
on
short
notice.
This
reservoir
is
limited
in
length,
but
very
fast
of
access.
The
instruction
stream
has
a
6-
bit
mode
field
per
instruction.
Storage
of
additional
modes
is
in
PE
memory.
If
they
are
to
be
packed
effcient-
ly
in
memory,
a
short
mod
e-
packing
routine
is
necessary.
Functional
Capability
Both
competing
systems
can
execute
the
same
functions.
A
partial
list
of
capabilities
follows
..
Quick
change
of
control.from
one
mode
register
to
another.
Storage
6f
several
different
mode
registers
per
PEe
Transfer
of
control
possible
in
response
to
mode
register
setting.
Comparison
A
list
of
features
in
which
the
suggested
methods
of
supplying
back-up
mode
registers
differs
reveals
none
in
which
the
functional
capabilities
made
avail-
able
by
the
one
cannot
also
be
made
available
by
the
other.
However,
there
are
other
differences.
"When
back-up
mode
storage
is
in
the
PE
every
microsequence
requires
the
operation
of
mode
gating
in
the
PE,
and
overlapping
of
microsequences
is
not
possible.
Six
bits
of
every
instruction
are
expended
on
mode
information.
"When
back-up
mode
storage
is
in
the
control
unit,
every
change
of
mode
requires
a
one
t-time
operation
interleaved
between
the
regular
arithmetic
operations.
Each
change
of
mode
takes
a
short
instruction,
say
16
bits.
What
is
needed,
to
compare
the
advantages
and
disadvantages
listed
above,
is
some
estimate
of
the
number
of
instructions,
on
the
average,
executed
with-
out
change
of
mode
setting,
or
executed
while
ignoring
mode.
Even
if
average
..
the
string
length
is
only
two
or
three.
The
speed
advantage
would
appear
to
lie
with
the
system
of
storing
modes
in
the
CD.
Another
difference
is
in
the
PE
hardware.
Considerable
savings
are
expected
by
placing
the
back
-up
mode
storage
in
the
CD,
whose
long
registers
are
much
cheaper
per
bit
than
PE
short
registers.
At
a
very
detailed
machine
language
level
of
coding
there
are
differences
in
the
progran1.ming
of
mode
control.
With
back-up
modes
stored
in
short
registers
in
the
PE,
a
different
mode
with
every
instruction
is
appropriate,
but
program-
ming
compl~xity
mounts
when
mode
registers
above
the
fourth
are
used.
With
control
unit
back-up
storage,
programming
complexity
remains
essentially
constant
for
any
depth
of
mode
storage,
although
producing'memory
interference
when
depth
of
storage
exceeds
that
available
in
the
high
speed
registers
of
the
control
unit.
The
same
assembly
language
could
be
used
in
either
case,
if
desired,
putting
the
above
differences
solely
into
the
assembly
program.
2
-21
SECTION
III
REVISED
PE
LOGIC
DESIGN
This
section
describes
the
mechanization
and
hardware
requirements
for
the
PE.
A
block
diagram
is
included
containing
the
data
paths
for
each
item
required
to
mechanize
a
PE.
The
logic
requirements
per
unit
are
tabulated
in
table
3-1.
PROCESSING
ELEMENT
DESCRIPTION
The
revised
PE
logic
design
is
shown
in
figure
3
-1.
A
PE
is
essentially
a
three-
register
system
which
can
execute
a
complete
general
purpose
computer
order
code
as
described
in
Section
VII
of
TR66-3
or
otherwise
modifieq
in
this
report.
A
fourth
register
was
added
to
store
intermediate
results.
Capabilities
of
some
units
may
be
increased
or
decreased
to
vary
operation
code
execution
times.
A
summary
of
each
unit
in
the
PE
follows.
MEMORY
DATA
REGISTER
(MDR)
This
unit
is
abuffer
register
between
a
PE
and
its
PE's
thin
film
stack
and
the
outside
world.
The
register
serves
as
an
intermediate
buffer
when
performing
array
shifts
or
memory
stores
and
fetches.
This
register
is
accessible
even
though
the
PE's
mode
status
is
disabled.
The
MDR
receives
operands
from
the
:
Common
Data
Bus,
A
Register,
Operand
Select
Gates,
Address
Adder
and
its
PE's
.memory.
The
register's
output
is
connected
to
the
Common
Data
Bus,
Address
Adder,
PE's
memory
and
Operand
Select
Gates
located
within
its
PE's
boundaries
or
four
nearest
neighbors.
3-1
CONTROL
UNIT
-
TO
I/O
REGISTERS
,_~1
__
,
COMMON
DA TA
BUS
1 N
SEW
3 1 3
2&3
t +
••
DRIVERs'rr
MODE
I
DRIVERS
& J l
DRIVERS
&
RECEIVERS
I I
RECEIVERS
REGISTER
RECEIVERS
RECEIVERS
: I
t t
r--=-
I I
I I
1 , + I I
I
I I
PE
MEMORY
DATA
REGISTER
I
PE
I
ENABLE
(MDR)
I
iMEMORY
I
SIGNALS
ADDRESS
I I
J
ADDER
I I
RECEIVERS
I I
OPERAND
I
MEMORY
I I
SELECT
GATES
PE
I
r-r--
r
INDEX
r
ADDRESS
I I
H S
REGISTERS
(OSG)
REGISTER
I I
I
REGISTER
(MAR)
I I
I I
1---
1
----1
1
~
.-
j
I B
REGISTER
I
MULTIPLICAND
I
SELECT
GATES
(MSG) 2 & 3
-'-
~
I I
DRIVERS
I
PSEUOADDER
J
~
TREE
1
+ + t
A
REGISTER
1 B
REGISTER
,J
t--
AND
OPERAND
N S E W
SELECT
GA
TES
t I
SELECT
GA
TES
I CARRY
NOTES:
PRO
PA GA
TE
ADDER
1.
The
blocks
iabeled
number
1
will
be
implemente
(CPA')
DDa
cabinet
or
half
Cabinet
BaSiS,
rather
than
a
1
PE
basis.
d
1 I
2.
The
blocks
labeled
number
2
are
implemented
to
r=:l
~
only
2
of
the
Signals
.
I l I
3.
The
blocks
labeled
number
3
will
be
implemente
LOGIC
as
required
on
a
PE
basis.
A
REGISTER
f
UNIT
d
1 I
t ,
I
LEADING
l I
ONEs
~
BARREL
DETECTOR
SWITCH
1
Figure
3-1.
Revised
PE
Logic
Design,
Block
Diagram
Table
3-l.
PE
Logic
Requirements
UNITS
FUNCTIONS
BIT
GATES/
TOTAL
FLIP
DRIVERS/
POSITIONS
BIT
GATES
FLOPS
RECEIVERS
Memory
Data
Register
Buffer
register
between
thin
film
stack
and
logic
64
6
384
64
Address
Adder
inputs,
bit
positions
52
through
63 12 1 12
Drivers
and
Receivers
to
and
from
Common
Data
Bus
64/64
Drivers
to
adjacent
PE
Operand
Select
Gates
64/00
Drivers
and
Receivers
to
Input
Output
Registers
64/64
Operand
Select
Gates
Selects
an
Operand
for
PE
processing
64 6
384
Receivers
from
two
adjacent
num-
bered
PE's
00/128
Multiplicand
Select
Selects
4-56
bit
words
based
on
8
multiplier
bits
56/
word
16 896
Control
word
enable
signal
gen-
eration,
some
fan
out
288
49
Carry
Propagate
Consists
of
six
adder
sections:
Adder
Receives
mantissa
field
inputs
from
A-B
Registers,
Pseudoadder
Tree
and
Operand
Select
Gates.
Gate
and
Flip-flop
requirement
for
six
sections
648
12
Input
Gates:
A
Register
48 2 96
B
Register
and
Operand
Select
Gates
48 3
144
Address
Adder
Consists
of
two
adder
sections:
Per-
form
address
modification,
exponent
summation
and
supplementary
sec-
tions
to
Carry
Propagate
Adder
Gate
and
Flip-flop
requirement
for
two
sections
216
4
Input
Gates:
Address
Modification
12 4 48
Exponent
Summation
16 5 80
Control
Inputs
16 3 48
Output
Gates:
16
3 48
PE's
Index
Register
Stores
Address
variable
12 12
Memory
Address
Buffer
register
between
thin
film
Register
stack
and
logic
12 12
A
Register
Main
operand
store
and
Overflow
flip-flop
64 7 448 65
B
Register
Auxiliary
operand
store
64 6
384
64
Address
Adder
input:,:
pOSitions
o
through
15 16 1 16
S
Register
Intermediate
operand
store
64 64
Pseudoadder
Tree
Converts
4
Words
+
Partial
Product
into
CPA
inputs
56
27
1512
Logic
Unit
Performs
logic
functions
between
A-B
Registers
64 6
396
Barrel
Switch
Performs
shifting
functions.
65 14 910 1
Barrel
Switch
Control
Logics
48
Mode
Register
Buffer
register
between
a
PE
and
the
Control
Unit
8 8
Mod.e
control
input
gates
38
Drivers
and
Receivers
to
Control
Unit
8/8
Leading
One's
Detects
position
of
leading
one,
Detector
mantissa
field
184
PE
Control
Receives
apprOximately
200
Signals
Signal
Rec.
frqm
the
Control
Unit
000/200
7228
355
200/264
A
ssurnptions:
1.
The
flip
flops
are
constructed
of
4
gates
per
element.
2.
The
Drivers
and
receivers
are
constructed
of
one
logic
gate.
Total
gates
used.
7228
1420
664
9312
3-3
3-4
OPERAND
SELECT
GATES
(OSG)
This
unit
consists
of
one
logic
level
of
decoding
gates.
The
enabled
gate
selects
an
incoming
operand
from
its
MDR
or
nearest
neighbor's
MDR
for
array
transfer
operations
or
PE
processing.
During
an
array
transfer
operation,
.
the
OSG
out-
put
is
stored
temporarily
in
the
MDR.
PE
processing
requirements
are
that
the
selected
operand
be
connected
to
the
B
Register
input
gates
and
Multiplicand
Select
Gates.
Depending
upon
the
instruction
an
operand,
or
multiple
thereof,
is
loaded
into
the
recipient
register.
A-B
REGISTER
(ACCUMULATOR)
The
A
Register
and
B
Register
provide
main
operand
store
and
auxiliary
operand
store,
respectively.
The
Barrel
Switch
output
is
connected
to
both
registers.
These
register
inputs
are
used
to
execute
operation
codes
that
require
main
or
auxiliary
operand
shifts,
or
logical
combinations
thereof.
Both
register
outputs
are
routed
to
the
Carry
Propagate
Adder
(CPA)
and
Logic
Unit.
In
addition
the
A
Register
receives
inputs
from
the
CPA
and
the
Modified
Address
Adder.
*
The
CPA
output
is
gated
into
the
A
Register's
mantissa
field.
This
output,
depending
on
the
algorithm,
could
be
a
partial
product
or
remainder,
a
summed
mantissa
or
just
an
operand
transfer
from
the
B
Register
or
the
OSG.
During
an
operand
transfer,
the
appropriate
exponent
is
gated
directly
into
the
register's
exponent
field.
The
Modified
Address
Adder
output
is
a
summed
value
loaded
into
the
exponent
field.
The
A
Register's
outputs
are
connected
to
the
Leading
ONEs
De
tector,
MDR,
Special
Register
(SR)
and
the
Modified
Address
Adder.
The
B
Register
receives
additional
inputs
from
the
CPA;
OSG,
and
SR.
The
CPA
input
is
a
completed
8
-bit
product
obtained
in
a
multiplication
micro-
sequence
step.
The
OSG
input
is
the
incoming
operand
from
this
PE's
or
an
adjacent
PE's
MDR.
The
SR
input
is
the
incoming
.operand
obtained
from
a
previous
calculation.
Other
B
Register's
outputs
are
connected
to
the
Multiplicand
Select
Gates
and
Modified
Address
Adder.
The
Multiplicand
Select
Gate
inputs
are
multiplier
bits
decoded
into
enable
signals
for
the
next
multiply
step.
The
Modified
Address
Adder
input
is
the
exponent
field
to
be
summed
in
the
adder.
LOGIC
UNIT
This
unit
performs
logic
functions
between
the
A
Register
and
B
Register.
The
unit
receives
outputs
from
both
accumulator
registers.
The
various
logic
functions
are
performed
as
specified
in
Section
VII
of
TR66
-3
or
otherwise
modified
in
this
report.
This
unit
provides
the
input
gating
function
to
the
Barrel
Switch
unit.
*
The
Modified
A.ddress
Adder
term
refers
to
the
Address
Adder
extended
4
bit
positions.
BARREL
SWITCH
The
Barrel
Switch
is
a
matrix
of
symmetrical
gates
which
shifts
a 64
-bit
parallel
input
any
number
of
places
to
the
left
or
right
either
end-off
or
end-around.
Oper-
ation
is
started
upon
receipt
of
a 64
-bit
parallel
input
and
appropriate
control
signals.
The
output
is
connected
to
both
the
A
and
B
Registers.
LEADING
ONES
DETECTOR
The
input
to
this
unit
consists
of
the
A
Register's
mantissa
bit
positions.
This
unit
detects
the
location
of
the
most
significant
set
bit,
decodes
its
position
as
a
radix
2
power
and
enables
appropriate
Barrel
Switch
displa'cement
control
signals.
The
unit
also
senses
the
absence
of
a
set
bit,
thereby
detecting
a
zero
value.
This
decoded
signal
is
used
to
zero
the
exponent
field
of
the
A
Register.
MULTIPLICAND
SELECT
GA
TES
These
gates
select
four
multiplicand
words
each
based
on
2
bits
of
the
multiplier
mantissa
only.
The
Multiplicand
Select
Gates
are
partitioned
into
two
parts,
one
which
receives
the
next
set
of
multiplier
bits
to
be
decoded
into
multiplicand
word
enable
signals
and
the
other
which
gates
the
appropriate
multiplicand
(word)
into
the
Pseudoadder
Tree.
During
multiplication,
four
2
-bit
pairs
of
multiplier
bits
are
received
and
decoded
in
advance
to
form
the
enable
signal
that
gates
the
required
words
into
the
Pseudoadder
Tree
for
the
next
microsequence
step.
The
decoded
signal
may
select
multiplicand
multiples
of
times
one,
times
minus
one,
or
times
two.
This
word
selection
enable
signal
is
stored
in
flip
-flops
at
the
termination
of
the
current
microsequence
step.
During
an
operand
transfer,
the
word
selected'
will
enter
the
Pseudoadder
Tree
so
that
its
output
is
positioned
at
the
A
Register
mantissa
input
gates.
PSEUDOADDER
TREE
The
Pseudoadder
Tree
consists
of
three
carry
save
adders
per
bit
position.
During
multiplication,
this
unit
receives
inputs
consisting
of
four
words
from
the
multipli-
cand
Select
Gates
and
the
current
partial
product
from
the
A
Register.
The
carry
save
adders
reduce
these
five
inputs
to
two
outputs,
one
being
the
summand
sum
and
the
other
the
summand
carry.
These
outputs
are
gated
into
the
CPA
in
their
true
and
complemented
form.
CARRY
PROPAGATE
ADDER
(CPA)
Upon
receipt
of
an
input
signal
set
from
the
Pseudoadder
Tree
or
the
A
-B
Regis-
ters,
or
the
A
register
and
Operand
Select
Gates,
the
CPA
adds
the
inputs
in
a
3-5
3-6
microsequence
step.
When
executing
an
arithmetic
instruction,
the
extended
CPA's
output
bit
positions
0
thru
47
are
gated
into
the
A
Register
mantissa
field.
When
executing
the
multiplication
algorithm,
the
CPA's
output
positions
48
thru
53
are
gated
into
the
B
Registers
eight
most
significant
locations,
and
positions
0
thru
47
are
gated
into
the
A
Registers
mantissa
field.
During
multiplication
the
CPA
functions
in
two
distinct
modes:
group
carry
save
additions,
and
extended
mantissa
addition.
The
saved
group
carries
are
dis-
placed
8
bit
positions
to
the
right
and
re
-entered
into
that
particular
adder
group.
The
group
carry
save
addition
is
performed
on
those
adder
sections
whose
output
is
entered
into
the
A
Register's
mantissa
field.
Additional
adder
sections
are
required
to
perform
this
algorithm.
These
adder
sections
receive
their
inputs
from
the
most
significant
16
bit
positions
of
the
B
Register
and
group
carry
save
bits
from
the
previous
microsequence
step.
The
least
significant
section
performs
section
carry
save
addition.
The
saved
carry
is
re
-entered
into
this
adder
displaced
eight
bit
positions
to
the
right.
The
last
step
in
the
multi-
plication
algorithm
requiring
product
summation
is
performed
by
the
CPA
oper-
ate
mode
of
extended
mantissa
addition
..
The
additional
adder
sections
are
connected
to
the
CPA
to
form
the·
extended
tna,ntissa
adder.
This
microsequence
step
allows
the
carry
to
propagate
throughout
the
adder
to
form
the
completed
product.
MODE
REGISTER
The
Mode
Register
is
a
buffer
register
between
the
PE
and
the
Control
Unit.
The
PE
status
(operative
or
inoperative)
is
controlled
by
two
bits
of
this
register.
These
two
bits
are
controlled
exclusively
by
the
Control
Unit.
The
PE
controls
other
register
flip
-flops
when
executing
specific
instructions.
ADDRESS
ADDER
The
Address
Adder
inputs
are
received
from
the
Common
Data
Bus,
MDR,
PE's
Index
Register
and
the
Memory
Address
Register.
These
added
sums
are
con-
nected
to
the
input
gates
to
the
MDR,
PE's
Index
Register
and
the
Memory
Address
Register.
The
Modified
Address
Adder
inputs
are
received
from
the
exponent
bit
positions
of
the
A
-B
Registers
and
the
OSG.
This
added
output
is
either
decoded
to
form
Barrel
Switch
control
signals
or
entered
into
the
A
Register's
exponent
field.
PE
INDEX
REGISTER
This
unit
receives
inputs
from
the
Address
Adder.
Its
output
is
connected
to
the
Address
Adder
where
compare
operation
or
address
modifications
are
performed.
MEMORY
ADDRESS
REGISTER
(MAR)
The
MAR
is
a
buffer
register
between
a
PE
and
its
thin
film
stack.
The
MAR
out-
put
is
the
thin
film
stack
operand
address
location.
Depending
upon
the
instruction
the
MAR
can
be
loaded
or
modified
with
Address
Adder
inputs.
The
Address
Adder
input
is
gated
into
the
MAR
at
the
start
of
a
memory
cycle.
SPECIAL
REGISTER
(SR)
The
SR
Register
is
used
to
store
intermediate
results
obtained
during
PE
pro-
cessing.
The
operand
input
is
received
from
the
A
Register.
The
SR
register's
output
is
gated
into
the
B
Register,
thus
providing
a
buffer
'store
for
a
previously
computed
operand
without
a
memory
fetch.
3-7
SECTION
IV
THE
INPUT-OUTPUT
SUBSYSTEM
GENERAL
CONSIDERATIONS
The
following
design
considerations
are
pertinent
to
the
ILLIAC
IV
I/O
Subsystem.
1.
A
disk
storage
system
will
provide
the
principal
on-line
backup
storage
for
the
ILLIAC
IV
System.
Present
memory
state
of
art
singles
out
disk
file
stor-
age
as
the
only
medium
with
the
necessary
volume
and
cost
parameters
to
satisfy
the
ILLIAC
IV
requirements.
Requirements
for
the
disk
file
system
are
further
considered
at
the
end
of
this
section.
2.
I/O
word
transfers
to
and
from
the
PE
Array
will
be
in
the
form
of
a
4096
-bit
word.
This
capability
makes
maximum
use
of
the
interrupt
time
of
a
quadrant
and
keeps
I/O
interference
with
the
Array
program
to
a
minimum.
It
also
provides
a
separate
1/
a
path
to
each
PE
to
accommodate
applications
which
require
asynchronous
direct
inputs
to
the
PE
memories.
Some
real-time
problems
involving
large
array
sensor
systems
are
typical
of
this
application.
3.
Capability
to
perform
interlaced
I/O
transfers
from
simple
descriptor
operation
is
desirable.
Considering
the
variety
and
number
of
peripheral
devices
the
system
may
ultimately
incorporate,
the
capability
to
store
and
rapidly
fetch
at
least
64
I/O
descriptors
to
control
the
interlaced
I/O
transfers
is
necessary.
4.
Much
routine
I/O
processing
such
as
data
packing
and
unpacking,
descriptor
assembly
and
updating,
etc.,
must
be
performed
external
to
and
inde-
pendent
of
the
Array
or
Control
Unit.
Therefore,
a
processing
'device
similar
to
a
medium
scale
computer
module
would
be
most
suitable.
There
are,
however,
some
1/
a
programs
such
as
code
and
format
conversion
which
are
eminently
suitable
to
Array
processing.
4-1
4-2
5.
The
existence
of
an
economical
core
memory
system
separate
and
distinct
from
the
Array
memories
is
necessary
for
the
following
reasons:
a)
Transfers
from
tape
to
disk,
or
card
to
disk,
or
disk
to
printer,
must
have
a
buffer
area
available
to
collect
reasonable
block
sizes
before
making
trans-
fers
to
the'
disk
in
order
to
minimize
latency
time
overhead.
b)
Frequently
used
subroutines
which
overflow
array
storage
can
be
kept
in
random
access,
zero
latency
time
memory
to
avoid
millisecond
delays
in
processing.
c)
To
permit
I I 0
lookahead
transfers
from
disk
to
the
array
to
overlap
latency
time
with
processing,
a
separate
buffer
memory
is
desirable
as
a
staging
area
for
such
transfers.
It
is
possible
that
the
buffer
memory
required
to
interface
the
disk
mass
memory
and
the
PE
array
memory
may
be
very
large.
Anum
ber
of
core
memory
manu-
facturers
are
being
surveyed
to
identify
possible
candidates
for
core
memories
in
the
capacity
range
up
to
16
million
bits
(modules
independently
accessible
up
to
8
million
bits),
with
word-lengths
in
the
range
of
1024
to
4096
bits,
and
cycle
times
in
the
range
of
2
to
8
microseconds.
In
the
indicated
capacity
range
the
ferrite
core
memories
are
currently
dominant
from
the
cost
standpoint.
Memory
organizations
involving
only
2
wires
per
core,
such
as
linear
select
or
variations
of
"2
1/2
D,
"
are
generally
indicated
to
minimize
mat
fabrication
cost.
This
is
also
a
significant
consideration
here
but
in
this
case
it
is
likely
that
the
electronics
will
remain
a
very
significant
portion
of
the
cost.
For
the
longer
word
lengths
(2K
-4K
bits)
it
appears
that"
2 1 / 2
D"
would
offer
little
advantage
over
linear
select
and
the
large
number
of
sense
amplifiers
and
data
bit
drivers
would
reflect
heavily
on
the
cost
per
bit.
It
also
appears
that
practical
considerations
may
limit
word
drivers
to
something
in
the
order
of
1000
cores
per
line
and
hence
might
require
multiple
driver
and
switch
matrices
for
the
longer
word
lengths.
Neces-
sarily
tentative
extrapolations
of
fairly
recent
cost
data
suggests
that
it
may
be
difficult
to
obtain
costs
much
lower
than
3
cents
per
bit
for
a
core
memory
for
this
requirement.
At
this
time
several
core
memory
manufacturers,
including
the
Burroughs
Com-
ponent
Division,
have
been
contacted
about
this
requirement.
'
The
unusual
geometry
of
this
unit
precludes
the
use
of
an
existing
product
line
item.
This
fact
has
delayed
detailed
responses
beyond
publication
of
this
report.
DISK
STORAGE
Effort
to
determine
candidates
for
a
disk
file
mass
memory
offering
high
data
transfer
rates
and
economic
storage
in
capacities
in
the
range
of
300
million
to
1
billion
bits
has
continued.
Obtaining
the
high
data
transfer
rates
desired
in
this
applicat~on
(greater
than
300
megabits
/
second)
requires
a
high
degree
of
parallelism
in
the
transfer,
e.
g. a
large
data
storage
word.
The
parallel
mode
of
operation
increases
the
amount
of
read-write
electronics
required
and
it
is
desirable
to
keep
this
increase
to
a
minimum.
It
is
possible
to
reduce
the
number
of
read
-write
channels
below
one
per
bit
if
basic
bit
rate
capability
can
be
traded
to
effect.
the
reduction.
This
tradeoff
is
indeed
a
necessity
if
fewer
heads
are
simultaneously
accessible
than
the
number
of
bits
required
in
parallel.
The
reduction
of
the
number
of
required
read
-write
channels
is
effected
by
using
a
multiple
zone
storage
format
in
which
several
bits
per
word
time
are
recorded
serially
in
the
outer
zones.
In
general
a
two
-zone
format
allows
the
greatest
.
reduction
in
the
number
of
read-write
channels
for
a
given
sacrifice
of
maximum
bit
rate
while
formats
of
three
or
more
zones
afford
greater
capacity
utilization.
More
than
three
zones
are
seldom
warranted
since
the
required
increase
in
the
number
of
clock
rates
tends
to
cancel
the
attractiveness
of
the
small,
further
incre-
ments
to
utilization
efficiency.
The
following
relations
enable
specific
disk
organizations
to
be
evaluated.
For
a
three-zone
format
the
number
of
tracks
in
each
zone
must
be
related
as
follows:
2n
+ n = N
[nb
+ 2 '-
K]
1 2 T
~
and
for
a
two
-zone
format
where
n1 =
the
number
of
tracks
in
the
outer
zone
n2 =
the
number
of
tracks
in
the
!Y'tddle
zone
NT
=
total
number
of
tracks
per
disk
face
K =
number
of
bits
per
track
per
word
stored
serially
in
the
outer
zone.
N =
the
number
of
bits
per
word
per
face
nf
N =
the
number
of
bits
per
storage
word
nf =
the
number
of
disk
faces
used
per
word
4-3
The
number
of
read-write
channels
required
is:
n = n. n
c n f
where
nh
is
the
number
of
heads
per
face
which
are
simultaneously
selected.
For
a
reduction
of
read
-write
channels
below
one
per
bit
of
the
parallel
storage
word
it
is
required
that
In
order
that
the
available
capacity
of
the
disk
file
be
used
efficiently
it
is
also
required
that
nh
be
an
integral
submultiple
of
the
total
number
of
tracks
per
disk
face,
NT'
and
that
nf
shall
be
an
integral
submultiple
of
the
total
number
of
faces
per
file,
N(
For
the
usual
ranges
of
interest
the
maximum
packing
density
occurs
in
the
inner-
most
track
of
the
outer
zone.
It
is
at
this
critical
radius,
R ,
that
the
maximum
bit
packing
density,
B,
applies.
C
where.
R2
=
outermost
track
radius
T =
the
radial
track
density.
The
number.
of
storage
words
per
data
cylinder,
that
is,
the
number
of
words
accessible
in
one
disk
revolution
without
head
switching,
is
given
by:
2TT
RcB
~=
K
The
word
transfer
rate,
UW,
is
then
determined
in
terms
of
the
disk
rpm
as
follows:
_
(RPMt
TT
RcB
(RPM)
~
-
nW
\60)-
30K
The
total
file
capacity
is;
bits/
file.
4-4
The
ideal
packing
efficiency#
a
measure
of
capacity
utilization
referred
to
the
maximum
possible
capacity
which
would
be
available
at
uniform
packing
density
in
all
tracks#
is
given
by:
where
R2
and
R1
are
the
radius
of
the
outermost
and
innermost
tracks
respectively.
A
related
packing
factor
based
on
the
maximum
possible
capacity
at
a
constant
number
of
bits
per
track
for
all
tracks
is:
Note
that
this
factor
is
in
general
greater
than
unity
because
the
multizone
format
utilizes
available
disk
area
more
efficiently
than
would
be
possible
with
a
single
clock
rate
for
the
entire
disk
(as
would
be
required
for
a
straight
parallel
configuration).
In
the
multiple
zone
format
every
track
in
the
same
zone
contains
the
same
number
of
bits#
but
each
zone
requires
a
different
clock
rate.
The
clock
rates
for
a
two-
zone
format#
for
example,
would
be:
K
nW
in
the
outer
zone
(K
-1)
nW
in
the
inner
zone.
As
previously
reported
J
the
Librascope
4800
disk
file
series
is
potentially
attractive
because
of
the
large
number
of
heads
per
face.
The
model
4802
is
of
particular
interest
because
the
increased
packing
density
offers
higher
basic
bit
rates
and
more
capacity
in
the
same
basic
unit.
The
salient
characteristics
are
summarized
here.
No.
disk
per
file
-6
(48"
dia.
)
Tracks
per
face
-
432
Packing
density
-
2000
bits/inch
Track
density
-48
tracks
I
inch
RPM
-
900
(35
ms
avg.
access)
4-5
Table
4-1.
512-Bit,
Parallel
Organizations
for
the
Librascope
4802
No
Head
All
Selected
Heads
on
Modifications
One
Disk
One
Face
Bits Bits
3:2
4:3
Format
(Out
Z
Inner
Zone)
3:2
4:3
er
one:
Head
Groups/Head
Assembly
1 1 3 4
Bits/Face/Word
86
128
256
512
Heads/Sel.
/Face/Word
36 36
108
144
Heads/Group
12 12 4 3
Faces
Sel.
/Unit
6
of
12 4
of
12 2
of
12 1
of
12
Ideal
Packing
Efficiency
83%
86.
3%
83%
86.3%
No.
Stg.
Words/Data
Cylinder
83840
58000
84500
58000
32-Bit
Segments/Data
Cylinder
2620
1815
2640
1815
Word
Transfer
Rate,
MHz
1.
26
O.
87
1.
27
0.87
Read-Write
Channels
216
·144
216
144
Data
Cylinders
/Unit
24
36
24
36
Total
Stg.
Cap.
Words/File
(10
6)
2.0
2.09 2.02
2.09
Total
Stg.
Cap.
Bits/File
(10
6)
1024
1070
1030
1070
Clock
Rate,
Inner,
MHz
2. 52 2.
62
2.
54
2. 62
Clock
Rate,
Outer,
MHz
3. 78 3.
48
3.81
3.48
No.
Tracks,
Outer
Zone
168
240
160
240
No.
Tracks,
Inner
Zone
264
192
272
192
Head
Groups,
Outer
Zone
14
20
40
80
Head
Groups,
Inner
Zone
22 16 68
64
Head
Sticks,
Outer
Zone
14
20
13
+
1/3
20
Head
Sticks,
Inner
Zone
22
16
22
+
2/3
16
Clock
Channels
/Unit
12 8 4 2
4-6
The
large
number
of
accessible
heads
permit
a
number
of
possible
disk
organizations.
Without
sacrificing
capacity
the
following
might
represent
limiting
transfer
rate
capabilities.
1.
Up
to
128
bits
per
face
or
1024
bits
per
file
at
a
word
transfer
rate
of
approximately
o.
8
megaHertz
without
modifications
to
head
stick
wiring.
2.
Up
to
512
bits
per
face
or
2048
bits
per
file
at
a
word
transfer
rate
of
approximately
o.
8
megaHertz
if
head
stick
wiring
is
revised
to
enable
simultaneous
selection
of
four
heads
per
stick.
3.
Up
to
256
bits
per
face
or
1024
bits
per
file
at
a
word
transfer
rate
of
approximately
1.
2
megaHertz
if
head
wiring
is
revised
to
enable
simultaneous
selection
of
three
heads
per
stick.
Still
larger
word
lengths
would
be
possible
by
revising
head
stick
wiring
to
permit
simultaneous
access
to
all
heads
but
the
resulting
increase
in
the
nun1.ber
of
head
wires
from
15
to
39
would
likely
be
troublesome.
With
no
fewer
than
3
heads
per
group
the
number
of
wires
per
stick
can
be
held
to
23
for
completely
flexible
head-
track
sparing,
or
to
14
wires
per
head
stick
if
restriction
to
group
sparing
is
acceptable.
At
the
word
lengths
of
interest
it
is
necessary
to
either
subdivide
the
head
stick
wiring
or
concurrently
select
heads
over
several
diskfaces.
Both
have
some
potential
disadvantages;
the
latter
from
the
possible
necessity
for
skew
cor-
rection
due
to
relative
timing
errors
between
several
faces
and/
or
head
plates,
and
the
additional
clock
channels
required.
At
a
word
length
of
512
bits
either
approach
is
a
possibility.
Table
4-1
presents
a
summary
for
four
different
organ-
izations:
two
involve
no
head
stick
revisions,
and
two
require
head
revisions
but
involve
only
one
disk
or
one
disk
face,
respectively.
Two
different
two
-zone
for-
mats
(indicated
in
the
table
as
the
number
of
serial
bits
per
word
time
in
outer
zone:
the
number
of
serial
bits
per
word
time
in
the
inner
zone)
offer
words
trans-
fer
rates
of
O.
8
to
1.
2
megaHertz.
At
the
initial
contact
with
General
Precision,
Inc.,
they
indicated
a
tentative
preference
for
organizations
involving
concurrent
head
selection
on
a
single
face
because
they
did
not
think
they
had
adequate
information
at
that
time
with
respect
to
the
skew
problem.
4-7
SECTION
V
PROGRAMMING
PE
OPERA
TIONS
This
section
describes
modifications
and
additions
to
the
list
of
instructions
ex-
ecuted,
at
least
in
part,
by
the
logic
within
Processing
Elements.
A11
instruc-
tions
not
mentioned
here
are
functionally
unchanged
from
those
described
in
Sec-
tion
VII
of
Progress
Report
No.1
(TR-66-3;
August
26,
1966).
However,
some
changes
in
instruction
forma.ts
are
contemplated
to
accommodate
certain
new
in-
structions.
In
particular,
some
of
the
new
instructions
will
be
ma.de
var
iants
of
old
instructions
instead
of
new
op-
codes
and
these
old
instructions
will
have
their
formats
changed
to
incorporate
variant
fields.
In
32-bit
mode,
bits
0
and
32
are
the
sign-bits
of
the
two
operands
in
either
the
A
or
the
B
Register;
bits
1-7
and
33-39
are
the
exponents;
bits
8-31
and
40-
63
are
the
mantissa
magnitudes.
Fixed-
point
operations
always
involve
the
mantissa
magnitudes
and
the
mantissa
signs
only.
Thus,
in
32-
bit
mode
fixed-
point,
op-
erations
are
performed
on
24-bit-plus-sign
quantities.
When
an
operation
uses
-
operands
from
both
the
A
and
the
B
Registers,
corresponding
bits
from
the
two
registers
are
involved.
A s
an
example,
a
double-length
arithmetic
shift
(SHD)
treats
bits
8-
31
of
the
A
and
B
Registers
as
one
48-
bit
quantity
and
bits
40-
63
of
the
two
registers
as
another
48-bit
quantity.
Shlft
counts
are
interpreted
modulo
32
in
32-blt
mode.
The
PE
Index
Register
is
not
affected
by
the
word-length
mode.
For
example,
Load
Index
from
A
Register
(LXA)
always
transmits
bits
52-
63
of
the
A
Register
'to
the
Index
Register.
.
There
are
now
nine
mode-bits
per
PE,
designated
as
the
\V,
E,
E1,
F,
F1,
G,
I,
and
J
bits.
The
\\1'-
Bit
is
always
set
or
reset
by
the
Control
Unit
and
designates
the
word-length
(ZERO
implies
64-bit
operands,
ONE
implies
32-bit
operands').
5-1
5-2
In
64-bit
mode
the
E-Bit
enables
or
disables
the
changing
of
full
operand
registers
and
the
E1-
Bit
is
not
available
for
program
use.
In
32 -
bit
mode,
the
E-
Bit
con-
trols
the
changing
of
bits
0-
31
of
the
registers
and
the
E1-
Bit
controls
bits
32-
63:
In
test
instructions
executed
by
PEts,
any
register
bit
which
is
disabled
from
being
changed
is
also
disabled
from
being
tested
to
generate
a
change
in
any
"mode
bit.
In
64-bit
mode,
the
F-Bit
is
set
to
ONE
whenever
an
arithmetic
fault
occurs
in
the
PE
and
the
F1-Bit
is
not
available
for
program
use.
In
32-bit
mode,
the
F-Bit
is
set
to
ONE
whenever
a
fault
occurs
in
arithmetic
performance
on
bits
0-31
of
operands
and
the
F1-bit
is
set
to
ONE
whenever
a
fault
occurs
in
bits
32-63.
In
64-bit
mode,
the
instruction
list
is
augmented
to
permit
specification
of
anyone
of
the
G,
H,
I,
and
J
bits
to
hold
the
result
of
any
test.
For
example,
the
tests
for
the
A-Register
being
zero
now
include
GAZ
and
HAZ
(Set
G/H
if
A
Equals
Zero)
as
well
as
IAZ
and
JAZ.
The
Control
Unit
can
enable
or
disable
a
PE
as
a
result
of
the
condition
of
the
G
and
H
bits
in
a
manner
identical
to
the
previous
capability
with
the
I
and
J
bits.
In
32-bit
mode,
the
IAZ
instruction
sets
the
G-Bit
if
bits
0-31
of
the
A-Register
are
zero
and
sets
the
I-Bit
if
bits
32-63
of
the
A-Register
are
zero.
Of
course,
if
the
E-Bit
equals
ZERO,
the
G-Bit
is
unchanged
and
if
the
E1-Bit
equals
ZERO,
the
I-Bit
is
unchanged.
Similarly,
JAZ
affects
both
the
Hand
J
bits
in
3~-bit
mode.
The
operations
of
instructions
calling
for
setting
of
the
G
and
H
bits
(e.
g.
GAZ,
HAZ)
are
presently
undefined
for
32-bit
mode.
The
new
register
in
each
PE
is
designated
the
ItS
Register"
(for
Save
Operand
Register
or
Special
Register).
The
contents
of
the
S
Register
are
not
involved
or
altered
by
any
arithmetic
or
shift
instruction.
Three
instructions
have
been
defined
involving
the
S
Register:
SAS
Store
A
Register
in
S
Register
Copy
all
of
the
A
Register
into
the
S
Register.
If
in
64-
bit
mode
and
E =
ZERO,
do
not
change
the
S
Register.
If
in
32-bit
mode
and
E =
ZERO,
do
not
change
bits
0-31
of
the
S
Register.
If
in
32-
bit
mode
and
E1
=
ZERO,
do
not
change
bits
31-63
of
the
S
Register.
IBS
SWAPS
Load
B
Register
from
S
Register
Swap
A,
Band
S
Registers
A
Goes
to
S
B
Goes
to
A
S
Goes
to
B
There
is
now
a
set
of
instructions
which
permit
operand
transmissions
to
begin
and/
or
end
at
the
MDR
without
involving
the
A
and
B
Registers.
These
instruc-
tions
are
designed
so
that
intermediate
PE's
can
participate
in
multi-
PE
routing.
without
having
their
operand
registers
altered..
The
Control
Unit
has
a
single
instruction
which
causes
a
sequence
of
transfers
among
neighboring
PE's
to
accomplish
the
desired
multi-PE
routing.
The
instructions
transmitted
to
the
PE's
by
the
Control
Unit
in
response
to
the
single
instruction
executed
by
the·
Control
Unit
are
also
available
for
the
program
to
call
upon
individually.
These
instructions
include
MDR-to-MDR
transmission
with
direction
specified,
storage
from
the
A
or
the
B
Register
to
the
MDR
of
the
same
PE,
and
leading
of
the
A
or
B
Register
from
the
MDR
of
the
same
PEe
A
disabled
PE
does
not
execute
those
instructions
which
load
its
A
or
B
Register
but
does
participate
in
the
others.
This
control
of
which
PEls
finally
receive
multi-PE
transmissions,
coupled
with
the
programmed
control
of
the
path
distance
and
directions,
permit
the
single
Control
Unit
instruction
required.
In
effect,
all
PE'
s
originate
transmissions
but,
after
the
path
control
has
been
counted
down,
only
enabled
PE's
accept
the
result
of
the
transmission.
The
operands
which
arrive
at
disabled
PE'
s
are
retained
in
their
MDR'
s
and
may
be
discarded
by
the
next
instruction
causing
their
replacement.
However,
some
saving
of
transmission
time
may
be
accomplished,
for
certain
routing
patterns,
if
the
first
multi-PE
transmission
is
followed
by
a
new
transmission
that
starts
with
MDR-to-MDR
transmissions
and
which
terminates
with
a
different
set
of
PEls
being
enabled
to
accept
the
transmission
results
from
their
MDR's
into
their
A
and
B
Registers.
There
are
now
instructions
which
cause
the
clearing
of
the
exponent
field
of
the
A
or
the
B
Register.
Clearing
of
the
mantissa
field
has
always
been
available
as
an
end-off
arithmetic
shift
with
any
shift
count
exceeding
the
length
of
the
mantissa.
With
the
Barrel,
shifting
to
clear
a
field
takes
no
longer
than
the
execution
of
a
special
clear
field
instruction.
Obviously,
in
assembly
language
a
clear
mantissa
instruction
may
exist
which
would
assemble
as
a
shift.
There
is
now
a
round
variant
on
each
of
the
pertinent
arithmetic
instructions.
In
floating-point,
rounding
is
accomplished
before
normalization
to
preserve
the
prope
r
significance.
There
are
now
instructions
which
add
address
fields
to
the
previous
contents
of
the
Memory
Address
Register,
with
the
address
fields
being
optionally
provided
by
fields
within
the
instruction
or
the
PE
Index
Register
or
both.
This
permits
straightforward,
multi-level,
indirect
addressing
when
the
level
is
known.
There
is
now
an
instruction
which
stores
an
operand
from
the
A,
B
or
Memory
Data
Register
of
an
enabled
PE
to
the
Common
Data.
Bus.
When
only
one
PE
is
enabled,
this
provides
the
required
transmission
of
an
operand
from
a
PE
to
the
Control
Unit
without
requiring
memory
access.
When
two
or
more
PEls
are
enabled,
the
result
of
this
instruction
is
undefined.
There
are
now
instructions
which
carry
an
indexable-
6-bit
field
specifying
one
bit
within
either
the
A
or
the
B
Registers.
The
bit
number
is
interpreted
modulo
64
in
64-bit
mode..
In
32-bit
mode,
the
bit
number
is
interpreted
modulo
32
and
as
modulo
32
plus
32,
allowing
specification
of
corresponding
'bits
in
both
halves
of
the
Register.
The
operations
on
and
with
the
specified
hit
are:
Set
either
the
G,
H,
I
or
J
bit
to
equal
the
specified
bit.
In
32-bit
~ode,
only
I
and
J
are
defined.
)
5-3
5-4
Change
the
specified
bit.
Set
the
specified
bit
to
ZERO.
Set
the
specified
bit
to
ONE.
There
are
now
instructions
which
perform
arithmetic
on
the
magnitudes
of
the
operands.
There
are
now
instructions
which
compare
the
magnitude
of
the
A
Register
with
the
magnitude
of
the
B
Register.
Also,
the
magnitude
of
either
register
may
be
compared
with
the
Common
Value.
Comparison
with
the
magnitude
of
the
Common
Value
is
not
included
as
a
separate
PE
inst
ruction
since
this
may
be
accomplished
by
having
the
Control
Unit
broadcast
the
magnitude
only
of
the
Common
Value.
The
instructions
which
operate
upon
or
compare
the
value
of
the
PE
Index
Register
have
been
augmented
so
that
the
Index
Register
may
be
stored
in,
loaded
from,or
compared
with
the
least
significant
twelve
bits
of
the
B
Register
or
a
memory
word.
When
a
memory
word
is
involved
in
any
of
these
instruc-
tions,
the
address
in
memory
is
indexable.
CONTROL
UNIT
INSTRUCTIONS
In
the
following
numeric
list
of
instructions,
the
first
syllable
is
given
in
octal.
Op-code
"000"
is
interpreted
by
the
CU
as
a
halt.
Op-codes
"001-177"
are
ex-
ecuted
in
part
by
the
CU
and
part
by
the
PE
array.
Op
-codes"
200
-377"
are
interpreted
fully
by
the
CD
and
no
direct
PE
action
results.
In
ulti
-syllabic
instructions,
the
following
abbreviations
are
used
to
indicate
the
coding
of
the
syllables
other
than
the
first:
A
bit
which
does
not
affect
the
operation
of
the
instruction
being
described.
a A
bit
which
is
part
of
an
address
or
literal
field.
b A
bit
which
is
part
of
a
field
which
designates
a
bit-
number
within
a
register.
c A
bit
which
is
part
of
a
shift
count
(
in
some
instructions
a
bit-number).
C
An
eight-bit
syllable
used
to
designate
an
address
within
CU
local
memory.
d A
bit
which
designates
shift
direction.
e A
bit
which
distinguishes
between
end
-off
and
end
-around
shifts.
End
-around
shifts
are
mnemonically
referred
to
as
"Rotate.
II
i A
variant
bit
which
is
defined
as
ONE
for
the
specific
variaI1t
being
discussed.
o A
variant
bit
which
is
defined
as
ZERO
for
the
variant
being
discussed.
L
An
eight
-bit
syllable
which
is
part
of
a
literal
field.
M
An
eight
-bit
syllable
which
is
part
of
an
address
field
used
by
the
CD
to
address
IVlain
Memory.
v A
variant
bit
which
has
meaning
in
defining
a
variant
described
elsewhere.
Examples
of
this
are
given
following
this
list.
x A
variant
bit
which
controls
indexing,
by
the
PE
Index
Register,
of
the
address,
shift-count
or
bit-number
field
given
in
the
in-
struction.
As
an
example
of
the
use
of
these
abbreviations
in
the
instruction
list,
consider
the
instructions
which
transmit
data
between
neighboring
PE's.
Each
of
these
instructions
starts
with
an
op
-code
syllable
equal
to
octal
120
and
has
a
second
syllable
specifying
variants.
The
variants
permit
the
choice
of
the
A
Register,
the
B
Register
or
the
Memory
Data
Register
as
the
transmission
origin;
the
same
three
registers,
or
the
Memory
Address
Register,
may
be
the
destination;
the
transmission-direction
may
be
North,
East,
South,
or
West.
Two
bits
are
used
for
each
of
these
variants,
leaving
two
undefined
bits
in
the
variant
syllable.
The
two
most
significant
bits
specify
the
originating
register,
the
next
two
bits
specify
the
destination
register
and
the
two
least
Significant
bits
specify
the
direction.
Transmission
from
the
Memory
Data
Register
to
the
Memory
Data
Register
of
the
North
neighbor
is
indicated
by
coding"
10"
for
each
of
the
register
-designation
bits
and
"00"
for
the
direction:
1010
..
00,
where
the
periods
indicate
the
undefined
bits.
In
the
instruction
list
this
would
appear
"ioio
..
00".
However,
to
avoid
listing
each
possible
variant
separately,
the
list
contains
the
following
t.hree
entries:
120
iovv
..
vv
120
vvio
..
vv
120
vvvv
..
00
D-T-
-DT-
--TN
The
first
entry
denotes
the
coding
of
the
bits
that
specify
the
origin
and
indicates
that
the
other
variant
bits
have
no
affect
on
the
origin.
Similarly,
the
second
entry
denotes
destination
and
the
third
entry
denotes
direction.
The
mnemonic
abbre-
viations
show
the
characters
that
represent
the
variants
and
hyphens
are
used
for
character
positions
that
are
used
to
denote
other
variants.
In
the
specific
example
being
considered,
"DDTN"
means
MDR
-to-MDR
transmission
North
and
is
encoded
"ioio
..
00".
5-5
5-6
Elements
of
the
Control
Unit
The
control
unit
is
conveniently
described
in
terms
of
its
registers
and
other
functional
entities.
The
registers
consist
of:
Eight
mode
function
registers,
E
l'
E2, F
l'
F , G, H, I
and
J
although
physically
located
in
the
processing
e~ements,
are
addressed
by
program
as
though
they
were
registers
of
the
control
unit
Index·
registers,
each
32
bits
long
64-bit
registers
in
high
speed
scratchpad
storage.
Sixty-four
of
these
can
be
used
as
the
broadcast
buffer.
Program
counter
Interrupt
register
and
mask,
or
interrupt
control,
register
Local
register
address
pointer,
8
bits
of
register
address.
The
above
registers
are
addressible
uniformly
using
an
8
-bit
register
address
in
the
instructions.
In
addition,
there
are
several
registers
which
are
not
addressible
by
the
8
-bit
field,
but
are
implicity
addressed
by
all
instructions
which
are
relevant
to
their
use.
They
are:
'.
Program
lookahead,
for
holding
a
reservoir
of
program
steps
independently
of
memory
accessing.
Address
register,
which
serves
as
an
accummulator
for
address-
sized
fields.
It
is
24
bits
long.
A
ccumulator
(for
want
of
a
better
name),
a
common
register
referenced
by
all
data
manipulating
instructions.
It
is
64
bits
long.
A
queue
of
instructions
and
q,ata,
which
decouples
the
operations
which
are
solely
within
the
CU
from
those
that
refer
to
the
PE's.
This
queue,
like
that
in
the
B8500,
has
no
effect
on
the
operations
of
the
instructions
except
to
permit
a
certain
amount
of
parallelism,
and
therefore
is
not
discussed
further.
Discussion
of
the
Instructions
All
registers
within
the
CU
are
uniformly
addressed
by
an
8
-bit
field,
which
is
in-
dexable.
The
operation
of
transferring
the
I
Register
to
the
E
Register,
which
is
"enable
those
PE's
which
had
a
true
result
on
the
most
recent
test
involving
the
I-Bit,
"
is
accomplished
by
the
same
sequence
of
instructions
as
transferring
register
number
24
to
register
number
42
which
merely
moves
data
around
the
scratchpad.
Similarly,
a
jump
is
accomplished
by
storing
a
new
value
in
the
numbered
register
which
is
the
program
counter.
A
bonus
which
comes
from
this
approach
is
that
capabili
ty
which
is
invented
for
the
use
of
one
special
case,
such
as
reading
a
PE
number
from
the
E
register,
can
be
used
on
any
data
within
the
CD,
thus
increasing
programming
flexibility.
The
CUAccummulator
is
central
to
most
CU
operations.
Its
use
is
implied
in
most
CD
data
transmission
instructions.
Whenever
an
instruction
is
transmitted
from
the
CD
to
the
PE
array,
the
CD
Accummulator
may
modify
the
PE
instruction
or
otherwise
participate
in
the
operation
of
the
PEt
For
example,
the
PE
in-
struction
"Load
A
Register
from
Common
Value"
(LAC)
transmits
the
contents
of
The
CD
Accummulator
to
the
A
Registers
of
all
enabled
PE's.
The
accummulator
also
receives
the
transmission
from
the
PE
in
the
"Store
A
Register
to
Common
Data
Bus"
Instruction
(SAC).
When
an
instruction
with
an
address,
shift-count,
or
bit-number
field
is
trans-
mitted
to
the
PE's,
the
contents
of
the
CU
Accumnlulator
are
added
to
this
field
before
transmission.
Thus,
multiple
indexing
with
CU
index
registers
is
accom-
plished
by
ordinary
addition
to
the
CD
Accummulator
before
issuance
of
the
PE
instruction.
The
CU
also
has_
one
special
index
register
for
its
own
use
in
addressing
w:ithin
its
local
memory.
The
act
of
placing
a
value
in
this
index
automatically
causes
the
addition
of
this
value
to
the
next
instruction
with
a
CD
register
number.
This
register
is
always
cleared
after
its
use.
000
HALT
Halt
All
Operations
002
CHSA
Change
Sign
of
A
Register
003
CHSB
Change
Sign
of
B
Register'
004
SAP
Set
A
Register
Positive
005
SBP
Set
B
Register
Positive
006
SAN
Set
A
Register
Negative
007
SBN
Set
B
Register
Negative
010
CEA
Clear
Exponent
of
A
Register
011
CEB
Clear
Exponent
of
B
Register
013
CMB
Complement
B
Register
015
CLB
Clear
B
Register
016
NORM
Normalize
A
Register
020
SAD
Store
from
A
Register
to
Memory
Data
Register
021
SBD
Store
from
i3
Register
to
Memory
Data
Register
022
LAD
Load
A
Register
From
Memory
Data
Register
5-7
5-8
023
024
025
026
027
.
030
031
032
033
034
035
036
037
040
041
042
043
044
045
046
047
050
051
052
053
054
055
06_0
061
062
063
064
LBD
SAC
SBC
LAC
LBC
LMA
LMB
LMC
LMD
LXA
LXB
LXC
LXD
SXA
SXB
SXC
SXD
ADAX
ADBX
ADCX
-ADDX
LDAM
LDBM
STAM
STBM
LDMM
LDDM
SAS
LBS
SWAP
SWS
DBA
Load
B
Register
from
Memory
Data
Register
Store
from
A
Register
to
Common
Data
Bus
Store
from
B
Register
to
Comn1on
Data
Bus
Load
A
Register
from
Common
Value
Load
B
Register
from
COlnmon
Value
Load
Memory
Address
Register
from
A
Register
Load
Memory
Address
Register
from
B
Register
Load
Memory
Address
Register
from
Common
Value
Load
Memory
Address
Register
from
Memory
Data
Register
Load
Index
Register
From
A
register
Load
Index
Register
from
B
Register
Load
Index
Register
from
Common
Value
Load
Index
Register
from
Memory
Data
Register
Store
Index
Register
in
A
Register
Sotre
Index
Register
in
B
Register
Store
Index
Register
to
Common
Data
Bus
Store
Index
Register
in
l\1emory
Data
Register
Add
A
Register
to
Index
Register
Add
B
Register
to
Index
Register
Add
Common
Value
to
Index
Register
Add
Memory
Data
Register
to
Index
Register
Load
A
Register
as
Designated
by
Memory
Address
Register
Load
B
Register
as
Designated
by
Memory
Address
Register
Store
A
Register
as
Designated
by
Memory
Address
Register
Store
B
Register
as
Designated
by
Memory
Address
Register
Load
Memory
Address
Register
as
Designated
by
Memory
Address
Register
Load
Memory
Data
Register
as
Designated
by
Memory
Address
Register
Store
A
Register
in
S
Register
Load
B
Register
from
S
Register
Swap
A
and
BRegisters
Swap
with
S
Register
(A
to
S: S
to
B:
B
to
A)
Duplicate
B
Register
from
A
Register
100
xdcc
cccc
SHA
Shift
A
Register
Mantissa
101
xdcc
cccc
SHB
Shift
B
Register
Mantissa
102
xdcc
cccc
SAL
Shift
A
Register
Logical
103
xdcc
cccc
SBL
Shift
B
Register
Logical
104
xdcc
cccc
RAL
Rotate
A
Register
Logical
105
xdcc
cccc
RBL
Rotate
B
Register
Logical
106
xdcc
cccc
SHD
Shift
Double
-Length
Mantissa
107
xdcc
cccc
SDL
Shift
Double
-Length
Logical
110 x.
bb
bbbb
CHBA
Change
Specified
Bit
of
A
Register
111 x.
bb
bbbb
CHBB
Change
Specified
Bit
of
B
Register
112
x.
bb
bbbb
SBA
Set
Specified
Bit
of
A
Register
113 x.
bb
bbbb
SBB
Set
Specified
Bit
of
B
Register
114 x.
bb
bbbb
CLBA
Clear
Specified
Bit
of
A
Register
115
x.
bb
bbbb
CLBB
Clear
Specified
Bit
of
B
Register
120
oovv
..
vv
A-T-
Inter
-PE
Transmission
frorn
A
Register
120
oivv
..
vv
B-T-
Inter
-PE
Transmission
from
B
Register
120
iovv
•.
vv
D-T-
Inter-PE
Transmission
from
Memory
Data
Register
120
vvoo
•.
vv
-AT-
Inter
-PE
Transmission
to
A
Register
120
vvoi
..
vv
-BT-
Inter-PE
Transmission
to
B
Register
120
vvio
..
vv
-DT-
Inter
-PE
Transmission
to
Memory
Data
Register
120
vvii
..
vv
-MT-
Inter
-
PE
Transmission
to
l\1emory
Addres
s
Re
gister
120
vvvv
••
00
--TN
Inter-PE
Transmission
North
120
vvvv
· .
oi
--TE
Inter
-
PE
Transmis
sion
East
120
vvvv
· .
io
--TS
Inter-PE
Transmission
South
120
vvvv
· .
ii
--TW
Inter-PE
Transmission
West
121
oioo
ioio
CLA
Clear
A
Register
BOOOO
Boolean
Function
0000
121
oHi
ioio
AND
A
AND
B;
Result
to
A
Register
BOO01
Boolean
Function
0001
121
oiii
ioii
NIMP
Not
(A
Implication
B);
Result
to
A
Register
B0010
Boolean
Function
0010
121·
oiii
iiio
NRIMP
Not(B
Implication
A);
Result
to
A
Register
BO
100
Boolean
Function
0100
5-9
5-10
121
oooi
ioio
121
ioii
ioio
121
ooii
ioio
121
o
iii
iiii
121
ioii
ioii
121
oooi
ioii
121
ooii
ioii
121
ooio
iiio
121
ooli
iiio
12
1
iiii
iiii
122
oovv
vvvv
122
oivv
vvvv
122
vvvv
vvvo
122
vvvv
vvvi
122
vvoo
ooov
122
vvoo
ooiv
122
vvoo
oiov
122
vvoo
oiiv
122
vvoo
ioov
122
vvoo
ioiv
122
.
vvoo
iiov
122
vvoo
iiiv
122
vvoi
ooiv
DAB
'B0101
EOR
BOlIO
OR
BOllI
NOR
BIOOO
MEQ
B100l
NOTB
B1010
RIMP
BI011
CMA
Bl100
IMP
BIlOl
NAND
B1ll0
Duplicate
A
Register
fromB
Register
Boolean
Function
0101
A
Exclusive
OR
B;
Result
to
A
Register
Boolean
Function
0110
A
OR
B;
Result
to
A
Register
Boolean
Function
0111
Not
(A
or
B);
Result
to
A
Register
Boolean
Function
1000
A
Material
Equivalence
B;
Result
to
B
Register
Boolean
Function
1001
Complement
of
B
Register
Transmitted
to
A
Register
Boolean
Function
1010
B
Implication
A;
Result
to
A
Register
Boolean
Function
1011
Complement
A
Register
Boolean
Function
1100
A
Implication
B;
Result
to
A
Register
Boolean
Function
1101
Not
(A
AND
B);
Result
to
A
Register
Boolean
Function
1110
Perform
A
rithmetic
on
Sign
and
Magnitude
----M
Perform
Arithmetic
on
Magnitude
Only
----R
No
Rounding
of
Result
----R
Round
Result
ADD
Fixed
Point
Add
A
to
B;
Result
to
A
Register
UFAD
Unnormalized
Floating
Add
A
to
B;
Result
to
A
Register
SUB
Fixed
Point
Subtract
B
from
A;
Result
to
A
Register
UFSU
Unnormalized
Floating
Subtract
B
from
A;
Result
to
A
Register
MUL
Fixed
Point
Multiply
A
by
B;
Result
to
A & B
UFMU
Unnormalized
Floating
Multiply
A
by
B;
Result
toA
& B
DIV
Fixed
Point
Divide
A
by
B;
Quotient
to
A~
Remainder
to
B
UFDV
Unnormalized
Floating
Divide
A
by
B;
Quotient
to
A~
Remainder
to
B
FAD
Float:ing
A
dd
A
to
B;
Result
to
A
122
:vvoi
oiiv
122
vvoi
ioiv
122
vvoi
iiiv
122
vvio
ooiv
122
vvio
oiiv
122
vvio
iiov
123
iiii
iiio
123
iiii
ioii
123
ioii ioii
130
oovv
vvvv
130
oivv
vvvv
130
iovv
vvvv
130
iivv
vvvv
130
vvoo vvvv
130
vvoi
vvvv
130
vvio
vvvv
130
vvvv
oovv
130
vvvv
oivv
130
vvvv
iovv
130
vvvv
vvoo
130
vvvv
vvoi
130
vvvv
vvio
131
oovv
vv
..
131
oivv
vv
..
FSU
FMU
FDV
EAD
ESU
IDV
GR8
LS8
EQ8
G---
H---
1---
J---
Floating
Subtract
B
from
A;
Result
to
A
Floating
Multiply
A
by
B;
Result
to
A & B
Floating
Divide
A
by
B;
Quotient
to
A,
Remainder,
to
B
Extend
Precision
After
Floating
Add;
Extension
of
Sum
to
A
Register
Extend
Precision
After
Floating
Subtract;
Extension
of
Difference
to
A
Register
Integer
Divide
A
by
B;
Quotient
to
A,
Remainder
to
B
Te
st
8-
bit
Bytes
for
A
Greater
than
B;
Result
to
A
Test
8-
Bit
Bytes
for
A
Less
than
B;
Result
to
A
Test
8-Bit
Bytes
for
A
Equal
to
B;
Result
to
A
Set
G-Bit
as
Result
of
Comparison
of
A
Register
with
B
Register
Set
H-Bit
as
Result
of
Comparison
of
A
Register
with
B
Register
Set
I-Bit
as
Result
of
Comparison
of
A
Register
with
B
Register
Set
J
-Bit
as
Result
of
Comparison
of
A
Register
with
B
Register
Compare
Sign
and
Magnitude
of
A
Register
with
Specified
State
of
B
Register
-
M-
- -
Compare
Magnitude
only
of
A
Register
with
Specified
State
of
B
Register
-
L---
Compare
Logical
Value
of
A
Register
with
Specified
State
of
B
Register
--
--
-
Compare
Sign
and
Magnitude
of
B
Register
with
Specified
State
of
A
Register
- -
--
M
Compare
Magnitude
Only
of
B
Register
with
Specified
State
of
A
Register
----L
Compare
Logical
Value
of
B
Register
with
Specified
State
of
A
Register
--
EQ-
Test
for
A
Equal
to
B
--
LS-
Test
for
A
Less
than
B
--GR-
Test
for
A
Greater
than
B
G---
Set
G-Bit
as
Result
of
Test
of
A
or
B
Register
H---
Set
H-Bit
as
Result
of
Test
of
A
or
B
Register
5-11
131
iovv
vv
.•
l---
Set
1-
Bit
as
Result
of
Test
of
A
or
B
Register
131
iivv
VV
••
J---
Set
J -
Bit
as
Result
of
Test
of
A
or
B
Register
131
vvov
vv
..
-A--
Test
A
Register
131
vviv
vv
..
-B--
Test
B
Register
131
vvoo
oi.
.
--Z
Test
for
Plus
or
Minus
Zero
131
vvvo
io
..
--PZ
Test
for
Plus
Zero
131
vvvo
ii.
.
--0
Test
for
ALL
ONES
(Minus
Full
Scale)
131
vvvi
00
••
--S
Copy
Sign-
Bit
to
Designated
Test
Bit
132
oovv
vv
..
GX--
Set
G-Bit
as
Result
of
Index-Register
Test
132
oivv
vv
..
HX··-
Set
H-Bit
as
Result
of
Index-Register
Test
132
iovv
vv
..
IX--
Set
I-Bit
as
Result
of
Index-
Register
Test
132
iivv
vv
..
JX--
Set
J
-Bit
as
Result
of
Index-
Register
Test
132
vvoo
vv
..
-XE-
Compare
Index
Register
for
Equality
with
Specified
Quantity
132
vvoi
vv
..
-XL-
Test
for
Index
Register
Less
than
Specified
Quantity
132
vvio
vv
..
-XG-
Test
for
Index
Register
Greater
than
Specified
Quantity
132
vvii
vv
..
-XZ
Test
for
Index
Zero
132
vvvv
00
••
-X-A
Compare
Index
Register
with
A
Register
132
vvvv
oi.
.
-X-B
Compare
Index
Register
with
B
Register
132
vvvv
io
..
-X-C
Compare
Index
Register
with
Common
Value
132
vvvv
ii
..
-X-D
Compare
Index
Regsiter
with
Memory
Data
Register
134
xobb
bbbb
GBA
Set
G-Bit
to
Designated
Bit
of
A
Register
134
xibb
bbbb
GBB
Set
G-Bit
to
Designated
Bit
of
B
Register
135
xobb
bbhb
HBA
Set
H-Bit
to
Designated
Bit
of
A
Register
135
xibb
bbbb
HBB
Set
H -
Bit
to
Designated
Bit
of
B
Register
136
xobb
bbbb
lBA
Set
I-Bit
to
Designated
Bit
of
A
Register
136
xibb
bbbb
lBB
Set
l-
Bit
to
Designated
Bit
of
B
Register
137
xobb
bbbb
JBA
Set
J -
Bit
to
Designated
Bit
of
A
Register
137
xibb
bbbb
JBB
Set
J
-Bit
to
Designated
Bit
of
B
Register
5-12
140
oovv
aaaa
aaaa
aaaa
GX-L
Set
G-Bit
as
Result
of
Comparison
of
Index
Register
with
Literal
140
oivv
aaaa
aaaa
aaaa
HX-L
Set
H-Bit
as
Result
of
Comparison
of
Index
Register
with
Literal
140
iovv
aaaa
aaaa aaaa
IX-L
Set
I-
Bit
as
Result
of
Comparison
of
Index
Register
with
Literal
140
iivv
aaaa
aaaa aaaa
JX-L
Set
J
-Bit
as
Result
of
Comparison
of
Index
Register
with
Literal
140
vvoo
aaaa
aaaa
aaaa
-XEL
Test
for
Index
Equal
to
Literal
140
vvoi
aaaa
aaaa
aaaa
-XLL
Test
for
Index
Less
than
Literal
140
vvio
aaaa
aaaa
aaaa
-XGL
Test
for
Index
Greater
than
Literal
141 ·
ovv
aaaa
aaaa aaaa
L-L
Load
Specified
Register
with
Literal
141 ·
ivy
aaaa
aaaa
aaaa
ADL-
Add
Literal
to
Specified
Register
141 ·
voo
aaaa aaaa
aaaa
LALj
ADLA
Specify
A
Register
141 ·
vol
aaaa
aaaa
aaaa
LBLj
ADLB
Specify
B
Register
141
via
aaaa
aaaa
aaaa
LXLj
ADLX
Specify
Index
Register
150
x.
00
aaaa aaaa aaaa
STA
Store
A
Register
150
x.oi
aaaa
aaaa aaaa
STB
Store
B
Register
150
x.
io
aaaa
aaaa
aaaa
STD
Store
Memory
Data
Register
150
x.ii
aaaa
aaaa
aaaa
STX
Sotre
Index
Register
.....
151
x.oo
aaaa
aaaa
aaaa
LDA
Load
A
Register
151
v.oi
aaaa
aaaa
aaaa
LDB
Load
B
Register
151
x.
io
aaaa
aaaa
aaaa
LDD
Load
:LyIemory
Data
Register
151
x.ii
aaaa aaaa aaaa
LDX
Load
Index
Register
152'
xooo
aaaa
aaaa aaaa
LDM
Load
Memory
Address
Register
152
xioo
aaaa
aaaa aaaa
ADM
Add
to
Memory
Address
Register
152
xiii
aaaa
aaaa
aaaa
ADMX
A
dd
from
Memory
to
Index
Register
200
DUP
Duplicate
Non-Zero
Half
of
CU
A
ccummulator
.201
FULL
Enter
Full-
Word
(64-bit)
Mode
202
HALF
Enter
Half-
Word
Mode
204
ZEROF
If
CU
Accumulator
Is
Not
All
ZEROS~
Skip
Until
the
Next
"UNSKIpl!
Instruction
5-13
205
ZEROT
If
CU A
ccumulator
is
All
ZEROS,
Skip
Until
the
Next
"UNSKI
P"
Instruction
206
ONESF
If
CU
Accumulator
Is
Not
All
ONES,
Skip
Until
the
Next
"UNSKIP"
Instruction
207
ONEST
If
CU A
ccum
ulator
is
all
ONES,
Skip
Until
the
Next
"UNSKIP"
Instruction
210
SKIPF
If
the
Result
of
the
Last
Test
was
False,
Skip
Until
the
Next
"UNSKIP"
In-
struction
211
SKIPT
If
the
Result
of
the
Last
Test
was
True,
Skip
Until
the
Next
"UNSKIP"
In-
struction
212
UNSKIP
Resume
Executing
All
Instructions
220
LEA
DO
Find
Leading
ONE
in
the
CU
Accum-
ulator;
Put
Bit-
Number
in
CU
Accumulator
221
LEADZ
Find
Leading
ZERO
in
the
CU
Accum-
ulator;
Put
Bit-
Number
in
CU
Accumulator
230
CCL
Clear
CU
A
ccumulato!'
231
CCOM
Complement
CU
Accumulator
232
XCUA
Index
by
CU
Accumulator
240
aaaa aaaa
STL
Store
CU A
ccumulator
in
CU
Local
Memory
241
aaaa
aaaa
STLC
Store
CU A
ccumulator
in
CU
Local
Memory,
Complemented
242
aaaa
aaaa
LDL
Load
CU
f..
ccumulator
from
CU
Local
Memory
244
aaaa
aaaa
EXCH
Exchange
CU. A
ccumulator
with
CU
Local
Memory
245
aaaa
aaaa
EXCHC
Exchange
Complement
of
CU
Accumulator
with
CU
Local
Memory
246
aaaa
aaaa
CADD
Add
to
CU
Accumulator
247
aaaa aaaa
CSUB
Substract
from
CU
Accumulator
250
aaaa
aaaa
CAND
AND
to
CU
Accumulator
251
aaaa
aaaa
COR
OR
to
CU
Accumulator
252
aaaa
aaaa
CEOR
Exclusive
OR
to
CU
Accumulator
5-14
260
oobb
bbbb
CCLB
Clear
Designated
Bit
260
oibb
bbbb
CSBO
Set
Designated
Bit
to
ONE
260
iobb
bbbb
DDHB
Change
De
signated
Bit
261
edcc
cccc
CSHIFT
Shift
262
oobb
bbbb
CTSBZ
Test
Bit
for
ZERO
262
oibb
bbbb
CTSBO
Test
Bit
for
ONE
270
aaaa
aaaa
EQUALT
271
aaaa aaaa
EQUALF
272
aaaa
aaaa
GRTRT
273
aaaa aaaa
GRTRF
274
aaaa
aaaa
LESST
275
aaaa
aaaa
LESSF
276
aaaa aaaa
XADD
Add
to
CU
Index
277
aaaa
aaaa
SUB
Subtract
from
CU
Index
30U
ooVY
nnnn
nnnn
RTA-
Route
from
A
Registers
300
oivv
nnnn
nnnn
RTB-
Route
from
B
Register
s
300
iovv
nnnn
nnnn
RTD-
Route
from
Memory
Data
Registers
300
vvoo
nnnn nnnn
RT-A
Route
to
A
Registers
300
vvoi
nnnn
nnnn
RT-B
Route
to
B
Registers
300
vvio
nnnn nnnn
RT-D
Route
to
Memory
Data
Registers
300
vvii
nnnn
nnnn
RT-M
Route
to
Memory
Addr(;ss
Registers
310
LLL
SLIT
Short
(24-
bit)
Literal
to
CU
Accumulator
311
MMM
STO
Store
One
"Vord
from
CU
Accumulator
to
Main
Memory
312
MMM
LOAD
Load
One
Word
from
Main
Memory
to
CU
Accumulator
320
CMMM
BIN
Block
Transfer
into
CU
Memory
from
Main
Memory
321
CMMM
BOUT
Block
Transfer
Out
from
CU
Memory
to
Main
Memory
3"30
LLLLLLLL
LIT
Full-
Word
Literal
to
CU
Accumulator
5-15
SECTION
VI
ILLIAC
IV
APPLICATIONS
STUDY
The
implementation
on
ILLIAC
IV
of
the
Cooley-Tukey
algorithm
for
the
calculation
of
complex
Fourier
series
has
been
studied.
A
method
is
described
in
this
section.
DESCRIPTION
OF
THE
COOLEY-TUKEY
ALGORITHM
This
is
a
method
for
evaluating
the
function
X{j)
for
N
values
of
the
argument
(j
=
0,1,
...
,N-1)
when
we
are
given
the
N
complex
coefficients
A(k),
k =
0,1,
••.
,
N
-I,
appearing
in
the
Fourier
sum
that
is
used
to
define
the
function
X(j).
N-1
X
(j)
= I A
(k).
Wj k
k=O
j =
O.
1
•••.
,
N-1.
Here
W
is
defined
to
be
the
principal
N
-th
root
of
unity.
w = e
27Ti
/N
I
(i
=J-T).
(1)
(2)
rhe
inverse
problem
can
also
be
solved
by
the
same
method.
for
if
the
N
values
of
X(j)
corresponding
to
j =
0.1
•••.•
N-1
are
given.
then
the
Fourier
coefficients
appearing
in
equation
(1)
are
defin
ed
by
N-1
A(k)
=;
L
X(j)
W~jk
j=
0
which
is
similar
in
form
to
equation
(1).
k =
O.
1
•••••
N-1
(3)
6-1
6-2
The
following
discussion
is
concerned
with
the
problem
as
stated
in
the
form
given
in
equation
(1).
The
straightforward
use
of
equation
(1)
is
equivalent
to
pre-multiplying
the
N-
component
vector
A(k)
by
the
NxN
matrix
Wjk
to
obtain
the
N
-component
solution
vector
X(j).
This
is
easily
implemented
on
ILLIAC
IV
which
computes
the
com-
ponents
of
X
in
parallel.
The
total
number
of
operations
required
would
be
about
N2
where
an
operation
is
considered
to
be
a
complex
multiplication
followed
by
a
complex
addition.
The
algorithm
des
cribed
by
Cooley
and
Tukey*
can
achieve
the
same
result
with
much
less
computation.
The
number
of
operations,
in
the
most
favorable
case,
is
proportional
to
N.
log
N
rather
than
to
N2.
It
is
also
economical
in
st'orage
requirements.
These
features
make
it
a
highly
desirable
method
for
this
problem,
especially
for
large
values
of
N.
Cooley
and
Tukey*
showed
that
choosing
N
to
be
a
power
of
2
(N
=
2m)
has
par-
ticular
advantages
for
computation
on
a
binary
machine.
"Vith
this
choice,
the
algorithm
takes
the
form
of
generating
iteratively
a
sequence
of
m
N-component
vectors.
The
first
member
of
the
sequence
is
derived
by
iteration
on
the
vector
A(k)
and
the
final
member
is
the
required
vector
X(j).
It
is
assumed
throughout
what
follows
that
the
choice
N = 2m
has
been
made.
To
define
the
sequence
of
vectors
the
indices
are
written
in
binary
form.
The
equation
(1)
can
then
be
written
,
...
, L
k
m-1
(4a,
b)
By
evaluating
the
indicated
sums
sequentially,
the
following
definition
of
the
sequence
Ar
(r
= 1, 2,
•••
,
m)
is
obtained.
The
notation
used
is
that
of
Cooley
and
Tukey
except
that
the
iteration
parameter
is
represented
by
r
instead
of
Z,.
for
ease
of
typing.
*
J.
W.
Cooley
and
J.
W.
Tukey:
An
Algorithm
for
the
Machine
Calculation
of
Complex
Fourier
Series.
Mathematics
of
Computation,
Vol.
19
(1965).
pp.
297
-
301.
(5)
A
(jo'·
••
' j
l'
k
l'
• •
.,
ko
>:=
I
r
r-
m-r-
k
Pr
A 1
(jo'
• • , j
2'
k , • • , k O
W
r-
r-
m-r
m-r
where
p
:=
j k 2m-1
1 0
m-1
(6)
p
:=
(j
r
r-1
r
-1
+ . ) k 2m - r
2
+,
•••
, JO
m-r
Writing
out
the
two
terms
of
the
sum
in
equation
(5)
we
obtain
A
(jo,
•.•
,j
l,k
1,···,k
O
):=A
l(jO,···,j
2:
0,
k
I'···'
kO) +
(7)
r
r-
m-r-
r-
r-
m-r-
( - 1 ) j r
-1
A 1 {j
0'
.
,j
2'
1, k
l'
• • • , k
0>.
W
qr
r-
r-
m-r-
r
:=
2, 3,
•••
, m
where
-
(J'
2r
-2
+ +
J'
).
2m-r
qr
- r - 2 , . . . , 0
(8
)
The
desired
components
of
X
are
then
defined
by
the
last
menlber
of
the
sequence.
(9)
MACHINE
IMPLEMENTA
TION
It
was
the
suggestion
of
Cooley
and
Tukey
that
the
value
of
A
(j
0'
• •
.,
j
I'
k
l'
• . , k O)
. r
r-
m-r-
calculated
by
means
of
equation
(7)
be
stored
in
a
location
whose
address
is
.•
2m-1
+
m-r
m-r-l
JO ' • • t + j
2 + k
2 + ,
.•
q +
kO·
.
r-
m-r-
6-3
6-4
When
this
is
done
the
storage
requirements
are
minimized
and
the
last
array
cal-
culated
gives
the
desired
Fourier
sum
s,
equation
(9),
in
such
an
order
that
the
index
of
an
X
must
have
its
binary
bits
put
in.
reverse
order
to
yield
its
index
in
array
Am.
On
any
iteration,
the
components
of
Ar
may
be
computed
in
parallel
since
the
cal-
culations
defined
by
equation
(7)
may
be
carried
0 ut
with
all
values
of
jo'
.
.,
j
r-
2
and
kO'
•.
, k 1
simultaneously.
m-r-
IMPLEMENTATION
ON
ILLIAC
IV
In
order
to
perform
the
calculations
indicated
on
ILLIAC
IV,
it
is
necessary,
on
the
r-th
cycle,
to
have
the
values
of
both
A 1
(jo'
••
'.
j
2'
0,
k
l'
•••
,
kO)
r-
r-
m-r-
and
q
available
in
the
same
PE
memory.
The
value
of
W r
must
also
be
available
to
the
PE.
One
then
computes,
according
to
equation
(7),
the
value
of
Ar
with
the
same
two
indices.
To
obtain
the
values
of
A
with
the
desired
pair
of
indices
in
the
same
PE
memory
it
may
be
necessary
to
slln'l
data
from
PE
to
PE
during
the
course
of
the
calculations.
The
constant
powers
of
W
required
by
each
PE,
during
the
entire
course
of
the
computation,
are
predetermined
by
the
method
in
which
the
original
coefficients
A{k)
are
distributed
within
the
PE
memories
and
by
the
scheme
adopted
for
shifting
data
between
PE
memories
as
the
compu
tation
proceeds.
These
details
will
now
be
discuss
ed.
STORAGE
OF
THE
COEFFICIENTS
A(k)
Use
is
made
of
the
fact
that
N
has
been
chosen
to
be
a
power
of
2:
N::
2m.
It
is
further
assumed
that
m
is
greater
than
8.
To
determine
the
location
in
which
a
particular
A(k)
is
stored,
the
representation
of
k
as
a
binary
number
as
in
equation
(4b)
is
used.
Let
the
last
eight
bits
(k
7,
•••
,kO)
of
this
binary
representa-
tion
of
k
define
the
PE
in
which
A(k)
is
stored.
The
las
t
two
of
these
bits
(k
1, kO)
d.etermine
the
number
o~
the
quadrant,
the
preceeding
six
bits
(k
7,
•••
,k
2)
deter-
mine
the
number
of
the
PE
within
the
quadrant.
The
remaini.ng
m - S
bits
(km-I,
•.•
, kS)
may
be
interpr
eted
as
the
add
res
s
of
a
storage
location
within
the
PE
memory.
To
allow
for
the
storage
of
both
real
and
imaginary
parts
of
A(k),
the
real
parts
can
be
stored
in
the
indicated
location
while
the
imaginary
part
can
be
stored
in
the
corresponding
location
of
another
block
of
m
emory,
congruent
to
the
block
in
which
the
real
parts
are
stored.
The
size
of
each
of
these
blocks
of
m
emory
will
be
2m-
a
words.
The
interpretation
of
the
binary
representation
of
the
index
k
is
thus
as
shown
in
figure
6-1.
!
Storage
Location
Withill
PE
(
PE
Number
Within
Quadrant
If"
Quadrant
I km-1
ka
k7
k2
I
kl
kO
I
m-S
bits
6
bits
2
bits
Figure
6-1.
Interpretation
of
k
as
a
Binary
Number
to
Define
Storage
Location
of
A(k).
The
quadrants
of
the
ILLIAC
IV
array
are
numbered
0, 1, 2, 3.
The
numbering
of
the
PE's
within
the
quadrant
is
shown
in
figure
6-
2.
If
p
is
the
PE
number
then
the
last
3
bits
of
the
binary
representation
of
p
are
the
column
and
the
first
3
bits
are
the
row
numbers.
When
the
initial
data
is
stored
as
described,
its
distribution
through-
out
the
arrays
is
shown
in
figure
6-3.
The
number
shown
are
k
mod
256.
~
Row
0 1 2 3 4 5 6 7
.
0 0 1 2 3 4 5 6 7
1 8 9
10
11
12 13
14 15
2
16
Row
I
Col
;
3
24
3
bits
3
bits
4
32
5
40
6 48
7
56
57 58
59
60
61.
62
63
-
Figure
6-2.
The
Numbering
of
the
PElS
withIn
a
Quadrant
6-5
6-6
0 4 8
12
16 20 24 28 1 5 9
13 17
21
25 29
32 36 40 33
37
41
64
68 65
69
o 96
97
1
/ 128
160
192
224 228 232 236 240 244 248 252 225 253 \
QUADRANT
NUMBERS
2 6 10 14 18 22 26 30
34
38
66 70
3 7 11
15
19
23 27
31
35 39
43
67
71
I
3
98 99
226 254 227 255
Figure
6-3.
The
.Distribution
of
the
Coefficients
A (k)
in
the
A
rray
Initially
COMPUTATION
AND
STORAGE
OF
INTERMEDIATE
RESULTS
The
calculations
fall
naturally
into
two
parts,
the
first
m-
8
iterations
and
the
final
8
iterations.
In
the
following
de
scr
iption,
the
binary
bit
s
(always
n~
in
number)
of
the
locations
referred
to
are
to
be
interpreted
in
the
same
manner
as
the
bi-
~ary
.representation
of
k
just
described.
The
first
m-S
cycles
r =
1,2,
...
,m-S.
1.
Calculate
A
(j,
...
, j
I'
k
l'
...
kO)'
by
use
of
equation
(7).
r 0
r-
m-
r-
2.
Store
the
result
in
location
(jo'
...
,j
l'
k .
1'"
.
,k
O
)'
i.
e.,
at
r-
m-r-
address
(j
"",
j
l'
k 1"
•.
, kS)
of
PE
(k
7,
...
, kO)'
Note
that
on
the
(nl-S
/
th
o .
r-
m-r-
cycle
it
is
address
(jo'
...
,
jm-
9)
of
PE
(k
7,
...
, kO)
that
is
meant.
The
calculation
is
performed
for
all
values
of
(jo'
...
,j
-1)
and
of
(k
m _ _
l'
...
, kO)'
It
is
seen
from
equation
(7)
that
for
the
computation
of
tbe
A
for
any
Inaex,
the
storage
locations
defined
for
the
quantities
on
both
the
left
~~d
right
hand
sides
of
the
equation
are
in
the
same
PE
memory
(the
last
S
bits
are
the
salne).
Hence
no
transfer
of
data
between
PE's
is
required.
Thus
all
PE'sScompute
in
parallel
and,
m-
on
each
cycle,
each
PE
computes
the
value
of
A
for
2
different
values
of
the
. d r
In
ex.
The
final
S
cycles
r =
m-7,
m-
6,
...
, n1.
1.
ShiftA
l(jO,···,j
2,k
,
...
,k
O)
from
location
(jo,
...
,j
10,j
2)
r-
r-
m-
r .
m-
r-
in
PE
(j
9 I I j
3'
k I
.,
kO>
to
location
(j
0'
...
, j
10'
k )
in
P~
m-
r-
m-r
m-
m-r
(j
9'
...
, j
2'
k
l'
...
, kO)'
Note
that
on
the
(m-7
)th
cycle
the
shift
meant
m-
r-
m-r-
is
from
location
(jo'
...
,
jm-
9)
in
PE
(k
7,
...
, kO)
to
location
(jo'
...
,
jm-l
0'
k7)
inPE
(jm-
9'
k6,
...
, kO)'
Note
also,
on
the
mth
cycle
the
shift
meant
is
from
location
(jo'
...
,
jm-lO'
jm-2)
in
PE
(jm-
9'
...
,
jm-3'
kO)
to
location
(jo'
...
,
jm-l
0'
kO)
in
PE
(j
9'
...
, j
2)'
m-
m-
2.
Calculate
A
(jo"'"
j
l'
k
I'
...
, kO>
by
use
of
equation
(7).
r
r-
m-r-
Note
on
the
mth
cycle
it
is
A
(jo"'"
j 1)
that
is
calculated.
r
r-
3.
Store
the
result
in
location
(jo'
...
, j
10'
j
1)
in
PE
(j
9'
...
,
m-
r-
m-
j
2'
k
I'
...
,k
O
)'
Note
on
the
mth
cycle
it
is
location
(jo'
...
, j
10'
j 1 >
r-
m-r-
.
m-
m-
in
PE
(j
9'
...
, j
2)
that
is
meant.
m- m-
Again
the
calculation
is
performed
for
all
values
of
(jo'
•••
'
jr-l)
and
of
(k
I" • • I kO).
The
shift
called
for
in
step
1
is
designed
to
bring
together
in
.
m-r-
the
same
PE
the
values
of
A 1
with
the
two
indices.
They
differ
only
in
the
r-th
r-
bit
appearing
on
the
right-hand
side
of
equation
(7).
Once
the
shifts
have
been
accomplished
all
PEls
compute
in
parallel
and
each
computes
the
values
of
A
for
2
ffi
-S .. r
values
of
the
index.
6-7
6-8
The
shifts
required
in
step
1
are
determined
by
the
values
of
the
bits
j 2
and
k .
Since
only
four
possible
combinations
of
the
values
of
these
blt§-are
pos-
sillrl,
the
corresponding
shifts
may
be
tabulated.
Table
6-1.
Shifts
Required
for
Combinations
of
Bits
j 2
and
k
r-
m-r
Bit
Shift
jr-2
k
m-r
0 0
No
shift.
0 1
Shift
to
PE
of
lower
number
(_2m-r).
Increase
location
by
1.
1 0
Shift
to
PE
of
higher
number
(+2 m-
r).
Decrease
location
by
1.
1 1
No
shift.
'----.
Since
the
possible
combinations
of
bit
values
occur
with
equal
frequency
it
is
seen
from
the
above
table
that
on
any
cycle
(r)
precisely
half
the
data
has
to
be
shifted
by
inter-PE
shiftin£.
Of
the
data
that
is
shifted,
half
goes
to
a
PE
of
higher
number
(+2m-r)
and
half
goes
to
a
PE
of
lower
number
(_2m-r).
This
shifting
is
such
that
on
cycle
r,
two
PE's
whose
number
in
the
array
differ
only
in
the
r-
(m-
8)
bit
position
exchange
a
word
of
data,
for
each
two
word
s
they
contain.
The
required
shifting
of
data
between
PE
memories
is
accomplished
by
one
or
two
routing
instructions.
In
the
fir
st
cycle
requiring
such
a
shift
each
PE
in
the
block
of
128
PE's
(those
numbered
0-127)
sends
and
receives
words
from
the
corres-
ponding
PE
in
the
second
block
of
12~
PE's
(128-
255).
Examination
of
the
PE
numbering
system
of
figure
6-3
shows
that
the
first
four
rows
of
PE's
in
each
quadrant
exchange
words
with
the
second
four
rows.
The
end-around
cylindrical
connection
of
the
PE's
on
the
North
and
South
edges
of
each
quadrant
are
such
that
sending
and
receiving
can
be
accomplished
in
one
instruction,
since
all
PE's
in
a
quadrant
shift
a
word
4
rows
South,
end-
around,
simultaneously.
In
the
second
cycle
requiring
a
shift,
each
PE
in
2
blocks
of
64
PE's
(0-
63;
128-191)
exchanges
words
with
the
corresponding
PE
in
the
corresponding
block
of
the
re-
maining
2
blocks
of
64
PEls
(64-127;
192-255).
This
requires
the
first
two
rows
of
PEls
within
a
quadrant
to
exchange
words
with
the
second
two
rows
and
the
third
pair
of
rows
to
exchange
words
with
the
fourth
pair
of
rows.
In
this
case
the
end-
around
connection
cannot
be
used
and
the
exchange
takes
place
under
mode
control
in
the
two
parts.
First,
rows
I,
2,
5,
6
transmit
data
two
rows
South
to
rows
3,
4,
7, 8
respectively.
Second,
rows
3,
4,
7,
8
transmit
data
two
rows
North
to
rows
I,
2,
5,
6
respectively.
In
the
third
cycle
information
is
exchanged
between
adjacent
rows.
The
following
three
cycles
repeat
the
same
pattern
of
shifts
but
between
columns
rather
than
rows.
Next
a
shift
of
8
rows
is
required.
This
involves
the
whole
array
as
data
moves
from
quadrant
to
quadrant
for
the
first
time.
Use
could
be
.
made
of
the
end-
around
cyclindinal
connection
of
the
whole
array,
as
in
the
first
shift
described.
The
final
shift
is
similar,
being
of
distaI1ce
8
between
columns.
This
pattern
of
shifts
is
listed
in
table
6-2.
Table
6-
2.
Shifting
Required
for
the
Final
Eight
Iterations
Block
s
of
PEl
s
Distance
of
Cycle
Shift
Required
Number
Size
Nearest
Number
m-7
2
128
*
4 NS
m-6
4
64
2
NS
m-5
8
32
1 NS
m-4
16
16
4*~~
WE
m-3
32
~
2
WE
m-2
64 4 1
WE
m-l
128
2 8*
NS
m
256
1
8**
WE
*
Denotes
that
end-
around
connectivity
may
be
used
to
allow
both
of
the
shifts
to
occur
simultaneously.
**
Shifts
can
also
use
end-
around
connectivity,
but
will
require
one
more
step
in
routing
than
"*"
shifts,
because
of
the
differences
in
edge
connectivity
patterns.
6-9
6-10
We
now
consider
the
powers
of
W
that
have
to
be
available
in
each
PE.
According
to
equation
(7)
the
computation
on
the
r-th
iteration,
of
A
(jo"'"
j
-1'
k _
-1"'"
k )
requires
the
use
of
W
qr
where
q
is
given
by
equation
f8).
With
1he
dJra
r
sPorage
scheme
described
the
addre§s
of
A
within
the
PE
determines
the
power
. r .
of
W
required
in
its
calculation.
According
to
equation
(8),
on
the
first
m-8
He
rations,
put
the
first
r-1
bits
of
the
address
in
reverse
and
m.ultiply
the
resul-
tant
number
by
2m-r
This
gives
the
required
powerOof
W.
On
the
first
iteration,
no
bits
are
selected
by
this
rule--corresponding
to
W
For
these
iterations
all
PEls
require
the
same
power
of
the
W
at
the
same
time.
Thus,
these
powers
should
be
broadcast,
and
not
stored
repetitively
in
each
PE.
On
the
final
8
iterations
one
has
Ar_1
(jo,
.•.
,
jr-2
km-s'
•.•
, kO>
stored
in
location
jo,".'
jnl-10'
km_r>
in
PE
(jm-9,""
jr-2'
km-r-1,
•.
·,
kO
)"
After
the
required
shifting
has
taken
place,
take
the
r-1
bits
that
have
been
underlined
(the
first
m-9
in
the
memory
address
followed
by
the
first
r-m
+ 8
in
the
PE
number),
invert
these
bits
and
multiply
the
number
obtained
by
2m-r.
This
is
the
power
of
W
required.
Since
these
powers
of
W
depend
on
the
PE
number
they
should
be
stored
within
the
PE
memories.
On
each
iteration
one
power
of
W
for
each
pair
of
indices
of
Ar
within
the
PE
is
required.
The
number
of
such
pairs
is
2m-9
and
the
number
of
iterations
in
this
mode
is
8.
Thus
2m-6
values
of
powers
of
Ware
required
by
each
PE.
A s
in
the
standard
algorithm,
at
the
completion
of
the
m-th
iteration,
to
f.ind
the
location
of
X(j)
in
the
A
array,
one
interprets
the
bits
of
j
in
reverse
as
a
lo-
cation
in
the
PE
memorTes.
However,
after
reversing
the
bits
of
j
it
is
necessary
to
shift
the
final
bit
(j
-1)
left
eight
plac
es
(to
between
j
-10
and
j -9>
before
terpreting
the
locatio!?as
in
figure
6-1.
m m
COMPUTA
TIONS
REQUIRED
To
give
some
idea
of
the
magnitudes
involved
some
figures
are
given
here
for
N =
4096
=
212.
(m
=
12).
1.
2.
3.
Number
of
iterations
required
(m)
= 12
Number
of
coefficients
A
(k)
in
each
PE(2
m-
8)
= 16
m-6
Number
of
values
of
W
stored
in
each
PE
(2 ) = 64
4.
Computation
required
for
each
pair
of
coefficients
(2 m-9 =
8)
per
P~,
is
indicated
by
equation
(7),
1
complex
multiplication
and
2
complex
additions.
This
in
terms
of
real
arithmetic,
amounts
to
4
multiplications
and
6
additions
for
each
pair
of
coefficients.
On
the
final
8
iterations
data
transfers
occur
and
2
words
(real
and
imaginary
part
of
one
of
the
pair)
are
transmitted
and
2
words
received
for
this
amount
of
calculation.
Since
there
are
8
pairs
of
coefficients
in
each
PE,
for
each
iteration
a
PE:
1.
Transmits
16
word
s
of
data
2.
Receives
16
words
of
data
3.
Performs
32
multiplications
4.
Performs
48
additions.
}
Not
required
on
first
4
of
the
12
iterations
During
the
first
4
iterations
one
is
effectively
solving
256
problems
in
parallel
(one
in
each
PE),
each
problem
being
of
size
N = 16.
Duripg
the
final
8
itera-
tions
one
is
solving
equivalently,
8
problems,
each
problem
being
of
size
N =
512.
These
last
problems
are
distributed
uniformly
throughout
the
array.
6-11
SECTION
VII
CIRCUIT
DESIGN
-
THE
ECL
CIRCUIT
The
basic
logical
circuit
for
ILLIAC
IV
is
the
current
switch,
or
ECL
circuit,
shown
in
figure
7 -1.
Logical
operations
can
be
performed
in
this
circuit
by
sup-
plying
input
transistors
in
parallel,
by
tying
collectors
together
into
a
common
collector
resistor,
and
by
connecting
together
the
output
ernitters.
Gating
on
the
input
transistors
appears
as
"not
OR"
for
positive
-going
signals
when
seen
at
the
inverting
output
"b,
"
or
as
OR
when
seen
at
the
noninverting
output,
Ila."
For
negative
-going
signals,
gating
at
the
input
transistors
appears
as
"not
AND"
and
AND,
respectively,
at
the
inverting
and
noninverting
outputs.
Logic
gating
can
also
be
done
by
sharing
a
single
resistor
among
otherwise
independent
collectors.
Such
sharing
produces
an
OR
for
negative
signals
or
an
AND
for
positive
signals,
and
would
appear
to
be
a
way
of
adding
another
gate
to
the
logic
without
adding
any
components
or
any
delay_
However,
one
must
take
care
of
the
case
that
two
col-
lectors
are
delivering
current
into
the
resistor
simultaneously,
either
by
clamping
the
voltage,
which
costs
components,
or
by
ensuring
in
the
logical
design
that
no
more
than
one
current
flows
at
anyone
time.
Mixing
outputs
by
tying
the
emitters
together
performs
exactly
the
same
function
as
mixing
inputs
of
the
next
stage
in
the
input
transistor,
namely
negative
AND
or
positive
OR.
Implementation
of
any
logic
equation
using
such
gates
can
always
be
found
by
translating
the
AND
-
OR
description
implicit
in
the
19ical
equation
into
a
NAND-NAND
description,
level
for
level.
As
reference
to
the
logical
design
of
the
processing
element
elsewhere
in
this
report
will
show,
designs
can
often
be
simplified
considerably
from
the
directly
translated
version.
The
use
of
ECL
circuits
differs
depending
upon
the
amount
of
wiring
which
must
be
driven
and
the
speed
to
be
obtained.
When
integrating
within
the
semiconductor
array,
freedom
to
use
the
ECL
gates
just
described
is
virtually'
unlimited.
When
_
any
wiring
is
involved,
however,
the
low
impedance
point,
namely
the
emitter
outputis
used
as
the
source
when
signals
must
be
sent
along
conductors
from
one
circuit
package
to
another.
Further,
fanout
suffers
on
signals
which
must
leave
the
circuit
package
because
of
the
need
to
supply
damping
and
terminating
resistors
for
the
wiring.
7
-1
7-2
v
cc
IN
o---;--t
+1.
2v
In
From
Logic
0--+--1
V
ee
-0
(b)
Inverting
Output
(a)
Non-
Invertlng
Output
Figure
7-1.
EeL
Gate~
Schematic
Diagram
61
183
I
I
+3.5v
1
I
I
1
____
---.
I
200
Line
36
I
I
I
I
I
I
I
-3.
v I
o-----~------------I~--------~
--l---l
-2.
Ov
I
Note:
All
Resistance
Values
in
Ohms.
I~
From
Line
Figure
7-2.
Driver~
Schematic
Diagram
2K
I
I
__
..J
320
-3.5v
120
V
ref
Note:
All
Resistance
Values
in
Ohms.
Figure
7-3.
Receiver~
Schematic
Diagram
Equivalent
Load
+1.
2v
Out
to
Logic
A t
the
collector,
signal
levels
are
approximately
+1. 2v
and
+0. 4v.
The
swing
is
set
essentially
by
the
current
in
the
emitter,
it
is
related
therefore
to
V , V
and
resistor
ratios.
The
more
positive
level
departs
from
V
by
an
arrrcrunt
cc
determined
by
base
current
in
the
outp.1 t
transistor,
a
few
peF8ent
at
most
of
the
"on"
current.
At
the
II
output"
emitter
in
figure
7
-I,
the
voltage
is
downshifted
by
an
amount
equal
to
the
base
-to-emitter
voltage
of
the
output
transistor.
The
outp.1 t
therefore
swings
from
+0.
4v
approximately
to
-0.4v
approximately.
These
output
voltages
are
made
meaningful
with
respect
to
the
nominally
zero
volt
input
thresh-
old
whatever
gate
receives
the.
Signals
which
must travel
considerable
distance,
such
as
between
cabinets,
for
instance,
may
need
more
margin
than
that
supplied
in
the
signals
at
the
output
of
the
ECL
gates.
The
extra
margin
is
needed
to
overcome
distortions
in
the
signal
due
to
imperfect
impedance
matching
in
the
wiring,
unwanted
components
due
to
crosstalk,
noise
from
external
sources,
and
discrepancies
in
temperature
and
perhaps
even
in
"zero
volts"
between
the
two
cabinets.
The
requirements
for
these
non
-EeL
signals
appear
to
be
as
follows:
Compatibility
with
ECL
levels
at
the
input
of
the
driver
and
the
output
of
the
receiver
Larger
signal
swing
(3.
Ov
based
upon
previous
Burroughs
experience)
f)
Drive
capability
for
several
transmission
line
characteris
impedances
in
parallel
(at
least
two;
or
36
ohms
load
.on
each
signal,
if
72
-ohm
line
is
used)
Fanout
of
64
(from
control
unit
to
all
64
P.
E'
s);
this
implies
an
input
inlpedance
of
over
2.
3k
ohms
per
receiver
if
32
receivers
are
to
be
attached
to
a
single
72
-ohm
line.
)
A
driver
circuit
which
satisfies
these
requirements
is
shown
in
figure
7
-2.
Note
that
the
portion
of
the
circuit
to
the
left
of
the
dotted
line
in
the
driver
circuit
is
identical
(except
for
a
somewhat
lower
"impedance
.level)
to
our
standard
EeL
gate.
When
this
driver
circuit
is
implemented
as
a
portion
of
a
large-scale
integrated
array,
the
portion
of
the
circuit
on
the
left,
which
is
"the
same"
as
the
standard
EeL
circuit,
is
·available
for
performing
some
logic
function.
Only
the
non-
inverted
output,
which
feeds
the
large
-swing
signal,
would
be
unavailable
for
normal
EeL
use.
A
receiver
circuit
which
satisfies
these
requirements
is
shown
in
figure
7
-3.
Note
that
the
portion
of
the
circuit
to
the
right
of
the
dotted
line
in
the
receiver
circuit
is
identical
to
our
standard
EeL
gate.
When
this
receiver
circuit
is
im-
plemented
as
a
portion
of
a
large
-scale
intregrated
array,
the
portion
of
the
circuit
on
the
right,
which
is
the
same
as
the
standard
EeL
circuit,
is
available
for
performing
logic
functions
within
the
array.
7
-3
+
1.2v
-3.5v
IN
LONG
LINE
-3.5v
-3.5v
DRIVER
RECEIVER
Figure
7-4.
EeL
Driver-Receiver,
Balanced
Signals,
Schematic
Diagram
'---+---o
OUT
OUT
~FALSE)
The
signal
across
the
interface,
in
this
system,
swings
from
+2.
5v
to
-0.
4v.
Threshold
is
at
1.
2v,
at
room
temperature,
and
has
a
slight
negative
temperature
coefficient.
Since
rise
times
no
faster
than
15
ns
are
of
interest,
the
gain
of
the
transistors
at
25
mc
or
thereabout
controls
the
input
impedance
of
the
receiver.
With
transistors
whose
cutoffs
are
in
the
hundreds
of
n'legacycles,
gains
of
30
are
easily
available.
The
+3.
5v
supply
can
be
disabled
to
control
intercabinet
transfers.
An
alternative
dirver
-receiver
circuit
has
been
suggested
in
which
each
signal
is
transmitted
balanced
with
respect
to
ground.
This
circuit
is
shown
in
figure
7
-4.
It
has
a
signal
swing
of
1.
6v,
the
differential
signal
between
the
two
outputs,
but
makes
up
for
lowered
signal
swing
with
increased
noise
immunity.
The
noise
immunity
is
almost
equal
to
half
the
minimum
signal
swing,
compared
to
a
noise
immunity
of
30%
to
350/0
of
the
signal
swing
in
the
system
exemplified
by
the
driver
and
receiver
of
figures
7
-2
and
7
-3.
This
scheme
has
a
further
advantage
in
rejecting
noise,
in
that
some
noise
sources
tend
to
induce
a
common
mode
com-
ponent
in
the
line.
This
single
-ended
scheme
rejects
common
mode
noise
essentially
perfectly
up
to
some
maximum
amplitude,
while
the
scheme
depends
on
the
coupling
between
the
two
conductors
of
the
line
to
induce
a
noise
com-
ponent
in
one
conductor
equal
to
the
noise
component
induced
on
the
other
by
some
external
source.
The
latter
scheme
is
extreluely
effective,
but
not
as
effective
as
balanced
signaJs.
An
advantage
of
the
balanced
signal
driver
and
receiver
is
that
they
are
identi-
cal
in
design
and
fabrication
with
standard
EeL
gates.
Disadvantages
of
the
balanced
signal
scheme
are
severe
for
certain
proposed
uses.
In
particular,
it
is
not
possible
to
put
more
than
one
driver
on
one
signal
wire.
For
the
connections
from
the
PE's
back
to
the
memory
access
buffer,
it
would
be
necessary
to
give
each
PE
its
own
expensive
wire
back
to
its
own
private
receiver
at
the
memory
access
buffer.
Using
the
unbalanced,
3.0
v
signal,
one
can
combine
all
eight
drivers
on
a
single
data
line.
This
defect
will
be
general,
whenever
several
data
sources
converge
on
a
common
destina-
tion.
A
second
defect
of
the
balanced
signal
scheme
arises
because
the
wiring
must
be
balanced
with
respect
to
ground.
Thus
one
must
use
pairs
of
wires,
either
twisted
pair,
or
shielded
twisted
pair.
Twisted
pair
is
inferior
to
coaxial
cable
or
ribbon
cable
in
terms
of
crosstalk;
shielded
twisted
pair
is
considerably
more
expensive
both
to
buy
and
to
install
than
coaxial
cable
and
ribbon
cable.
Even
unshielded
twisted
pair,
when
procurred
in
belted
form,
can
be
surprisingly
expensive.
A
third
limitation
of
the
balanced
signal
scheme
arises
because
of
the
difficulty
in
designing
for
low
.output
impedance.
As
described
to
us,
the
drive
capability
of
the
balanced
driver
was
only
sufficient
to
drive
a
single
transmission
line.
In
the
case
of
data
being
transmitted
from
one
PE
to
neighboring
PE's,
the
data
transmission
paths
go
in
both
directions
from
the
transmitting
PEa
For
optirEum
per.formance,
it
is
necessary
to
drive
two
transmission
lines,
one
going
in
each
7
-5
7-6
direction.
Therefore,
two
balanced
driver
circuits
are
required
to
handle
the
signal.
Greater
fan
-out
at
the
receiver
end
is
also
available
with
the
un-
balanced
design.
The
conclusion
is
that
either
driver
and
receiver
design
is
satisfactory
for
use
in
cables
up
to
some
maximum
length,
where
not
more
than
one
driver
per
sig-
nal
set
is
required.
Ordinary
twisted
pair
can
probably
be
driven
by
the
balanced
driver
design
in
medium
length
runs
of
up
to
10
or
30
feet
where
the
unbalanced
driver
would
require
coaxial
cable
or
the
equivalent.
Beyond
that,
in
either
case,
higher
quality
wire
is
required,
such
as
ribbon
cable
using
three
wires
per
signal,
coaxial
cable,
or
shielded
twisted
pair.
For
this
last
class
of
signals,
single
-ended
signals
would
be
more
economical
of
wiring,
would
be
directly
compatible
with
popular
types
of
"foreign"
logical
circuitry
which
is
likely
to
be
found
in
external
equipment,
and
would
be
more
compatible
with
any
require-
ment
such
as
r.
f.
filtering.
A
table
of
"long
lines"
signals
is
in
table
7-1.
Table
7
-1
contains
our
conclusions
on
the
implementation
of
such
"long
lines"
Signals.
Either
balanced
or
unbalanced
signals
will
be
acceptable
within
each
quadrant,
assuming
each
quadrant
to
be
packaged
within
a
single
unit,
a
set
of
Table
7
-1.
Signals
Requiring
Driver
and
Receiver
Estimated
Suitable
Class
of
Signals
Length
(feet)
Driver
Design
Pe
to
PE
(same
cabinet)
short
Pe
to
PE
(different
cabinets)"
6
either
PE
to
110
Buffer
30
either
PE
to
memory
access
buffer
30
single
ended
PE
from
CD,
control
signals
and
mode
control
I 30
either
CD
to
CD,
different
quadrants
80
single
ended
CD
to
peripheral
devices
200
single
ended
Comments
no
driver
needed,
stand-
ard
ECL
signals
suitable
----
--
--
drivers
must
be
aRable
-
---
economically
more
attractive
wiring
compatibility
with
foreign
signals.
cabinets
bolted
together.
For
longer
runs,
the
lower
price
of
coaxial
c;~::.,
..
the
greater
compatibility
of
unbalanced
signals
at
a
"foreign"
interface
~.,(,.'::,'
,
call
for
the
use
of
3.
Ov
unbalanced
signals
between
quadrants,
and
frGrn
th
..
:
array
processor
to
the
outside.
Figure
7
-5
shows
an
example
of
how
the
unbalanced
system
nlight
be
used
~('
distribute
control
signals
from
the
CU
to
the
64
PE's
assunling
tha
t
the
I'~:
:';: :,"
receiver
in
each
PE.
A
fanout
of
64,
as
shown,
may
be
beyond
the
C'~qnL~l:
~:'
of
either
scheme
at
the
required
speed,
especially
the
balanced
signal
SC!:(·~
....
However,
the
grouping
of
the
PE
s
is
such
that
one
receiver
should
do
for
;1<'.::'
or
eight
PE'
s.
Drive
r
(in
CU)
32
Receivers
(in
PE's)
r----4f-+-----1~---_t>_+-------
- - - - - -
----e---f---
r----*-----*-----+------------------------
...
-
...
-
...
-
..
32
Receivers
(in
PE's)
L.-...1f---Clf-+-----1~---~>_+------
- - - - - -
----.e---t--
L..
.......
_________
-----+----------.--------->-----~-."
..
'"
..
Figure
7-5.
Use
of
Drivers
and
Receivers
for
Distributing
COc.tI>,.)1
Signals
from
C U
to
PEl
S

Navigation menu