Computer Organization Hamacher Instructor Manual Solution Chapter 8

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 8

DownloadComputer Organization Hamacher Instructor Manual Solution - Chapter 8
Open PDF In BrowserView PDF
Chapter 8 – Pipelining
8.1. ( ) The operation performed in each step and the operands involved are as given
in the figure below.
Clock cycle
Instruction
I1: Add

1

2

3

4

Fetch

Decode,
20, 2000

Add

R1←
2020

Fetch

Decode,
3, 50

Mul

R3←
150

Fetch

Decode,
$3A, 50

And

R4←
50

Fetch

Decode,
2000, 50

Add

I2: Mul

I3: And

I4: Add

5

6

7

R5←
2050



( )

Clock cycle

2

3

4

5

Buffer B1

Add instruction
(I  )

Mul instruction
(I )

And instruction
(I )

Add instruction
(I )

Buffer B2

Information
from a previous
instruction

Decoded I 
Source
operands:
20, 2000

Decoded I
Source
operands:
3, 50

Decoded I
Source
operands:
$3A, 50

Buffer B3

Information
from a previous
instruction

Information
from a previous
instruction

Result of I  :
2020
Destination 
R1

Result of I :
150
Destination 
R3

1

8.2. ( )
Clock cycle
Instruction
Add

1

2

3

4

Fetch

Decode,
20, 2000

Add

R1←
2020

Fetch

Decode,
3, 50

Mul

Mul

And

Fetch

6

7

Decode,
$3A, ? $3A, 2020

And

R4←
32

Fetch

Decode,
2000, 50

Add

Add

5

R3←
150

R5←
2050



( ) Cycles 2 to 4 are the same as in P8.1, but contents of R1 are not available
until cycle 5. In cycle 5, B1 and B2 have the same contents as in cycle 4. B3
contains the result of the multiply instruction.
8.3. Step D may be abandoned, to be repeated in cycle 5, as shown below. But,
instruction I  must remain in buffer B1. For I to proceed, buffer B1 must be
capable of holding two instructions. The decode step for I has to be delayed as
shown, assuming that only one instruction can be decoded at a time.
Clock cycle

1

2

3

4

F1

D1

E1

W1

F2

D2

5

6

7

D2

E2

W2

E3

W3

8

Instruction
I1 (Mul)
I2 (Add)
I3
I4

F3

D3
F4

2

D4

E4

W4

8.4. If all decode and execute stages can handle two instructions at a time, only instruction I is delayed, as shown below. In this case, all buffers must be capable
of holding information for two instructions. Note that completing instruction I
before I could cause problems. See Section 8.6.1.
Clock cycle

1

2

3

4

F1

D1

E1

W1

5

6

7

E2

W2

Instruction
I1 (Mul)

D2

F2

I2 (Add)

F3

I3
I4

D3

E3

W3

F4

D4

E4

W4

8.5. Execution proceeds as follows.
Clock cycle

1

2

3

4

5

6

7

F1

D1

E1

W1

8

D2

E2

W2

I3

F3

D3

E3

W3

I4

F4

D4

E4

9

Instruction
I1

F2

I2

W4

8.6. The instruction immediately preceding the branch should be placed after the
branch.

LOOP

Instruction 1

LOOP

 

Instruction 1
 

Instruction 
Instruction
Conditional Branch LOOP



Instruction 
Conditional Branch LOOP
Instruction

This reorganization is possible only if the branch instruction does not depend on
instruction .

3

8.7. The UltraSPARC arrangement is advantageous when the branch instruction is at
the end of the loop and it is possible to move one instruction from the body of
the loop into the delay slot. The alternative arrangement is advantageous when
the branch instruction is at the beginning of the loop.
8.8. The instruction executed on a speculative basis should be one that is likely to be
the correct choice most often. Thus, the conditional branch should be placed at
the end of the loop, with an instruction from the body of the loop moved to the
delay slot if possible. Alternatively, a copy of the first instruction in the loop
body can be placed in the delay slot and the branch address changed to that of
the second instruction in the loop.
8.9. The first branch (BLE) has to be followed by a NOP instruction in the delay slot,
because none of the instructions around it can be moved. The inner and outer
loop controls can be adjusted as shown below. The first instruction in the outer
loop is duplicated in the delay slot following BLE. It will be executed one more
time than in the original program, changing the value left in R3. However, this
should cause no difficulty provided the contents of R3 are not needed once the
sort is completed. The modified program is as follows:

OUTER
INNER

NEXT

ADD
ADD
SUB
SUB
LDUB
LDUB
SUB
BLE,pt
SUB
STUB
STUB
OR
BGE,pt,a
LDUB
SUB
BGT,pt
SUB

R0,LIST,R3
R0,N,R1
R1,1,R1
R1,1,R2
[R3+R1],R5
[R3+R2],R6
R6,R5,R0
NEXT
R2,1,R2
R5,[R3+R2]
R6,[R3+R1]
R0,R6,R5
INNER
[R3+R2],R6
R1,1,R1
OUTER
R1,1,R2

4

Get LIST(j)
Get LIST(k)

k

k 1

Get LIST(k)

8.10. Without conditional instructions:

Action2
Action1
Next

Compare
Branch  0
...
Branch
...
...

A,B
Action1
...
Next
...

Check A

B

One or more instructions
One or more instructions

If conditional instructions are available, we can use:

Next

Compare
...
...
...

A,B
...
...

Check A B
Action1 instruction(s), conditional
Action2 instruction(s), conditional

In the second case, all Action 1 and Action 2 instructions must be fetched and
decoded to determine whether they are to be executed. Hence, this approach is
beneficial only if each action consists of one or two instructions.
Without conditional instructions
Clock cycle

1

2

F1

E1

3

4

5

6

Instruction
Compare A,B

F2

Branch>0 Action1
Action2

…
Branch

Action1

…

Next

…

E2
F3

E3
F4

Next

E4

F6

With conditional instructions
Compare A,B

F1

F2

If >0 then action1
If ≤0 then action2
NEXT

E1
E2
F3

…

E3
F4

5

E4

E1

8.11. Buffer contents will be as shown below.
Clock
Cycle No.
ALU Operation
R3
RSLT

3

4

+
45
198

Shift
130
130

5
O3
260
260

8.12. Using Load and Store instructions, the program may be revised as follows:
INSERTION

HEAD

SEARCH
LOOP

INSERT
TAIL

Test
Branch  0
Move
Return
Load
Load
Compare
Branch  0
Store
Move
Return
Move
Load
Test
Branch=0
Load
Load
Compare
Branch  0
Move
Branch
Store
Store
Return

RHEAD
HEAD
RNEWREC,RHEAD
RTEMP1,(RHEAD)
RTEMP2,(RNEWREC)
RTEMP1,RTEMP2
SEARCH
RHEAD,4(RNEWREC)
RNEWREC,RHEAD
RHEAD,RCURRENT
RNEXT,4(RCURRENT)
RNEXT
TAIL
RTEMP1,(RNEXT)
RTEMP2,(RNEWREC)
RTEMP1,RTEMP2
INSERT
RNEXT,RCURRENT
LOOP
RNEXT,4(RNEWREC)
RNEWREC,4(RCURRENT)

This program contains many dependencies and branch instructions. There very
few possibilities for instruction reordering. The critical part where optimization
should be attempted is the loop. Given that no information is available on branch
behavior or delay slots, the only optimization possible is to separate instructions
that depend on each. This would reduce the probability of stalling the pipeline.
The loop may be reorganized as follows.

6

LOOP

INSERT
TAIL

Load
Load
Test
Load
Branch=0
Compare
Branch  0
Move
Branch
Store
Store
Return

RNEXT,4(RCURRENT)
RTEMP2,(RNEWREC)
RNEXT
RTEMP1,(RNEXT)
TAIL
RTEMP1,RTEMP2
INSERT
RNEXT,RCURRENT
LOOP
RNEXT,4(RNEWREC)
RNEWREC,4(RCURRENT)

Note that we have assumed that the Load instruction does not affect the condition
code flags.
8.13. Because of branch instructions, 120 clock cycles are needed to execute 100 program instructions when delay slots are not used. Using the delay slots will eliminate 0.85 of the idle cycles. Thus, the improvement is given by:

  


 

That is, instruction throughput will increase by 8.1%.
8.14. Number of cycles needed to execute 100 instructions:
Without optimization
With optimization ( !"#$   #%$   )

140
127

Thus, throughput improvement is !'&(*)+,   , or 10.2%
8.15. Throughput improvement due to pipelining is , where is the number of pipeline
stages.
Number of cycles needed
to execute one instruction:
4-stage:

    $  -,  !

6-stage:

 !#     .#    $  -

Throughput
4/1.04  3.85
 /

Thus, the 6-stage pipeline leads to higher performance.

7

6/1.19  5.04

8.16. For a “do while” loop, the termination condition is tested at the beginning of the
loop. A conditional branch at that location will be taken when exiting the loop.
Hence, it should be predicted not taken. That is, the state machine should be
started in the state LNT, unless the loop is not likely to be executed at all.
A “do until” loop is executed at least once, and the branch condition is tested
at the end of the loop. Assuming that the loop is likely to be executed several
times, the branch should be predicted taken. That is, the state machine should be
started in state LT.
8.17. An instruction fetched in cycle 0 reaches the head of the queue and enters the
decode stage in cycle 02143 . Assume that the instruction preceding I  is decoded
and instruction I5 is fetched in cycle 1. This leads to instructions I  to I5 being in
the queue at the beginning of cycle 2. Execution would then proceed as shown
below.
Note that the queue is always full, because at most one instruction is dispatched
and up to two instructions are fetched in any given cycle. Under these conditions,
the queue length would drop below 6 only in the case of a cache miss.

Clock cycle

1

2

3

4

5

6

7

8

9

Time
10

Queue length

6

6

6

6

6

6

6

6

6

6

…

D1

E1

E1

E1

W1

…

D2

I1
I2
I3
I4
I5 (Branch)

D5

I6

F6

Ik
Ik+1

E2

W2

D3

E3

W3

D4

E4

W4

Dk

Ek

Wk

Dk+1

Ek+1

X
Fk
Fk+1

8



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.6
Linearized                      : Yes
Create Date                     : 2001:08:17 10:36:25Z
Modify Date                     : 2017:04:26 18:56:43+05:30
Has XFA                         : No
XMP Toolkit                     : Adobe XMP Core 5.6-c015 84.159810, 2016/09/10-02:41:30
Metadata Date                   : 2017:04:26 18:56:43+05:30
Producer                        : Aladdin Ghostscript 5.10
Document ID                     : uuid:ff3fa3a6-e1d2-48ed-9553-1268755c32be
Instance ID                     : uuid:8e59efd2-f0d3-44f8-af21-cd144e835a6d
Format                          : application/pdf
Page Count                      : 8
EXIF Metadata provided by EXIF.tools

Navigation menu