Computer Organization Hamacher Instructor Manual Solution Chapter 8

User Manual:

Open the PDF directly: View PDF .
Page Count: 8

Chapter 8 – Pipelining

8.1. ( ) The operation performed in each step and the operands involved are as given

in the ﬁgure below.

Fetch Decode,

20, 2000 Add

Fetch Decode,

3, 50 Mul

Fetch Decode,

$3A, 50 And

Fetch Decode,

2000, 50 Add

R1←

2020

R3←

150

R4←

R5←

2050

Clock cycle 1 2 3 4 5 6 7

I2: Mul

I3: And

I4: Add

I1: Add

Instruction

( )

Clock cycle 2 3 4 5

Buffer B1 Add instruction

(I ) Mul instruction

(I ) And instruction

(I ) Add instruction

(I )

Buffer B2 Information

from a previous

instruction

Decoded I

Source

operands:

20, 2000

Decoded I

Source

operands:

3, 50

Decoded I

Source

operands:

$3A, 50

Buffer B3 Information

from a previous

instruction

Information

from a previous

instruction

Result of I :

2020

Destination

Result of I :

150

Destination

8.2. ( )

Fetch Decode,

20, 2000 Add

Fetch Decode,

3, 50 Mul

Fetch Decode, $3A, 2020 And

Fetch Decode,

2000, 50 Add

R1←

2020

R3←

150

R4←

R5←

2050

Clock cycle 1 2 3 4 5 6 7

Mul

And

Add

Instruction

$3A, ?

( ) Cycles 2 to 4 are the same as in P8.1, but contents of R1 are not available

until cycle 5. In cycle 5, B1 and B2 have the same contents as in cycle 4. B3

contains the result of the multiply instruction.

8.3. Step D may be abandoned, to be repeated in cycle 5, as shown below. But,

instruction I must remain in buffer B1. For I to proceed, buffer B1 must be

capable of holding two instructions. The decode step for I has to be delayed as

shown, assuming that only one instruction can be decoded at a time.

F1E1

Clock cycle 1 2 3 4 5 6 7

I2 (Add)

I1 (Mul)

Instruction

D1W1

F2E2

D2W2

F3E3

D3W3

F4E4

D4W4

8.4. If all decode and execute stages can handle two instructions at a time, only in-

struction I is delayed, as shown below. In this case, all buffers must be capable

of holding information for two instructions. Note that completing instruction I

before I could cause problems. See Section 8.6.1.

F1E1

Clock cycle 1 2 3 4 5 6 7

I2 (Add)

I1 (Mul)

Instruction

D1W1

F2E2

D2W2

F3E3

D3W3

F4E4

D4W4

8.5. Execution proceeds as follows.

F1E1

Clock cycle 1 2 3 4 5 6 7

Instruction

D1W1

F2E2

D2W2

D3W3

F4E4

D4W4

8.6. The instruction immediately preceding the branch should be placed after the

branch.

LOOP Instruction 1 LOOP Instruction 1

Instruction Instruction

Instruction Conditional Branch LOOP

Conditional Branch LOOP Instruction

This reorganization is possible only if the branch instruction does not depend on

instruction .

8.7. The UltraSPARC arrangement is advantageous when the branch instruction is at

the end of the loop and it is possible to move one instruction from the body of

the loop into the delay slot. The alternative arrangement is advantageous when

the branch instruction is at the beginning of the loop.

8.8. The instruction executed on a speculative basis should be one that is likely to be

the correct choice most often. Thus, the conditional branch should be placed at

the end of the loop, with an instruction from the body of the loop moved to the

delay slot if possible. Alternatively, a copy of the ﬁrst instruction in the loop

body can be placed in the delay slot and the branch address changed to that of

the second instruction in the loop.

8.9. The ﬁrst branch (BLE) has to be followed by a NOP instruction in the delay slot,

because none of the instructions around it can be moved. The inner and outer

loop controls can be adjusted as shown below. The ﬁrst instruction in the outer

loop is duplicated in the delay slot following BLE. It will be executed one more

time than in the original program, changing the value left in R3. However, this

should cause no difﬁculty provided the contents of R3 are not needed once the

sort is completed. The modiﬁed program is as follows:

ADD R0,LIST,R3

ADD R0,N,R1

SUB R1,1,R1

SUB R1,1,R2

OUTER LDUB [R3+R1],R5 Get LIST(j)

LDUB [R3+R2],R6 Get LIST(k)

INNER SUB R6,R5,R0

BLE,pt NEXT

SUB R2,1,R2 k k 1

STUB R5,[R3+R2]

STUB R6,[R3+R1]

OR R0,R6,R5

NEXT BGE,pt,a INNER

LDUB [R3+R2],R6 Get LIST(k)

SUB R1,1,R1

BGT,pt OUTER

SUB R1,1,R2

8.10. Without conditional instructions:

Compare A,B Check A B

Branch 0 Action1

Action2 . .. ... One or more instructions

Branch Next

Action1 . .. ... One or more instructions

Next ...

If conditional instructions are available, we can use:

Compare A,B Check A B

... ... Action1 instruction(s), conditional

... ... Action2 instruction(s), conditional

Next ...

In the second case, all Action 1 and Action 2 instructions must be fetched and

decoded to determine whether they are to be executed. Hence, this approach is

beneﬁcial only if each action consists of one or two instructions.

Clock cycle 123456

Branch>0

…

Branch

Compare

Instruction

…

F2E2

F3E3

F4E4

F6E1

…

A,B

Action1

Without conditional instructions

If >0 then action1

If ≤0 then action2

Compare A,B

NEXT …

F1E1

F2E2

F3E3

F4E4

With conditional instructions

Action2

Action1

8.11. Buffer contents will be as shown below.

RSLT

Cycle No.

Clock

198 130 260

345

ALU Operation + Shift O3

R3 45 130 260

8.12. Using Load and Store instructions, the program may be revised as follows:

INSERTION Test RHEAD

Branch 0 HEAD

Move RNEWREC,RHEAD

Return

HEAD Load RTEMP1,(RHEAD)

Load RTEMP2,(RNEWREC)

Compare RTEMP1,RTEMP2

Branch 0 SEARCH

Store RHEAD,4(RNEWREC)

Move RNEWREC,RHEAD

Return

SEARCH Move RHEAD,RCURRENT

LOOP Load RNEXT,4(RCURRENT)

Test RNEXT

Branch=0 TAIL

Load RTEMP1,(RNEXT)

Load RTEMP2,(RNEWREC)

Compare RTEMP1,RTEMP2

Branch 0 INSERT

Move RNEXT,RCURRENT

Branch LOOP

INSERT Store RNEXT,4(RNEWREC)

TAIL Store RNEWREC,4(RCURRENT)

Return

This program contains many dependencies and branch instructions. There very

few possibilities for instruction reordering. The critical part where optimization

should be attempted is the loop. Given that no information is available on branch

behavior or delay slots, the only optimization possible is to separate instructions

that depend on each. This would reduce the probability of stalling the pipeline.

The loop may be reorganized as follows.

LOOP Load RNEXT,4(RCURRENT)

Load RTEMP2,(RNEWREC)

Test RNEXT

Load RTEMP1,(RNEXT)

Branch=0 TAIL

Compare RTEMP1,RTEMP2

Branch 0 INSERT

Move RNEXT,RCURRENT

Branch LOOP

INSERT Store RNEXT,4(RNEWREC)

TAIL Store RNEWREC,4(RCURRENT)

Return

Note that we have assumed that the Load instruction does not affect the condition

code ﬂags.

8.13. Because of branch instructions, 120 clock cycles are needed to execute 100 pro-

gram instructions when delay slots are not used. Using the delay slots will elim-

inate 0.85 of the idle cycles. Thus, the improvement is given by:

That is, instruction throughput will increase by 8.1%.

8.14. Number of cycles needed to execute 100 instructions:

Without optimization 140

With optimization ( ) 127

Thus, throughput improvement is , or 10.2%

8.15. Throughput improvement due to pipelining is , where is the number of pipeline

stages.

Number of cycles needed

to execute one instruction: Throughput

4-stage: 4/1.04 3.85

6-stage: 6/1.19 5.04

Thus, the 6-stage pipeline leads to higher performance.

8.16. For a “do while” loop, the termination condition is tested at the beginning of the

loop. A conditional branch at that location will be taken when exiting the loop.

Hence, it should be predicted not taken. That is, the state machine should be

started in the state LNT, unless the loop is not likely to be executed at all.

A “do until” loop is executed at least once, and the branch condition is tested

at the end of the loop. Assuming that the loop is likely to be executed several

times, the branch should be predicted taken. That is, the state machine should be

started in state LT.

8.17. An instruction fetched in cycle reaches the head of the queue and enters the

decode stage in cycle . Assume that the instruction preceding I is decoded

and instruction I is fetched in cycle 1. This leads to instructions I to I being in

the queue at the beginning of cycle 2. Execution would then proceed as shown

below.

Note that the queue is always full, because at most one instruction is dispatched

and up to two instructions are fetched in any given cycle. Under these conditions,

the queue length would drop below 6 only in the case of a cache miss.

D1E1E1E1W1

I5 (Branch)

1 2 3 4 5 6 7 8 9Clock cycle

E2W2

D4W4

FkDkEk

Fk+1Dk+1

Ik+1

Ek+1

6 6 6 6 6 6 6 6 6Queue length 6

Time

…

Computer Organization Hamacher Instructor Manual Solution Chapter 8

Navigation menu

Versions of this User Manual:

Views

Navigation